Comments (4)
I also encountered a similar issue while training an ImageNet classification model on a resource-limited PC. Specifically, I found that the prefetch_factor of StreamingDataLoader is set to 10 by default when num_worker > 0.
litdata/src/litdata/streaming/dataloader.py
Line 604 in d5eff39
This default value appears to differ from what is described in the StreamingDataLoader docstring and also PyTorch's DataLoader default.
Manually setting the prefetch_factor to a smaller number, like 2, significantly reduced the RAM usage.
from litdata.
Thanks for pointing out wrong default value in docstring, it'll be fixed soon.
Did you try values in the middle, like 5-6? If yes, please share your experience.
It'll help in fixing this if something needs to be done.
from litdata.
I think the behavior of StreamingDataLoader is expected.
prefetch_factor (int, optional, keyword-only arg) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).
If the training_step consumes batches faster than the dataloader prepares them, the prefetch queue could be small, thereby consuming less RAM. However, if the training_step slows down and the dataloader gradually fills up the prefetch queue until it reaches the limit (prefetch_factor * num_workers), an increase in host memory usage can be observed. Therefore, I think the only problem is that the default prefetch_factor is too high and not in sync with well-known defaults, causing people to easily overlook this issue.
from litdata.
Hey @jackcyc. Yes, it seems to be fine with LLM pre-training and ImageNet but as you perfectly stated, it is data specific. This was a tuning I made to accelerate training as I noticied lower GPU utilization.
If you feel like the value should be put back to something lower, feel free to make a PR and we will merge it.
from litdata.
Related Issues (20)
- Explore about integrating homomorphic encryption
- Bug: Inconsistent Behavior with StreamingDataloader loading states (specific to StreamingDataset)
- Add support for multi sample item in optimize and yielding from the _getitem_ of the StreamingDataset
- Expose max_pre_download in StreamingDataset HOT 1
- Use different batch sizes in CombinedStreamingDataset HOT 1
- Bug: Issues with Dataloader Batching Resulting in Uneven number of Batches and Streamed Items HOT 2
- Bug: Inconsistent Behavior with StreamingDataloader loading states (specific to CombinedStreamingDataset)
- StreamingDataset intermittently fails due to lack of index.json HOT 2
- Lazyload subsamples if subsample=1.0
- CombinedStreamingDataset causes NCCL timeout when using multiple nodes HOT 9
- Bug: Loading compressed data fails silently (no error message, the application simply hangs up) HOT 3
- Tests related to torchaudio fail HOT 1
- Error Should Indicate Missing Folder Instead of Missing index.json File HOT 1
- When using DDP, processes see truncated cached index.json when data is loaded from a mounted network filesystem HOT 3
- A contributing.md for the project HOT 1
- Failed to Resume Training w/ CombinedStreamingDataset HOT 1
- Large number of chunks causes `OSError: [Errno 24] Too many open files` HOT 8
- RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin'] HOT 11
- How can I shut down automatically distributing data when using StreamingDataset? HOT 3
- The config isn't consistent between chunks HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litdata.