🐛 Bug During training, in the first epoch the ram increase.

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Ram increasing during first epoch of training about litdata HOT 4 CLOSED

rakro101 commented on September 26, 2024

Ram increasing during first epoch of training

from litdata.

Comments (4)

jackcyc commented on September 26, 2024

I also encountered a similar issue while training an ImageNet classification model on a resource-limited PC. Specifically, I found that the prefetch_factor of StreamingDataLoader is set to 10 by default when num_worker > 0.

litdata/src/litdata/streaming/dataloader.py

Line 604 in d5eff39

 prefetch_factor=(10 if num_workers > 0 else None) if prefetch_factor is None else prefetch_factor, 

This default value appears to differ from what is described in the StreamingDataLoader docstring and also PyTorch's DataLoader default.

Manually setting the prefetch_factor to a smaller number, like 2, significantly reduced the RAM usage.

from litdata.

deependujha commented on September 26, 2024

Thanks for pointing out wrong default value in docstring, it'll be fixed soon.

https://github.com/Lightning-AI/litdata/blob/d5eff393cd17ba4f789fa846788f40b5ca4d0779/src/litdata/streaming/dataloader.py#L533C1-L538C1

Did you try values in the middle, like 5-6? If yes, please share your experience.

It'll help in fixing this if something needs to be done.

from litdata.

jackcyc commented on September 26, 2024

I think the behavior of StreamingDataLoader is expected.

prefetch_factor (int, optional, keyword-only arg) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).

If the training_step consumes batches faster than the dataloader prepares them, the prefetch queue could be small, thereby consuming less RAM. However, if the training_step slows down and the dataloader gradually fills up the prefetch queue until it reaches the limit (prefetch_factor * num_workers), an increase in host memory usage can be observed. Therefore, I think the only problem is that the default prefetch_factor is too high and not in sync with well-known defaults, causing people to easily overlook this issue.

from litdata.

tchaton commented on September 26, 2024

Hey @jackcyc. Yes, it seems to be fine with LLM pre-training and ImageNet but as you perfectly stated, it is data specific. This was a tuning I made to accelerate training as I noticied lower GPU utilization.

If you feel like the value should be put back to something lower, feel free to make a PR and we will merge it.

from litdata.

Recommend Projects

Ram increasing during first epoch of training about litdata HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent