Comments (11)
That is exactly the problem. As you can read in the second issue I linked, even if the size IS exact, when specifying drop_last=True, pytorch_lightning seems to skip the validation.
Also the warning is still raised by pytorch lightning.
I'll try to provide a MWE for it when I got some time to spare.
from litdata.
@enrico-stauss I think I have a fix. Could you try this branch: #139. This will work only with the StreamingDataLoader
Example of the issue: There is 300 samples, 2 workers, batch size of 4. This is 300 / (4 * 2) = 37.5 batches. Because there is a non completed batch, the StopIteration is triggered while fetching the last batch and the validation is skipped.
My PR extends the StreamingDataLoader to pass the number of workers and batch size to the dataset, so the shuffler can drop the extra 0.5 batches causing the issue.
from litdata.
Hey @enrico-stauss, can you confirm it works for you with the PR ?
from litdata.
Hey @enrico-stauss, can you share a reproducible script of the problem. Not sure I fully follow it. The size of the StreamingDataset should be exact. If not, there is a bug.
from litdata.
@tchaton
Please have a look at the modified original post. You can exchange DROP_LAST_TRAIN_SAMPLE=False
to see that it then does run the validation epoch.
from litdata.
Maybe changing to the standard Dataset type could also help with this one #135 (comment).
from litdata.
Hey @enrico-stauss, changing the base type is a very large task and not something I am planing to do.
from litdata.
I understand. Do you have any idea how to proceed though, as it does severely break compatibility? I might have a look at it but can't promise anything.
In all honesty, I think the change should be made on the side of PyTorchLightning but as mentioned here, it seems as this is just not possible at the moment.
from litdata.
Sorry @tchaton I did not find time to test it earlier. My MWE however still shows that no validation is performed even with the updates that are not merged into main. I don't think it's possible to resolve this from the side of LitData without either removing the __len__
method or switching to the standard Dataset base class.
from litdata.
Hey @enrico-stauss Trust me, we are going to figure this out. And I am one of the core dev of PyTorch Lightning, so we will find a way. But It think this is a litdata problem.
Would you be available to pair debug this with me sometimes next week ?
Also, would you be interested to join the core team of litdata ?
from litdata.
Hi @tchaton
The reason I believe that it's not a LitData problem is, that the second issue I linked in the original post already reported the issue using the 'IterableDataset' as base class.
But with you being a core dev of PyTorchLightning, too, I'm confident that we can figure it out.
I think we can schedule a meeting for next week, let's get in touch on discord. Then we can also talk about what you proposed. :)
from litdata.
Related Issues (20)
- Explore about integrating homomorphic encryption
- Bug: Inconsistent Behavior with StreamingDataloader loading states (specific to StreamingDataset)
- Add support for multi sample item in optimize and yielding from the _getitem_ of the StreamingDataset
- Expose max_pre_download in StreamingDataset HOT 1
- Use different batch sizes in CombinedStreamingDataset HOT 1
- Bug: Issues with Dataloader Batching Resulting in Uneven number of Batches and Streamed Items HOT 2
- Bug: Inconsistent Behavior with StreamingDataloader loading states (specific to CombinedStreamingDataset)
- StreamingDataset intermittently fails due to lack of index.json HOT 2
- Lazyload subsamples if subsample=1.0
- CombinedStreamingDataset causes NCCL timeout when using multiple nodes HOT 9
- Bug: Loading compressed data fails silently (no error message, the application simply hangs up) HOT 3
- Tests related to torchaudio fail HOT 1
- Error Should Indicate Missing Folder Instead of Missing index.json File HOT 1
- When using DDP, processes see truncated cached index.json when data is loaded from a mounted network filesystem HOT 3
- A contributing.md for the project HOT 1
- Failed to Resume Training w/ CombinedStreamingDataset HOT 1
- Large number of chunks causes `OSError: [Errno 24] Too many open files` HOT 8
- RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin'] HOT 11
- How can I shut down automatically distributing data when using StreamingDataset? HOT 3
- The config isn't consistent between chunks HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litdata.