Comments (3)
Hey cgebbe,
Do you know why it crashed ?
from litdata.
it was an out of memory error, but I don't have the logs anymore.
Found it a bit strange that it only happened after several hours. Didn't have other tasks running.
from litdata.
Hey @cgebbe.
We can support this. The writer keeps track of the chunk info there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/streaming/writer.py#L253 and we have already some logic to merge the index json file:
litdata/src/litdata/streaming/writer.py
Line 395 in 26bf6b2
In reality, you could even process your dataset by chunk and just combine them at the end.
Would you be interesting in trying to contribute this feature ?
from litdata.
Related Issues (20)
- Explore about integrating homomorphic encryption
- Bug: Inconsistent Behavior with StreamingDataloader loading states (specific to StreamingDataset)
- Add support for multi sample item in optimize and yielding from the _getitem_ of the StreamingDataset
- Expose max_pre_download in StreamingDataset HOT 1
- Use different batch sizes in CombinedStreamingDataset HOT 1
- Bug: Issues with Dataloader Batching Resulting in Uneven number of Batches and Streamed Items HOT 2
- Bug: Inconsistent Behavior with StreamingDataloader loading states (specific to CombinedStreamingDataset)
- StreamingDataset intermittently fails due to lack of index.json HOT 2
- Lazyload subsamples if subsample=1.0
- CombinedStreamingDataset causes NCCL timeout when using multiple nodes HOT 9
- Bug: Loading compressed data fails silently (no error message, the application simply hangs up) HOT 3
- Tests related to torchaudio fail HOT 1
- Error Should Indicate Missing Folder Instead of Missing index.json File HOT 1
- When using DDP, processes see truncated cached index.json when data is loaded from a mounted network filesystem HOT 3
- A contributing.md for the project HOT 1
- Failed to Resume Training w/ CombinedStreamingDataset HOT 1
- Large number of chunks causes `OSError: [Errno 24] Too many open files` HOT 8
- RuntimeError: All the chunks should have been deleted. Found ['chunk-0-0.bin'] HOT 11
- How can I shut down automatically distributing data when using StreamingDataset? HOT 3
- The config isn't consistent between chunks HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litdata.