Comments (7)
general data transforms
I am not hundred percent sure about it. I would say we would guarantee DataPipe graph (pipeline) is going to be serializable with user provided function. Our current way is to pickle lambda function using dill.
data splitting into train/validation sets
We have utility DataPipe provided to users to split data into two separate pipelines. This may not be related, but I want to let you know. We would provide dynamic sharding for users, which means users don't need to hardcode sharding setting in their DataSet.
summary statistic computation
We currently have a way to retrieve a graph of data pipeline. But, better visualization is not done yet. https://github.com/pytorch/pytorch/blob/3202028ed1ca24c91dc7192ef69b305690db7abc/torch/utils/data/graph.py#L54
Are DataPipes guaranteed to be pickle safe and is there anything that needs to be done to support that?
Our provided DataPipes would be guaranteed to be serializable. And, we can't guarantee the users' implementation of DataPipes. But, if users choose to use DataLoader2 with their datapipes, they would get notification about if their DataPipe is serializable or not.
I was also wondering if there's multiprocessing based datapipes and how that works since this seems comparable
We would provide multiprocessing. The functionality is in-place, but we are still working with internal teams to align the API of DataLoaderV2.
should this be on the pytorch discussion forums instead?
I don't think this is a right timing as we are not officially released. And, the RFC is tracked in PyTorch Core not in this repo.
from data.
cc: @VitalyFedyunin to see if you want to supply other comments.
from data.
@ejguan regarding builtin datapipes being pickle-safe... is this the way you'd recommend folks implement checkpointing for datapipes?
from data.
regarding builtin datapipes being pickle-safe
IIRC, it's a requirement for both multiprocessing and checkpointing. As @NivekT is working on checkpointing, feel free to chime in
from data.
Yes, though you can write custom __getstate__
and __setstate__
methods to accomplish that.
from data.
IIUC when num_workers > 1
the DataPipes
are iterated on the dataloader worker (child process). Therefore, the "state" of the datapipe will be resident on the child proc not the main parent (where the trainer loop will run). How exactly does one get the pickled state of the datapipe from the child process back to the parent for checkpointing?
from data.
Good question! The plan is to use PrototypeMultiprocessingReadingService
to pass request/response messages, where the response will be the pickled state of the DataPipe
from data.
Related Issues (20)
- Loading `.tfrecords` files that require a deserialization method
- S3FileLoaderIterDataPipe buffer_size
- Iterating a data pipe, created with random split, ends in error as the code tries to iterate past the data pipe lenght
- `v2.1.2+cu118` and `v2.1.1+cu118` run into torchdata `ImportError: libssl.so.3: cannot open shared object file: No such file or directory`, that `v2.1.0+cu118` doesn't have an issue with HOT 1
- PyTorch 2.2: import torchdata fails on ubuntu-20.04 github runners HOT 3
- Dataloader is slow with iterdatapipes and shuffle that has large in-memory fields (because traverse_dps is slow) HOT 3
- DataLoader2 with multiprocess raise exception: Can not request next item while we are still waiting response for previous request HOT 1
- Move to removesuffix string method after python 3.8 support is dropped
- torchdata not compatible with torch 2.3.0 HOT 3
- [StatefulDataLoader] macOS tests are too slow
- MacOS state_dict tests in CI are failing during shutdown HOT 2
- StatefulDataLoader stores worker state twice if the IterableDataset is also an Iterator
- GDriveReaderDataPipe complains "using a sharing/viewing link instead of a download link"
- iter(dataset) is called twice for certain cases of state restore of IterableDataset HOT 3
- State_dict on dataset seems to be called more often than expected HOT 2
- Make DistributedSampler stateful HOT 4
- Enable Append Mode in SaverIterDataPipe HOT 1
- Returning tensor instead of dict for state_dict causes failure HOT 2
- Importing `torchdata.stateful_dataloader` hides `torch` RandomSampler and BatchSampler HOT 8
- best practice for `snapshot_every_n_steps` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data.