I've been looking at how we might go about supporting torchdata within TorchX and with

cc: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

regarding builtin datapipes being pickle-safe <p dir="a

Yes, though you can write custom __getstate__ and <co

IIUC when num_workers > 1 the <code class="notrans

datapipe serialization support / cloudpickle / parallel support about data HOT 7 OPEN

pytorch commented on August 18, 2024 1

datapipe serialization support / cloudpickle / parallel support

from data.

Comments (7)

ejguan commented on August 18, 2024 1

general data transforms

I am not hundred percent sure about it. I would say we would guarantee DataPipe graph (pipeline) is going to be serializable with user provided function. Our current way is to pickle lambda function using dill.

data splitting into train/validation sets

We have utility DataPipe provided to users to split data into two separate pipelines. This may not be related, but I want to let you know. We would provide dynamic sharding for users, which means users don't need to hardcode sharding setting in their DataSet.

summary statistic computation

We currently have a way to retrieve a graph of data pipeline. But, better visualization is not done yet. https://github.com/pytorch/pytorch/blob/3202028ed1ca24c91dc7192ef69b305690db7abc/torch/utils/data/graph.py#L54

Are DataPipes guaranteed to be pickle safe and is there anything that needs to be done to support that?

Our provided DataPipes would be guaranteed to be serializable. And, we can't guarantee the users' implementation of DataPipes. But, if users choose to use DataLoader2 with their datapipes, they would get notification about if their DataPipe is serializable or not.

I was also wondering if there's multiprocessing based datapipes and how that works since this seems comparable

We would provide multiprocessing. The functionality is in-place, but we are still working with internal teams to align the API of DataLoaderV2.

should this be on the pytorch discussion forums instead?

I don't think this is a right timing as we are not officially released. And, the RFC is tracked in PyTorch Core not in this repo.

from data.

ejguan commented on August 18, 2024

cc: @VitalyFedyunin to see if you want to supply other comments.

from data.

kiukchung commented on August 18, 2024

@ejguan regarding builtin datapipes being pickle-safe... is this the way you'd recommend folks implement checkpointing for datapipes?

from data.

ejguan commented on August 18, 2024

regarding builtin datapipes being pickle-safe

IIRC, it's a requirement for both multiprocessing and checkpointing. As @NivekT is working on checkpointing, feel free to chime in

from data.

NivekT commented on August 18, 2024

Yes, though you can write custom __getstate__ and __setstate__ methods to accomplish that.

from data.

kiukchung commented on August 18, 2024

IIUC when num_workers > 1 the DataPipes are iterated on the dataloader worker (child process). Therefore, the "state" of the datapipe will be resident on the child proc not the main parent (where the trainer loop will run). How exactly does one get the pickled state of the datapipe from the child process back to the parent for checkpointing?

from data.

NivekT commented on August 18, 2024

Good question! The plan is to use PrototypeMultiprocessingReadingService to pass request/response messages, where the response will be the pickled state of the DataPipe

from data.

Recommend Projects

datapipe serialization support / cloudpickle / parallel support about data HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent