delta-incubator / deltatorch Goto Github PK

License: Apache License 2.0

Python 99.55% Makefile 0.45%

deltatorch's Introduction

deltatorch

Concept

deltatorch allows users to directly use DeltaLake tables as a data source for training using PyTorch. Using deltatorch, users can create a PyTorch DataLoader to load the training data. We support distributed training using PyTorch DDP as well.

Why yet another data-loading framework?

Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images
Classical Big Data formats like Parquet can help with this issue, but are hard to operate:
- writers might block readers
- Failed write can make the whole dataset unreadable
- More complicated projects might ingest data all the time, even during training

Delta Lake storage format solves all these issues, but PyTorch has no direct support for DeltaLake datasets. deltatorch introduces such support and allows users to use DeltaLake for training Deep Learning models using PyTorch.

Usage

Requirements

Python Version > 3.8
pip or conda

Installation

with pip:

pip install  git+https://github.com/delta-incubator/deltatorch

Create PyTorch DataLoader to read our DeltaLake table

To utilize deltatorch at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. There is a requirement: this table must have an autoincrement ID field. This field is used by deltatorch for sharding and parallelization of loading. After that, we can use the create_pytorch_dataloader function to create PyTorch DataLoader, which can be used directly during training. Below you can find an example of creating a DataLoader for the following table schema :

CREATE TABLE TRAINING_DATA 
(   
    image BINARY,   
    label BIGINT,   
    id INT
) 
USING delta LOCATION 'path'

After the table is ready we can use the create_pytorch_dataloader function to create a PyTorch DataLoader :

from deltatorch import create_pytorch_dataloader
from deltatorch import FieldSpec

def create_data_loader(path:str, batch_size:int):

    return create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )

deltatorch's People

Contributors

Stargazers

Watchers

Forkers

nkarpov mrpowers s-udhaya bluerivertechnology souvik-databricks bluerivertechnology candiedcode

deltatorch's Issues

Update README with "why Delta Lake for PyTorch analyses"

README should explain why Delta Lake is a good Lakehouse storage format for PyTorch analyses.

Should also include an explanation of when Delta Lake is a good fit.

Add Ruff with CI checks

Lint code with Ruff and add a CI check.

Fix CI/CD environment

Our CI/CD tests fail because we are using the AG_NEWS dataset (part of pytorch) in our unit tests, which is part of a PyTorch. By default, PyTorch now installs the wheels with CUDA dependencies, which is not working in our ci/cd env since there are no GPUs there and, therefore, no CUDA drivers.
We should use PyTorch CPU dependency during CI/CD tests.

Create ImageNet demo

Create a demo using ImageNet dataset.

Add black code formatting CI check

Add CI check to make sure the code uses black formatting.

Add helper function to generate id to partition data

Would be great to have a function to generate an ID for the dataset (with spark rank() and for users not using spark?)

Do we need a Spark dependency in DeltaIterableDataset?

From what I can see, Spark reader is only used to get a count in https://github.com/delta-incubator/deltatorch/blob/main/deltatorch/deltadataset.py

This is possible to do with more lightweight deltalake (which is used elsewhere in the same class).

Can we remove the Spark dependency altogether?

Add helper to split stratify data with spark

Add built-in function to split the dataset in test/train (validate?) and keep proper distribution

provide np conversion data type as input argument for decode_numpy_and_apply_shape

In the deltadataset.py implementation, decode_numpy_and_apply_shape convert a binary column of the delta table to numpy arrays with np.frombuffer(bytes, np.uint8).
The API can be more general if the converted data type is configurable and can be set as input argument of the method.

add support for using Delta table name in create_pytorch_dataloader

For users saving the delta table in metastore, it is more convenient to use table_name to reference the data than the path argument of create_pytorch_dataloader

create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )

missing requirments.txt in repo

the notebook examples/cv_caltech256_ptl_singlenode.py installs dependencies at the notebook scope with
%pip install -r ../requirements.txt
but the requirements.txt file does not exist in the repo

Support table features / protocol versions

Document the table features / protocol version that are supported by this project for each release.

If a Delta table has a table feature enabled that deltatorch does not support (like deletion vectors), then deltatorch should error out.

Add an exception to be raised if a table is empty

Problem:
Currently, if a dataset is empty(table contains no records), there will be an error through not found num_workers, while I would expect to get an error: your table is empty please verify your path is correct or please use the table that contains records.

Proposed Solution:
Do a count of records on a table and add an exception with a proper message.

Can fix it.

Is there a documentation to guide us how to create the DataLoader?

Hi,

I would like to first express my sincere thanks to the author for providing us such a great tool.

The example shown on read me is about loading image dataset. However, my use case is just tabluar data. How should I deal with the parameter 'load_image_using_pil' in FieldSpec . And I also want to know what does FieldSpec do.

Therefore, I am wondering is there a user guide / documentation to guide us how to use this great tool.

Thanks in advance.
Lucas

add API interface to support selection of delta table versions or timestamp

besides path of the delta table, users should be able to load delta table based on version_id or timestamp too.

Add deltatorch examples to delta-examples project

The delta-examples project contains different examples of Delta Lake with other technologies.

See this issue for creating a PyTorch + Delta Lake example.

Update README to include latest installation instructions

The README should include pip install deltatorch

make create_pytorch_dataloader accepts additional arguments of torch.utils.data.DataLoader

in deltatorch/pytorch.py , add **kwargs to create_pytorch_dataloader

def create_pytorch_dataloader(
    path: str,
    id_field: str,
    fields: List[FieldSpec],
    batch_size: int = None,
    collate_fn: Optional[Callable[[List], Any]] = None,
    use_fixed_rank: bool = False,
    rank: int = None,
    num_ranks: int = None,
    num_workers: int = 2,
    shuffle: bool = False,
    timeout: int = 15,
    queue_size: int = 25000,
    **kwarg
):
    dataset = IDBasedDeltaDataset(
        path,
        id_field,
        fields,
        use_fixed_rank,
        rank,
        num_ranks,
        num_workers,
        shuffle,
        timeout,
        queue_size,
    )

    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=0,
        collate_fn=collate_fn,
    **kwarg
    )