Giter Club home page Giter Club logo

deltatorch's Introduction

deltatorch

image image image

Concept

deltatorch allows users to directly use DeltaLake tables as a data source for training using PyTorch. Using deltatorch, users can create a PyTorch DataLoader to load the training data. We support distributed training using PyTorch DDP as well.

Why yet another data-loading framework?

  • Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images
  • Classical Big Data formats like Parquet can help with this issue, but are hard to operate:
    • writers might block readers
    • Failed write can make the whole dataset unreadable
    • More complicated projects might ingest data all the time, even during training

Delta Lake storage format solves all these issues, but PyTorch has no direct support for DeltaLake datasets. deltatorch introduces such support and allows users to use DeltaLake for training Deep Learning models using PyTorch.

Usage

Requirements

  • Python Version > 3.8
  • pip or conda

Installation

  • with pip:
pip install  git+https://github.com/delta-incubator/deltatorch

Create PyTorch DataLoader to read our DeltaLake table

To utilize deltatorch at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. There is a requirement: this table must have an autoincrement ID field. This field is used by deltatorch for sharding and parallelization of loading. After that, we can use the create_pytorch_dataloader function to create PyTorch DataLoader, which can be used directly during training. Below you can find an example of creating a DataLoader for the following table schema :

CREATE TABLE TRAINING_DATA 
(   
    image BINARY,   
    label BIGINT,   
    id INT
) 
USING delta LOCATION 'path' 

After the table is ready we can use the create_pytorch_dataloader function to create a PyTorch DataLoader :

from deltatorch import create_pytorch_dataloader
from deltatorch import FieldSpec

def create_data_loader(path:str, batch_size:int):

    return create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )

deltatorch's People

Contributors

mshtelma avatar s-udhaya avatar souvik-databricks avatar mrpowers avatar

Stargazers

Aaron Goebel avatar Eran Levy avatar  avatar Chris Tomlinson avatar  avatar Stefan Suwelack avatar James Thewlis avatar Yao Liu avatar  avatar Abhilash Shankarampeta avatar Filippo Mameli avatar Kim Min Woo avatar Hongbo Miao avatar  avatar Oneal65 avatar Jung Yeon Lee avatar Thomas avatar  avatar Angad Singh avatar Leonce Nshuti avatar Ganesh Bhat avatar Avinash Sooriyarachchi avatar  avatar Alfredo Serafini avatar Zhenlong Huang avatar  avatar LorinLee avatar  avatar Diana Kowalska avatar  avatar András Fülöp avatar Alexander Kai Chen avatar Trent Hauck avatar Nicolás Roldán Fajardo avatar Shaun B. avatar Roberto Bruno Martins avatar  avatar Nick avatar  avatar Anton Berlinsky avatar  avatar  avatar Prayag avatar Sandalots avatar Vegard Fjellbo avatar  avatar  avatar Florent Brosse avatar Lu Wang avatar Anastasia Prokaieva avatar smellslikeml avatar Corey Abshire avatar

Watchers

Nick avatar Denny Lee avatar  avatar Debu Sinha avatar Roberto Bruno Martins avatar

deltatorch's Issues

Fix CI/CD environment

Our CI/CD tests fail because we are using the AG_NEWS dataset (part of pytorch) in our unit tests, which is part of a PyTorch. By default, PyTorch now installs the wheels with CUDA dependencies, which is not working in our ci/cd env since there are no GPUs there and, therefore, no CUDA drivers.
We should use PyTorch CPU dependency during CI/CD tests.

add support for using Delta table name in create_pytorch_dataloader

For users saving the delta table in metastore, it is more convenient to use table_name to reference the data than the path argument of create_pytorch_dataloader

create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )

missing requirments.txt in repo

the notebook examples/cv_caltech256_ptl_singlenode.py installs dependencies at the notebook scope with
%pip install -r ../requirements.txt
but the requirements.txt file does not exist in the repo

Support table features / protocol versions

Document the table features / protocol version that are supported by this project for each release.

If a Delta table has a table feature enabled that deltatorch does not support (like deletion vectors), then deltatorch should error out.

Add an exception to be raised if a table is empty

Problem:
Currently, if a dataset is empty(table contains no records), there will be an error through not found num_workers, while I would expect to get an error: your table is empty please verify your path is correct or please use the table that contains records.

Proposed Solution:
Do a count of records on a table and add an exception with a proper message.

Can fix it.

Is there a documentation to guide us how to create the DataLoader?

Hi,

I would like to first express my sincere thanks to the author for providing us such a great tool.

The example shown on read me is about loading image dataset. However, my use case is just tabluar data. How should I deal with the parameter 'load_image_using_pil' in FieldSpec . And I also want to know what does FieldSpec do.

Therefore, I am wondering is there a user guide / documentation to guide us how to use this great tool.

Thanks in advance.
Lucas

make create_pytorch_dataloader accepts additional arguments of torch.utils.data.DataLoader

in deltatorch/pytorch.py , add **kwargs to create_pytorch_dataloader

def create_pytorch_dataloader(
    path: str,
    id_field: str,
    fields: List[FieldSpec],
    batch_size: int = None,
    collate_fn: Optional[Callable[[List], Any]] = None,
    use_fixed_rank: bool = False,
    rank: int = None,
    num_ranks: int = None,
    num_workers: int = 2,
    shuffle: bool = False,
    timeout: int = 15,
    queue_size: int = 25000,
    **kwarg
):
    dataset = IDBasedDeltaDataset(
        path,
        id_field,
        fields,
        use_fixed_rank,
        rank,
        num_ranks,
        num_workers,
        shuffle,
        timeout,
        queue_size,
    )

    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=0,
        collate_fn=collate_fn,
    **kwarg
    )

Allow renaming field name in FieldSpec

For the FieldSpec class, using a target_name value different from name creates a new column with column name target_name but keeps the original column as well. A rename option would avoid duplicating the field in the created dataloader.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.