Giter Club home page Giter Club logo

fondant's People

Contributors

alexanderremmerie avatar andres-vv avatar anush008 avatar carolineadam avatar christiaensbert avatar dependabot[bot] avatar georgeslorre avatar hakimovich99 avatar j-branigan avatar janvanlooy avatar khaerensml6 avatar mrchtr avatar nielsrogge avatar nsff avatar philippemoussalli avatar philmod avatar robbesneyders avatar satishjasthi avatar shayorshay avatar shub-kris avatar tillwenke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fondant's Issues

Split dataframe to write different subsets

Currently, the user returns a single Fondant dataframe when loading or transforming data. However, there is a need to split this dataframe into multiple subsets based on component specifications and save each subset into a separate location.

This task involves creating a process that allows for efficient and accurate splitting of the dataframe while ensuring each subset is written to its designated location.

Implement example pipelines & components

For the alpha release, we want to have example pipelines for the following use cases:

Example pipelines

  1. 8 of 8
    Components
    NielsRogge
  2. 8 of 8
    Components
    PhilippeMoussalli

For this we need the following example components:

Example components

  1. Components
    NielsRogge
  2. Components
    NielsRogge
  3. Components
    NielsRogge
  4. Components
    PhilippeMoussalli
  5. Components
    NielsRogge
  6. Components
    NielsRogge
  7. Components
    PhilippeMoussalli
  8. Components
    PhilippeMoussalli
  9. Components
    ChristiaensBert
  10. Components
    PhilippeMoussalli

Handle different kubeflow data types internally

Data types such as Tuples ,bool, Lists are not always represented correctly after passing them to Kubeflow. We need to handle them internally after they are passed by the user and parse them as intended

Stable Diffusion fine-tuning from a seed image dataset pipeline

Create an example pipeline to create a dataset for fine-tuning Stable Diffusion based on a seed dataset from the Hugging Face hub.

The components needed are:

Components

  1. Components
    PhilippeMoussalli
  2. Components
    PhilippeMoussalli
  3. Components
    ChristiaensBert
  4. Components
    NielsRogge
  5. Components
    PhilippeMoussalli
  6. Components
    NielsRogge
  7. Components
    PhilippeMoussalli
  8. Components documentation
    PhilippeMoussalli

The final dataset should ideally be registered to the Hugging Face hub.

Use manifest evolution in `Dataset`

#41 provides the functionality to generate the output manifest based on the input manifest and the component spec. We should use this in the Component or Dataset class to validate the output of the component.

Clean up args.metadata

Currently, each component takes a --metadata argument, which includes the run_id, component_id and base_path.

However, the run_id and base_path stay the same for all components in a pipeline, hence it might make sense to only have a loading component accept a --metadata argument, which is then added to the manifest and kept for the rest of the pipeline. The component_id can be inferred from the component spec so there's no need to pass that anymore (see #56).

Alternatively, perhaps the pipeline wrapper class could create the initial Manifest, add the metadata to it, and pass that to the loading component. This way, no component would require a --metadata argument.

Create one click deployment for GCP and AWS

Goal is to help users start using Fondant

There are multiple ways we can make this happen with different levels of user involvement.

Image

Options from easiest to hardest (from user perspective)

1. Fondant as SaaS
tbd

2. Marketplace solution
Fondant is available on the all the marketplaces of the cloud provider. User clicks a button and K8s and kubeflow (and other dependencies like storage and access rights) are all provisioned. (note that this is a complex process that includes a review with each cloud provider)

GCP documentation: https://cloud.google.com/marketplace/docs/partners/kubernetes/submitting

3. Makefile
we create makefiles that will setup a k8s cluster and deploy kubeflow pipelines on them. This is a multistep approach and allows for some user customization.

4. Terraform
We create a terrafrom module or example code that the user can apply (an maintain). Requires terraform knowledge

5. detailed guide
We create documentation on how to setup everything (probably combining already existing guides

Merge loaded subsets into a single dataframe

The loaded subsets need to be merged into a single dataframe by joining them on the index, which is a combination of the id and the source columns, before passing it to the user.

ControlNet fine-tuning for interior design pipeline

Create an example pipeline to create a dataset for fine-tuning ControlNet for interior design.

The components needed are:

Components

  1. Components
    NielsRogge
  2. Components
    NielsRogge
  3. Components
    NielsRogge
  4. Components
    PhilippeMoussalli
  5. Components
    NielsRogge
  6. Components
    NielsRogge
  7. Components
    PhilippeMoussalli
  8. Components documentation
    ChristiaensBert jamesbraniganml6

The final dataset should ideally be registered to the Hugging Face hub.

Write documentation

We should update the README and create the following detailed documentation pages:

Tasks

  1. documentation
    RobbeSneyders
  2. documentation
    GeorgesLorre
  3. documentation
    RobbeSneyders
  4. documentation
    RobbeSneyders
  5. documentation
    NielsRogge
  6. documentation
    PhilippeMoussalli
  7. documentation
  8. documentation
    GeorgesLorre
  9. Components documentation
    ChristiaensBert jamesbraniganml6
  10. Components documentation
    PhilippeMoussalli

Infer when a subset or the index needs to be written to a new location

#42 provides functionality to generate the output manifest from the input manifest and component spec. But it does not yet correctly update the location of the subsets or location. It should deduce when this is necessary and update the location with the new one.

Based on our offline discussion:

  • A component should define any data it changes in its output_subsets. So a subset needs to be written to a new location if and only if it is defined in the component output_subsets. Since we only need the component specification, this can be done statically.
  • To know if the index changes, we need to check if the input index and output index of a component at runtime. We need to check how we can do this efficiently using Dask. As a first step we will always rewrite the index and implement this check as optimization later.

Based on this, we get the following subtasks:

Tasks

  1. Core

Add coverage check

We should add a coverage check of our test suite to Github actions. This provides a view of how well our test suite covers our code and can give indications on when / where to add tests. High coverage can help users trust the quality of Frondant.

We can look at Connexion for how to add this:
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/pyproject.toml#L76
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/tox.ini#L36
https://github.com/spec-first/connexion/blob/15fe2eda8fa1853a4d73a13839d7e35814ad7358/.github/workflows/pipeline.yml#L25

Add typing check in Github actions

The Fondant code is typed, but the typing is not checked. This is an issue if users want to leverage the Fondant typing, as there is no guarantee that it is correct. We should add a typing check to our Github actions to validate this. This will require fixing the types, and might have to be done in multiple steps.

Generate output manifest based on input manifest and component spec

The output manifest of a component should be defined completely by the input manifest and the specification of the component.

This functionality can be used for validation of the data produced by the component, and for static validation of the pipeline before execution.

Importing mechanism for reusable components

We should provide an importing mechanism for our reusable components so they can easily be included in user pipelines.

Loading a local component is done as follows:

from fondant.pipeline import FondantComponentOp

component_op = FondantComponentOp(
    component_spec_path="...",
    arguments={...},
)

where the component_spec_path is a local path.

We no longer want the user to have to specify the component_spec_path, but still want them to be able to provide custom arguments.

We can provide a single ReusableComponent class which takes the name of the component to load

from fondant.components import ReusableComponentOp

component_op = ReusableComponentOp(
    name="...",
    arguments={...},
)

Which we can optionally still wrap as specific classes:

from fondant.components import CaptioningComponentOp

captioning_op = CaptioningComponent(
    arguments={...},
)

Redesign `component.transform` interface

Currently the interface of the component.transform method is:

def transform(
    self, args: argparse.Namespace, dataframe: dd.DataFrame
) -> dd.DataFrame:

Since the dataframe is the main argument, I would put it first. And instead of providing an argparse.Namespace, I would provide the arguments as keyword arguments.

def transform(
    self, dataframe: dd.DataFrame, **kwargs
) -> dd.DataFrame:

This way users can define the keyword arguments they are expecting in the method signature instead of having to unpack an argparse.Namespace. Eg. for a component that filters on image dimensions:

def transform(
    self, dataframe: dd.DataFrame, *, height: int, width: int
) -> dd.DataFrame:

Implement basic Fondant pipeline mechanism

Currently, compiling and executing pipelines requires the users to have basic kubeflow knowledge. Additionally, the compiled pipelines do not validate the component specifications of the different chained components (static validation before compiling the pipeline).

The ExpressPipeline class that aims to compile the defined pipelines and validate if they are defined in the correct order (all subsets are present when needed based on the ordering of the components).

Update format subset locations

Currently the format of the location of subsets or the index is:

f"{run_id}/{component_id}/{subset_name}"

But since a subset is not rewritten for every run, and certainly not every component, this leads to a directory structure with only sparse data.

This structure makes it hard to find data versions of a single subset without making it easy to find the data versions of a specific run, since runs might use datasets created by previous runs ones we have pipeline caching.

I would therefore propose to change it to

f"{subset_name}/{run_id}/{component_id}"

Add Pipeline integration tests

To make sure updates to the Fondant code base don't break existing pipelines, we need a test infrastructure on Kubernetes that spins up a test pipeline, runs some basic components and deletes the test pipeline. Specifically:

  • build docker images for all components
  • create a dedicate test namespace in the cluster
  • deploy ml pipeline using the newly built components
  • run the test
  • delete the namespace
  • delete the images

This can be achieved similar to how KubeFlow pipelines is doing this: https://github.com/kubeflow/pipelines/tree/master/test.

Implement a WriteComponent class

Problem Statement

Currently we have two types of components:

  • LoadComponent : takes as input user arguments for loading dataset (e.g. dataset_id) and loads the dataset from a remote location. This component also creates the initial manifest based on the metadata passed to the component and the component specs.
    The user is expected to change the column names to subset_field if the columns of the loaded dataset do not have this. (this might change.

  • TransformComponent takes as input the evolved manifest based on the component spec and loads the dataset from the artifact registry. it also presents the dataframe to the user as subset_field

Both components share some common functionalities based on the abstract run method:

  • loading the manifest
  • processing the dataset based on the load or transform function implemented by the user
  • evolving the manifest based on the component spec
  • writing the new datasets
  • uploading the manifest

What is still missing is a WriteComponent (this is currently implemented with a transform component). This component should take as an argument a write_path and write the dataset with an appropriate schema based on the component spec.

Proposed Approach

Let's take the write_to_hub component as an example. What we want is the following:

  • Allow users to write the dataset with custom columns names
    This can be done by providing a mapping dict argument similar to what Bert proposed for the loading component link

  • The user returns the dataset with the modified names and we take care of writing it to the appropriate location
    Two options here:

    • (Option A): User has to implement the writing functions themselves similar to how it's currently implemented here
      Not ideal since they have to call the schema stuff and the to_parquet method which we normally handle in the backend.
    • (Option B): User returns a url of where to write the data as well as the dataframe. We take care of writing the dataset to the appropriate location. However, this would require us to implement a DaskDataSink component similar to the DaskDataWriter (we have to change some names) where the biggest difference is that:
      • The write location is the one returned by the user and not based on the manifest (run_id + subset)
      • We do not write any subsets in this component
        Only caveat here is that we're assuming that all the remote locations can be sinked to using the to_parquet method (example). This applies to both the cloud and hf. Also, it will require injecting some dependencies into the Fondant code (writing to hf://) requires hf_hub as a dependency
      • We might want to introduce specific classes for the Data Sinks since it enables for a nicer isolation of the behavior of the components. E.g. the dataset needs to have a specific schema when writing to hub in order to render properly (see below). Where the specific classes would subclass a general DaskDataSink class. Wondering whether we should follow a similar approach for the LoadComponent if we want to introduce many sources.

Tasks

Implementation Steps/Tasks

Depends on the choice of solution.

another thing that we will have to implement is change the schema passed to the to_parquet method from a dict to a pyarrow.Schema data type. This is mainly to support adding additional metadata to the schema needed for the write_to_hub component.

Image

Potential Impact

if Option B is introduced:

Component class will have to be adjusted by adding a new class and potentially modifying the run method (maybe we'll have to implement a separate run method per component type since they will share many common methods anymore if the WriteComponent is introduced ) as well as DataIo

Testing

Documentation

We will have to document all 3 types of component in the custom_component.md

Feedback and Suggestions

Dependent features

Additional Notes

Split Component into Load and Transform subclasses

The Component class currently has a type attribute which indicates if the component is used to load data or transform data, which is used in if / else checks in different places. It would be cleaner to split this into LoadComponent and TransformComponent subclasses

Simplify Fondant naming

Many objects within the framework have long names. For example: FondantComponentOp, FondantPipeline, FondantClient
Simplify naming by potentially using short abbreviations (e.g. F)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.