Giter Club home page Giter Club logo

dataverse's Introduction


DATAVERSE

The Universe of Data. All about Data, Data Science, and Data Engineering.

DocsExamplesAPI ReferenceFAQContribution GuideContactDiscord

Welcome to Dataverse!

Dataverse is a freely-accessible open-source project that supports your ETL(Extract, Transform and Load) pipeline with Python. We offer a simple, standardized and user-friendly solution for data processing and management, catering to the needs of data scientists, analysts, and developers in LLM era. Even though you don't know much about Spark, you can use it easily via dataverse.

Why should I use Dataverse?

  • Integrated library: Dataverse streamline your workflow by integrating multiple preprocessing libraries into one, eliminating the hassle of settings and searching for the right tools. You can even use HuggingFace datasets from hub directly into the pipeline.
  • Simplified Spark usage: You don't have to be a pro with Spark. With just a few setting configurations, you can easily take Spark's high performance effortlessly.
  • Facilitated collaboration: Offer uniform preprocessing codes to ensure consistent results whether who runs the code. Dataverse also enable collaboration among users with varying levels of Spark proficiency.

Key Features of Dataverse

  • Block-Based: Dataverse lets you build Spark code like putting together puzzle pieces. You can easily add, take away, or rearrange pieces to get the results you want.
  • Configure-Based: Dataverse has a user-friendly setup where you don't need to know all the code. Just set up the options, and you're good to go.
  • Extensible: It's designed to meet your specific demands, allowing for custom features that fit perfectly with your project.

🌌 Installation

🌠 Prerequisites

To use this library, the following conditions are needed:

  • Python (version between 3.10 and 3.11)
  • JDK (version 11)
  • PySpark Detail installation guide for prerequisites can be found on here.

🌠 Install via PyPi

pip install dataverse

🌌 Quickstart

Various and more detailed tutorials are here.

  • add_new_etl_process.ipynb : If you want to use your custom function, you have to register the function on Dataverse. This will guide you from register to apply it on pipeline.
  • test_etl_process.ipynb : When you want to get test(sample) data to quickly test your ETL process, or need data from a certain point to test your ETL process.
  • scaleout_with_EMR.ipynb : For people who want to run their pipeline on EMR cluster.
Detail to the example etl configure.
    • data_ingestion___huggingface___hf2raw
    • Load dataset from Hugging Face, which contains a total of 2.59k rows.
    • utils___sampling___random
    • To decrease the dataset size, randomly subsample 50% of data to reduce the size of dataset, with a default seed value of 42.
      This will reduce the dataset to 1.29k rows.
    • deduplication___minhash___lsh_jaccard
    • Deduplicate by question column, 5-gram minhash jaccard similarity threshold of 0.1.
    • data_load___parquet___ufl2parquet
    • Save the processed dataset as a Parquet file to ./guideline/etl/sample/quickstart.parquet.
      The final dataset comprises around 1.14k rows.
    # 1. Set your ETL process as config.
    
    from omegaconf import OmegaConf
    
    ETL_config = OmegaConf.create({
        'spark': {
            'appname': 'ETL',
            'driver': {'memory': '4g'},
        },
        'etl': [
            { 
                'name': 'data_ingestion___huggingface___hf2raw', # Extract; You can use HuggingFace datset from hub directly!
                'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
            },
            {
                'name': 'utils___sampling___random',
                'args': {'sample_n_or_frac': 0.5}
            },
            {
                'name': 'deduplication___minhash___lsh_jaccard', # Transform
                'args': {'threshold': 0.1,
                        'ngram_size': 5,
                        'subset': 'question'}
            },
            {
              'name': 'data_load___parquet___ufl2parquet', # Load
              'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}
            }
          ]
      })
    # 2. Run ETLpipeline.
    
    from dataverse.etl import ETLPipeline
    
    etl_pipeline = ETLPipeline()
    spark, dataset = etl_pipeline.run(config=ETL_config, verbose=True)
    # 3. Result file is saved on the save_path

    🌌 Contributors

    🌌 Acknowledgements

    Dataverse is an open-source project orchestrated by the Data-Centric LLM Team at Upstage, designed as an data ecosystem for LLM(Large Language Model). Launched in March 2024, this initiative stands at the forefront of advancing data handling in the realm of LLM.

    🌌 License

    Dataverse is completely freely-accessible open-source and licensed under the Apache-2.0 license.

    🌌 Citation

    If you want to cite our 🌌 Dataverse project, feel free to use the following bibtex

    @misc{dataverse,
      title = {Dataverse},
      author = {Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi Kim, Seonghoon Yang, Jihoo Kim, Changbae Ahn, Chanjun Park},
      year = {2024},
      publisher = {GitHub, Upstage AI},
      howpublished = {\url{https://github.com/UpstageAI/dataverse}},
    }

    dataverse's People

    Contributors

    illgamhoduck avatar 41ow1ives avatar parkchanjun avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.