Giter Club home page Giter Club logo

fms-dgt's Introduction

Scalable Synthetic Data Generation (SDG)

Introduction

Framework for scalable synthetic data generation (SDG).

Getting Started

Setup

We recommend using a Python virtual environment with Python 3.9+. Here is how to setup a virtual environment using Python venv:

python3 -m venv ssdg_venv
source ssdg_venv/bin/activate
pip install .

Note: If you have used pyenv, Conda Miniforge or another tool for Python version management, then use the virtual environment with that tool instead. Otherwise, you may have issues with packages installed but modules from that package not found as they are linked to you Python version management tool and not venv.

SDG uses Large Language Models (LLMs) to generate synthetic data. Scalable SDG therefore requires access to LLMs to inference or call the model. The following LLM inference APIs are supported:

Scalable SDG uses a .env file to specify the configuration for the IBM GenAI and OpenAI APIs. The .env file needs to be availabe from where the generate command is run from. There is a template env file here.

The subsections that follow explain how to setup for the different APIs.

IBM Generative AI (GenAI)

When using the IBM GenAI API, you need to:

  1. Add configuration to env file as follows:
GENAI_KEY=<genai key goes here>
GENAI_API=<genai api goes here>
  1. Install GenAI dependencies as follows:
pip install -e ".[genai]"

OpenAI

When using the OpenAI platform, you need to:

  1. Add configuration to env file as follows:
OPENAI_API_KEY=<openai api key goes here>
  1. Install OpenAI dependencies as follows:
pip install -e ".[openai]"

vLLM

When using the vLLM batched inference, you need to:

  1. Install vLLM dependencies as follows:
pip install -e ".[vllm]"

Note: vLLM requires Linux OS and CUDA.

Testing out the Framework

To get started with this example, make sure you have followed the Setup instructions, configured IBM GenAI, and/or configured vLLM

In this example, we will use the preloaded data files as the seed data to to generate the synthetic data.

Testing with GenAI

The default data builder is set to run with the GenAI api unless overridden. We thus only need to run the following command (run from the root of the repository) to execute data generation with GenAI:

python -m fms_dgt.__main__ --data-path ./data/logical_reasoning/causal/qna.yaml

Alternatively, you can also use the CLI

fms_dgt --data-path ./data/logical_reasoning/causal/qna.yaml

Testing with vLLM

For convenience, we have provided an additional configuration file that can be modified to test out using a local model with vLLM. First, open the config file and update the model field model_id_or_path to substitute the <local-path-to-model> variable with the path of a model that has been downloaded locally.

python -m fms_dgt.__main__ --data-path ./data/logical_reasoning/causal/qna.yaml --include-config-path ./configs/demo.yaml

Note: vLLM requires Linux OS and CUDA.

Examine Outputs

The generated data will be output to the following directory: output/causal/data->logical_reasoning->causal/generated_instructions.json

This example uses the SimpleInstructDataBuilder as defined in ./fms_dgt/databuilders/simple/. For more information on data builders and other components of Scalable SDG, take a look at the SDG Design doc.

Contributing

Check out our contributing guide to learn how to contribute.

References

This repository is based on the Language Model Evaluation Harness which uses an MIT license.

@misc{eval-harness,
    author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
    title = {A framework for few-shot language model evaluation},
    month = 12,
    year = 2023,
    publisher = {Zenodo},
    version = {v0.4.0},
    doi = {10.5281/zenodo.10256836},
    url = {https://zenodo.org/records/10256836}
}

fms-dgt's People

Contributors

mvcrouse avatar yuanchi2807 avatar drugilsberg avatar gabe-l-hart avatar sivasankalpp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.