Giter Club home page Giter Club logo

fms-dgt's Introduction

Scalable Synthetic Data Generation (SDG)

Introduction

Framework for scalable synthetic data generation (SDG).

Getting Started

Setup

We recommend using a Python virtual environment with Python 3.9+. Here is how to setup a virtual environment using Python venv:

python3 -m venv ssdg_venv
source ssdg_venv/bin/activate
pip install .

Note: If you have used pyenv, Conda Miniforge or another tool for Python version management, then use the virtual environment with that tool instead. Otherwise, you may have issues with packages installed but modules from that package not found as they are linked to you Python version management tool and not venv.

SDG uses Large Language Models (LLMs) to generate synthetic data. Scalable SDG therefore requires access to LLMs to inference or call the model. The following LLM inference APIs are supported:

Scalable SDG uses a .env file to specify the configuration for the IBM GenAI and OpenAI APIs. The .env file needs to be availabe from where the generate command is run from. There is a template env file here.

The subsections that follow explain how to setup for the different APIs.

IBM Generative AI (GenAI)

When using the IBM GenAI API, you need to:

  1. Add configuration to env file as follows:
GENAI_KEY=<genai key goes here>
GENAI_API=<genai api goes here>
  1. Install GenAI dependencies as follows:
pip install -e ".[genai]"

OpenAI

When using the OpenAI platform, you need to:

  1. Add configuration to env file as follows:
OPENAI_API_KEY=<openai api key goes here>
  1. Install OpenAI dependencies as follows:
pip install -e ".[openai]"

vLLM

When using the vLLM batched inference, you need to:

  1. Install vLLM dependencies as follows:
pip install -e ".[vllm]"

Note: vLLM requires Linux OS and CUDA.

Testing out the Framework

To get started with this example, make sure you have followed the Setup instructions, configured IBM GenAI, and/or configured vLLM

In this example, we will use the preloaded data files as the seed data to to generate the synthetic data.

Testing with GenAI

The default data builder is set to run with the GenAI api unless overridden. We thus only need to run the following command (run from the root of the repository) to execute data generation with GenAI:

python -m fms_dgt.__main__ --data-path ./data/logical_reasoning/causal/qna.yaml

Alternatively, you can also use the CLI

fms_dgt --data-path ./data/logical_reasoning/causal/qna.yaml

Testing with vLLM

For convenience, we have provided an additional configuration file that can be modified to test out using a local model with vLLM. First, open the config file and update the model field model_id_or_path to substitute the <local-path-to-model> variable with the path of a model that has been downloaded locally.

python -m fms_dgt.__main__ --data-path ./data/logical_reasoning/causal/qna.yaml --include-config-path ./configs/demo.yaml

Note: vLLM requires Linux OS and CUDA.

Examine Outputs

The generated data will be output to the following directory: output/causal/data->logical_reasoning->causal/generated_instructions.json

This example uses the SimpleInstructDataBuilder as defined in ./fms_dgt/databuilders/simple/. For more information on data builders and other components of Scalable SDG, take a look at the SDG Design doc.

Contributing

Check out our contributing guide to learn how to contribute.

References

This repository is based on the Language Model Evaluation Harness which uses an MIT license.

@misc{eval-harness,
    author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
    title = {A framework for few-shot language model evaluation},
    month = 12,
    year = 2023,
    publisher = {Zenodo},
    version = {v0.4.0},
    doi = {10.5281/zenodo.10256836},
    url = {https://zenodo.org/records/10256836}
}

fms-dgt's People

Contributors

mvcrouse avatar drugilsberg avatar sivasankalpp avatar gabe-l-hart avatar

Stargazers

Harsha avatar c0ldstudy avatar Kinjal Basu avatar Ioana Baldini avatar Iván Baldo avatar  avatar  avatar Elron Bandel avatar  avatar Asim Munawar avatar

Watchers

Lucian avatar Russell Bryant avatar Pavan avatar JJ Asghar avatar Ramón Fernandez Astudillo avatar  avatar  avatar Raghu Ganti avatar Yu Chin Fabian Lim avatar  avatar Ibrahim Abdelaziz avatar Dr. Rashed Z. Bhatti avatar Jason Rute avatar Martin Hickey avatar Sukriti Sharma avatar

fms-dgt's Issues

Maybe add support for GPT-4o?

Hi!
Trying to generate data with GPT-4o fails in

splitted_data = re.split(r"\*\*\s+(Instruction|Input|Output):?", inst)
since it doesn't recognize the format.

A couple of outputs for example:

2024-07-03T22:20:27 - Discarded instruction(didn't match expected format): '\nCertainly! Here are five diverse task instructions for the given document:\n\n1. **Instruction**\nName three films submitted for the 96th Academy Awards for Best Animated Feature.\n**Input**\n<noinput>\n**Output**\nThree of the films submitted for the 96th Academy Awards for Best Animated Feature are "Elemental," "Spider-Man: Across the Spider-Verse," and "Turning Red."\n\n2. **Instruction**\nProvide a brief summary of the main categories for which films can be submitted at the 96th Academy Awards based on the given document.\n**Input**\n<noinput>\n**Output**\nFilms can be submitted in categories such as Best Animated Feature, Best Documentary Feature, and Best International Feature Film at the 96th Academy Awards.\n\n3. **Instruction**\nList the types of features included in the "See also" section for the 96th Academy Awards.\n**Input**\n<noinput>\n**Output**\nThe "See also" section includes Best Animated Feature, Best Documentary Feature, and Best International Feature Film for the 96th Academy Awards.\n\n4. **Instruction**\nAre there submissions available for the 96th Academy Awards for Best Documentary Feature in the document?\n**Input**\n<noinput>\n**Output**\nYes, there are submissions available for the 96th Academy Awards for Best Documentary Feature as listed in the document.\n\n5. **Instruction**\nSummarize the contents of the "See also" section related to the 96th Academy Awards.\n**Input**\n<noinput>\n**Output**\nThe "See also" section provides links to lists of submissions for the 96th Academy Awards in three categories: Best Animated Feature, Best Documentary Feature, and Best International Feature Film.'

2024-07-03T22:20:33 - Discarded instruction(didn't match expected format): '\n1. **Instruction**\nWhich film won the Best Cinematography award?\n**Input**\n<noinput>\n**Output**\nOppenheimer won the Best Cinematography award, with the cinematography by Hoyte van Hoytema.\n\n2. **Instruction**\nGenerate a summary of the winners for Best Production Design and Best Costume Design.\n**Input**\n<noinput>\n**Output**\nThe winner for Best Production Design was Poor Things, with production design by James Price and Shona Heath, and set decoration by Zsuzsa Mihalek. The winner for Best Costume Design was Poor Things as well, with costume design by Holly Waddington.\n\n3. **Instruction**\nList the nominees for Best Makeup and Hairstyling.\n**Input**\n<noinput>\n**Output**\nThe nominees for Best Makeup and Hairstyling were:\n- Golda – Karen Hartley Thomas, Suzi Battersby, and Ashra Kelly-Blue\n- Maestro – Kazu Hiro, Kay Georgiou, and Lori McCoy-Bell\n- Oppenheimer – Luisa Abel\n- Society of the Snow – Ana López-Puigcerver, David Martí, and Montse Ribé\n\n4. **Instruction**\nWho won the Best Visual Effects award?\n**Input**\n<noinput>\n**Output**\nThe winner of the Best Visual Effects award was Godzilla Minus One, with visual effects by Takashi Yamazaki, Kiyoko Shibuya, Masaki Takahashi, and Tatsuji Nojima.\n\n5. **Instruction**\nDescribe the recipients of the Academy Honorary Awards at the 14th Governors Awards ceremony.\n**Input**\n<noinput>\n**Output**\nThe recipients of the Academy Honorary Awards at the 14th Governors Awards ceremony were Angela Bassett, Mel Brooks, and Carol Littleton.'

Using it from InstructLab 0.17.1 with the official container as built by RHELAI.
Thanks!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.