Giter Club home page Giter Club logo

gpt3datagen's Introduction

Python 3.7, 3.8, 3.9 Imports: isort CI pre-commit Docs

tsiameh twitter

GPT3DataGen

GPT3DataGen is a python package that generates fake data for fine-tuning your openai models.

               _      ___      _         _
              ( )_  /'_  )    ( )       ( )_
   __   _ _   | ,_)(_)_) |   _| |   _ _ | ,_)   _ _    __     __    ___
 /'_ `\( '_`\ | |   _(_ <  /'_` | /'_` )| |   /'_` ) /'_ `\ /'__`\/' _ `\
( (_) || (_) )| |_ ( )_) |( (_| |( (_| || |_ ( (_| |( (_) |(  ___/| ( ) |
`\__  || ,__/'`\__)`\____)`\__,_)`\__,_)`\__)`\__,_)`\__  |`\____)(_) (_)v0.1.0
( )_) || |                                          ( )_) |
 \___/'(_)                                           \___/'

Install with pip. See Install & Usage Guide

pip install -U gpt3datagen

Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:

pip install git+https://github.com/donwany/gpt3datagen.git --use-pep517

Or git clone repository:

git clone https://github.com/donwany/gpt3datagen.git
cd gpt3datagen
make install && pip install -e .

To update the package to the latest version of this repository, please run:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/donwany/gpt3datagen.git

Command-Line Usage

Run the following to view all available options:

gpt3datagen --help
gpt3datagen --version

Output formats: jsonl, json, csv, tsv, xlsx

gpt3datagen \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "classification" \
    --output_format "jsonl" \
    --output_dir .

gpt3datagen \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type completion \
    --output_format csv \
    --output_dir .

gpt3datagen \
    --sample_type completion \
    --output_format jsonl \
    --output_dir .

gpt3datagen --sample_type completion -o . -f jsonl
gpt3datagen --sample_type news -o . -f jsonl

Data Format

{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
                                    ...

Basic Usage

Only useful if you clone the repository

python prepare.py \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "classification" \
    --output_format "jsonl" \
    --output_dir .

python prepare.py \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "completion" \
    --output_format "csv" \
    --output_dir .

python prepare.py \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "completion" \
    --output_format "json" \
    --output_dir /Users/<tsiameh>/Desktop

Validate Sample Data

pip install --upgrade openai

export OPENAI_API_KEY="<OPENAI_API_KEY>"

# validate sample datasets generated
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.jsonl
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.csv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.tsv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.json
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.xlsx
openai tools fine_tunes.prepare_data -f /Users/<tsiameh>/Desktop/data_prepared.jsonl

# fine-tune
openai api fine_tunes.create \
  -t <DATA_PREPARED>.jsonl \
  -m <BASE_MODEL: davinci, curie, ada, babbage>

# List all created fine-tunes
openai api fine_tunes.list

Test Runs

# For multiclass classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes <N_CLASSES>

# For binary classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes 2 \
  --classification_positive_class <POSITIVE_CLASS_FROM_DATASET>

Contribute

Please see CONTRIBUTING.

License

GPT3DataGen is released under the MIT License. See the bundled LICENSE file for details.

BuyMeCoffee

Build

Credits

  • Theophilus Siameh

gpt3datagen's People

Contributors

aerogramme avatar donwany avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.