The gpt3datagen from donwany

GPT3DataGen

GPT3DataGen is a python package that generates fake data for fine-tuning your openai models.

               _      ___      _         _
              ( )_  /'_  )    ( )       ( )_
   __   _ _   | ,_)(_)_) |   _| |   _ _ | ,_)   _ _    __     __    ___
 /'_ `\( '_`\ | |   _(_ <  /'_` | /'_` )| |   /'_` ) /'_ `\ /'__`\/' _ `\
( (_) || (_) )| |_ ( )_) |( (_| |( (_| || |_ ( (_| |( (_) |(  ___/| ( ) |
`\__  || ,__/'`\__)`\____)`\__,_)`\__,_)`\__)`\__,_)`\__  |`\____)(_) (_)v0.1.0
( )_) || |                                          ( )_) |
 \___/'(_)                                           \___/'

Install with pip. See Install & Usage Guide

pip install -U gpt3datagen

Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:

pip install git+https://github.com/donwany/gpt3datagen.git --use-pep517

Or git clone repository:

git clone https://github.com/donwany/gpt3datagen.git
cd gpt3datagen
make install && pip install -e .

To update the package to the latest version of this repository, please run:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/donwany/gpt3datagen.git

Command-Line Usage

Run the following to view all available options:

gpt3datagen --help
gpt3datagen --version

Output formats: jsonl, json, csv, tsv, xlsx

gpt3datagen \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "classification" \
    --output_format "jsonl" \
    --output_dir .

gpt3datagen \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type completion \
    --output_format csv \
    --output_dir .

gpt3datagen \
    --sample_type completion \
    --output_format jsonl \
    --output_dir .

gpt3datagen --sample_type completion -o . -f jsonl
gpt3datagen --sample_type news -o . -f jsonl

Data Format

{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
{"prompt": "<prompt text> \n\n###\n\n", "completion": " <ideal generated text> END"}
                                    ...

Basic Usage

Only useful if you clone the repository

python prepare.py \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "classification" \
    --output_format "jsonl" \
    --output_dir .

python prepare.py \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "completion" \
    --output_format "csv" \
    --output_dir .

python prepare.py \
    --num_samples 500 \
    --max_length 2048 \
    --sample_type "completion" \
    --output_format "json" \
    --output_dir /Users/<tsiameh>/Desktop

Validate Sample Data

pip install --upgrade openai

export OPENAI_API_KEY="<OPENAI_API_KEY>"

# validate sample datasets generated
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.jsonl
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.csv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.tsv
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.json
openai tools fine_tunes.prepare_data -f <SAMPLE_DATA>.xlsx
openai tools fine_tunes.prepare_data -f /Users/<tsiameh>/Desktop/data_prepared.jsonl

# fine-tune
openai api fine_tunes.create \
  -t <DATA_PREPARED>.jsonl \
  -m <BASE_MODEL: davinci, curie, ada, babbage>

# List all created fine-tunes
openai api fine_tunes.list

Test Runs

# For multiclass classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes <N_CLASSES>

# For binary classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes 2 \
  --classification_positive_class <POSITIVE_CLASS_FROM_DATASET>

Contribute

Please see CONTRIBUTING.

License

GPT3DataGen is released under the MIT License. See the bundled LICENSE file for details.

BuyMeCoffee

Credits

Theophilus Siameh

donwany / gpt3datagen Goto Github PK

gpt3datagen's Introduction

GPT3DataGen

Install with pip. See Install & Usage Guide

Command-Line Usage

Data Format

Basic Usage

Validate Sample Data

Test Runs

Contribute

License

BuyMeCoffee

Credits

gpt3datagen's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent