Giter Club home page Giter Club logo

deita's Introduction

Deita

πŸ€— HF RepoΒ Β Β  πŸ“„ PaperΒ Β Β  πŸ“š 6K DataΒ Β Β  πŸ“š 10K Data

Welcome to Deita (Data-Efficient Instruction Tuning for Alignment) Project!

We will continue to update, please stay tuned!

What is Deita?

Deita is an open-sourced project designed to facilitate Automatic Data Selection for instruction tuning in Large Language Models (LLMs).

It includes:

  • Open-sourced Toolkits for automatic data selection in instruction tuning
  • Deita Datasets: A series of extremely lightweight, high-quality alignment SFT data. We release 6k-sized and 10k-sized datasets in the first release
  • Deita Models: A series of powerful models on par with SOTA chat LLMs with an extremely efficient instruction tuning Process. Deita models can be obained by training with 10x less instruction tuning data compared with other SOTA LLMs

News

Performance

πŸ”” Still curious about how far a small amount of high-quality data can lead LLMs?

Deita may provide an answer for you:

πŸ”¦ Highlights

Model Align Data Size MT-Bench AlpacaEval(%)
Zephyr-7B-sft SFT 200K 5.32 75.12
$\text{Zephyr-7B-}\beta$ SFT + DPO 200K SFT + 60K DPO 7.34 90.60
OpenChat-3.5 C-RLFT >> 70K C-RLFT 7.81 88.51
Starling-7B C-RLFT + APA >> 70K C-RLFT + 183K APA 8.09 91.99
Tulu-2-13B SFT 326K 6.70 78.90
Tulu-2-13B+DPO SFT + DPO 326K SFT + 60K DPO 7.00 89.50
LLaMA2-13B-Chat SFT + PPO -- 6.65 81.09
WizardLM-13B-v1.2 SFT >70K 7.09 89.17
Vicuna-13B-v1.5 SFT >125K 6.57 78.80
DEITA-7B-v1.0 (6K) SFT 6K 7.22 80.78
DEITA-7B-v1.0-sft SFT 10K 7.32 81.67
DEITA-7B-v1.0 SFT + DPO 6K SFT + 10K DPO 7.55 90.06

DEITA models are based on Mistral-7B-v0.1. πŸ”₯

Please refer to this table for full evaluations including Open LLM Leaderboard as well, which includes DEITA models with LLaMA base models and comparisons with other data selection approaches.

πŸ“ˆ Full Evaluations

See full evaluations
Model Align Data Size MT-Bench AlpacaEval(%) OpenLLM (Avg.)
Proprietary Models
GPT-4-Turbo ? -- 9.32 97.70 --
GPT-4 SFT + PPO -- 8.99 95.03 --
Claude-2 SFT + PPO -- 8.06 91.36 --
GPT-3.5-turbo SFT + PPO -- 7.94 89.37 --
Open-sourced Models based on LLaMA-1-13B
LIMA SFT 1K SFT 4.29 41.98 59.82
WizardLM-13B SFT 70K SFT 6.35 75.31 58.96
Vicuna-13B-v1.3 SFT 125K SFT 6.39 82.11 60.01
Random SFT 10K SFT 6.03 71.52 60.14
DEITA-LLaMA1-13B-v1.0-sft SFT 10K SFT 6.60 78.01 64.27
Open-sourced Models based on LLaMA-2-13B
Tulu-2-13B SFT 326K SFT 6.70 78.90 --
Tulu-2-13B+DPO SFT + DPO 326K SFT + 60K DPO 7.00 89.50 --
LLaMA2-13B-Chat SFT + PPO -- 6.65 81.09 --
WizardLM-13B-v1.2 SFT >70K SFT 7.09 89.17 --
Vicuna-13B-v1.5 SFT 125K SFT 6.57 78.80 61.63
Random SFT 10K SFT 5.78 65.19 61.32
DEITA-LLaMA2-13B-v1.0-sft SFT 10K SFT 6.79 81.09 62.71
Open-sourced Models based on Mistral-7B
Mistral-7B-Instruct-v0.1 -- -- 6.84 69.65 60.45
Zephyr-7B-sft SFT 200K SFT 5.32 75.12 60.93
$\text{Zephyr-7B-}\beta$ SFT + DPO 200K SFT + 60K DPO 7.34 90.60 66.36
OpenChat-3.5 C-RLFT >> 70K C-RLFT 7.81 88.51 --
Starling-7B C-RLFT + APA >>70K C-RLFT + 183K APA 8.09 91.99 --
Random SFT 10K SFT 5.89 56.90 61.72
DEITA-7B-v1.0-sft (6K) SFT 6K SFT 7.22 80.78 64.94
DEITA-7B-v1.0-sft (10K) SFT 10K SFT 7.32 81.67 64.00
DEITA-7B-v1.0 SFT + DPO 6K SFT + 10K DPO 7.55 90.06 69.86

πŸš€ Deita Resources

Resource Link License
Deita Datasets
deita-6k-v0 πŸ€— HF Repo MIT License
deita-10k-v0 πŸ€— HF Repo MIT License
deita-complexity-scorer-data πŸ€— HF Repo MIT License
deita-quality-scorer-data πŸ€— HF Repo MIT License
Scorers
deita-complexity-scorer πŸ€— HF Repo LLaMA License
deita-quality-scorer πŸ€— HF Repo LLaMA License
Deita Models
DEITA-7B-v1.0-sft πŸ€— HF Repo Apache-2.0
DEITA-7B-v1.0 πŸ€— HF Repo Apache-2.0
DEITA-LLaMA2-13B-v1.0-sft πŸ€— HF Repo LLaMA 2 License
DEITA-LLaMA1-13B-v1.0-sft πŸ€— HF Repo LLaMA License

πŸƒβ€β™‚οΈ How to start?

Installation

  git clone https://github.com/hkust-nlp/deita.git
  cd deita
  pip install -e .

Data Sample Scoring

If you wish to assess the quality of a response for a single sample, you can follow these steps:

from deita.selection.scorer import Llama_Scorer

model_name_or_path = "hkust-nlp/deita-quality-scorer"

scorer = Llama_Scorer(model_name_or_path)

# example input
input_text = "word to describe UI with helpful tooltips" # Example Input
output_text = "User-friendly or intuitive UI" # Example Output
quality_score = scorer.infer_quality(input_text, output_text)

print(quality_score)
# 2.0230105920381902

Deita also supports VLLM for faster inference. If you want to use VLLM for inference,

pip install vllm

And set is_vllm = True when initilizing scorer

scorer = Llama_Scorer(model_name_or_path, is_vllm = True)

To assess other dimensions of data samples, please refer to the examples/scoring

Deita Pipelines

You can use deita pipelines to perform a variety of operations on the dataset with only one line code and configurations.

  • Dataset Scoring
from deita.pipeline import Pipeline

pipeline = Pipeline("score_pipeline", 
                    data_path = args.data_path,   # json file with sharegpt format
                    scorer = args.scorer,   # [mistral, llama]
                    scorer_name_or_path = args.scorer_name_or_path,  # scorer name or path e.g. hkust-nlp/deita-complexity-scorer
                    is_vllm = args.is_vllm,  # launch with vllm [True, False]
                    score_type = args.score_type, # [complexity, quality]
                    output_path = args.output_path)  # output path (json format)

pipeline.run()
  • Get Embeddings

We use Huggingface Accelerate to enhance efficiency:

from deita.pipeline import Pipeline

embed_pipeline = Pipeline("embed_pipeline", 
                          data_path = args.data_path,   # json file with sharegpt format
                          output_path = args.output_path,  # output path (pickle format)
                          model_name_or_path = args.model_name_or_path,  # model name or path e.g. mistralai/Mistral-7B-v0.1
                          max_length = args.max_length,
                          use_flash_attention = args.use_flash_attention,  
                          batch_size_per_device = args.batch_size_per_device,
                          conv_template = args.conv_template,
                          only_answer = args.only_answer,
                          random_shuffle = args.random_shuffle,
                          bfloat16 = True
                          )

embed_pipeline.run()
CUDA_VISIBLE_DEVICES=$GPUIDX accelerate launch \
    --mixed_precision bf16 \
    --num_processes $NUMPROCESS \
    --num_machines 1 \
    examples/pipelines/embed_datasets.py \
    --use_flash_attention true \
    --data_path $DATAPATH \
    --output_path $OUTPUTPATH \
    --batch_size_per_device $BSZ
  • Score-first, Diversity-aware Selection
from deita.pipeline import Pipeline

filter_pipeline = Pipeline("filter_pipeline", 
                          data_path = args.data_path,  # json file with sharegpt format
                          other_data_path = args.other_data_path,  # embedding file path (pickle format)
                          threshold = args.threshold,  # filter threshold default: 0.9 
                          data_size = args.data_size,  # size of selected data
                          chunk_size = args.chunk_size,  # used for more efficient GPU computing  default: 100000
                          sort_key = args.sort_key,  # default: "complexity_scores,quality_scores"
                          output_path = args.output_path,  # json format output path
                          distance_metric = args.distance_metric,  # default: cosine
                          embedding_field = args.embedding_field,  # default: embedding
                          is_compression = args.is_compression,  # default: False
                          device = args.device  # GPU IDX, default: 0
                          )

filter_pipeline.run()

You can refer to examples/pipelines for more details. A doc will also be coming soon.

SFT Training

Please refer to examples/train/sft.sh

deepspeed --include localhost:${DEVICES} --master_port 29501 src/deita/alignment/train.py \
    --model_name_or_path ${MODELPATH} \
    --data_path ${DATAPATH} \
    --output_dir ${OUTPUTPATH}/${RUNNAME} \
    --num_train_epochs 6 \
    --per_device_train_batch_size ${BSZPERDEV} \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps ${GRADACC} \
    --eval_steps 50 \
    --save_strategy "no" \
    --save_steps 100 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --do_eval False \
    --evaluation_strategy "no" \
    --model_max_length 2048 \
    --lazy_preprocess True \
    --conv_template "vicuna_v1.1" \
    --mask_user True \
    --report_to "wandb" \
    --run_name ${RUNNAME} \
    --bf16 True \
    --deepspeed src/deita/ds_configs/deepspeed_config_zero2_no_offload.json

DPO Training

Please refer to examples/train/dpo.sh

deepspeed --include localhost:${DEVICES} --master_port 29502 src/deita/alignment/dpo_train.py \
    --model_name_or_path ${MODELPATH} \
    --json_path ${JSONPATH} \
    --data_split ${DATASPLIT} \
    --output_dir ${OUTPUTPATH}/${RUNNAME} \
    --num_train_epochs ${DPOEPOCH} \
    --beta 0.1 \
    --per_device_train_batch_size ${BSZPERDEV} \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps ${GRADACC} \
    --save_global_steps False \
    --eval_steps 50 \
    --save_strategy "no" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 5e-7 \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "linear" \
    --logging_steps 1 \
    --do_eval False \
    --evaluation_strategy "no" \
    --model_max_length 2048 \
    --conv_template "vicuna_v1.1" \
    --report_to "wandb" \
    --run_name ${RUNNAME} \
    --bf16 True \
    --gradient_checkpointing True \
    --deepspeed src/deita/ds_configs/stage3_no_offloading_accelerate.json

Evaluation

πŸ’ͺ What's more?

This is the preview version of Deita project. We will continue to update including

  • Release data selection pipeline with efficient implementation
  • More automatic data selection strategies
  • CLI-Interface Supported
  • Online Demo

Citation

If you find the content of this project helpful, please cite our paper as follows:

@inproceedings{
liu2024what,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=BTKAeLqLMw}
}

Acknowledgement

For training code, we use the code template of fastchat.

deita's People

Contributors

jxhe avatar vpeterv avatar winglian avatar zeng-wh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

deita's Issues

Questions about the "Pool=50K" in your paper.

Hi, thanks for your work! I have some questions about your experiment in training the complexity scorer.

β€œPool=50K” denotes the data selection procedure is conducted in a 50K-sized subset due to the cost of using ChatGPT to annotate the entire pool."

1、The data used for "EVOL COMPLEXITY (Pool=50K)" is sampled from 50K samples while that for "EVOL COMPLEXITY" is sampled from the original data pool?
2、How do you sample the data from the original data pool?
Hope for your reply!

data of deita's dpo+sft

May I ask where the preference data used in your dialogue model during the DPO process comes from? Is there an open-source plan for it? Thank you.

Cosine distance computation

Hello,

While the paper and the code both say that cosine distance is used to promote diversity, it seems that the current implementation computes cosine similarity instead of distance:

matrix_norm = matrix / matrix.norm(dim=1)[:, None]
matrix_2_norm = matrix_2 / matrix_2.norm(dim=1)[:, None]
return torch.mm(matrix_norm, matrix_2_norm.t())

If cosine similarity is used, it rather enforces data similarity than diversity. Any clarifications would be much appreciated!

Best,
Sang

How did you train the complexity & quality scorer

First of all, thank you, and huge congrats on the paper release! Really enjoyed reading it.

I wanted to ask if you can share any details on how you trained your scorer. Was it simple next token prediciton on the collected data samples? 2k each?

Could you please publish the original data pool?

Hi,
First of all, thank you for your work and the great repo!

As stated in the title, could you please provide the original data pool used in your paper, especially $X_{sota}$. I have tried to obtain the dataset following the reference in the paper. However, I cannot find a version of ShareGPT and UltraChat Huggingface datasets that match the statistics stated in the paper. I would greatly appreciate it if you could provide the dataset or teach me how to filter out the two datasets from existing Huggingface datasets.

Best regards

Question about which score to ultimately use for the filtering process.

As illustrated in the left part of Figure 1, we then ask ChatGPT to rank and score these 6 samples
(prompt in Appendix E.2), obtaining the complexity scores c corresponding to the instructions. We
emphasize that, distinct from direct scoring, we give ChatGPT all 6 samples within one prompt – these
samples represent different evolution stages of the same original sample and such a scoring scheme
helps ChatGPT capture the small complexity differences among them, which leads to complexity
scores to achieve finer-grained complexity differentiation among samples.

So the score used as filter is the highest complexity or the original score? Is the use of Evol the method to make the score more confident?

Questions about performance improvement in Open LLM leaderboard

Hi,
First of all, thank you for sharing your wonderful work!

I was searching for efficient ways of mining instructions used in instruction-tuning LLMs.
While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets,
I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of
the multi-choice question answering tasks such as ARC-challenge and MMLU?

In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.

Do you have any ideas why it worked?

Does the EVOL process of instruction dataset has been released?

This is a very interesting work! Thanks for publishing dataset deita-complexity-scorer-data and deita-quality-scorer-data.

According table 14 and table 18 in this work (prompt for ranking and scoring), capturing the small differences among EVOL variants is important. Does the EVOL process of instruction dataset has been released?
I find 9481 training samples in deita-complexity-scorer-data and 9276 training samples in deita-quality-scorer-data, but I can not find the EVOL process of each instruction.

Question 1: Does the EVOL process (relationship from M=1 to M=5) of instruction dataset has been released?
Question 2: Do deita-complexity-scorer-data and deita-quality-scorer-data have done Elimination Evolving as described in WizardLM.

thanks!!!

reproduce mt-bench score

Dear Authors,

Thank you for you great work! I'm trying to reproduce the reported MT-Bench scores with the released code and data.

Trying to reproduce:
DEITA-7B-v1.0 (6K) --> mt-bench: 7.22
DEITA-7B-v1.0-sft --> mt-bench: 7.32

Data I used:
hkust-nlp/deita-6k-v0
hkust-nlp/deita-10k-v0

Code I used:
https://github.com/hkust-nlp/deita/blob/main/examples/train/sft.sh

The scores for both 6k and 10k I got are around 7.06 (vs. 7.22, 7.32). The difference seems larger than regular SFT and MT-Bench eval variability.

Any suggestions to resolve the discrepancy would be appreciated.

Thanks!

[Question] Regarding the order bias in sample scoring.

Hello, thank you for your great work!

Regarding the EVOL COMPLEXITY method in the paper, where ChatGPT ranks and scores the complexity of samples, I have recently observed that many LLMs tend to score samples in a descending order from high to low. For example, a sample sequence like ABCD tends to be scored from high to low, and when the order is adjusted (e.g., CDAB), the scoring trend remains similar.

Have you observed a similar phenomenon, and if so, have you made any corresponding adjustments in your experiments?

Scorer models on hub are 7b not 13b

Hello, it appears that the scorer models on the hub are 7b models rather than 13b specified in the model card.

from transformers import AutoModelForCausalLM

model_name = "hkust-nlp/deita-quality-scorer"
model = AutoModelForCausalLM.from_pretrained(model_name)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
num_params_in_billions = num_params / 1_000_000_000

print(f"Number of parameters in the {model_name} model: {num_params_in_billions} B") #6.73B

Do you have the 13b versions available?

The length of samples

It seems each sample in the deita dataset consists of a lot of turns and is super long (>10k tokens). Your paper mentioned the max length of input is 2048 for SFT. Does that mean most text of each training sample is truncated and discarded?

Some questions about running the scorer for arbitary model

Hi, thanks for your great work! I notice that you used ChatGPT for scorer but it seems that there is no place for us to insert our own token. Does this mean we cannot use this scorer for arbitary model?

Moreover, do you think it can be used for a evaluation metric of llm output? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.