uclaml / sppo Goto Github PK

View Code? Open in Web Editor NEW

415.0 27.0 55.0 3.13 MB

The official implementation of Self-Play Preference Optimization (SPPO)

Home Page: https://uclaml.github.io/SPPO/

License: Apache License 2.0

Python 91.79% Shell 8.21%

deep-learning fine-tuning large-language-models rlhf self-play

sppo's Issues

Any chance it work on my homelab?

Hello everyone,

I would like to know if these scripts are capable of running on a home lab setup with the following specifications:

64GB RAM
2x RTX 4090 GPUs

Thank you for your assistance.

I noticed that the scripts are configured to use 8 GPUs.

DPO baseline implementation

Dear authors, may I know how we can train the iterative DPO baseline model using this repo? Is there a convenient way to modify the sppo code?

Some packages' version are too old

When use accelerate==0.23.0 in setup.py, it got the following error:
Accelerator.__init__() got an unexpected keyword argument use_seedable_sampler
When upgrade accelerate to 0.31.0, this error fixed.

Is it normal the pipeline start with a huge loss ?

step 10:
{'loss': 119743.8516, 'grad_norm': 938286.7284407256, 'learning_rate': 2.0161290322580643e-09, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -128.30323791503906, 'logps/chosen': -178.66146850585938, 'logits/rejected': -0.7681801915168762, 'logits/chosen': -0.792536735534668, 'epoch': 0.0}

step 20:
{'loss': 119688.3056, 'grad_norm': 1090985.982531398, 'learning_rate': 2.0161290322580644e-08, 'rewards/chosen': -8.749030530452728e-05, 'rewards/rejected': 0.00024323315301444381, 'rewards/accuracies': 0.2222222238779068, 'rewards/margins': -0.00033072344376705587, 'logps/rejected': -102.9691390991211, 'logps/chosen': -104.48147583007812, 'logits/rejected': -0.30933287739753723, 'logits/chosen': -0.3230978548526764, 'epoch': 0.0}

step 30:
{'loss': 122734.3, 'grad_norm': 677227.7630694123, 'learning_rate': 4.032258064516129e-08, 'rewards/chosen': -0.00015188578981906176, 'rewards/rejected': 3.675480911624618e-05, 'rewards/accuracies': 0.20000000298023224, 'rewards/margins': -0.00018864059529732913, 'logps/rejected': -132.24008178710938, 'logps/chosen': -116.12632751464844, 'logits/rejected': -0.4473434388637543, 'logits/chosen': -0.4207238554954529, 'epoch': 0.01}

I am surprised at such a huge loss, is this normal ?

Scores and probability calcuations

SPPO/scripts/compute_prob.py

Line 39 in e524519

prb[i][j] = 1 / (1 + np.exp(score[j] - score[i]))

From my understanding of the code, the score list here is the output from the blender.rank(*, return_scores=True) which should output the average relative score of the response in the index being better than other responses. Please correct me if wrong.

For example, given three responses, {y1, y2, y3}, the first element of the scores output by the blender model (s1, s2, 3) is, s1 = P(y1 > y2) + P(y1 > y3), disregarding the constant coefficient and P is general preference score function, not probability. [references from blender code and their paper]

Thus, subtracting two scores, i.e., s1 - s2, is also dependent on the third response y3 as well, which seems a bit different from what is described in the paper.

In summary, I feel it is more appropriate to use the score output from the blender with just two responses (although, I don't think this would make a significant difference in the performance), e.g.,

score = blender.rank([x], [[yj, yi]], return_scores=True)[0, 0]
prb[i][j] = 1 / (1 + np.exp(score))

(sorry for the badly coded example)

Questions about the training code

Thank you for sharing your code and making it open source.

But I noticed that the training code you used still uses the DPO trainer, which seems inconsistent with the SPPO in the paper. Can you explain why you use DPO instead of the suggested SPPO in your code?

What's the package configuration for reproduce SPPO-Gemma-2?

I found that the current repository configuration is not compatible with Gemma2. The reason might be that transformers and vllm are not fully compatible with Gemma2. Could you share the package configurations to reproduce SPPO-Gemma-2?

Dataset used and results in Gemma-2-9B results

Thanks for the great product.
I am so impressed with your research that I have tried it many times.
However, the results with Gemma-2-9B are very different from your results.

The score was even Iter-3 lower than the original Gemma2-9B-it.

My question is, what did you use,
UCLA-AGI/data-mistral-7b-instruct-sppo-iter[x]?

I am aware that these or others were based on UltraFeedBack, and the Github code was the same.

Sincelery, Kazuya

Adaptation for 4-bit Quantization Training/Responses Generation (with 2 Home GPUs)

Hello everyone!

For those who want to run this algorithm in your own home lab, I am working on adapting this code to run on my setup:

AMD Ryzen 9 7950X
2x 32GB DDR5 @ 6000 MHz
2x RTX 4090 24GB (48GB total)

If you're interested in testing and contributing, you can find my progress on my fork of this repository:
https://github.com/kaykyr/SPPO

Once I accomplish this objective, I intend to fully refactor this repository to allow both full-precision training on multi-GPUs and half-precision training on home GPUs with an easy-to-use script.

I appreciate all the contributions to the papers and the original code and will be even more grateful for any contributions from the community.

Thank you!

Question about SPPO

Hi! 😊 Your SPPO project caught my eye. Amazing work on this python repository! ✨ Could you send me more details on Telegram? Also, please review my work and follow me on GitHub @nectariferous. Thanks!

Which version of vllm should be installed

Hi, when I follow the default steps to set up environment:
pip install vllm
it will automaticly install vllm 0.5.0.post1, and transformers>=4.40.0 is required.

When installing SPPO ( transformers==4.36.2 are required), I got the following errors:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.5.0.post1 requires tokenizers>=0.19.1, but you have tokenizers 0.15.2 which is incompatible.
vllm 0.5.0.post1 requires transformers>=4.40.0, but you have transformers 4.36.2 which is incompatible.

Should I degrade the vllm version or ignore this error, how could I fix this error?

ConnectionError: Couldn't reach 'synthetic_data_llama-3-8b-instruct-sppo-iter3_score' on the Hub (ConnectionError)

Great work!
I commented all the push_to_hub in the code. Is synthetic_data_llama-3-8b-instruct-sppo-iter3_score dataset generated by PairRM?

[rank4]: Traceback (most recent call last):
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/run_dpo.py", line 249, in
[rank4]: main()
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/run_dpo.py", line 43, in main
[rank4]: main_inner(model_args, data_args, training_args)
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/run_dpo.py", line 78, in main_inner
[rank4]: raw_datasets = get_datasets(data_args, splits=["train"])
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/alignment/data.py", line 164, in get_datasets
[rank4]: raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/alignment/data.py", line 189, in mix_datasets
[rank4]: dataset = load_dataset(ds, split=split)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 2129, in load_dataset
[rank4]: builder_instance = load_dataset_builder(
[rank4]: ^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 1815, in load_dataset_builder
[rank4]: dataset_module = dataset_module_factory(
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 1512, in dataset_module_factory
[rank4]: raise e1 from None
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 1468, in dataset_module_factory
[rank4]: raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).name})")
[rank4]: ConnectionError: Couldn't reach 'synthetic_data_llama-3-8b-instruct-sppo-iter3_score' on the Hub (ConnectionError)

Good work

Ranking speed & training hyperparameters

I'm trying to replicate the Llama-3-8B setup with one of our custom finetunes and stumbled across some questions:

The ranking model gets called with a batch size of 1 and increasing that didn't seem to make the ranking any faster. With my current setup the ranking takes longer than the actual training. Is there a way to speed up the ranking part of the pipeline?

You mention in the paper that you train for 18 epochs per iteration, usually instruction tuning is done with a single epoch since the models can overfit on the data pretty quickly. Did you really end up training each iteration for 18 epochs and didn't that lead to massive overfitting?

Could you provide some loss numbers for iter 1/2/3 just so people have a number to compare their runs to? The loss seems very high but i'm not sure how the numbers are supposed to look like since SPPO uses a custom loss function.

Overall a pretty nice pipeline that you built with the iterative generation->ranking->training setup

ShareGPT appending

How is the ShareGPT format handled with this workflow? I'm currently developing a dataset that could be greatly benefited from this technique. However, I hate training on "User" and "Assistant" tokens. It goes against my intentions when working with language models. With Axolotl, there's a way to change the header IDs for sharegpt datasets. I was wondering if there was something similar I could do here, or perhaps I could just do some data processing to change the format...

SPPO Implementation on Axolotl!

Hey guys! For who is interested, I recently submitted a pull request to implements SPPO on Axolotl trainer, you can fallow the pull request here:
axolotl-ai-cloud/axolotl#1735

Original SPPO implementation fork:
https://github.com/kaykyr/axolotl

See examples/llama3/sppo-qlora-8b.yml config file to see how train SPPO.

Is it possible to run llama 3-70B and/or mixtral 8x22b through this process?

I'm running the Llama-3-Instruct-8B-SPPO-Iter3 model locally and am very impressed by the improved quality from the original model. I can't help but wonder what the results would be if this finetuning process were run on larger models.

Is it possible to run the code on these larger models, or are the smaller versions too different form their larger counterparts; requiring a rework of the training scripts?

Thank you for what you have contributed, this is great stuff!

Suggestion: Gemma 2 9B and 27B.

Imagine how these models could performs even better using SPPO.

uclaml / sppo Goto Github PK

sppo's Issues

Recommend Projects

Recommend Topics

Recommend Org