uclaml / sppo Goto Github PK
View Code? Open in Web Editor NEWThe official implementation of Self-Play Preference Optimization (SPPO)
Home Page: https://uclaml.github.io/SPPO/
License: Apache License 2.0
The official implementation of Self-Play Preference Optimization (SPPO)
Home Page: https://uclaml.github.io/SPPO/
License: Apache License 2.0
Hello everyone,
I would like to know if these scripts are capable of running on a home lab setup with the following specifications:
Thank you for your assistance.
I noticed that the scripts are configured to use 8 GPUs.
Dear authors, may I know how we can train the iterative DPO baseline model using this repo? Is there a convenient way to modify the sppo code?
When use accelerate==0.23.0
in setup.py
, it got the following error:
Accelerator.__init__() got an unexpected keyword argument use_seedable_sampler
When upgrade accelerate
to 0.31.0, this error fixed.
step 10:
{'loss': 119743.8516, 'grad_norm': 938286.7284407256, 'learning_rate': 2.0161290322580643e-09, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -128.30323791503906, 'logps/chosen': -178.66146850585938, 'logits/rejected': -0.7681801915168762, 'logits/chosen': -0.792536735534668, 'epoch': 0.0}
step 20:
{'loss': 119688.3056, 'grad_norm': 1090985.982531398, 'learning_rate': 2.0161290322580644e-08, 'rewards/chosen': -8.749030530452728e-05, 'rewards/rejected': 0.00024323315301444381, 'rewards/accuracies': 0.2222222238779068, 'rewards/margins': -0.00033072344376705587, 'logps/rejected': -102.9691390991211, 'logps/chosen': -104.48147583007812, 'logits/rejected': -0.30933287739753723, 'logits/chosen': -0.3230978548526764, 'epoch': 0.0}
step 30:
{'loss': 122734.3, 'grad_norm': 677227.7630694123, 'learning_rate': 4.032258064516129e-08, 'rewards/chosen': -0.00015188578981906176, 'rewards/rejected': 3.675480911624618e-05, 'rewards/accuracies': 0.20000000298023224, 'rewards/margins': -0.00018864059529732913, 'logps/rejected': -132.24008178710938, 'logps/chosen': -116.12632751464844, 'logits/rejected': -0.4473434388637543, 'logits/chosen': -0.4207238554954529, 'epoch': 0.01}
I am surprised at such a huge loss, is this normal ?
Line 39 in e524519
From my understanding of the code, the score list here is the output from the blender.rank(*, return_scores=True)
which should output the average relative score of the response in the index being better than other responses. Please correct me if wrong.
For example, given three responses, {y1, y2, y3}, the first element of the scores output by the blender model (s1, s2, 3) is, s1 = P(y1 > y2) + P(y1 > y3), disregarding the constant coefficient and P is general preference score function, not probability. [references from blender code and their paper]
Thus, subtracting two scores, i.e., s1 - s2, is also dependent on the third response y3 as well, which seems a bit different from what is described in the paper.
In summary, I feel it is more appropriate to use the score output from the blender with just two responses (although, I don't think this would make a significant difference in the performance), e.g.,
score = blender.rank([x], [[yj, yi]], return_scores=True)[0, 0]
prb[i][j] = 1 / (1 + np.exp(score))
(sorry for the badly coded example)
Thank you for sharing your code and making it open source.
But I noticed that the training code you used still uses the DPO trainer, which seems inconsistent with the SPPO in the paper. Can you explain why you use DPO instead of the suggested SPPO in your code?
I found that the current repository configuration is not compatible with Gemma2. The reason might be that transformers and vllm are not fully compatible with Gemma2. Could you share the package configurations to reproduce SPPO-Gemma-2?
Thanks for the great product.
I am so impressed with your research that I have tried it many times.
However, the results with Gemma-2-9B are very different from your results.
The score was even Iter-3 lower than the original Gemma2-9B-it.
My question is, what did you use,
UCLA-AGI/data-mistral-7b-instruct-sppo-iter[x]?
I am aware that these or others were based on UltraFeedBack, and the Github code was the same.
Sincelery, Kazuya
Hello everyone!
For those who want to run this algorithm in your own home lab, I am working on adapting this code to run on my setup:
If you're interested in testing and contributing, you can find my progress on my fork of this repository:
https://github.com/kaykyr/SPPO
Once I accomplish this objective, I intend to fully refactor this repository to allow both full-precision training on multi-GPUs and half-precision training on home GPUs with an easy-to-use script.
I appreciate all the contributions to the papers and the original code and will be even more grateful for any contributions from the community.
Thank you!
Hi! ๐ Your SPPO project caught my eye. Amazing work on this python repository! โจ Could you send me more details on Telegram? Also, please review my work and follow me on GitHub @nectariferous. Thanks!
Hi, when I follow the default steps to set up environment:
pip install vllm
it will automaticly install vllm 0.5.0.post1, and transformers>=4.40.0 is required.
When installing SPPO ( transformers==4.36.2 are required), I got the following errors:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.5.0.post1 requires tokenizers>=0.19.1, but you have tokenizers 0.15.2 which is incompatible.
vllm 0.5.0.post1 requires transformers>=4.40.0, but you have transformers 4.36.2 which is incompatible.
Should I degrade the vllm version or ignore this error, how could I fix this error?
Great work!
I commented all the push_to_hub in the code. Is synthetic_data_llama-3-8b-instruct-sppo-iter3_score dataset generated by PairRM?
[rank4]: Traceback (most recent call last):
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/run_dpo.py", line 249, in
[rank4]: main()
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/run_dpo.py", line 43, in main
[rank4]: main_inner(model_args, data_args, training_args)
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/run_dpo.py", line 78, in main_inner
[rank4]: raw_datasets = get_datasets(data_args, splits=["train"])
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/alignment/data.py", line 164, in get_datasets
[rank4]: raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/huangxing/software/SPPO/sppo/alignment/data.py", line 189, in mix_datasets
[rank4]: dataset = load_dataset(ds, split=split)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 2129, in load_dataset
[rank4]: builder_instance = load_dataset_builder(
[rank4]: ^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 1815, in load_dataset_builder
[rank4]: dataset_module = dataset_module_factory(
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 1512, in dataset_module_factory
[rank4]: raise e1 from None
[rank4]: File "/training-data/software/miniconda3/envs/mcts/lib/python3.11/site-packages/datasets/load.py", line 1468, in dataset_module_factory
[rank4]: raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).name})")
[rank4]: ConnectionError: Couldn't reach 'synthetic_data_llama-3-8b-instruct-sppo-iter3_score' on the Hub (ConnectionError)
Hi! ๐ Your SPPO project caught my eye. Amazing work on this python repository! โจ Could you send me more details on Telegram? Also, please review my work and follow me on GitHub @nectariferous. Thanks!
I'm trying to replicate the Llama-3-8B setup with one of our custom finetunes and stumbled across some questions:
The ranking model gets called with a batch size of 1 and increasing that didn't seem to make the ranking any faster. With my current setup the ranking takes longer than the actual training. Is there a way to speed up the ranking part of the pipeline?
You mention in the paper that you train for 18 epochs per iteration, usually instruction tuning is done with a single epoch since the models can overfit on the data pretty quickly. Did you really end up training each iteration for 18 epochs and didn't that lead to massive overfitting?
Could you provide some loss numbers for iter 1/2/3 just so people have a number to compare their runs to? The loss seems very high but i'm not sure how the numbers are supposed to look like since SPPO uses a custom loss function.
Overall a pretty nice pipeline that you built with the iterative generation->ranking->training setup
How is the ShareGPT format handled with this workflow? I'm currently developing a dataset that could be greatly benefited from this technique. However, I hate training on "User" and "Assistant" tokens. It goes against my intentions when working with language models. With Axolotl, there's a way to change the header IDs for sharegpt datasets. I was wondering if there was something similar I could do here, or perhaps I could just do some data processing to change the format...
Hey guys! For who is interested, I recently submitted a pull request to implements SPPO on Axolotl trainer, you can fallow the pull request here:
axolotl-ai-cloud/axolotl#1735
Original SPPO implementation fork:
https://github.com/kaykyr/axolotl
See examples/llama3/sppo-qlora-8b.yml config file to see how train SPPO.
I'm running the Llama-3-Instruct-8B-SPPO-Iter3 model locally and am very impressed by the improved quality from the original model. I can't help but wonder what the results would be if this finetuning process were run on larger models.
Is it possible to run the code on these larger models, or are the smaller versions too different form their larger counterparts; requiring a rework of the training scripts?
Thank you for what you have contributed, this is great stuff!
Imagine how these models could performs even better using SPPO.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.