Comments (1)
sorry for the late reply @roshansridhar . I don't have access to the wandb logs anymore, but my coauthor should -- @xwinxu can you paste the plots from the last llama7b+ppo run, as well as the json dump of the config?
However, I'm experiencing negligible improvement with the PPO part of the training.
your charts look pretty good to me. the policy loss is declining over time. the critic declines first but is flat afterwards -- this is expected because the value network is pretty small (2-layer MLP), so its capacity is limited, and the irreducible error will be pretty large. mean reward is going up over time as well. the only suspicious part is the exploding loss/seqratio, but you can probably fix this by using ++loss.cliprange=[value smaller than 0.5].
so I adjusted the PPO epochs to 3 in my experiments
IIRC, we did PPO for 1 epoch as well, the same as all the other alignment methods
could you elaborate on how and when to decide which checkpoint to use for downstream tasks, especially for PPO, DPO, and KTO scenarios?
sure. for KTO, we basically used the same hyperparam settings recommended in the DPO paper, so no decision needed to be made there. for PPO, hyperparams were chosen based on how whether training was stable, rewards went up over time, and the GPT-4-judged winrate against the baseline went up over time
When conducting sft training, it calculates train and validation losses using train dataset splits. If I use the same dataset for ppo, how can I ensure that I am not retraining on the train split inadvertently?
doing sft on the positive examples in a preference/feedback dataset is fairly common practice, as done in the DPO paper. this isn't something to worry about IMO, but you can always just do ppo/dpo/kto without SFT or just do SFT on a separate dataset. all you need to do is change the ++model.load_from field.
Based on your plots and results shared, am I correct in understanding that you had a batch size of 32 and conducted 200k steps of sft training,
bs=32 sounds right. @xwinxu can you double-check how many SFT steps there were for llama7b on [shp,hh,oasst]?
from halos.
Related Issues (20)
- Figure 5 in paper does not appear to show the SFT+KTO result HOT 2
- In dataloder HOT 1
- No checkpoint for archangel_sft-dpo_llama7b HOT 2
- Question about the KL term in the loss function HOT 3
- a few queries HOT 2
- Would you please support LoRA training? HOT 1
- How to sample from HF models? HOT 3
- How to setup the Environment without Conda? HOT 3
- Losses list appears to be empty for loss=DPO HOT 2
- Error fix for evaluation script `eval.py` HOT 1
- Can you provide a clear description of the dataset structure we can use for our custom dataset. HOT 2
- Is there a problem with training? HOT 4
- ERROR:None of the inputs have requires_grad=True. Gradients will be None HOT 12
- Gradient Clipping for FSDP HOT 1
- Comments in KTO Trainer `forward()` HOT 1
- Compatibility with quantized embeddings HOT 2
- Llama 3 compatibility
- Hidden State mapping to two value nodes instead of 1
- Process oasst dataset HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from halos.