Hello Developers, Firstly, I would like to thank you for the excelle

sorry for the late reply <a class="user-mention notranslate" data-hovercard-type="user

Request for details and assistance on PPO Experiments with SFT+PPO training about halos HOT 1 OPEN

roshansridhar commented on July 28, 2024

Request for details and assistance on PPO Experiments with SFT+PPO training

from halos.

Comments (1)

kawine commented on July 28, 2024

sorry for the late reply @roshansridhar . I don't have access to the wandb logs anymore, but my coauthor should -- @xwinxu can you paste the plots from the last llama7b+ppo run, as well as the json dump of the config?

However, I'm experiencing negligible improvement with the PPO part of the training.

your charts look pretty good to me. the policy loss is declining over time. the critic declines first but is flat afterwards -- this is expected because the value network is pretty small (2-layer MLP), so its capacity is limited, and the irreducible error will be pretty large. mean reward is going up over time as well. the only suspicious part is the exploding loss/seqratio, but you can probably fix this by using ++loss.cliprange=[value smaller than 0.5].

so I adjusted the PPO epochs to 3 in my experiments

IIRC, we did PPO for 1 epoch as well, the same as all the other alignment methods

could you elaborate on how and when to decide which checkpoint to use for downstream tasks, especially for PPO, DPO, and KTO scenarios?

sure. for KTO, we basically used the same hyperparam settings recommended in the DPO paper, so no decision needed to be made there. for PPO, hyperparams were chosen based on how whether training was stable, rewards went up over time, and the GPT-4-judged winrate against the baseline went up over time

When conducting sft training, it calculates train and validation losses using train dataset splits. If I use the same dataset for ppo, how can I ensure that I am not retraining on the train split inadvertently?

doing sft on the positive examples in a preference/feedback dataset is fairly common practice, as done in the DPO paper. this isn't something to worry about IMO, but you can always just do ppo/dpo/kto without SFT or just do SFT on a separate dataset. all you need to do is change the ++model.load_from field.

Based on your plots and results shared, am I correct in understanding that you had a batch size of 32 and conducted 200k steps of sft training,

bs=32 sounds right. @xwinxu can you double-check how many SFT steps there were for llama7b on [shp,hh,oasst]?

from halos.

Recommend Projects

Request for details and assistance on PPO Experiments with SFT+PPO training about halos HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent