Giter Club home page Giter Club logo

Comments (1)

kawine avatar kawine commented on July 28, 2024

sorry for the late reply @roshansridhar . I don't have access to the wandb logs anymore, but my coauthor should -- @xwinxu can you paste the plots from the last llama7b+ppo run, as well as the json dump of the config?

However, I'm experiencing negligible improvement with the PPO part of the training.

your charts look pretty good to me. the policy loss is declining over time. the critic declines first but is flat afterwards -- this is expected because the value network is pretty small (2-layer MLP), so its capacity is limited, and the irreducible error will be pretty large. mean reward is going up over time as well. the only suspicious part is the exploding loss/seqratio, but you can probably fix this by using ++loss.cliprange=[value smaller than 0.5].

so I adjusted the PPO epochs to 3 in my experiments

IIRC, we did PPO for 1 epoch as well, the same as all the other alignment methods

could you elaborate on how and when to decide which checkpoint to use for downstream tasks, especially for PPO, DPO, and KTO scenarios?

sure. for KTO, we basically used the same hyperparam settings recommended in the DPO paper, so no decision needed to be made there. for PPO, hyperparams were chosen based on how whether training was stable, rewards went up over time, and the GPT-4-judged winrate against the baseline went up over time

When conducting sft training, it calculates train and validation losses using train dataset splits. If I use the same dataset for ppo, how can I ensure that I am not retraining on the train split inadvertently?

doing sft on the positive examples in a preference/feedback dataset is fairly common practice, as done in the DPO paper. this isn't something to worry about IMO, but you can always just do ppo/dpo/kto without SFT or just do SFT on a separate dataset. all you need to do is change the ++model.load_from field.

Based on your plots and results shared, am I correct in understanding that you had a batch size of 32 and conducted 200k steps of sft training,

bs=32 sounds right. @xwinxu can you double-check how many SFT steps there were for llama7b on [shp,hh,oasst]?

from halos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.