Giter Club home page Giter Club logo

Comments (3)

ejmichaud avatar ejmichaud commented on August 17, 2024 1

Got it! This all makes sense. I see the discussion on the Discord about this and that you've changed the model names on HuggingFace too. Seems reasonable to do this for the long-term!

from pythia.

StellaAthena avatar StellaAthena commented on August 17, 2024 1

@ejmichaud One other thing I forgot to point out: the reason you got numbers matching OPT when using one, but not both, of the embed/unembed matrices is that we were told by Neel Nanda and some others that the standard practice of tying the weights of the two matrices is deleterious for interpretability research. The architecture and number of learnable parameters (the typical number to plot on the x-axis of scaling laws work) is the same as the corresponding OPT models when they exist.

We're also going to train a real 13B parameter model this week.

from pythia.

StellaAthena avatar StellaAthena commented on August 17, 2024

@ejmichaud Thank you for raising this as an issue! I was able to track down the cause of each number and we can discuss what makes sense to do going forwards.

Firstly, some of our configs come from GPT-3 and OPT papers. Specifically, all models other than 19M, 800M, and 13B use the same configs as models in those papers. We used the same nomenclature as those papers, which is to use total number of parameters and round to a nice number. Based on personal communication with OpenAI employees I know that the decision to do it this way wasn't considered particularly carefully, as for the largest models it barely makes any difference (as your calculations show). The paper's focus was on large models and even for the 1.3B model it's ~10% and goes down from there.

The 19M and 800M models have custom config files that were created by us. We didn't think about the fact that GPT-3 and OPT used the aforementioned model naming conventions and instead used the number of trainable parameters as it seemed more natural. The 13B model is the real problem: this was supposed to be the same as the 13B model in OPT and GPT-3, but it appears a transcription error was made at some point and we used 36 layers instead of 40. If you redo your calculations using 40 layers you would in fact get 13B parameters.

The question is, what should we do going forward? We need to correct the 13B model for sure, but we should also choose a consistent naming convention. Since our models do in fact match many widely discussed models it would be useful to have them carry the same naming convention. The 19M and 125M models are the only ones where the distinction between embedding and non-embedding parameters is really that big a deal I think, but I also don't like how the current way this is presented in the lit regularly confuses people (incl. people who should know better like us). We could use descriptive instead of numeric names, but I really hate that practice and the amount of work it is to look up the sizes of models like T5 and BERT.

from pythia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.