Hi there! Thank you so much for releasing these models! They've already been really va

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Question about naming convention for models about pythia HOT 3 CLOSED

eleutherai commented on August 17, 2024

Question about naming convention for models

from pythia.

Comments (3)

ejmichaud commented on August 17, 2024 1

Got it! This all makes sense. I see the discussion on the Discord about this and that you've changed the model names on HuggingFace too. Seems reasonable to do this for the long-term!

from pythia.

StellaAthena commented on August 17, 2024 1

@ejmichaud One other thing I forgot to point out: the reason you got numbers matching OPT when using one, but not both, of the embed/unembed matrices is that we were told by Neel Nanda and some others that the standard practice of tying the weights of the two matrices is deleterious for interpretability research. The architecture and number of learnable parameters (the typical number to plot on the x-axis of scaling laws work) is the same as the corresponding OPT models when they exist.

We're also going to train a real 13B parameter model this week.

from pythia.

StellaAthena commented on August 17, 2024

@ejmichaud Thank you for raising this as an issue! I was able to track down the cause of each number and we can discuss what makes sense to do going forwards.

Firstly, some of our configs come from GPT-3 and OPT papers. Specifically, all models other than 19M, 800M, and 13B use the same configs as models in those papers. We used the same nomenclature as those papers, which is to use total number of parameters and round to a nice number. Based on personal communication with OpenAI employees I know that the decision to do it this way wasn't considered particularly carefully, as for the largest models it barely makes any difference (as your calculations show). The paper's focus was on large models and even for the 1.3B model it's ~10% and goes down from there.

The 19M and 800M models have custom config files that were created by us. We didn't think about the fact that GPT-3 and OPT used the aforementioned model naming conventions and instead used the number of trainable parameters as it seemed more natural. The 13B model is the real problem: this was supposed to be the same as the 13B model in OPT and GPT-3, but it appears a transcription error was made at some point and we used 36 layers instead of 40. If you redo your calculations using 40 layers you would in fact get 13B parameters.

The question is, what should we do going forward? We need to correct the 13B model for sure, but we should also choose a consistent naming convention. Since our models do in fact match many widely discussed models it would be useful to have them carry the same naming convention. The 19M and 125M models are the only ones where the distinction between embedding and non-embedding parameters is really that big a deal I think, but I also don't like how the current way this is presented in the lit regularly confuses people (incl. people who should know better like us). We could use descriptive instead of numeric names, but I really hate that practice and the amount of work it is to look up the sizes of models like T5 and BERT.

from pythia.

Question about naming convention for models about pythia HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent