Expert / Dense layer inner size

minGrok

This repo is meant as a guide to how XAI's newly open-sourced model Grok-1 works. To see their original implementation, click here. To find the googel colab notebook that walks through the architecture in excruciating detail as a demonstration for beginners, click here and check out my youtube video where I walk through it below. If you're not a beginner (already knowledgeable about decoder-only transformers) then I recommend skimming through model.py and config.py to see all the ways in which Grok-1 differs from other open-source models like Llama, Mistral and Gemma.

Repo Contents

The Accompanying Colab Notebook - the teaching material I walk through in my youtube video
minGrok_train-test.ipynb - the notebook where I actually trained the 1m parameter model. The code here is essentially the same as what's in section 3 of the colab notebook
model.py - contains the nn.Modules used to define minGrok. The code here is essentially the same as what's in section 2 of the colab notebook
config.py - contains minGrok's configuration hyperparameters as well as comments indicating what full-sized Grok uses
tokenizer.py - a very simple tokenizer of length 128 built off of TinyShakespeare's original 65 character vocabulary. By no means should anyone actually use this in production but it's fine as a simple stand-in given that the purpose of this repo is not to teach about tokenization
input.txt - just TinyShakespeare. If i wasn't so lazy I would've set all this code to download it directly rather than actually storing a copy in this repo
models/ - a folder of 1m parameter model(s) that i trained on my macbook air. Again don't expect anything impressive, they're just here for teaching purposes so that you can load them rather than training your own. If you train something bigger feel free to upload I guess, but stick with my lazy practice of designating hyperparameters in the title

ToDo

A commenter pointed out my lack of inclusion of MoE specific training dynamics. Basically in order to encourage proper expert utilization rather than over-reliance on one expert, you need to both add randomness to the Router's logits and add a diversity loss to ensure every expert is used in every batch. The video will not be changing but the code has been updated accordingly
Grok's FFN inner-dimension multiplier is actually effectively 5.33. Very odd way they set that up on their part which is why i missed it but anyways these comments have also been fixed
youtube commenter @rpbmpn caught my silly brainfart at the attention normalization. Originally I was not sure where the 0.08838834764831845 scale factor came from but they pointed out that it's just the reciprocal of the square root of the head dimension (1/sqrt(128)). I've only added comments and not actually updated the code bc i'm too lazy to train a new model based on this one tiny change. If anything bigger comes up worth re-training minGrok then i'll include this.

evintunador / mingrok Goto Github PK

mingrok's Introduction

minGrok

Repo Contents

ToDo

Check out my socials

mingrok's People

Contributors

Stargazers

Watchers

Forkers

mingrok's Issues

Expert / Dense layer inner size

Mystery value 0.08838834764831845 solved

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent