Giter Club home page Giter Club logo

mingpt's Introduction

minGPT

mingpt

A PyTorch re-implementation of GPT, both training and inference. minGPT tries to be small, clean, interpretable and educational, as most of the currently available GPT model implementations can a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code (see mingpt/model.py). All that's going on is that a sequence of indices feeds into a Transformer, and a probability distribution over the next index in the sequence comes out. The majority of the complexity is just being clever with batching (both across examples and over sequence length) for efficiency.

note (Jan 2023): though I may continue to accept and change some details, minGPT is in a semi-archived state. For more recent developments see my rewrite nanoGPT. Basically, minGPT became referenced across a wide variety of places (notebooks, blogs, courses, books, etc.) which made me less willing to make the bigger changes I wanted to make to move the code forward. I also wanted to change the direction a bit, from a sole focus on education to something that is still simple and hackable but has teeth (reproduces medium-sized industry benchmarks, accepts some tradeoffs to gain runtime efficiency, etc).

The minGPT library is three files: mingpt/model.py contains the actual Transformer model definition, mingpt/bpe.py contains a mildly refactored Byte Pair Encoder that translates between text and sequences of integers exactly like OpenAI did in GPT, mingpt/trainer.py is (GPT-independent) PyTorch boilerplate code that trains the model. Then there are a number of demos and projects that use the library in the projects folder:

  • projects/adder trains a GPT from scratch to add numbers (inspired by the addition section in the GPT-3 paper)
  • projects/chargpt trains a GPT to be a character-level language model on some input text file
  • demo.ipynb shows a minimal usage of the GPT and Trainer in a notebook format on a simple sorting example
  • generate.ipynb shows how one can load a pretrained GPT2 and generate text given some prompt

Library Installation

If you want to import mingpt into your project:

git clone https://github.com/karpathy/minGPT.git
cd minGPT
pip install -e .

Usage

Here's how you'd instantiate a GPT-2 (124M param version):

from mingpt.model import GPT
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2'
model_config.vocab_size = 50257 # openai's model vocabulary
model_config.block_size = 1024  # openai's model block_size (i.e. input context length)
model = GPT(model_config)

And here's how you'd train it:

# your subclass of torch.utils.data.Dataset that emits example
# torch LongTensor of lengths up to 1024, with integers from [0,50257)
train_dataset = YourDataset()

from mingpt.trainer import Trainer
train_config = Trainer.get_default_config()
train_config.learning_rate = 5e-4 # many possible options, see the file
train_config.max_iters = 1000
train_config.batch_size = 32
trainer = Trainer(train_config, model, train_dataset)
trainer.run()

See demo.ipynb for a more concrete example.

Unit tests

Coverage is not super amazing just yet but:

python -m unittest discover tests

todos

  • add gpt-2 finetuning demo on arbitrary given text file
  • add dialog agent demo
  • better docs of outcomes for existing projects (adder, chargpt)
  • add mixed precision and related training scaling goodies
  • distributed training support
  • reproduce some benchmarks in projects/, e.g. text8 or other language modeling
  • proper logging instead of print statement amateur hour haha
  • i probably should have a requirements.txt file...
  • it should be possible to load in many other model weights other than just gpt2-*

References

Code:

  • openai/gpt-2 has the model definition in TensorFlow, but not the training code
  • openai/image-gpt has some more modern gpt-3 like modification in its code, good reference as well
  • huggingface/transformers has a language-modeling example. It is full-featured but as a result also somewhat challenging to trace. E.g. some large functions have as much as 90% unused code behind various branching statements that is unused in the default setting of simple language modeling

Papers + some implementation notes:

Improving Language Understanding by Generative Pre-Training (GPT-1)

  • Our model largely follows the original transformer work
  • We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.
  • Adam max learning rate of 2.5e-4. (later GPT-3 for this model size uses 6e-4)
  • LR decay: increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule
  • We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
  • Since layernorm is used extensively throughout the model, a simple weight initialization of N(0, 0.02) was sufficient
  • bytepair encoding (BPE) vocabulary with 40,000 merges
  • residual, embedding, and attention dropouts with a rate of 0.1 for regularization.
  • modified version of L2 regularization proposed in (37), with w = 0.01 on all non bias or gain weights
  • For the activation function, we used the Gaussian Error Linear Unit (GELU).
  • We used learned position embeddings instead of the sinusoidal version proposed in the original work
  • For finetuning: We add dropout to the classifier with a rate of 0.1. learning rate of 6.25e-5 and a batchsize of 32. 3 epochs. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5.
  • GPT-1 model is 12 layers and d_model 768, ~117M params

Language Models are Unsupervised Multitask Learners (GPT-2)

  • LayerNorm was moved to the input of each sub-block, similar to a pre-activation residual network
  • an additional layer normalization was added after the final self-attention block.
  • modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers. (weird because in their released code i can only find a simple use of the old 0.02... in their release of image-gpt I found it used for c_proj, and even then only for attn, not for mlp. huh. https://github.com/openai/image-gpt/blob/master/src/model.py)
  • the vocabulary is expanded to 50,257
  • increase the context size from 512 to 1024 tokens
  • larger batchsize of 512 is used
  • GPT-2 used 48 layers and d_model 1600 (vs. original 12 layers and d_model 768). ~1.542B params

Language Models are Few-Shot Learners (GPT-3)

  • GPT-3: 96 layers, 96 heads, with d_model of 12,288 (175B parameters).
  • GPT-1-like: 12 layers, 12 heads, d_model 768 (125M)
  • We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein
  • we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer
  • we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel
  • all models use a context window of nctx = 2048 tokens.
  • Adam with β1 = 0.9, β2 = 0.95, and eps = 10−8
  • All models use weight decay of 0.1 to provide a small amount of regularization. (NOTE: GPT-1 used 0.01 I believe, see above)
  • clip the global norm of the gradient at 1.0
  • Linear LR warmup over the first 375 million tokens. Then use cosine decay for learning rate down to 10% of its value, over 260 billion tokens.
  • gradually increase the batch size linearly from a small value (32k tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size.
  • full 2048-sized time context window is always used, with a special END OF DOCUMENT token delimiter

Generative Pretraining from Pixels (Image GPT)

  • When working with images, we pick the identity permutation πi = i for 1 ≤ i ≤ n, also known as raster order.
  • we create our own 9-bit color palette by clustering (R, G, B) pixel values using k-means with k = 512.
  • Our largest model, iGPT-XL, contains L = 60 layers and uses an embedding size of d = 3072 for a total of 6.8B parameters.
  • Our next largest model, iGPT-L, is essentially identical to GPT-2 with L = 48 layers, but contains a slightly smaller embedding size of d = 1536 (vs 1600) for a total of 1.4B parameters.
  • We use the same model code as GPT-2, except that we initialize weights in the layerdependent fashion as in Sparse Transformer (Child et al., 2019) and zero-initialize all projections producing logits.
  • We also train iGPT-M, a 455M parameter model with L = 36 and d = 1024
  • iGPT-S, a 76M parameter model with L = 24 and d = 512 (okay, and how many heads? looks like the Github code claims 8)
  • When pre-training iGPT-XL, we use a batch size of 64 and train for 2M iterations, and for all other models we use a batch size of 128 and train for 1M iterations.
  • Adam with β1 = 0.9 and β2 = 0.95
  • The learning rate is warmed up for one epoch, and then decays to 0
  • We did not use weight decay because applying a small weight decay of 0.01 did not change representation quality.
  • iGPT-S lr 0.003
  • No dropout is used.

License

MIT

mingpt's People

Contributors

brchristian avatar ericjang avatar fpgaminer avatar karpathy avatar luigidisotto avatar michaellavelle avatar mishig25 avatar nat avatar neverix avatar shivamtawari avatar t-vi avatar waynemystir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mingpt's Issues

How to apply to time series?

I replaced the nn.Embedding layer (token embeddings) with a simple linear layer.
But the model fails to model the data.
Also trained with many different configurations. But It fails to even overfit a really small time series data (100 steps). Since it cannot overfit to a small data, I think that there is something missing or wrong.

Any idea how to properly apply this model to time series?
Maybe there is a problem with pos embeddings? or LayerNorms?

Question about the CharDataset

Hello,

I was having a quick look at the play_char.ipynb notebook (great project by the way!) and I have a question regarding these two lines:

x = torch.tensor(dix[:-1], dtype=torch.long)
y = torch.tensor(dix[1:], dtype=torch.long)

In the README, you say that:

all that's going on is that a sequence of indices goes into a sequence of transformer blocks, and a probability distribution of the next index comes out

Regarding x, I understand that it is a sequence of characters (the last one is not included). Regarding y, I understand that it is supposed to be the last character (which is not included in x). So I would have expected something like this:

y = torch.tensor(dix[-1], dtype=torch.long)

I don't understand why you define y as a sequence, starting from the second character. Am I wrong?

Caching for generation

Currently, generation is done by recomputing every activation after a token is added to the prompt. Normally, one would want to cache the intermediate activations to avoid recomputing them every time. It doesn't compose as well with using the forward function, but that's precisely why a clean and simple implementation should be a part of minGPT. It's very surprising that this is not afforded by pytorch's native TransformerEncoder module either.

Possible Improvement to top_k_logits

I have a possible improvement to the mingpt.utils.top_k_logits function:

def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf') # changed from 1e-10
    return out

I was using the play_char notebook to train against the IMDB dataset, but was getting really terrible samples out of it after training, unless I set temperature very low. Looking into the sampling code I noticed the odd choice of 1e-10 in top_k_logits. It seemed odd since most logits are negative, so using 1e-10 may actually make many characters higher probability, not less/none. Replacing with negative infinity vastly improved sampling for me. A demonstration follows below. I'm happy to open a pull request, just let me know.

Demo Code:

print("Original top_k:")
for _ in range(10):
    context = "O God  O God!"
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = sample(model, x, 200, temperature=0.9, sample=True, top_k=5)[0]
    completion = ''.join([train_dataset.itos[int(i)] for i in y])
    print(completion)
    print()

print("\n\n\nModified top_k:")
for _ in range(10):
    context = "O God  O God!"
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = my_sample(model, x, 200, temperature=0.9, sample=True, top_k=5)[0]
    completion = ''.join([train_dataset.itos[int(i)] for i in y])
    print(completion)
    print()

Output:

Original top_k:
O God  O God!  The DANGER of TERROR RATS GARDEN, 1959 5 79. The movie starts vZERO on quite 

ewtGPX79 end24? Was ChaveBoQwell 6 deserves gKX02 80101 HOURS OF THE WORST POSSIBLE SHOW ON THE LAST? Was 4 7 ratings o

O God  O God!" 9 !!! :  ? ;  #I stumbled aJes DansUn0 JeUS into the 1930 ; So'Pa#AeF7 7 I was so judged.' I'm not sure what tJ2" , The Harri4 X YeR S2ND  20010  Jackson 3 X 44 LD 703010. AT THE TRAIN   THE TRIED O

O God  O God! Don't 5 Dw8becks!#Thekno84 #Watched this movie, like great zombie 30's year Ze7 30 AQUATER EVIL ;1958 tSiff and 3.59Ja :  S47 episodes a5L60 Years and I didn't own REAL WAY very much 'squatreWeLs of 

O God  O God! DoWn Knox  has a WaSTE OF THE STAR TIME!!!  has a stellar point VHS ! No! knowing the 6 Hours.#I Am not The Best Picture or knowing #This Way ToP" do5 POST was a

1970D Very beautifulHbO" Midnight in

O God  O God!! JQ DfH lCnah just a; x's 80X year old Diw4od jokes z? fFf0 to video America4 Z Cou#This was onRIGHT 60 and by far the most past P!! 9 6 I was NOT the, and ;princeOus which made X Le4crophiles pick u

O God  O God!  could ma's 

? Some of you #I , Thed7R 3 not#I really enjoyed this movie how long I didn't...U: The x Newsu? It was '2001'.sLugQleJKxliE : Z63 piece

A 00 ,Vef BoIIled BX 700022 ,: artM dodgy788 pro

O God  O God! Tockzie both , you 8, tAUgues and the...just rjaZzro#; Finally Unstained #I'm 5 SR Movies 8PM keeping ! F" movies Thne? Firstly, I caught this film OK, 4. I QUEST 6 and struck to 9.j. ;3 Out of 10.  

O God  O God!!! a4. I VIETE p6H! iW's not a5 years old as daggerMib0kfijijSS noO.diedk#Wek4 :X m, because the actEMPTION AND The resolution Nazi LiRA ! When seeing Daniels kUNG! , DrNNG of Friends , "The Black Clu

O God  O God!!!! Very lUckK!#IN THE SECOND MANNERS #Izo ons61 is a hiQ :  u havegottCh?  Actually, bI'm the only one xing 50's Naked, 'Calpda7 kids1...iB has watched it WITTY THE BAD MOVIE withvy HIGHER 60WOODS 60

O God  O God!  Un35 U4 END THIS MOVIE . The USA ...Bw??? The sKW, X 1954, 1983I's BETTER....8 10 comes kids.Unless you're not 9 out of 10.!!!QUALITIES!!!!? just when it you a5PM now!#This coulder DVD ;Iun75 was #I




Modified top_k:
O God  O God!!  This is a mediocre film! I was wondering if you are that big fan of the show that you will like to watch.#I saw this movie and I had to be surprised by all the gay movies that take my excellent pla

O God  O God!!  Then, in fact, the movie dies work well together. The plot is all too bad. It is nice to see the movie without a lot of standards to change these comedies, but that is a masterpiece to the film's c

O God  O God!!!!! This is not a good film. It is a good movie, which is no established as a movie with a childhood that is a story that is no more sub than a horror film.#With the best part of the movie the words,

O God  O God!!!!!! ! ! That movie would be bad every silent and the one thing is so good... it is not the best. it's a simple show..it's not a better film. and there is one reason that the child's plots holes are 

O God  O God!

Another reviewer will say that this film is a big fan of my favorite actors. It was too long. I can't really believe the film together but the movie was a stinker. And the script also had always bee

O God  O God!  and then the scary plus in this movie with a bunch of serious comics they are not even mentioned in the cast. I have not seen summarise that they were not expecting a gun or a bad story. The plot wa

O God  O God!  They don't live them up a little bit, and there was much more to tell them. I was also the waters are all over again. They were so cool by the end of the movie because I don't really know what to do

O God  O God!!   And when I was a fan of Sonny Bruce and I was inspired by the scene where she is so stupid and not to mention these scenes with Sanji and her parents trying to stand out, she stays away for the so

O God  O God!  Another show is a meal through her mother and she was tried to be somewhat of her brain.

I highly disagree with her. I didn't even go back to the energy of the movie, and in the movie that it was m

O God  O God!!!!!!!! !!!!!!!!?  THIS SHOUT OF HER OUT!!!! . It is a great film. It is not a good movie but not a good film. I have no idea of hearing the man in a comet. It is a great movie to be a star. I would r

EDIT: Slight addendum: I just have to say how impressive the results of this model are with the fixed sampling, given that I only trained it for a few hours on a 2070.

Renaming transformer.h into transformer.l

In model.py, in GPT class, the number of layers self.transformer.h should be renamed into self.transformer.l (h is reminiscent of "heads" and l would be reminiscent of "layers").

play_char training is broken; CharDataset is not multiprocessing compatible

I discovered that the CharDataset implementation is broken and returns the same batch of data multiple times in a row. This causes massive overfitting and wasted cycles.

The root issue is that CharDataset is not multiprocessing compatible, but num_workers is >1 so it's used in multiprocessing mode.

Details

The source of the problem is this line in __getitem__:

        i = np.random.randint(0, len(self.data) - (self.block_size + 1))

CharDataset is fed into a DataLoader during training, with num_workers set greater than 1. This puts DataLoader into multiprocessing mode where it distributes the Dataset to multiple processes. The crux of the issue is that in doing so it copies over local state, including for example the state of random number generators. So that line above will return the exact same sequence of "random" indexes in every worker process. This results in the same batch of data being repeated four times in a row, before repeating the next batch of data four times, and so on.

Here is a notebook that simplifies play_char to demonstrate the issue: https://gist.github.com/fpgaminer/7737a9377e3379fe17dc5bb83d4db69c

In the simplified notebook __getitem__ returns i directly. In the last cell it iterates the loader and prints out the batches. As can be seen, batches are repeated four times.

Workaround

The workaround for me was to set num_workers to 1. Before the workaround the model showed signs of overfitting on WebText2 which shouldn't be possible. After the workaround, the model started to train correctly and test loss began dropping as expected.

Fix

I haven't worked with raw PyTorch much, so I don't know the idiomatic fix. I'm happy to research and propose a pull request if you would like. Perhaps the easiest fix is to use the workaround and drop an assert into CharDataset to throw if multiprocessing gets used. Since the dataset is in-memory there's little reason to use multiple workers. Larger datasets would really need a different Dataset implementation anyway.

Sharing Pretrained Checkpoints

Hi,

I am currently having proken pipe error while trying to train the model for images. While I am trying to solve this error on my machine, I would like to move on to the evaluation part of the process at the same time. However, I could not find a pretrained weights file at least for the default model with default parameters. Is there any chance for me to retrieve these weight files?

Thanks a lot.

What hardware is supported?

I just tried running play_char.ipynb on Colab Pro "High mem" (Tesla P100) and ran out of GPU memory. I am wondering what hardware can run this?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    31W / 250W |  16273MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  0%|          | 0/2 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-16-19d8cdcc2bb5> in <module>()
      6                       num_workers=4)
      7 trainer = Trainer(model, train_dataset, None, tconf)
----> 8 trainer.train()

14 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
   1674         ret = torch.addmm(bias, input, weight.t())
   1675     else:
-> 1676         output = input.matmul(weight.t())
   1677         if bias is not None:
   1678             output += bias

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 15.90 GiB total capacity; 15.01 GiB already allocated; 7.88 MiB free; 15.12 GiB reserved in total by PyTorch)

Curious Question: Is LayerNorm in the wrong position or is that deliberate?

Code:

def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x

Paper says: Output of each sub-layer is LayerNorm(x + Sublayer(x))

I changed the code to the following and the results seem (slightly) better on play_char, albeit it's a qualitative assessment...

def forward(self, x):
x = self.ln1(x + self.attn(x))
x = self.ln2(x + self.feed_forward(x))
return x

Stop words?

I realize that there isn't an implementation of stop words in the code yet, so the generation will go all the way till it reaches max length. Any idea how to add stop words in the generation process without fine-tuning the pretrained models?

play_image notebook doesn't work.

Running the third cell results in this error:


<ipython-input-7-0c6d9fef7d07> in <module>()
      1 # make deterministic
----> 2 from mingpt.utils import set_seed
      3 set_seed(42)

ModuleNotFoundError: No module named 'mingpt'

Meaning of "-1 because very last digit doesn't plug back"

Hi, this is an awesome repository. I was reading on AdditionDataset and noticed the block size is calculated as follows:

# +1 due to potential carry overflow, but then -1 because very last digit doesn't plug back
self.block_size = ndigit + ndigit + ndigit + 1 - 1

The meaning of "then -1 because very last digit doesn't plug back" wasn't exactly clear to me. Did you mean "but then -1 because the last digit doesn't plug back and needs to be predicted"?

TPU/GPU training: KeyError 'pos_emb'

Hi,

I am currently testing the char notebook.
Everything works fine while CPU training, but if I try to execute the same code on a GPU/TPU the following error occurs:

Exception has occurred: KeyError 'pos_emb'

If I simply remove the problematic code line:

no_decay.add('pos_emb')

It kind of works also in GPU/TPU training but the loss oscillation gets stuck and practically no improvement (or opposite) is made while training like it happens while CPU training where the loss is obviously oscillating with same code base.

Can anyone explain to me how it is possible to solve this KeyError without corrupting the no_decay set?
Thanks a lot! :)

play_math AdditionDataset.__get_item__ return value?

In the case of train_dataset[0], wherein self.permutation_array[0] contains 4717, why does get_item return x, y as
(tensor([4, 7, 1, 7, 0, 6]), tensor([-100, -100, -100, 0, 6, 4]))
and not
(tensor([4, 7, 1, 7]), tensor([-100, -100, -100, -100, 0, 6, 4]))
or
(tensor([4, 7, 1, 7, -100, -100, -100]), tensor([-100, -100, -100, -100, 0, 6, 4]))

This question is not about the implementation of the function, rather it is about how the return value is used with minGPT. Is minGPT only trying to predict the the last digit, i.e. '4'? Not the last 3 digits '064'? Why are the last 3 digits not entirely excluded from x ? Why does y not include the first digit masked out?

How to determine `warmup_tokens` and `final_tokens`?

Hey folks,

Thanks a lot for this implementation @karpathy! I was wondering how you got the values in the addition example:

warmup_tokens=1024,
final_tokens=50 * len(train_dataset) * (ndigit + 1),

And how does one estimate these for a different task (i.e. based on vocabulary, epochs, etc)?

Cheers,
Florian

Perfect training and evaluation loss, but terrible test-time performance

I encountered a pretty ridiculous failure mode in which:

  • I was getting almost 0 training and validation loss
  • I was getting very bad performance when feeding the model incomplete sequences (e.g. for test-time word-generation)

After much debugging, I found that the issue was that the value of self.masked_bias (currently set to -10e4) is too low – this high value is supposed to implement the "mask" of the causal masked attention.

For some high enough learning rates, the network is able to find a hack to copy the input to the output (getting around the causal masking): just drive the attention weights lower than -10e-4, and the causal mask will effectively not be doing anything! This means that the model will be able to attend the whole input in trying to generate the output (even the future tokens), so it will be able to simply copy it at training and validation time (getting 0 loss).

What I found while debugging this line:

>>> attn_weights[0,0,:4,:4]
[[-15258.9805, -10000.0000, -10000.0000, -10000.0000],
        [-15044.7910, -16940.4766, -10000.0000, -10000.0000],
        [-11722.1553, -13301.4287,  -1438.0649, -10000.0000],
        [ -9711.6445, -11315.6006,  -1065.3066, -12052.6035]]

As one can see, the attention weights outside of the causal masking are even larger than this fixed value, which lead the softmax to return non-zero weights on all inputs.

Documenting this in case it could be useful to others!

In terms of fixes, there should probably be an assertion error that the weights never go below self.masked_bias, or setting it to an even lower value.

(I encountered this issue when using this code from the transformers library, but I'm guessing it also affects this library)

Facilitating setup with popular tools

IMHO if the goal is to facilitate understanding then the setup phase is particularly important as it is a barrier to entry.

There are numerous solutions to overcome that, from containerization, e.g Docker or Podman, to dedicated AI ones e.g cog or HuggingFace with a gradio interface, to finally a Jupyter Notebook that could itself be part of an online setup, e.g Gitpod.

I believe providing a Dockerfile with instructions on how to do the most basic steps, e.g training or a single inference, would help newcomers who can then focus on the actual core of the project.

Potential encoding issue in addition problem in play_math notebook?

Karpathy,

Thanks again for the (magic) code. Karpathy magic of simplifying the marvelous into tiny code.

I especially enjoyed the math notebook, since it opens up new applications beyond autoregressive prediction - into conditional response.

If my understanding is right, the way you have encoded the addition problem data only trains the model to learn how to predict the units digit. Please correct me if I got it wrong.

You provide x and y like this.

(tensor([4, 7, 1, 7, 0, 6]), tensor([-100, -100, -100, 0, 6, 4]))

This means you are providing the first two of the three digits in the example so the model only needs to copy them.
For instance, it may not learn how to handle carry into tens or hundreds.

The proper encoding of x that will not 'leak' the answer should be one of two ways:

  1. masking the answer in the x vector:
    4,7,1,7,-100,-100 -> -100, -100, -100, 0, 6, 4
    This forces model to learn how to do all digits.

  2. variable length example encoding with operator tokens:
    <|add|>, 4, 7, <|with|>, 3, 2, 5 <|answer|>, 3, 7, 2,<|end|>
    with corresponding loss mask only turned on of the last FOUR (3,7,2,<|end|>) tokens, because model needs to know how many digits to generate and stop.
    (Note: I had to insert pipes in the tags to avoid git html renderer thinking they are markup for it to interpret)

There are several advantages to the second encoding:

  1. you could teach add, subtract and possibly multiply.
  2. the input numbers and output numbers can be variable length
  3. Both these mean that we can check if model is able to generalize representation for numbers that work in addition AND subtraction and ndigits invariant math.
  4. Most important you can create a long stream of inputs by placing examples back to back into a 'sentence' and use any length of input.
  5. The first encoding is a seq-seq frame of mind. The second one is the GPT-way of thinking that just by predicting one more token lot can be learned. Masking only helps model not to try to predict problem itself since this formulation is a qa mode.

It is possible that my understanding may be incorrect, if so, kindly correct me.
Also I write lot of words, to articulate what is in my mind, not to mansplain, so please read with that hint!

Thanks again,
Ravi

Error when I provide test dataset (custom minGPT)

@karpathy and other contributors

Hey guys,

I am loving your implementation. It's awesome, but I am getting an error when I try to use a test dataset during training with custom minGPT and play_char code.

I use a slightly modified minGPT and modified version of play_char notebook, so the problem may be on my end, but I could really use some help to narrowing down the problem because I can't fix it for some reason.

Here is what I get:

epoch 1 iter 305: train loss 1.24552. lr 6.000000e-04: 100%|██████████| 306/306 [01:46<00:00,  2.87it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-78d9c7cbef4b> in <module>()
      1 #@title (OPTION 1) Train the model
      2 get_ipython().magic('cd /content/')
----> 3 trainer.train()

1 frames
/content/tegridy-tools/tegridy-tools/minGPT.py in train(self)
    373             run_epoch('train')
    374             if self.test_dataset is not None:
--> 375                 test_loss = run_epoch('test')
    376 
    377             # supports early stopping based on the test loss, or just save always if no test set is provided

/content/tegridy-tools/tegridy-tools/minGPT.py in run_epoch(split)
    321             losses = []
    322             pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader)
--> 323             for it, (x, y) in pbar:
    324 
    325                 # place data on the correct device

ValueError: too many values to unpack (expected 2)

If someone can take a look at the code/notebook, I would really appreciate it.

Here is my version of minGPT and the notebook:

https://github.com/asigalov61/tegridy-tools/blob/main/tegridy-tools/minGPT.py

https://colab.research.google.com/drive/1erZa6Wk4Tvm1bHet2_BQ3Qk3ALHpi6Ix?usp=sharing

Thank you in advance for your time and help with this issue.

Alex

P.S. I thought it was related to my tqdm.auto.tqdm statement, but it gives a similar error. So it's not tqdm IMHO.

Is it more reasonable to only use causal attention in the first block of GPT

Dear @karpathy,

Thanks for this nice GPT implementation. Really helps a lot!

When I comparing this GPT with other Transformers, I found that here all the attention layers was using causal self-attention. I'm wondering is it really needed, or we can just use causal self-attentiong in the first block, such as to avoid using future tokens in prediction.

I'm not sure if my idea is correct. But ideally, only using causal attention in the first block should avoid using the future tokens, as there're no residual connections between blocks.

Thanks for your attention.

Crashed Encoder possible data corruption

Identified bare exception in Encoder that could cause unexpected behavior (example if SystemExit isn't raised)

Created PR with simple fix to add explicit exception handling for these lines.

#110

Broken Pipe running "'play_char" notebook

When running Input 12 in the play_char jupyter notebook, I got the error:

BrokenPipeError Traceback (most recent call last)
in
6 num_workers=4)
7 trainer = Trainer(model, train_dataset, None, tconf)
----> 8 trainer.train()

~\Documents\GitHub\minGPT\mingpt\trainer.py in train(self)
123 for epoch in range(config.max_epochs):
124
--> 125 run_epoch('train')
126 if self.test_dataset is not None:
127 run_epoch('test')

~\Documents\GitHub\minGPT\mingpt\trainer.py in run_epoch(split)
77
78 losses = []
---> 79 pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader)
80 for it, (x, y) in pbar:
81

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in iter(self)
277 return _SingleProcessDataLoaderIter(self)
278 else:
--> 279 return _MultiProcessingDataLoaderIter(self)
280
281 @Property

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in init(self, loader)
717 # before it starts, and del tries to join but will get:
718 # AssertionError: can only join a started process.
--> 719 w.start()
720 self._index_queues.append(index_queue)
721 self._workers.append(w)

C:\ProgramData\Anaconda3\lib\multiprocessing\process.py in start(self)
110 'daemonic processes are not allowed to have children'
111 _cleanup()
--> 112 self._popen = self._Popen(self)
113 self._sentinel = self._popen.sentinel
114 # Avoid a refcycle if the target function holds an indirect

C:\ProgramData\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
221 @staticmethod
222 def _Popen(process_obj):
--> 223 return _default_context.get_context().Process._Popen(process_obj)
224
225 class DefaultContext(BaseContext):

C:\ProgramData\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
320 def _Popen(process_obj):
321 from .popen_spawn_win32 import Popen
--> 322 return Popen(process_obj)
323
324 class SpawnContext(BaseContext):

C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in init(self, process_obj)
87 try:
88 reduction.dump(prep_data, to_child)
---> 89 reduction.dump(process_obj, to_child)
90 finally:
91 set_spawning_popen(None)

C:\ProgramData\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
58 def dump(obj, file, protocol=None):
59 '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60 ForkingPickler(file, protocol).dump(obj)
61
62 #

BrokenPipeError: [Errno 32] Broken pipe

Question about memory usage for play_math

I've been experimenting with minGPT / play_math for the purpose of seeing if multiplication is possible. I've got a somewhat anemic GTX1060 with only 6GB of memory, when attempting to expand the sequence size width of ndigit = 3 works but anything above that results in a SIGKILL which I am assuming is GPU OOM. But what's weird is that with ndigit = 3 the GPU is occupied with only 705MiB, why would ndigit = 4 result in OOM?

My target goal right now is simple multiplication with 24-bit integers, any advice on model refinements would be greatly appreciated.

Great project karpathy ;)

tests do not run in project as built

building the project as:

pip install -e .

and running the test command given in README.md

python -m unittest discover tests

gives an unexpected error:

ImportError: Failed to import test module: test_huggingface_import
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/loader.py", line 436, in _find_test_path
    module = self._get_module_from_name(name)
  File "/usr/lib/python3.10/unittest/loader.py", line 377, in _get_module_from_name
    __import__(name)
  File "/home/slowpoke/work/fork/minGPT/tests/test_huggingface_import.py", line 7, in <module>
    from transformers import GPT2Tokenizer, GPT2LMHeadModel
ModuleNotFoundError: No module named 'transformers'

the cause appears to be a missing dependency in the requirements section of setup.py

Use PyTorch Lightning to handle the training (free checkpointing + logging + 16-bit precision)

Awesome repo!

However, not sure why go through the effort of implementing your own trainer again...

In lightning we already support:

  • automatic checkpoint loading/saving
  • multi-cpu
  • multip-gpu
  • multi-tpu core
  • 16-bit precision (amp and native)
  • accumulated gradients
  • and about 40+ more features.

Not to mention it's maintained by a team of over 20+ fulltime engineers and 200+ open-source contributors and has been adopted by over 400 companies and research labs.

https://pytorch-lightning.readthedocs.io/en/latest/new-project.html

self.model.train() in fake_lighting.py fit

Hi,

Thank you for this wonderful code and for implementing lightning support too.
In self.model.train(), fit function, there is a call to model.train. I am not able to understand its purpose.
Sorry if it is my ignorance.

Use of amp.autocast does not improve performance

I'm experimenting with amp.autocast (automatic mixed precision) with torch version, '1.7.0a0+7036e91'
I find the performance has not improved (but slight degraded, 36.2s for AMP vs 30.4s for FP32 with N=1 V100 GPU, converged in 40 epochs).
My changes are in bold below and they follow the amp recipe
Thank you for any explanation you can provide.

    def train(self):
        use_amp = self.config.precision == 'AMP'
        scaler = torch.cuda.amp.GradScaler(enabled=use_amp)  # Setup once at the beginning of training

        model, config = self.model, self.config
        raw_model = model.module if hasattr(self.model, "module") else model
        optimizer = raw_model.configure_optimizers(config)

        def run_epoch(split):
            is_train = split == 'train'
            model.train(is_train)
            data = self.train_dataset if is_train else self.test_dataset
            loader = DataLoader(data, shuffle=True, pin_memory=True,
                                batch_size=config.batch_size,
                                num_workers=config.num_workers)

            losses = []
            pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader)
            for it, (x, y) in pbar:

                # place data on the correct device
                x = x.to(self.device)
                y = y.to(self.device)

                # forward the model
                with torch.set_grad_enabled(is_train):
                    with torch.cuda.amp.autocast(enabled=use_amp):  # cast ops in mixed precision
                            logits, loss = model(x, y)
                    loss = loss.mean() # collapse all losses if they are scattered on multiple gpus
                    losses.append(loss.item())

                if is_train:
                    model.zero_grad()
                    # Scale loss, call backward() to create scaled gradients
                    scaler.scale(loss).backward()
                    # unscale gradient, call optimizer.step()
                    scaler.step(optimizer)
                    # Update the scale for next iteration
                    scaler.update()
                    # TBD deleted 'torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)'

UnboundLocalError: local variable 'test_loss' referenced before assignment

When the training process finished the first epoch and saved the model, it threw out the error:

epoch 1 iter 8713: train loss 0.25403. lr 3.000169e-04: 100%|██████████| 8714/8714 [4:12:50<00:00,  1.74s/it]  
Traceback (most recent call last):
  File "try.py", line 105, in <module>
    trainer.train()
  File "/home/ec2-user/minGPT/mingpt/trainer.py", line 129, in train
    best_loss = test_loss
UnboundLocalError: local variable 'test_loss' referenced before assignment

Environment: Python3 + pytorch 1.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.