Giter Club home page Giter Club logo

mingpt-tf's Introduction

minGPT-TF

A TensorFlow re-implementation of mingpt mingpt-tf

Notebooks

play_math.ipynb and play_char.ipynb trained in colab.Links are on the top of each notebook to train model on colab. play_char.ipynb notebook batch_size is reduced to fit in to colab GPU memory. Change the parameters according to GPU memory.

minGPT - Readme

mingpt

A PyTorch re-implementation of GPT training. minGPT tries to be small, clean, interpretable and educational, as most of the currently available ones are a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code, including boilerplate and a totally unnecessary custom causal self-attention module. Anyway, all that's going on is that a sequence of indices goes into a sequence of transformer blocks, and a probability distribution of the next index comes out. The rest of the complexity is just being clever with batching (both across examples and over sequence length) so that training is efficient.

The core minGPT "library" (hah) is two files: mingpt/model.py contains the actual Transformer model definition and mingpt/trainer.py is (GPT-independent) PyTorch boilerplate that trains the model. The attached Jupyter notebooks then show how the "library" (hah) can be used to train sequence models:

  • play_math.ipynb trains a GPT focused on addition (inspired by the addition section in the GPT-3 paper)
  • play_char.ipynb trains a GPT to be a character-level language model on arbitrary text, similar to my older char-rnn but with a transformer instead of an RNN
  • play_words.ipynb a BPE version that does not yet exist

With a bpe encoder, distributed training and maybe fp16 this implementation may be able to reproduce GPT-1/GPT-2 results, though I haven't tried $$$. GPT-3 is likely out of reach as my understanding is that it does not fit into GPU memory and requires a more careful model-parallel treatment.

Example usage

This code is simple enough to just hack inline, not "used", but current API looks something like:

# you're on your own to define a class that returns individual examples as PyTorch LongTensors
from torch.utils.data import Dataset
train_dataset = MyDataset(...)
test_dataset = MyDataset(...)

# construct a GPT model
from mingpt.model import GPT, GPTConfig
mconf = GPTConfig(vocab_size, block_size, n_layer=12, n_head=12, n_embd=768) # a GPT-1
model = GPT(mconf)

# construct a trainer
from mingpt.trainer import Trainer, TrainerConfig
tconf = TrainerConfig(max_epochs=10, batch_size=256)
trainer = Trainer(model, train_dataset, test_dataset, tconf)
trainer.train()
# (... enjoy the show for a while... )

# sample from the model (the [None, ...] and [0] are to push/pop a needed dummy batch dimension)
from mingpt.utils import sample
x = torch.tensor([1, 2, 3], dtype=torch.long)[None, ...] # context conditioning
y = sample(model, x, steps=30, temperature=1.0, sample=True, top_k=5)[0]
print(y) # our model filled in the integer sequence with 30 additional likely integers

References

Code:

  • openai/gpt-2 has the model but not the training code, and in TensorFlow
  • openai/image-gpt has some more modern gpt-3 like modification in its code, good reference as well
  • huggingface/transformers has a language-modeling example. It is full-featured but as a result also somewhat challenging to trace. E.g. some large functions have as much as 90% unused code behind various branching statements that is unused in the default setting of simple language modeling.

Papers + some implementation notes:

Improving Language Understanding by Generative Pre-Training (GPT-1)

  • Our model largely follows the original transformer work
  • We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.
  • Adam max learning rate of 2.5e-4. (later GPT-3 for this model size uses 6e-4)
  • LR decay: increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule
  • We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
  • Since layernorm is used extensively throughout the model, a simple weight initialization of N(0, 0.02) was sufficient
  • bytepair encoding (BPE) vocabulary with 40,000 merges
  • residual, embedding, and attention dropouts with a rate of 0.1 for regularization.
  • modified version of L2 regularization proposed in (37), with w = 0.01 on all non bias or gain weights
  • For the activation function, we used the Gaussian Error Linear Unit (GELU).
  • We used learned position embeddings instead of the sinusoidal version proposed in the original work
  • For finetuning: We add dropout to the classifier with a rate of 0.1. learning rate of 6.25e-5 and a batchsize of 32. 3 epochs. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5.
  • GPT-1 model is 12 layers and d_model 768, ~117M params

Language Models are Unsupervised Multitask Learners (GPT-2)

  • LayerNorm was moved to the input of each sub-block, similar to a pre-activation residual network
  • an additional layer normalization was added after the final self-attention block.
  • modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers. (weird because in their released code i can only find a simple use of the old 0.02... in their release of image-gpt I found it used for c_proj, and even then only for attn, not for mlp. huh. https://github.com/openai/image-gpt/blob/master/src/model.py)
  • the vocabulary is expanded to 50,257
  • increase the context size from 512 to 1024 tokens
  • larger batchsize of 512 is used
  • GPT-2 used 48 layers and d_model 1600 (vs. original 12 layers and d_model 768). ~1.542B params

Language Models are Few-Shot Learners (GPT-3)

  • GPT-3: 96 layers, 96 heads, with d_model of 12,288 (175B parameters).
  • GPT-1-like: 12 layers, 12 heads, d_model 768 (125M)
  • We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein
  • we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer
  • we always have the feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel
  • all models use a context window of nctx = 2048 tokens.
  • Adam with β1 = 0.9, β2 = 0.95, and eps = 10−8
  • All models use weight decay of 0.1 to provide a small amount of regularization. (NOTE: GPT-1 used 0.01 I believe, see above)
  • clip the global norm of the gradient at 1.0
  • Linear LR warmup over the first 375 million tokens. Then use cosine decay for learning rate down to 10% of its value, over 260 billion tokens.
  • gradually increase the batch size linearly from a small value (32k tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size.
  • full 2048-sized time context window is always used, with a special END OF DOCUMENT token delimiter

License

MIT

mingpt-tf's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mingpt-tf's Issues

Saving checkpoints and final model

Hello, is there any way to save checkpoints to a specific file ? Also, how could I save the model for future use via other TF code ?

I'm still getting started learning Tensorflow and Python, so please bear with me 😅 !

Edit: current setup doesn't save checkpoint. Will it save only after each epoch ?

image

[bug] Cast string to float unsupported

I am getting multiple errors when running the play_math file:

2023-06-30 20:04:36.575689: W tensorflow/core/framework/op_kernel.cc:1807] OP_REQUIRES failed at cast_op.cc:121 : UNIMPLEMENTED: Cast string to float is not supported
2023-06-30 20:04:36.575889: W tensorflow/core/framework/op_kernel.cc:1807] OP_REQUIRES failed at cast_op.cc:121 : UNIMPLEMENTED: Cast string to float is not supported
Traceback (most recent call last):
  File "play_math.py", line 96, in <module>
    trainer.train()
  File "/home/iccn/Desktop/minGPT-TF/mingpt/trainer.py", line 153, in train
    loss = train_step(inputs)
  File "/home/iccn/miniconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/iccn/miniconda3/envs/tf_gpu/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

Detected at node 'Cast' defined at (most recent call last):
    File "play_math.py", line 96, in <module>
      trainer.train()
    File "/home/iccn/Desktop/minGPT-TF/mingpt/trainer.py", line 153, in train
      loss = train_step(inputs)
    File "/home/iccn/Desktop/minGPT-TF/mingpt/trainer.py", line 115, in train_step
      per_example_losses = self.strategy.run(step_fn, args=(dist_inputs,))
    File "/home/iccn/Desktop/minGPT-TF/mingpt/trainer.py", line 112, in step_fn
      self.optimizer.apply_gradients(list(zip(grads, self.model.trainable_variables)))
    File "/home/iccn/Desktop/minGPT-TF/mingpt/optimization.py", line 71, in apply_gradients
      zip(grads, tvars),
    File "/home/iccn/miniconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
      return super().apply_gradients(grads_and_vars, name=name)
    File "/home/iccn/miniconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 632, in apply_gradients
      self._apply_weight_decay(trainable_variables)
    File "/home/iccn/miniconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1159, in _apply_weight_decay
      tf.__internal__.distribute.interim.maybe_merge_call(
    File "/home/iccn/miniconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1155, in distributed_apply_weight_decay
      distribution.extended.update(
    File "/home/iccn/miniconda3/envs/tf_gpu/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1151, in weight_decay_fn
      wd = tf.cast(self.weight_decay, variable.dtype)
Node: 'Cast'
2 root error(s) found.
  (0) UNIMPLEMENTED:  Cast string to float is not supported
	 [[{{node Cast}}]]
  (1) CANCELLED:  Function was cancelled before it was started
0 successful operations.
0 derived errors ignored. [Op:__inference_train_step_33982]

What can I do to resolve them?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.