Giter Club home page Giter Club logo

gpt-2's Introduction

GPT-2

Recreating gpt-2 on my own, at first, and then pulling in optimizations from Andrej Karpathy's final YT video in the Zero to Hero Deep Learning series (from commit cc0a0c606d6c8de9a7cb4c0e7751d1d38c318563 onwards).

Setup

Dependencies

Set up a python environment and install requirements:

python3 -m venv .
source ./bin/activate
pip3 install -r requirements.txt

Datasets

Then download and prepare the training and val datasets:

python3 ./data/prepare.py

Training

You can train the model by calling:

python3 ./train.py

Or with DDP (if you have multiple GPUs - highly suggested):

# DDP on 4 gpus on 1 node (for example)
torchrun --standalone --nproc_per_node=4 train.py

Note that this, by default, loads the training checkpoint located in out/*.pt. If there is no training checkpoint, it starts training the model from scratch.

Sampling

Sample from the model by calling:

# with no prompt and default max tokens
python3 ./sample.py

# with a prompt
python3 ./sample.py -p "Hello, I'm a language model,"

# with a prompt and setting the maximum tokens to 500
python3 ./sample.py -p "Hello, I'm a language model," -m 500

Note that this, by default, loads the committed checkpoint located in checkpoint/*.pt. If there is no committed checkpoint, it will sample from an untrained model.

Build Details

Links

Data

GPT-2 was built off the WebText dataset. This dataset is internal to OpenAI, so I will be using the OpenWebText dataset. You can find all data in the data directory.

Notably, the WebText dataset was scraped with the following constraints:

  • All outbound links from Reddit posts with at least 3 karma
  • All posts up until December 2017
  • ~8 million documents total
  • ~40GB of text
  • Removal of all wikipedia documents and links since Wikipedia is "a common data source for other datasets and could complicate the analysis due to overlapping training data with test evaluation tasks".

OpenAI did not leverage CommonCrawl to reduce the data quality complexity they would have to surmount. Their main aim was to show that unsupervised learning on a large corpus could lead to meta learning on multiple tasks.

Tokenization

OpenAI leveraged BPE (byte pair encoding) on top of UTF-8 unicode points to represent the text data. They then tokenized on sub-word groupings with a vocab size of 50,527. They leveraged other token pre-processing steps to prevent things like BPE merging across character categories for any byte sequence.

Since the aim of this project is to just recreate the core of GPT-2, I will be leveraging tiktoken instead of implementing and training the tokenizer from scratch. This should also allow me to download the open source weights and know that my model can interop with whatever setup OpenAI used internally.

Model Architecture

GPT-2 largely follows the GPT-1 architecture, which consists of:

  • 12-layer decoder-only transformer
  • Masked self-attention with 768 dim states and 12 attention heads
  • position-wise feed-forward networks with 3072 dim inner state
  • Adam optimizer with a learning rate of ~2.5e-4
  • Dropout with a rate of 0.1 regularization at the residual, embedding, and attention layers
  • A modified version of L2 regularization with w=0.01 on all non-bias or gain weights
  • GELU for the activation functions

With some modifications:

  • LayerNorm was moved to the input of each sub-block
  • An additional LayerNorm was added after the final self-attention block
  • Modified initialization that accounts for accumulations on the residual path with model depth
  • Scaled weights of the residual layers by a factor of 1/sqrt(N) where N is the number of residual layers
  • Context size of 1024
  • Batch size of 512

gpt-2's People

Contributors

vhmth avatar

Stargazers

Chaitanya Kumar avatar  avatar Jatin Jindal avatar  avatar Namu avatar William Mbotta avatar Maximilian Weber avatar  avatar vishu bandari avatar Aditya avatar Jean de Dieu Nyandwi avatar  avatar Parsa avatar  avatar Sagar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.