eleutherai / gpt-neo Goto Github PK

View Code? Open in Web Editor NEW

8.1K 178.0 935.0 1.6 MB

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

Home Page: https://www.eleuther.ai

License: MIT License

Python 53.90% Jupyter Notebook 45.92% Dockerfile 0.18%

language-model transformers gpt gpt-2 gpt-3

gpt-neo's Introduction

GPT Neo

*As of August, 2021 code is no longer maintained. It is preserved here in archival form for people who wish to continue to use it.

🎉 1T or bust my dudes 🎉

An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.

If you're just here to play with our pre-trained models, we strongly recommend you try out the HuggingFace Transformer integration.

Training and inference is officially supported on TPU and should work on GPU as well. This repository will be (mostly) archived as we move focus to our GPU-specific repo, GPT-NeoX.

In addition to the functionality offered by GPT-3, we also offer the following:

NB, while neo can technically run a training step at 200B+ parameters, it is very inefficient at those scales. This, as well as the fact that many GPUs became available to us, among other things, prompted us to move development over to GPT-NeoX.

Pretrained Models

Update 21/03/2021:

We're proud to release two pretrained GPT-Neo models trained on The Pile, the weights and configs can be freely downloaded from the-eye.eu.

1.3B: https://mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/

2.7B: https://mystic.the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/

For more information on how to get these set up, see the colab notebook, or read through the rest of the readme.

Model Evaluations

Linguistic Reasoning

Model and Size	Pile BPB	Pile PPL	Wikitext PPL	Lambada PPL	Lambada Acc	Winogrande	Hellaswag
GPT-Neo 125M	-----	-----	32.285	30.266	37.36%	50.43%	28.67%
GPT-3 125M	-----	-----	-----	18.6	42.7%	52.0%	33.7%
GPT-Neo 350M	-----	-----	22.5657	13.876	47.27%	51.14%	32.16%
GPT-3 350M	-----	-----	-----	9.09	54.3%	52.1%	43.6%
GPT-3 Ada	0.9631	-----	-----	9.954	51.60%	52.90%	35.93%
GPT-Neo 1.3B	0.7527	6.159	13.10	7.498	57.23%	55.01%	38.66%
GPT-3 1.3B	-----	-----	-----	5.44	63.6%	58.7%	54.7%
GPT-2 1.5B	1.0468	-----	17.48	10.634	51.21%	59.40%	40.03%
GPT-Neo 2.7B	0.7165	5.646	11.39	5.626	62.22%	56.50%	42.73%
GPT-3 2.7B	-----	-----	-----	4.60	67.1%	62.3%	62.8%

Physical and Scientific Reasoning

Model and Size	MathQA	PubMedQA	Piqa
GPT-Neo 125M	22.78%	55.10%	63.06%
GPT-3 125M	-----	-----	64.6%
GPT-Neo 350M	23.45%	53.80%	65.07%
GPT-3 350M	-----	-----	70.2%
GPT-3 Ada	24.29%	52.80%	68.88%
GPT-Neo 1.3B	24.05%	54.40%	71.11%
GPT-3 1.3B	-----	-----	75.1%
GPT-2 1.5B	23.64%	58.33%	70.78%
GPT-Neo 2.7B	24.72%	57.54%	72.14%
GPT-3 2.7B	-----	-----	75.6%

Note: All evaluations were done using our evaluation harness. Some results for GPT-2 and GPT-3 are inconsistent with the values reported in the respective papers. We are currently looking into why, and would greatly appreciate feedback and further testing of our eval harness.

Setup

git clone https://github.com/EleutherAI/GPTNeo
cd GPTNeo
pip3 install -r requirements.txt

Training Setup

TPUs:

Create your VM through a google shell (https://ssh.cloud.google.com/) with ctpu up --vm-only so that it can connect to your Google bucket and TPUs and install the requirements with pip (see above).

Google colab provides tpu-v8s for free, which should be enough to finetune our models up to GPT3XL (1.5B parameter) sizes. Click to run through our example colab notebook.

For more detailed instructions, run through our Training Guide below.

GPUs:

You can also choose to train GPTNeo locally on your GPUs. To do so, you can omit the Google cloud setup steps above, and git clone the repo locally. Run through the Training Guide below, then when running main.py, you simply have to omit the tpu flag, and pass in GPU ids instead.

Note: Some users have reported having difficulty getting MTF to recognize their GPUs. See here for details and instructions on how to fix it.

Generating Text

Once you have a trained model, or you've downloaded one of our pre-trained models, generating text is as simple as running the main.py script with the --predict flag on. You can pass a path to your prompt txt file with the --prompt flag, like so:

python3 main.py --predict --prompt <example_prompt.txt> --tpu <tpu_name> --model <config_name>

or, if using GPUs:

python3 main.py --predict --prompt <example_prompt.txt> --gpu_ids <device:GPU:0 device:GPU:1> --model <config_name>

Training Guide

1. Create your Tokenizer (OPTIONAL)

We recommend you use Huggingface's pretrained GPT2 tokenizer with our repo (instructions provided below), but if you want to train a model with a different vocabulary size, we provide facilities to train your own tokenizer like so:

python data/train_tokenizer.py \
    --base_dir ./path/to/your/txt/files \
    --output_dir ./output/path \
    --file_type txt \
    --vocab_size 50257

# if it succeeded, you should see the message
# 'tokenizer saved at ./output/path/byte-level-bpe.tokenizer.json'

2. Tokenizing your Dataset

If you just want to test training, you can skip this step and download some dummy data like so:

wget https://storage.googleapis.com/connors-datasets/bundestag/bundestag_0.tfrecords

Then copy the data to your bucket, or if using GPUs, a local directory:

gsutil cp bundestag_0.tfrecords gs://<your bucket>/

If using your own data to train, you can use the data/create_tfrecords.py script to encode your text data into tfrecords.

Your data must either be in the form of lots of normal .txt files (one document per file), or in any format supported by lm_dataformat.

You can run the script without parameters to see help for all options.

In document mode Each example in the tfrecords is one (variably sized) document. This is to be used with the documents_fixed and documents_random sampling modes (For more details see the parameters reference section). Document mode is the default mode.

The below command will tokenize all files in acceptable formats in base_dir using gpt2 tokenizer and save them to output_dir

python3 create_tfrecords.py --mode documents --input_dir <base> --name <name> --output_dir <output> --use_gpt2_tokenizer --minimum_size <min>

input_dir: Defines the folder where your data is located. The script will encode all files present in this folder.
name: Name of output files will be name_i.tfrecords where i is the number of the file.
output_dir: Where to save the tfrecords to
use_gpt2_tokenizer: Whether to use the pretrained HuggingFace GPT2 tokenizer, in which case the separator will be set to [50256].
encoder_path: if not using the pretrained gpt2 tokenizer, use this flag to provide a path to your generated tokenizer json.
separator: Written in list format, the separator token(s) to insert between documents (e.g. "[0]"). Will depend on your encoder.
minimum_size: The minimum size (in tokens) a document must have, otherwise it is discarded. This is what will later determine your stitch parameter: stitch * minimum_size must always be greater or equal n_ctx (For more details see the parameters reference section).

4. Using a Dataset in a Model

To use a dataset in a model, you must first register that dataset under ./configs/dataset_configs folder. First choose a filename with a .json extension. That filename will serve as the dataset identification. The config should be filled out the following manner.

If you have a dataset encoded using the pretrained gpt2 tokenizer, you can specify that like so:

{
    "n_vocab": 50257,
    "path": "gs://neo-datasets/openwebtext-documents/openwebtext_*.tfrecords",
    "eval_path": "gs://neo-datasets/openwebtext-documents/openwebtext_*.tfrecords",
    "tokenizer_is_pretrained": true,
    "tokenizer_path": "gpt2"
}

or if you've trained a custom tokenizer, like so:

{
    "n_vocab": 32768,
    "path": "./path/to/your/*.tfrecords",
    "eval_path": "./path/to/your/eval/*.tfrecords",
    "tokenizer_path": "./path/to/your/byte-level-bpe.tokenizer.json"
}

Finally, in your model config, add the filename that you created above to the datasets array.

The <dataset id> will be the filename, excluding the .json, that you created above

"datasets": [[<dataset id>, <stitch>, <datatype>, <weight>]] # datasets key defines at run time how each dataset is processed for training

5. Choose a model configuration

Once you have your datasets set up, find a suitable config in /configs.

Here we use a GPT3-XL sized model as an example, but there are many more in ./configs, all of which have short summaries in the Available Configs section.

All you need to do is edit the dataset id as described above, and edit model_path (where logs and checkpoints will be saved) to point to a cloud bucket you have write access to (or local path, if using GPUs).

{
    "n_head": 32,
    "n_vocab": 50257,
    "embed_dropout": 0.1,
    "lr": 0.0002,
    "lr_decay": "cosine",
    "warmup_steps": 3000,
    "beta1": 0.9,
    "beta2": 0.95,
    "epsilon": 1e-8,
    "opt_name": "adam",
    "weight_decay": 0.1,
    "train_batch_size": 512,
    "attn_dropout": 0.1,
    "train_steps": 286150,
    "eval_steps": 0,
    "predict_steps": 1,
    "res_dropout": 0.1,
    "eval_batch_size": 128,
    "predict_batch_size": 1,
    "iterations": 2500,
    "n_embd": 2048,
    "datasets": [["your_dataset_name", 25, "documents_random", 1.0]],
    "model_path": "gs://neo-models/GPT3_XL",
    "n_ctx": 2048,
    "n_layer": 24,
    "scale_by_depth": true,
    "scale_by_in": false,
    "attention_types" :  [[["global"],24]],
    "mesh_shape": "x:128,y:2",
    "layout": "batch:x,memory_length:y,embd:y",
    "activation_function": "gelu",
    "recompute_grad": true,
    "gradient_clipping": 1.0,
    "tokens_per_mb_per_replica": 2048
}

6. Run Training

python3 main.py --model <your_config_name> --steps_per_checkpoint <n> --tpu <tpu-name>

tpu: Name of the TPU to use.
steps_per_checkpoint: The frequency in steps at which to save checkpoints.
--auto_layout and --auto_layout_and_mesh_shape (Optional): Disable training and instead auto generate a memory efficient layout (and mesh_shape)
gpu_ids: if training using GPUs, omit the tpu flag and pass in the ids of your gpus. In the example below, we train on 3 GPUs, specifying their device ids delimited by spaces:

python3 main.py --model <your_config_name> --steps_per_checkpoint <n> --gpu_ids <device:GPU:0 device:GPU:1>

Available Configs

We have several model sizes available, but some of our configs require large TPUs and will need tweaking to run on smaller machines, or GPUs. Below is a short guide to each model in the configs directory:

TODO

Extra Features:

Training (with Sacred)

Sacred helps track experiments and is much nicer to work with than tensorboard.

To setup:

Install Docker and Docker-compose
Run docker-compose up

To use:

Ensure model_dir doesn't have any metric logs in it (it trips up the metric stuff for tensorboard, which assumes that it's a continuation of the existing run). You can use gsutil rm -r ... to delete model dir
Run python3 run_experiment.py --tpu sometpuhere --model someconfig.json Options are the same as main.py.
You can go to http://server_ip_goes_here:8081/ to see the Omniboard overview. If you prefer to see a tensorboard, the script also spins one up and automatically assigns it a port. The script should print out the tensorboard port near the top of the log.

Peeking at a Dataset

If you are ever confused by the dataset of a particular config file, you can easily check the minimum and maximum token ids with a single command. This is useful for making sure that the vocabulary size of the model is at least as large as the maximum token id. Tensorflow will not error if you try to gather on a matrix with out of bounds indices, so you need to make sure your vocabulary size is sufficiently large.

python main --model {config_name} --check_dataset

Masked Language Modeling

In addition to being able to train large GPT's, this repository also allows you to easily do masked language modeling (BERT, RoBERTa). In order to do so, you must follow two additional steps.

When tokenizing your dataset, you must reserve a special id for the [mask] token.
In the configs, you will have to define two additional fields

"mlm_training": true,                           # must be set to true
"mlm_mask_id": <mask id>                        # the mask id that you reserved from above

That's all you need to train a model with the MLM objective, good for any type of data that you have encoded properly. If you would like to tweak the other related hyperparameters, please continue reading.

"mlm_cls_token_id": <cls token id>,                # auto append specified CLS token id on the left
"mlm_mask_prob": 0.15,                             # the probability of masking a token, defaults to 15%
"mlm_same_token_prob": 0.10,                       # probability of keeping the token the same, defaults to 10%
"mlm_random_token_prob": 0.10,                     # probability of tokens that are replaced with random tokens, 10% was recommended by the BERT paper
"mlm_mask_ignore_ids": [<cls token>, <sep token>]  # ignore masking other special tokens, if any

Parameter Reference

Pick a valid config from /configs and tweak the parameters as needed:

n_heads: The number of attention heads.
n_embd: Size of the hidden layers, must be divisible by n_heads.
n_vocab: Vocabulary size.
embed_dropout, res_dropout, attn_dropout: Dropout probability for word embedding/residuals/attention
lr: Learning rate
warmup_steps: Number of steps before full learning rate is reached (linear ramp from 0 to lr).
lr_decay: cosine or linear.
opt_name: adam or adafactor.
beta1, beta2 and epsilon: adam optimizer params.
beta1, ada_epsilon1 and ada_epsilon2: adafactor optimizer params.
weight_decay: Weight decay parameter, if not present no weight decay is used (the weight decay fix for Adam is used) (default: 0.01) (optional).
train_batch_size: Batch size during training.
train_steps: Number of training steps (batches), set to roughly ~1 epoch for now (total number of tokens in your dataset / number of tokens per batch (= train_batch_size / n_ctx)).
eval_steps: Number of steps to run for each evaluation. Set to 0 for no eval. i.e After every checkpoint, the model is tested for eval_steps
iterations: Number of steps queued to the TPU, must be smaller than steps_per_checkpoint. (default: 500)
datasets: List of tfrecords datasets to use. Each dataset is a list with the following parameters: [train glob , eval glob, stitch, sampling_mode, weight]. So for example for a single dataset (note the double list): [["bundestag_*.tfrecords", "", 10, "random_sample", 1.0]]
- dataset_id: The name of a dataset configuration file in ./configs/dataset_configs
- stitch: If sampling_mode random_sample is used, the input pipeline samples this amount of texts into one to sample from. You must select stitch so that stitch * minimum_document_length >= n_ctx
- sampling_mode: chunks (tfrecords are preprocessed into the correct length and are read sequentially) or documents_random (stitch amount of documents are concatenated and then a n_ctx chunk is randomly subsampled)
- weights: How much relative weight this dataset should have compared to others
model: Which model to train. Currently only GPT is supported, and it defaults to this if not present.
model_path: Google storage bucket location (or local path, if using GPUs) to save model checkpoints and logs.
n_ctx: Size of context window. Default is 2048
n_layer: Number of layers (blocks) in the model.
scale_by_depth: If true, the weight initialization of layers are scaled by their depth as in the GPT2 paper.
scale_by_in: If true, the weight initialization of layers are scaled by their number of inputs as in the GPT2 paper.
mesh_shape: A Mesh is an n-dimensional array of processors with named dimensions used for parallelism in the mesh-tensorflow library. Each Tensor is split evenly across mesh dimensions according to the layout (see below). The 'mesh_shape' is the shape of this array, and must be equal to the number of processors. e.g., for a v3-128 TPU "mesh_shape": “x:16,y:8”.
layout: A Tensor is laid out on its mesh with one slice on each processor. A Tensor "layout", is an injective partial map specifying which dimensions of the tensor are (evenly) split across which dimensions of the mesh. No dimension of a tensor may be split across two dimensions of its mesh and no two dimensions of a tensor may be split across the same dimension of its mesh. The user defines a global set of layout rules in the form of (tensor-dimension-name, mesh-dimension-name) pairs. A dimension of a tensor is split across a dimension of its mesh if there is a matching rule, e.g. (for the above example mesh_shape: "layout":"batch:x,heads:y"
activation_function: selu (self normalizing) or gelu (used by OA), activation function used in feed-forward passes. (default: gelu)
attention_types: the type of attention for each layer in a list of the following format [[["attention_type"], n_layers]]. e.g. for a 12 layer net [[["global"], 12]] or [[["local"], 10], [["global"], 2]].
- Choose from: linear, global, local or none. We have found a 50/50 mix of global and linear to work well. none allows you to create feed-forward only layers for more efficient PAR Transformer models.
precision: float32 or bfloat16.
tokens_per_mb_per_replica: If not None, will split the batch up into smaller microbatches containing tokens_per_mb_per_replica tokens to avoid OOMs. Gradients are accumulated locally and reduced once. IMPORTANT: mb refers to minibatch not megabyte here.

Mixture of Experts

moe_layers: A list of layer numbers to append a mixture of experts layer onto. E.G: [2,4,6,8,10,12]. We have experimentally found a moe layer for every two self-attention layers to work well.
moe_params: a dictionary of additional kwargs to pass in to the moe layer. E.G {"moe_dropout_rate": 0.0 }

Experimental features

axial_pos_emb_: If true, uses [axial positional embedding](https://arxiv.org/abs/1912.12180.
mlp_glu: If true, uses a gated linear unit variant of feed forward layers.
scalenorm: If true, uses scalenorm instead of layernorm.
rezero: If true, uses rezero instead of layernorm.
num_mem_kv: adds memory / key values from the all-attention paper. Param is an int with the number of desired mem/key values.
macaron: if true - uses a macaron transformer for each layer block.

TODO:

finalize documentation
update configs

Citing GPT-Neo

If you have found GPT-Neo helpful in your work, you can cite this repository as

@software{gpt-neo,
  author       = {Black, Sid and
                  Gao, Leo and
                  Wang, Phil and
                  Leahy, Connor and
                  Biderman, Stella},
  title        = {{GPT-Neo: Large Scale Autoregressive Language 
                   Modeling with Mesh-Tensorflow}},
  month        = mar,
  year         = 2021,
  note         = {{If you use this software, please cite it using 
                   these metadata.}},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.5297715},
  url          = {https://doi.org/10.5281/zenodo.5297715}
}

The version number should be replaced with the version number you are using, and the year corresponds to the project's open-source release.

If you are specifically interested in citing the GPT-Neo models trained on the Pile, we would appreciate also citing

@article{gao2020pile,
  title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}

gpt-neo's People

Contributors

Stargazers

Watchers

Forkers

felixgithub2017 zeta1999 yyht samithaj bratao arankomat disco195 siz-olab doinker robertalanm connorjl gungga ai-waifu tawawhite snoopfab jonathanbechtel djoldman shashi456 shir1917 xmaster96 robertomalatesta srulikbd davidsoong gareththomasnz igor-krawczuk trisongz kfriesth wkryst hirajanwin z-m-k relsi aadabi duythien0912 brunotech shaunakg shinoyuki222 mbyase gchenfly fanfanamen badexception faisal-w rogerfitz lhfjjk shadowkun weizx208 jithinraj pertain99 tiagocapelli cybernetics any0503 josete89 jqlts1 wind91725 david-jr liusj2000 jawad1347 jeffhsu3 prashant118 djjty obastemur parkman328 lmpan 7exe ericargilbert mjdhasan haojiepan1 sycomix samriddhishree hooji threadreaper hawrami beomi guozanhua stjordanis thomascherickal ibivibiv alfalmi pingponglabs rosssong galloj laksh9950 noah670 emrul miolini razcle codeaudit hammeiam innovaite benwaldner sherief denisvcode org-mars tspannhw alan-ai-learner ashbt malcolmgreaves stellaathena vdt gogumee kunalpowar

gpt-neo's Issues

GPT3XL training

It's not clear to me how to train the GPT3XL via GPU/Colab.
Could you add more details?

Thank you.

Pin Specific Requirement for lm_dataformat

For some reason, multithreading isn't playing nice with the latest version of lm_dataformat and throws an error when using create_tfrecords.py

"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "data/create_tfrecords.py", line 230, in create_file
    for d in data:
  File "data/create_tfrecords.py", line 221, in _archive_to_files
    for s in g:
  File "/usr/local/lib/python3.6/dist-packages/lm_dataformat/__init__.py", line 119, in stream_data
    p.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 103, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "data/create_tfrecords.py", line 278, in <module>
    for i in tqdm(pool.imap(create_file, enumerate(file_chunks)), total=len(file_chunks)):
  File "/usr/local/lib/python3.6/dist-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
AssertionError: daemonic processes are not allowed to have children
  0% 0/104 [00:00<?, ?it/s]

The last version I was able to test and validate working is

pip install git+git://github.com/leogao2/lm_dataformat.git@f77f6e86d36306f0ecd4a8b47ee8223d5e491b4a

Great work so far - really excited to see GPTNeo progress!

Failed to install requirements

I got this error when installing the requirements. Can anyone help?

$ pip3 install -r requirements.txt
...
  Could not find a version that satisfies the requirement tensorflow==2.4.0 (from -r requirements.txt (line 10)) (from versions: 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow==2.4.0 (from -r requirements.txt (line 10))

My system is: Debian 10.8 64bit, Python 3.7.3.

Fix Dropout

We should make sure all dropouts are working as expected. We had a problem with embd_dropout in particular.

Mainly opening this issue for documentation purposes as I believe I have this fixed already, I'm just running a test to see if my fix has worked.

Colab breaks when tokenizing NIH Dataset

Traceback (most recent call last):
File "data/create_tfrecords.py", line 205, in
results = create_tfrecords_mp(files, args)
File "data/create_tfrecords.py", line 186, in create_tfrecords_mp
files = split_list(files, len(files) // args.processes)
File "data/create_tfrecords.py", line 67, in split_list
return [l[i:i+n] for i in range(0, len(l), n)]
ValueError: range() arg 3 must not be zero

I'll look into why it breaks

UnboundLocalError: local variable 'skip_idx' referenced before assignment

Describe the bug
UnboundLocalError: local variable 'skip_idx' referenced before assignment

To Reproduce

Trying running the example notebook in Colab
Get to the training part (where main.py is called)
Run into the error that ends in:

File "/content/GPTNeo/inputs.py", line 346, in sequential_input
    skip_idx, remainder = _get_skip_index(filenames, n_batches=global_step * params["train_batch_size"]) # TODO: fix for > 1 epoch
  File "/content/GPTNeo/inputs.py", line 295, in _get_skip_index
    return skip_idx, remainder
UnboundLocalError: local variable 'skip_idx' referenced before assignment

Expected behavior
Clean training on TPUs

Proposed solution
As I understand it, adding a line like count = 0 to inputs.py, line 279 should solve the problem.

[Q] Are the pretrained models not unable to work with normal CPUs?

Are the pretrained models not unable to work with normal CPUs? If they can run, what are the minimum specs of the CPU for a "reasonable" speed of text generation?

Experiment Plan

Overview [DRAFT]

We have few resources available and it would be good to have an experiment plan to utilize them as much and quickly as possible.
In the last days there were few structural changes that should push the model to achieve new results.

Proposed Plan

Qualitative experiments

run a baseline test to determine the best lambada score we can get with a 256 setup

Efficiency experiments

Determine the best configuration that leverages the MOE and BFLOAT16 setup.

Implement scheduler for batch size increase

Batch size increase

Efficient Sampling

Currently our sampling is incredibly inefficient, doesn't store past values for k / v, and instead recomputes them for every token.

We should look again at how sampling is done in mesh / T5 (maybe ask Colin?) and see if we can store k/v values to increase the efficiency of our sampling code.

The basic infrastructure for this is already in place, but commented out (the k/v values will be stored in this Context object: https://github.com/EleutherAI/GPTNeo/blob/master/sample.py#L148, https://github.com/EleutherAI/GPTNeo/blob/master/models/gpt2/gpt2.py#L138). The problem we ran into last time was that in the mesh code they seem to feed in the inputs a token at a time, but the token is missing a batch dimension, and we're not exactly sure where batch gets added on, and therefore how to replicate this. (see: https://github.com/tensorflow/mesh/blob/a3c05f705641dfe144f70b7b5230db4933ce8ca9/mesh_tensorflow/transformer/transformer.py#L1137)

This is (the last?) issue I'd like to get solved before the code is released publicly, so I will try to look into this soon, but any help would be appreciated.

Does this support distributed training?

Does this support distributed training? For instance, when multiple people want to pool computation power to contribute it.

Set the server to return logits

Test and validate the Mixture-of-Experts model

@lucidrains has implemented a mixture-of-experts model for GPT-2, but it needs to be tested.

Wrong mesh_shape

Trying out the notebook on colab TPUs with both "mesh_shape":"x:4,y:2" and "mesh_shape":"all:8" led to the following error:

Traceback (most recent call last):
  File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 230, in main
    estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3130, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3242, in _model_fn
    input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1484, in generate_infeed_enqueue_ops_and_dequeue_fn
    self._invoke_input_fn_and_record_structure())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1547, in _invoke_input_fn_and_record_structure
    enqueue_ops.append(wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3771, in _wrap_computation_in_while_loop
    parallel_iterations=1)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2774, in while_loop
    return_same_structure)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2256, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2181, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3760, in computation
    with tf.control_dependencies(op_fn()):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1217, in enqueue_ops_fn
    placement_function=device_function_impl)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_feed.py", line 631, in generate_enqueue_ops
    for (shard, index) in zip(sharded_inputs, xrange(self.number_of_shards))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_feed.py", line 631, in <listcomp>
    for (shard, index) in zip(sharded_inputs, xrange(self.number_of_shards))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1162, in tpu_ordinal_function_impl
    if ctx.device_assignment:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 420, in device_assignment
    if self._model_parallelism_enabled else None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 353, in _get_device_assignment
    num_replicas=self.num_replicas)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/device_assignment.py", line 374, in device_assignment
    topology = Topology(serialized=topology)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/topology.py", line 80, in __init__
    self._parse_topology(serialized)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/topology.py", line 111, in _parse_topology
    "entries; got {}".format(self._mesh_shape))
ValueError: `mesh_shape` must be a vector of size 4 with positive entries; got [2 2 2]

ETA: running a sudo apt update && sudo apt -y upgrade solved the problem. But I guess is has something to do with versions of tensorflow and mesh-tensorflow.

How to train gpt2 on mutil node and mutil gpu

what should input about gpu_ids on mutil node and mutil gpu?below right?
node0 cmd: python3 main.py --model <your_config_name> --steps_per_checkpoint --gpu_ids 0,1,2,3
node1 cmd: python3 main.py --model <your_config_name> --steps_per_checkpoint --gpu_ids 0,1,2,3
node2 cmd: python3 main.py --model <your_config_name> --steps_per_checkpoint --gpu_ids 0,1,2,3

BFloat16

The current attention implementation uses float32. This is not efficient on TPUS.

Task: Change the logic to leverage the TPU's MXU units.

"activation_function" in the config doesn't appear to be implemented

GELU is just hard coded. We should allow the activation function to be changed.

Make sure we handle short text correctly

As a cleaning heuristic, documents with fewer than 128 tokens were removed from the dataset. These shorter documents tended to be lower quality, as determined by text coherence. We release this dataset as the OpenWebTextCorpus

source

Colab notebook create_tfrecords.py: unknown argument "--mode documents"

Describe the bug
The argument parser seems to no longer support the "--mode documents" from the Colab notebook

To Reproduce
Steps to reproduce the behavior:

Follow the Colab notebook

Proposed solution
Remove the argument in the Colab tutorial notebook

Environment (please complete the following information):
Google Colab

About MLM & fine-tuning

I found the MLM option, which was probably added as an experimental option. For this, I'd like to note a few things you may find useful, specifically concerning Roberta (cuz Bert sucks lol).

I may be misunderstanding this, but it appears that your process of MLM doesn't include replacing a token with a random token, i.e., the second option of the figure:

It would be awesome if you can add the option of fine-tuning on GLUE (found it in another repo by Eleuther). Since the current MLM training doesn't add [CLS] token at the beginning, you may want to fix that.

Low quality samples with MOE models when using fast sampling

#32 (comment)

When sampling as normal, moe models display sub-par quality. We're not entirely sure why, and it would be good if sampling speed from moe models could be improved.

relevant sections of code:

https://github.com/EleutherAI/GPTNeo/blob/master/sample.py#L67 - we set the slow sampling which results in states being recalculated for the whole context at each step instead of being cached.

run_experiment.py doesn't terminate all threads on keyboard interrupt

yeah

Mixture of Experts

Is the Mixture of Experts model still a work-in-progress as indicated in the README? Which MoE model is being implemented? Is there a reason we are not using @lucidrains's package https://github.com/lucidrains/mixture-of-experts?

Tagging @Mistobaan who asked about this in Discord.

run_experiment.py recreates in infinite loop if training is complete

if you start a training run that's reached train_steps steps, run_experiment.py simply starts the process and recreates the tpu in an infinite loop

ValueError when predicting with pretrained models

Describe the bug
When using GPT3XL to perform inference with the --predict flag as shown in examples, the following error is thrown

ValueError: Argument not a list with same length as devices arg=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255] devices=['device:GPU:0']

This is with a single GTX 1070 GPU.

commands that both produced this error were:
python main.py --model=gpt3xl/config.json --predict --prompt=prompt.txt
python main.py --model=gpt3xl/config.json --predict --prompt=prompt.txt --gpu_ids=['device:GPU:0']

While loop in predict mode

Trying to wrap this block in the while loop

   if args.predict:
        # Predict
        predictions = estimator.predict(input_fn=pred_input_fn)
        logger.info("Predictions generated")
        enc = fetch_encoder(params)
        handle_pred_output_fn(predictions, logger, enc, params, out_name=f"predictions_{args.sacred_id}_{current_step}")

ends with OOM (model is allocated again on the GPU/TPU).

connect timeout

when i run :
!python3 main.py --predict --prompt prompt1.txt --tpu=[0,1,2,3,4,5,6,7] --model /content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL/config.json

I got :

`2021-03-22 16:14:03.058026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Current step 362000
Saving config to /content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL
2021-03-22 16:14:07.810984: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz
2021-03-22 16:14:07.811573: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d6e95b8f40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-22 16:14:07.811629: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-22 16:14:07.814273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-22 16:14:07.824584: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-22 16:14:07.824644: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b30276c531da): /proc/driver/nvidia/version does not exist
2021-03-22 16:14:07.831001: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
Done!
params = defaultdict(<function fetch_model_params.. at 0x7f8fed3e0b00>, {'n_head': 16, 'n_vocab': 50257, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0, 'train_batch_size': 512, 'attn_dropout': 0, 'train_steps': 400000, 'lr_decay_end': 300000, 'eval_steps': 10, 'predict_steps': 0, 'res_dropout': 0, 'eval_batch_size': 128, 'predict_batch_size': 128, 'iterations': 500, 'n_embd': 2048, 'datasets': [['pile', None, None, None]], 'model_path': '/content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local'], 'mesh_shape': 'x:128,y:2', 'layout': 'batch:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 4096, 'precision': 'bfloat16', 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'pile': {'n_vocab': 50257, 'path': 'gs://neo-datasets/pile/pile_*.tfrecords', 'eval_path': 'gs://neo-datasets/pile_val.tfrecords', 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 256, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': True, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 5000, 'predict': True, 'model': 'GPT', 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False})
Traceback (most recent call last):
File "/usr/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/usr/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/usr/lib/python3.7/http/client.py", line 944, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/lib/python3.7/socket.py", line 728, in create_connection
raise err
File "/usr/lib/python3.7/socket.py", line 716, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 262, in
main(args)
File "main.py", line 139, in main
tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(args.tpu) if params["use_tpu"] else None
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py", line 207, in init
discovery_url=discovery_url)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/client/client.py", line 164, in init
self._project = _request_compute_metadata('project/project-id')
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/client/client.py", line 82, in _request_compute_metadata
resp = request.urlopen(req)
File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 1378, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>`

env: colab TPU; code is : git clone https://github.com/EleutherAI/gpt-neo.git

ERROR: transformers 3.5.1 has requirement tokenizers==0.9.3

Hi,
in colab on the requirements.txt one get following error
ERROR: transformers 3.5.1 has requirement tokenizers==0.9.3, but you'll have tokenizers 0.8.1rc2 which is incompatible.

How to download gs://neo-models/GPT3_xxx model ?

LAMBADA metric_fn inaccurate for multi-token responses

we need to check if the answer is split across multiple tokens rather than just calculating accuracy from the very last token.

should just be able to parse whether the token has a space at the beginning of the word and if not, take the previous token as well.

https://github.com/EleutherAI/GPTNeo/blob/master/model_fns.py#L255

Research Directions

From our website we link the following papers:

Tracking them here until we are at the point to experiment with them.

Link to dataset is not available

The link (https://eaidata.bmk.sh/data/hn.tar.gz) to HackerNews dataset in GPTNeo_example_notebook.ipynb is not available. Could you provide a new link?

run_experiment.py isn't logging loss to tensorboard

see title

Code Cleanup

The code is a massive mess,
before we release, we need to clean everything up, and add comments where possible.

TODO:

delete old configs
delete / fix broken configs (which atm is anything larger than gpt3_XL, and maybe test.json & v8_test.json)
move dataset configs into configs folder
move all scripts into scripts (think this just applies to start_test_tb.sh)
remove unused code
clean up / simplify input pipeline
remove old steps from readme, and make it easier to follow (especially input)
move as much code as possible in model_fns / gpt2.py into functions to make the body of the code more readable (it should read like pytorch code where possible)

If anyone has any more concrete suggestions on how we should do this, go ahead. I think @ConnorJL wanted to make our code a bit more of a flexible / class based thing, but i'll leave it open as to how we want it to look eventually.

HugginFace compatibility

Hey,

If a train a GPT2 model from scratch using your codebase, can it be converted to models compatible with the HuggingFace library?

Thank you

Implement bfloat16 dtype mixed precision training

Implement server batching

Unrecognized arguments base_dir and use_gpt2_tokenizer

create_tfrecords.py: error: unrecognized arguments: --base_dir /content/GPTNeo/openwebtext --use_gpt2_tokenizer

I even tried using input_dir instead of base_dir and without gpt2_tokenizer. I think that worked, but when I got to copying data to the storage bucket it did this instead: No URLs matched: /content/GPTNeo/openwebtext_tokenized
I tried to make folders in the bucket to match the path but nothing worked.

ValueError in data/create_tfrecords.py after unzip openwebtext.tar.xz

Hi,

I'm trying to follow the GPTNeo_example_notebook. After tar xf openwebtext.tar.xz and python3 data/create_tfrecords.py --mode documents --input_dir /content/GPTNeo/openwebtext --name openwebtext --output_dir openwebtext_tokenized --write_dataset_config, I encountered ValueError

2021-02-09 08:57:16.075990: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-02-09 08:57:16.076052: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "data/create_tfrecords.py", line 205, in
results = create_tfrecords_mp(files, args)
File "data/create_tfrecords.py", line 186, in create_tfrecords_mp
files = split_list(files, len(files) // args.processes)
File "data/create_tfrecords.py", line 67, in split_list
return [l[i:i+n] for i in range(0, len(l), n)]
ValueError: range() arg 3 must not be zero

Any help will be appreciated!

Transformative Mediation

Is anybody discussing training models wisely? It seems it would even be best to train models based on crowd sourced human feedback on something like what leaves everyone understood and respected. Is this reasonable to do?

Here is tiny booklet on nonviolent communication, a mediation process that has reliably ended long-running wars and family squabbles with a little bit of talk: https://gateway.ipfs.io/ipfs/QmdFVjYwgeuUpw83hBB74Wy4js8SrmmNxt8U2MkdRA2f7m/Books/We%20Can%20Work%20It%20Out:%20Resolving%20Conflicts%20Peacefully%20and%20Powerfully.pdf . There are many long books on NVC, too.

Can't infer on the provided Colab

In the provided Colab (only using provided cells), after downloading a pre-trained GPT3_XL, I tried to infer from it, which resulted in the following output from the very last cell:

out.txt

The interesting part seems to be:

Starting infeed thread controller.
Starting outfeed thread controller.
Initialized dataset iterators in 0 seconds
Before copy master to slices.
Done with copy master to slices.
Enqueue next (1) batch(es) of data to infeed.
Dequeue next (1) batch(es) of data from outfeed.
Outfeed finished for iteration (0, 0)
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:worker/replica:0/task:0:
DisableableBlockingRefcount is disabled.
	 [[node OutfeedDequeueTuple_7 (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:2261) ]]

Original stack trace for 'OutfeedDequeueTuple_7':
  File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 184, in main
    handle_pred_output_fn(predictions, logger, enc, params, out_name=f"predictions_{args.sacred_id}_{current_step}")
  File "/content/GPTNeo/inputs.py", line 165, in handle_pred_output
    for i, p in enumerate(predictions):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3167, in predict
    yield_single_examples=yield_single_examples):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 613, in predict
    self.config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3525, in _model_fn
    host_call_ret = host_calls.create_tpu_hostcall()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2261, in create_tpu_hostcall
    device_ordinal=ordinal_id)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3455, in outfeed_dequeue_tuple
    device_ordinal=device_ordinal, name=name)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Graph was finalized.
Restoring parameters from gs://peppa-test-1/GPT3_XL/model.ckpt-362000
Closing session due to error From /job:worker/replica:0/task:0:
9 root error(s) found.
  (0) Resource exhausted: Failed to allocate request for 1.0KiB (1024B) on device ordinal 3
	 [[{{node ConstantFolding/split-folded-3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[ConstantFolding/split-folded-4_G4895]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

...followed by many more similar OOM errors.

I'd be glad for any help with running the inference in Google Colab. Training actually seems to work and saves a new checkpoint, but I have not been able to run inference even on the provided pre-trained network.

There is no `--mode` switch in `data/generate_tfrecord.py`

Even though --mode documents is both mentioned in the README and used in the Colab notebook, the script data/generate_tfrecord.py does not actually define this parameter.

Minimize 3D shapes - Integrate commits

Integrate this commit from Shawwn shawwn/gpt-2@4d766e9

truncating prompts > n_ctx

in our current prediction input function, when a prompt is larger than the context length, we truncate it like so:

if len(tokens) > params["n_ctx"]:
        tokens = tokens[:params["n_ctx"]]

this would input only the beginning of the prompt, if it's longer than n_ctx. Wouldn't it be preferable to truncate input from the beginning, like so?

if len(tokens) > params["n_ctx"]:
        tokens = tokens[len(tokens) - params["n_ctx"]:]

relevant lines of code here:

https://github.com/EleutherAI/GPTNeo/blob/master/inputs.py#L212

Fix input function to be deterministic

Add checkpointing and make the functionality deterministic so we don't get fucked by preemption. See https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint and https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/utils.py#L1654

Some minor issues with MoE

This is to let you know that there are some tiny things with MoE you may want to fix.
https://github.com/EleutherAI/GPTNeo/blob/4bc16ff1d2e064b39f7a772a06994e44ecdcce63/models/gpt2/gpt2.py#L344-L348

Dropout layer needs to be applied to the output of this line (m).
You also may want to add activation = params["activation_function"] as an argument to transformer_moe_layer_v1, which would enable an activation function other than ReLU on MoE.

CrossShardOptimizer must be used for model training on TPUs

Running the example on a Colab TPU results in the following error:

File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 230, in main
    estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3130, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3386, in _model_fn
    _validate_tpu_training_graph(ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3817, in _validate_tpu_training_graph
    'CrossShardOptimizer must be used for model training on TPUs.')
ValueError: CrossShardOptimizer must be used for model training on TPUs.

Make repository public

The codebase has become mature enough that it seems reasonable to make everything public. It's not like we're releasing it yet.

It's a full model parallel GPT implementation with sampling and eval tasks. Definitely meets the quality threshold for "let people take a peek" imo.

Finalize dataset creation pipeline

Right now our create_tfrecords.py script doesn't use the HuggingFace GPT tokenizer we've been using for tokenization.

https://github.com/EleutherAI/GPTNeo/blob/master/datasets/create_tfrecords.py#L164

should just need to change this line, but imo the dataset creation / tokenizing / dataset creation needs to be clearer in general.

<padding> tokens left in sample output

we need to cut any padding tokens out of the final sample output when returning samples
see screenshot:

eleutherai / gpt-neo Goto Github PK

gpt-neo's Introduction

GPT Neo

Pretrained Models

Model Evaluations

Linguistic Reasoning

Physical and Scientific Reasoning

Setup

Training Setup

TPUs:

GPUs:

Generating Text

Training Guide

1. Create your Tokenizer (OPTIONAL)

2. Tokenizing your Dataset

4. Using a Dataset in a Model

5. Choose a model configuration

6. Run Training

Available Configs

Extra Features:

Training (with Sacred)

Peeking at a Dataset

Masked Language Modeling

Parameter Reference

TODO:

Citing GPT-Neo

gpt-neo's People

Contributors

Stargazers

Watchers

Forkers

gpt-neo's Issues

Overview [DRAFT]

Proposed Plan

Qualitative experiments

Efficiency experiments

Recommend Projects

Recommend Topics

Recommend Org