Giter Club home page Giter Club logo

gpt-neo's Introduction

GPT Neo

DOI arXiv

*As of August, 2021 code is no longer maintained. It is preserved here in archival form for people who wish to continue to use it.

๐ŸŽ‰ 1T or bust my dudes ๐ŸŽ‰

An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.

If you're just here to play with our pre-trained models, we strongly recommend you try out the HuggingFace Transformer integration.

Training and inference is officially supported on TPU and should work on GPU as well. This repository will be (mostly) archived as we move focus to our GPU-specific repo, GPT-NeoX.

In addition to the functionality offered by GPT-3, we also offer the following:

NB, while neo can technically run a training step at 200B+ parameters, it is very inefficient at those scales. This, as well as the fact that many GPUs became available to us, among other things, prompted us to move development over to GPT-NeoX.

Pretrained Models

Update 21/03/2021:

We're proud to release two pretrained GPT-Neo models trained on The Pile, the weights and configs can be freely downloaded from the-eye.eu.

1.3B: https://mystic.the-eye.eu/public/AI/gptneo-release/GPT3_XL/

2.7B: https://mystic.the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/

For more information on how to get these set up, see the colab notebook, or read through the rest of the readme.

Model Evaluations

Linguistic Reasoning

Model and Size Pile BPB Pile PPL Wikitext PPL Lambada PPL Lambada Acc Winogrande Hellaswag
GPT-Neo 125M ----- ----- 32.285 30.266 37.36% 50.43% 28.67%
GPT-3 125M ----- ----- ----- 18.6 42.7% 52.0% 33.7%
GPT-Neo 350M ----- ----- 22.5657 13.876 47.27% 51.14% 32.16%
GPT-3 350M ----- ----- ----- 9.09 54.3% 52.1% 43.6%
GPT-3 Ada 0.9631 ----- ----- 9.954 51.60% 52.90% 35.93%
GPT-Neo 1.3B 0.7527 6.159 13.10 7.498 57.23% 55.01% 38.66%
GPT-3 1.3B ----- ----- ----- 5.44 63.6% 58.7% 54.7%
GPT-2 1.5B 1.0468 ----- 17.48 10.634 51.21% 59.40% 40.03%
GPT-Neo 2.7B 0.7165 5.646 11.39 5.626 62.22% 56.50% 42.73%
GPT-3 2.7B ----- ----- ----- 4.60 67.1% 62.3% 62.8%

Physical and Scientific Reasoning

Model and Size MathQA PubMedQA Piqa
GPT-Neo 125M 22.78% 55.10% 63.06%
GPT-3 125M ----- ----- 64.6%
GPT-Neo 350M 23.45% 53.80% 65.07%
GPT-3 350M ----- ----- 70.2%
GPT-3 Ada 24.29% 52.80% 68.88%
GPT-Neo 1.3B 24.05% 54.40% 71.11%
GPT-3 1.3B ----- ----- 75.1%
GPT-2 1.5B 23.64% 58.33% 70.78%
GPT-Neo 2.7B 24.72% 57.54% 72.14%
GPT-3 2.7B ----- ----- 75.6%

Note: All evaluations were done using our evaluation harness. Some results for GPT-2 and GPT-3 are inconsistent with the values reported in the respective papers. We are currently looking into why, and would greatly appreciate feedback and further testing of our eval harness.

Setup

git clone https://github.com/EleutherAI/GPTNeo
cd GPTNeo
pip3 install -r requirements.txt

Training Setup

TPUs:

Sign up for Google Cloud Platform, and create a storage bucket.

Create your VM through a google shell (https://ssh.cloud.google.com/) with ctpu up --vm-only so that it can connect to your Google bucket and TPUs and install the requirements with pip (see above).

Google colab provides tpu-v8s for free, which should be enough to finetune our models up to GPT3XL (1.5B parameter) sizes. Click Open In Colab to run through our example colab notebook.

For more detailed instructions, run through our Training Guide below.

GPUs:

You can also choose to train GPTNeo locally on your GPUs. To do so, you can omit the Google cloud setup steps above, and git clone the repo locally. Run through the Training Guide below, then when running main.py, you simply have to omit the tpu flag, and pass in GPU ids instead.

Note: Some users have reported having difficulty getting MTF to recognize their GPUs. See here for details and instructions on how to fix it.

Generating Text

Once you have a trained model, or you've downloaded one of our pre-trained models, generating text is as simple as running the main.py script with the --predict flag on. You can pass a path to your prompt txt file with the --prompt flag, like so:

python3 main.py --predict --prompt <example_prompt.txt> --tpu <tpu_name> --model <config_name>

or, if using GPUs:

python3 main.py --predict --prompt <example_prompt.txt> --gpu_ids <device:GPU:0 device:GPU:1> --model <config_name>

Training Guide

1. Create your Tokenizer (OPTIONAL)

We recommend you use Huggingface's pretrained GPT2 tokenizer with our repo (instructions provided below), but if you want to train a model with a different vocabulary size, we provide facilities to train your own tokenizer like so:

python data/train_tokenizer.py \
    --base_dir ./path/to/your/txt/files \
    --output_dir ./output/path \
    --file_type txt \
    --vocab_size 50257

# if it succeeded, you should see the message
# 'tokenizer saved at ./output/path/byte-level-bpe.tokenizer.json'

2. Tokenizing your Dataset

If you just want to test training, you can skip this step and download some dummy data like so:

wget https://storage.googleapis.com/connors-datasets/bundestag/bundestag_0.tfrecords

Then copy the data to your bucket, or if using GPUs, a local directory:

gsutil cp bundestag_0.tfrecords gs://<your bucket>/

If using your own data to train, you can use the data/create_tfrecords.py script to encode your text data into tfrecords.

Your data must either be in the form of lots of normal .txt files (one document per file), or in any format supported by lm_dataformat.

You can run the script without parameters to see help for all options.

In document mode Each example in the tfrecords is one (variably sized) document. This is to be used with the documents_fixed and documents_random sampling modes (For more details see the parameters reference section). Document mode is the default mode.

The below command will tokenize all files in acceptable formats in base_dir using gpt2 tokenizer and save them to output_dir

python3 create_tfrecords.py --mode documents --input_dir <base> --name <name> --output_dir <output> --use_gpt2_tokenizer --minimum_size <min> 
  • input_dir: Defines the folder where your data is located. The script will encode all files present in this folder.
  • name: Name of output files will be name_i.tfrecords where i is the number of the file.
  • output_dir: Where to save the tfrecords to
  • use_gpt2_tokenizer: Whether to use the pretrained HuggingFace GPT2 tokenizer, in which case the separator will be set to [50256].
  • encoder_path: if not using the pretrained gpt2 tokenizer, use this flag to provide a path to your generated tokenizer json.
  • separator: Written in list format, the separator token(s) to insert between documents (e.g. "[0]"). Will depend on your encoder.
  • minimum_size: The minimum size (in tokens) a document must have, otherwise it is discarded. This is what will later determine your stitch parameter: stitch * minimum_size must always be greater or equal n_ctx (For more details see the parameters reference section).

4. Using a Dataset in a Model

To use a dataset in a model, you must first register that dataset under ./configs/dataset_configs folder. First choose a filename with a .json extension. That filename will serve as the dataset identification. The config should be filled out the following manner.

If you have a dataset encoded using the pretrained gpt2 tokenizer, you can specify that like so:

{
    "n_vocab": 50257,
    "path": "gs://neo-datasets/openwebtext-documents/openwebtext_*.tfrecords",
    "eval_path": "gs://neo-datasets/openwebtext-documents/openwebtext_*.tfrecords",
    "tokenizer_is_pretrained": true,
    "tokenizer_path": "gpt2"
}

or if you've trained a custom tokenizer, like so:

{
    "n_vocab": 32768,
    "path": "./path/to/your/*.tfrecords",
    "eval_path": "./path/to/your/eval/*.tfrecords",
    "tokenizer_path": "./path/to/your/byte-level-bpe.tokenizer.json"
}

Finally, in your model config, add the filename that you created above to the datasets array.

The <dataset id> will be the filename, excluding the .json, that you created above

"datasets": [[<dataset id>, <stitch>, <datatype>, <weight>]] # datasets key defines at run time how each dataset is processed for training

5. Choose a model configuration

Once you have your datasets set up, find a suitable config in /configs.

Here we use a GPT3-XL sized model as an example, but there are many more in ./configs, all of which have short summaries in the Available Configs section.

All you need to do is edit the dataset id as described above, and edit model_path (where logs and checkpoints will be saved) to point to a cloud bucket you have write access to (or local path, if using GPUs).

{
    "n_head": 32,
    "n_vocab": 50257,
    "embed_dropout": 0.1,
    "lr": 0.0002,
    "lr_decay": "cosine",
    "warmup_steps": 3000,
    "beta1": 0.9,
    "beta2": 0.95,
    "epsilon": 1e-8,
    "opt_name": "adam",
    "weight_decay": 0.1,
    "train_batch_size": 512,
    "attn_dropout": 0.1,
    "train_steps": 286150,
    "eval_steps": 0,
    "predict_steps": 1,
    "res_dropout": 0.1,
    "eval_batch_size": 128,
    "predict_batch_size": 1,
    "iterations": 2500,
    "n_embd": 2048,
    "datasets": [["your_dataset_name", 25, "documents_random", 1.0]],
    "model_path": "gs://neo-models/GPT3_XL",
    "n_ctx": 2048,
    "n_layer": 24,
    "scale_by_depth": true,
    "scale_by_in": false,
    "attention_types" :  [[["global"],24]],
    "mesh_shape": "x:128,y:2",
    "layout": "batch:x,memory_length:y,embd:y",
    "activation_function": "gelu",
    "recompute_grad": true,
    "gradient_clipping": 1.0,
    "tokens_per_mb_per_replica": 2048
}

6. Run Training

python3 main.py --model <your_config_name> --steps_per_checkpoint <n> --tpu <tpu-name>
  • tpu: Name of the TPU to use.
  • steps_per_checkpoint: The frequency in steps at which to save checkpoints.
  • --auto_layout and --auto_layout_and_mesh_shape (Optional): Disable training and instead auto generate a memory efficient layout (and mesh_shape)
  • gpu_ids: if training using GPUs, omit the tpu flag and pass in the ids of your gpus. In the example below, we train on 3 GPUs, specifying their device ids delimited by spaces:
python3 main.py --model <your_config_name> --steps_per_checkpoint <n> --gpu_ids <device:GPU:0 device:GPU:1>

Available Configs

We have several model sizes available, but some of our configs require large TPUs and will need tweaking to run on smaller machines, or GPUs. Below is a short guide to each model in the configs directory:

TODO

Extra Features:

Training (with Sacred)

Sacred helps track experiments and is much nicer to work with than tensorboard.

To setup:

  1. Install Docker and Docker-compose

  2. Run docker-compose up

To use:

  1. Ensure model_dir doesn't have any metric logs in it (it trips up the metric stuff for tensorboard, which assumes that it's a continuation of the existing run). You can use gsutil rm -r ... to delete model dir

  2. Run python3 run_experiment.py --tpu sometpuhere --model someconfig.json Options are the same as main.py.

  3. You can go to http://server_ip_goes_here:8081/ to see the Omniboard overview. If you prefer to see a tensorboard, the script also spins one up and automatically assigns it a port. The script should print out the tensorboard port near the top of the log.

Peeking at a Dataset

If you are ever confused by the dataset of a particular config file, you can easily check the minimum and maximum token ids with a single command. This is useful for making sure that the vocabulary size of the model is at least as large as the maximum token id. Tensorflow will not error if you try to gather on a matrix with out of bounds indices, so you need to make sure your vocabulary size is sufficiently large.

python main --model {config_name} --check_dataset

Masked Language Modeling

In addition to being able to train large GPT's, this repository also allows you to easily do masked language modeling (BERT, RoBERTa). In order to do so, you must follow two additional steps.

  1. When tokenizing your dataset, you must reserve a special id for the [mask] token.

  2. In the configs, you will have to define two additional fields

"mlm_training": true,                           # must be set to true
"mlm_mask_id": <mask id>                        # the mask id that you reserved from above

That's all you need to train a model with the MLM objective, good for any type of data that you have encoded properly. If you would like to tweak the other related hyperparameters, please continue reading.

"mlm_cls_token_id": <cls token id>,                # auto append specified CLS token id on the left
"mlm_mask_prob": 0.15,                             # the probability of masking a token, defaults to 15%
"mlm_same_token_prob": 0.10,                       # probability of keeping the token the same, defaults to 10%
"mlm_random_token_prob": 0.10,                     # probability of tokens that are replaced with random tokens, 10% was recommended by the BERT paper
"mlm_mask_ignore_ids": [<cls token>, <sep token>]  # ignore masking other special tokens, if any

Parameter Reference

Pick a valid config from /configs and tweak the parameters as needed:

  • n_heads: The number of attention heads.
  • n_embd: Size of the hidden layers, must be divisible by n_heads.
  • n_vocab: Vocabulary size.
  • embed_dropout, res_dropout, attn_dropout: Dropout probability for word embedding/residuals/attention
  • lr: Learning rate
  • warmup_steps: Number of steps before full learning rate is reached (linear ramp from 0 to lr).
  • lr_decay: cosine or linear.
  • opt_name: adam or adafactor.
  • beta1, beta2 and epsilon: adam optimizer params.
  • beta1, ada_epsilon1 and ada_epsilon2: adafactor optimizer params.
  • weight_decay: Weight decay parameter, if not present no weight decay is used (the weight decay fix for Adam is used) (default: 0.01) (optional).
  • train_batch_size: Batch size during training.
  • train_steps: Number of training steps (batches), set to roughly ~1 epoch for now (total number of tokens in your dataset / number of tokens per batch (= train_batch_size / n_ctx)).
  • eval_steps: Number of steps to run for each evaluation. Set to 0 for no eval. i.e After every checkpoint, the model is tested for eval_steps
  • iterations: Number of steps queued to the TPU, must be smaller than steps_per_checkpoint. (default: 500)
  • datasets: List of tfrecords datasets to use. Each dataset is a list with the following parameters: [train glob , eval glob, stitch, sampling_mode, weight]. So for example for a single dataset (note the double list): [["bundestag_*.tfrecords", "", 10, "random_sample", 1.0]]
    • dataset_id: The name of a dataset configuration file in ./configs/dataset_configs
    • stitch: If sampling_mode random_sample is used, the input pipeline samples this amount of texts into one to sample from. You must select stitch so that stitch * minimum_document_length >= n_ctx
    • sampling_mode: chunks (tfrecords are preprocessed into the correct length and are read sequentially) or documents_random (stitch amount of documents are concatenated and then a n_ctx chunk is randomly subsampled)
    • weights: How much relative weight this dataset should have compared to others
  • model: Which model to train. Currently only GPT is supported, and it defaults to this if not present.
  • model_path: Google storage bucket location (or local path, if using GPUs) to save model checkpoints and logs.
  • n_ctx: Size of context window. Default is 2048
  • n_layer: Number of layers (blocks) in the model.
  • scale_by_depth: If true, the weight initialization of layers are scaled by their depth as in the GPT2 paper.
  • scale_by_in: If true, the weight initialization of layers are scaled by their number of inputs as in the GPT2 paper.
  • mesh_shape: A Mesh is an n-dimensional array of processors with named dimensions used for parallelism in the mesh-tensorflow library. Each Tensor is split evenly across mesh dimensions according to the layout (see below). The 'mesh_shape' is the shape of this array, and must be equal to the number of processors. e.g., for a v3-128 TPU "mesh_shape": โ€œx:16,y:8โ€.
  • layout: A Tensor is laid out on its mesh with one slice on each processor. A Tensor "layout", is an injective partial map specifying which dimensions of the tensor are (evenly) split across which dimensions of the mesh. No dimension of a tensor may be split across two dimensions of its mesh and no two dimensions of a tensor may be split across the same dimension of its mesh. The user defines a global set of layout rules in the form of (tensor-dimension-name, mesh-dimension-name) pairs. A dimension of a tensor is split across a dimension of its mesh if there is a matching rule, e.g. (for the above example mesh_shape: "layout":"batch:x,heads:y"
  • activation_function: selu (self normalizing) or gelu (used by OA), activation function used in feed-forward passes. (default: gelu)
  • attention_types: the type of attention for each layer in a list of the following format [[["attention_type"], n_layers]]. e.g. for a 12 layer net [[["global"], 12]] or [[["local"], 10], [["global"], 2]].
    • Choose from: linear, global, local or none. We have found a 50/50 mix of global and linear to work well. none allows you to create feed-forward only layers for more efficient PAR Transformer models.
  • precision: float32 or bfloat16.
  • tokens_per_mb_per_replica: If not None, will split the batch up into smaller microbatches containing tokens_per_mb_per_replica tokens to avoid OOMs. Gradients are accumulated locally and reduced once. IMPORTANT: mb refers to minibatch not megabyte here.

Mixture of Experts

  • moe_layers: A list of layer numbers to append a mixture of experts layer onto. E.G: [2,4,6,8,10,12]. We have experimentally found a moe layer for every two self-attention layers to work well.
  • moe_params: a dictionary of additional kwargs to pass in to the moe layer. E.G {"moe_dropout_rate": 0.0 }

Experimental features

  • axial_pos_emb_: If true, uses [axial positional embedding](https://arxiv.org/abs/1912.12180.
  • mlp_glu: If true, uses a gated linear unit variant of feed forward layers.
  • scalenorm: If true, uses scalenorm instead of layernorm.
  • rezero: If true, uses rezero instead of layernorm.
  • num_mem_kv: adds memory / key values from the all-attention paper. Param is an int with the number of desired mem/key values.
  • macaron: if true - uses a macaron transformer for each layer block.

TODO:

  • finalize documentation
  • update configs

Citing GPT-Neo

If you have found GPT-Neo helpful in your work, you can cite this repository as

@software{gpt-neo,
  author       = {Black, Sid and
                  Gao, Leo and
                  Wang, Phil and
                  Leahy, Connor and
                  Biderman, Stella},
  title        = {{GPT-Neo: Large Scale Autoregressive Language 
                   Modeling with Mesh-Tensorflow}},
  month        = mar,
  year         = 2021,
  note         = {{If you use this software, please cite it using 
                   these metadata.}},
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.5297715},
  url          = {https://doi.org/10.5281/zenodo.5297715}
}

The version number should be replaced with the version number you are using, and the year corresponds to the project's open-source release.

If you are specifically interested in citing the GPT-Neo models trained on the Pile, we would appreciate also citing

@article{gao2020pile,
  title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
  author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
  journal={arXiv preprint arXiv:2101.00027},
  year={2020}
}

gpt-neo's People

Contributors

ai-waifu avatar anishthite avatar asacooperstickland avatar beomi avatar brettkoonce avatar clashluke avatar connorjl avatar cynthia avatar dependabot[bot] avatar dlperf avatar hyunwoongko avatar kevinwatkins avatar leogao2 avatar louis030195 avatar lucidrains avatar mgrankin avatar mistobaan avatar naelsondouglas avatar noanabeshima avatar nostalgebraist avatar redthing1 avatar samriddhishree avatar sdtblck avatar shashi456 avatar srulikbd avatar stellaathena avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpt-neo's Issues

GPT3XL training

It's not clear to me how to train the GPT3XL via GPU/Colab.
Could you add more details?

Thank you.

Pin Specific Requirement for lm_dataformat

For some reason, multithreading isn't playing nice with the latest version of lm_dataformat and throws an error when using create_tfrecords.py

"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "data/create_tfrecords.py", line 230, in create_file
    for d in data:
  File "data/create_tfrecords.py", line 221, in _archive_to_files
    for s in g:
  File "/usr/local/lib/python3.6/dist-packages/lm_dataformat/__init__.py", line 119, in stream_data
    p.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 103, in start
    'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "data/create_tfrecords.py", line 278, in <module>
    for i in tqdm(pool.imap(create_file, enumerate(file_chunks)), total=len(file_chunks)):
  File "/usr/local/lib/python3.6/dist-packages/tqdm/std.py", line 1104, in __iter__
    for obj in iterable:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
AssertionError: daemonic processes are not allowed to have children
  0% 0/104 [00:00<?, ?it/s]

The last version I was able to test and validate working is

pip install git+git://github.com/leogao2/lm_dataformat.git@f77f6e86d36306f0ecd4a8b47ee8223d5e491b4a

Great work so far - really excited to see GPTNeo progress!

Failed to install requirements

I got this error when installing the requirements. Can anyone help?

$ pip3 install -r requirements.txt
...
  Could not find a version that satisfies the requirement tensorflow==2.4.0 (from -r requirements.txt (line 10)) (from versions: 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow==2.4.0 (from -r requirements.txt (line 10))

My system is: Debian 10.8 64bit, Python 3.7.3.

Fix Dropout

We should make sure all dropouts are working as expected. We had a problem with embd_dropout in particular.

Mainly opening this issue for documentation purposes as I believe I have this fixed already, I'm just running a test to see if my fix has worked.

Colab breaks when tokenizing NIH Dataset

Traceback (most recent call last):
File "data/create_tfrecords.py", line 205, in
results = create_tfrecords_mp(files, args)
File "data/create_tfrecords.py", line 186, in create_tfrecords_mp
files = split_list(files, len(files) // args.processes)
File "data/create_tfrecords.py", line 67, in split_list
return [l[i:i+n] for i in range(0, len(l), n)]
ValueError: range() arg 3 must not be zero

I'll look into why it breaks

UnboundLocalError: local variable 'skip_idx' referenced before assignment

Describe the bug
UnboundLocalError: local variable 'skip_idx' referenced before assignment

To Reproduce

  1. Trying running the example notebook in Colab
  2. Get to the training part (where main.py is called)
  3. Run into the error that ends in:
File "/content/GPTNeo/inputs.py", line 346, in sequential_input
    skip_idx, remainder = _get_skip_index(filenames, n_batches=global_step * params["train_batch_size"]) # TODO: fix for > 1 epoch
  File "/content/GPTNeo/inputs.py", line 295, in _get_skip_index
    return skip_idx, remainder
UnboundLocalError: local variable 'skip_idx' referenced before assignment

Expected behavior
Clean training on TPUs

Proposed solution
As I understand it, adding a line like count = 0 to inputs.py, line 279 should solve the problem.

Experiment Plan

Overview [DRAFT]

We have few resources available and it would be good to have an experiment plan to utilize them as much and quickly as possible.
In the last days there were few structural changes that should push the model to achieve new results.

Proposed Plan

Qualitative experiments

  • run a baseline test to determine the best lambada score we can get with a 256 setup

Efficiency experiments

  • Determine the best configuration that leverages the MOE and BFLOAT16 setup.

Efficient Sampling

Currently our sampling is incredibly inefficient, doesn't store past values for k / v, and instead recomputes them for every token.

We should look again at how sampling is done in mesh / T5 (maybe ask Colin?) and see if we can store k/v values to increase the efficiency of our sampling code.

The basic infrastructure for this is already in place, but commented out (the k/v values will be stored in this Context object: https://github.com/EleutherAI/GPTNeo/blob/master/sample.py#L148, https://github.com/EleutherAI/GPTNeo/blob/master/models/gpt2/gpt2.py#L138). The problem we ran into last time was that in the mesh code they seem to feed in the inputs a token at a time, but the token is missing a batch dimension, and we're not exactly sure where batch gets added on, and therefore how to replicate this. (see: https://github.com/tensorflow/mesh/blob/a3c05f705641dfe144f70b7b5230db4933ce8ca9/mesh_tensorflow/transformer/transformer.py#L1137)

This is (the last?) issue I'd like to get solved before the code is released publicly, so I will try to look into this soon, but any help would be appreciated.

Wrong mesh_shape

Trying out the notebook on colab TPUs with both "mesh_shape":"x:4,y:2" and "mesh_shape":"all:8" led to the following error:

Traceback (most recent call last):
  File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 230, in main
    estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3130, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3242, in _model_fn
    input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1484, in generate_infeed_enqueue_ops_and_dequeue_fn
    self._invoke_input_fn_and_record_structure())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1547, in _invoke_input_fn_and_record_structure
    enqueue_ops.append(wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3771, in _wrap_computation_in_while_loop
    parallel_iterations=1)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2774, in while_loop
    return_same_structure)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2256, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2181, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3760, in computation
    with tf.control_dependencies(op_fn()):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1217, in enqueue_ops_fn
    placement_function=device_function_impl)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_feed.py", line 631, in generate_enqueue_ops
    for (shard, index) in zip(sharded_inputs, xrange(self.number_of_shards))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_feed.py", line 631, in <listcomp>
    for (shard, index) in zip(sharded_inputs, xrange(self.number_of_shards))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1162, in tpu_ordinal_function_impl
    if ctx.device_assignment:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 420, in device_assignment
    if self._model_parallelism_enabled else None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 353, in _get_device_assignment
    num_replicas=self.num_replicas)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/device_assignment.py", line 374, in device_assignment
    topology = Topology(serialized=topology)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/topology.py", line 80, in __init__
    self._parse_topology(serialized)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/topology.py", line 111, in _parse_topology
    "entries; got {}".format(self._mesh_shape))
ValueError: `mesh_shape` must be a vector of size 4 with positive entries; got [2 2 2]

ETA: running a sudo apt update && sudo apt -y upgrade solved the problem. But I guess is has something to do with versions of tensorflow and mesh-tensorflow.

How to train gpt2 on mutil node and mutil gpu

what should input about gpu_ids on mutil node and mutil gpu?below right?
node0 cmd: python3 main.py --model <your_config_name> --steps_per_checkpoint --gpu_ids 0,1,2,3
node1 cmd: python3 main.py --model <your_config_name> --steps_per_checkpoint --gpu_ids 0,1,2,3
node2 cmd: python3 main.py --model <your_config_name> --steps_per_checkpoint --gpu_ids 0,1,2,3

BFloat16

The current attention implementation uses float32. This is not efficient on TPUS.

Task: Change the logic to leverage the TPU's MXU units.

Make sure we handle short text correctly

As a cleaning heuristic, documents with fewer than 128 tokens were removed from the dataset. These shorter documents tended to be lower quality, as determined by text coherence. We release this dataset as the OpenWebTextCorpus

source

Colab notebook create_tfrecords.py: unknown argument "--mode documents"

Describe the bug
The argument parser seems to no longer support the "--mode documents" from the Colab notebook

To Reproduce
Steps to reproduce the behavior:

  1. Follow the Colab notebook

Proposed solution
Remove the argument in the Colab tutorial notebook

Environment (please complete the following information):
Google Colab

About MLM & fine-tuning

I found the MLM option, which was probably added as an experimental option. For this, I'd like to note a few things you may find useful, specifically concerning Roberta (cuz Bert sucks lol).

  • I may be misunderstanding this, but it appears that your process of MLM doesn't include replacing a token with a random token, i.e., the second option of the figure:

fig1

  • It would be awesome if you can add the option of fine-tuning on GLUE (found it in another repo by Eleuther). Since the current MLM training doesn't add [CLS] token at the beginning, you may want to fix that.

ValueError when predicting with pretrained models

Describe the bug
When using GPT3XL to perform inference with the --predict flag as shown in examples, the following error is thrown

ValueError: Argument not a list with same length as devices arg=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255] devices=['device:GPU:0']

This is with a single GTX 1070 GPU.

commands that both produced this error were:
python main.py --model=gpt3xl/config.json --predict --prompt=prompt.txt
python main.py --model=gpt3xl/config.json --predict --prompt=prompt.txt --gpu_ids=['device:GPU:0']

While loop in predict mode

Trying to wrap this block in the while loop

   if args.predict:
        # Predict
        predictions = estimator.predict(input_fn=pred_input_fn)
        logger.info("Predictions generated")
        enc = fetch_encoder(params)
        handle_pred_output_fn(predictions, logger, enc, params, out_name=f"predictions_{args.sacred_id}_{current_step}")

ends with OOM (model is allocated again on the GPU/TPU).

connect timeout

when i run :
!python3 main.py --predict --prompt prompt1.txt --tpu=[0,1,2,3,4,5,6,7] --model /content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL/config.json

I got :

`2021-03-22 16:14:03.058026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Current step 362000
Saving config to /content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL
2021-03-22 16:14:07.810984: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz
2021-03-22 16:14:07.811573: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d6e95b8f40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-22 16:14:07.811629: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2021-03-22 16:14:07.814273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-03-22 16:14:07.824584: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-22 16:14:07.824644: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (b30276c531da): /proc/driver/nvidia/version does not exist
2021-03-22 16:14:07.831001: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
Done!
params = defaultdict(<function fetch_model_params.. at 0x7f8fed3e0b00>, {'n_head': 16, 'n_vocab': 50257, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0, 'train_batch_size': 512, 'attn_dropout': 0, 'train_steps': 400000, 'lr_decay_end': 300000, 'eval_steps': 10, 'predict_steps': 0, 'res_dropout': 0, 'eval_batch_size': 128, 'predict_batch_size': 128, 'iterations': 500, 'n_embd': 2048, 'datasets': [['pile', None, None, None]], 'model_path': '/content/the-eye.eu/eleuther_staging/gptneo-release/GPT3_XL', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local'], 'mesh_shape': 'x:128,y:2', 'layout': 'batch:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 4096, 'precision': 'bfloat16', 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'pile': {'n_vocab': 50257, 'path': 'gs://neo-datasets/pile/pile_*.tfrecords', 'eval_path': 'gs://neo-datasets/pile_val.tfrecords', 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 256, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': True, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 5000, 'predict': True, 'model': 'GPT', 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False})
Traceback (most recent call last):
File "/usr/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/usr/lib/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/usr/lib/python3.7/http/client.py", line 972, in send
self.connect()
File "/usr/lib/python3.7/http/client.py", line 944, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/lib/python3.7/socket.py", line 728, in create_connection
raise err
File "/usr/lib/python3.7/socket.py", line 716, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 262, in
main(args)
File "main.py", line 139, in main
tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(args.tpu) if params["use_tpu"] else None
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py", line 207, in init
discovery_url=discovery_url)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/client/client.py", line 164, in init
self._project = _request_compute_metadata('project/project-id')
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/client/client.py", line 82, in _request_compute_metadata
resp = request.urlopen(req)
File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 1378, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
`

env: colab TPU; code is : git clone https://github.com/EleutherAI/gpt-neo.git

Code Cleanup

The code is a massive mess,
before we release, we need to clean everything up, and add comments where possible.

TODO:

  • delete old configs
  • delete / fix broken configs (which atm is anything larger than gpt3_XL, and maybe test.json & v8_test.json)
  • move dataset configs into configs folder
  • move all scripts into scripts (think this just applies to start_test_tb.sh)
  • remove unused code
  • clean up / simplify input pipeline
  • remove old steps from readme, and make it easier to follow (especially input)
  • move as much code as possible in model_fns / gpt2.py into functions to make the body of the code more readable (it should read like pytorch code where possible)

If anyone has any more concrete suggestions on how we should do this, go ahead. I think @ConnorJL wanted to make our code a bit more of a flexible / class based thing, but i'll leave it open as to how we want it to look eventually.

HugginFace compatibility

Hey,

If a train a GPT2 model from scratch using your codebase, can it be converted to models compatible with the HuggingFace library?

Thank you

Unrecognized arguments base_dir and use_gpt2_tokenizer

create_tfrecords.py: error: unrecognized arguments: --base_dir /content/GPTNeo/openwebtext --use_gpt2_tokenizer

I even tried using input_dir instead of base_dir and without gpt2_tokenizer. I think that worked, but when I got to copying data to the storage bucket it did this instead: No URLs matched: /content/GPTNeo/openwebtext_tokenized
I tried to make folders in the bucket to match the path but nothing worked.

ValueError in data/create_tfrecords.py after unzip openwebtext.tar.xz

Hi,

I'm trying to follow the GPTNeo_example_notebook. After tar xf openwebtext.tar.xz and python3 data/create_tfrecords.py --mode documents --input_dir /content/GPTNeo/openwebtext --name openwebtext --output_dir openwebtext_tokenized --write_dataset_config, I encountered ValueError

2021-02-09 08:57:16.075990: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-02-09 08:57:16.076052: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "data/create_tfrecords.py", line 205, in
results = create_tfrecords_mp(files, args)
File "data/create_tfrecords.py", line 186, in create_tfrecords_mp
files = split_list(files, len(files) // args.processes)
File "data/create_tfrecords.py", line 67, in split_list
return [l[i:i+n] for i in range(0, len(l), n)]
ValueError: range() arg 3 must not be zero

Any help will be appreciated!

Transformative Mediation

Is anybody discussing training models wisely? It seems it would even be best to train models based on crowd sourced human feedback on something like what leaves everyone understood and respected. Is this reasonable to do?

Here is tiny booklet on nonviolent communication, a mediation process that has reliably ended long-running wars and family squabbles with a little bit of talk: https://gateway.ipfs.io/ipfs/QmdFVjYwgeuUpw83hBB74Wy4js8SrmmNxt8U2MkdRA2f7m/Books/We%20Can%20Work%20It%20Out:%20Resolving%20Conflicts%20Peacefully%20and%20Powerfully.pdf . There are many long books on NVC, too.

Can't infer on the provided Colab

In the provided Colab (only using provided cells), after downloading a pre-trained GPT3_XL, I tried to infer from it, which resulted in the following output from the very last cell:

out.txt

The interesting part seems to be:

Starting infeed thread controller.
Starting outfeed thread controller.
Initialized dataset iterators in 0 seconds
Before copy master to slices.
Done with copy master to slices.
Enqueue next (1) batch(es) of data to infeed.
Dequeue next (1) batch(es) of data from outfeed.
Outfeed finished for iteration (0, 0)
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:worker/replica:0/task:0:
DisableableBlockingRefcount is disabled.
	 [[node OutfeedDequeueTuple_7 (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:2261) ]]

Original stack trace for 'OutfeedDequeueTuple_7':
  File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 184, in main
    handle_pred_output_fn(predictions, logger, enc, params, out_name=f"predictions_{args.sacred_id}_{current_step}")
  File "/content/GPTNeo/inputs.py", line 165, in handle_pred_output
    for i, p in enumerate(predictions):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3167, in predict
    yield_single_examples=yield_single_examples):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 613, in predict
    self.config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3525, in _model_fn
    host_call_ret = host_calls.create_tpu_hostcall()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2261, in create_tpu_hostcall
    device_ordinal=ordinal_id)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3455, in outfeed_dequeue_tuple
    device_ordinal=device_ordinal, name=name)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3536, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 1990, in __init__
    self._traceback = tf_stack.extract_stack()

Graph was finalized.
Restoring parameters from gs://peppa-test-1/GPT3_XL/model.ckpt-362000
Closing session due to error From /job:worker/replica:0/task:0:
9 root error(s) found.
  (0) Resource exhausted: Failed to allocate request for 1.0KiB (1024B) on device ordinal 3
	 [[{{node ConstantFolding/split-folded-3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[ConstantFolding/split-folded-4_G4895]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

...followed by many more similar OOM errors.

I'd be glad for any help with running the inference in Google Colab. Training actually seems to work and saves a new checkpoint, but I have not been able to run inference even on the provided pre-trained network.

truncating prompts > n_ctx

in our current prediction input function, when a prompt is larger than the context length, we truncate it like so:

if len(tokens) > params["n_ctx"]:
        tokens = tokens[:params["n_ctx"]]

this would input only the beginning of the prompt, if it's longer than n_ctx. Wouldn't it be preferable to truncate input from the beginning, like so?

if len(tokens) > params["n_ctx"]:
        tokens = tokens[len(tokens) - params["n_ctx"]:]

relevant lines of code here:

https://github.com/EleutherAI/GPTNeo/blob/master/inputs.py#L212

CrossShardOptimizer must be used for model training on TPUs

Running the example on a Colab TPU results in the following error:

File "main.py", line 256, in <module>
    main(args)
  File "main.py", line 230, in main
    estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=next_checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3130, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3125, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2962, in _call_model_fn
    config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3386, in _model_fn
    _validate_tpu_training_graph(ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3817, in _validate_tpu_training_graph
    'CrossShardOptimizer must be used for model training on TPUs.')
ValueError: CrossShardOptimizer must be used for model training on TPUs.

Make repository public

The codebase has become mature enough that it seems reasonable to make everything public. It's not like we're releasing it yet.

It's a full model parallel GPT implementation with sampling and eval tasks. Definitely meets the quality threshold for "let people take a peek" imo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.