nvidia / megatron-lm Goto Github PK

View Code? Open in Web Editor NEW

8.6K 154.0 1.9K 8.07 MB

Ongoing research training transformer models at scale

Home Page: https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

License: Other

Python 97.13% Shell 1.72% Makefile 0.01% C++ 0.95% C 0.10% HTML 0.09% Dockerfile 0.01%

large-language-models model-para transformers

megatron-lm's Introduction

Megatron-LM & Megatron-Core

GPU optimized techniques for training transformer models at-scale

Latest News

[2024/1 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron-Core in this repository. Megatron-Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the Megatron-Core intro for more details.

Megatron Overview
- Megatron-LM
- Megatron-Core
Training Speed and Scalability
Setup
- Downloading Checkpoints
Usage
Training
Evaluation and Tasks
Datasets
- Collecting Wikipedia Training Data
- Collecting GPT Webtext Data
Reproducibility
Projects using Megatron

Megatron Overview

This repository comprises two essential components: Megatron-LM and Megatron-Core. Megatron-LM serves as a ressearch-oriented framework leveraging Megatron-Core for large language model (LLM) training. Megatron-Core, on the other hand, is a library of GPU optimized training techniques that comes with formal product support including versioned APIs and regular releases. You can use Megatron-Core alongside Megatron-LM or Nvidia NeMo Framework for an end-to-end and cloud-native solution. Alternatively, you can integrate Megatron-Core's building blocks into your preferred training framework.

Megatron-LM

First introduced in 2019, Megatron (1, 2, and 3) sparked a wave of innovation in the AI community, enabling researchers and developers to utilize the underpinnings of this library to further LLM advancements. Today, many of the most popular LLM developer frameworks have been inspired by and built directly leveraging the open-source Megatron-LM library, spurring a wave of foundation models and AI startups. Some of the most popular LLM frameworks built on top of Megatron-LM include Colossal-AI, HuggingFace Accelerate, and NVIDIA NeMo Framework. A list of projects that have directly used Megatron can be found here.

Megatron-Core

Megatron-Core is a newly released open-source PyTorch-based library that further expands the collections of GPU optimized techniques inherited from Megatron-LM with more cutting-edge innovations on system-level optimizations. It abstracts them into composable and modular APIs, allowing full flexibility for developers and model researchers to train custom transformers at-scale on NVIDIA accelerated computing infrastructure. This library is compatible with all NVIDIA Tensor Core GPUs, including FP8 acceleration support for NVIDIA Hopper architectures.

Megatron-Core offers the core building blocks such as attention mechanisms, transformer blocks and layers, normalization layers, and embedding techniques. Additional functionality like activation recomputation, distributed checkpointing is also natively built-in to the library. The building blocks and functionality are all GPU optimized, and can be built with advanced parallelization strategies for optimal training speed and stability on NVIDIA Accelerated Computing Infrastructure. Another key component of the Megatron-Core library includes advanced model parallelism techniques (tensor, sequence, and pipeline). Currently, popular LLM model architectures based on Decoder (ex. GPT, Llama), Encoder (ex. BERT), Encoder-Decoder (ex. T5), Retrieval Enhanced Transformers (ex. RETRO), and Mixture of Experts (MoE) can easily be built with performance and efficiency at large compute scales. Developers can also use Megatron-Core's transformer blocks and functional APIs to build their own custom layers.

Training Speed and Scalability

Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specific model size. As the model size increases, we also modestly increase the batch size. We leverage NVIDIA's Selene supercomputer to perform scaling studies and use up to 3072 A100 GPUs for the largest model. Each cluster node has 8 NVIDIA 80GB A100 GPUs. The graph below shows that we scale nearly linear up to 1 trillion parameter models running on 3072 GPUs. Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

The following table shows both model (MFU) and hardware (HFU) FLOPs utilization for select configurations up to 1T parameters (see our paper for a description of how these are calculated). As the model size increases, we achieve better GPU utilization. For the one trillion parameter model, we reach a MFU and HFU of 56.3% and 57.0%, respectively. Note that these numbers are also measured on benchmark runs and in this case are measured using a data parallel size of one. Data parallelism introduces some overhead due to the gradient all-reduce required between the data parallel groups. However, for large transformer models, this overhead is not large and can almost entirely eliminated by overlapping the gradient all-reduce with backpropagation.

Model Size	Model FLOPs Utilization	Hardware FLOPs Utilization
22B	41.5%	43.7%
175B	51.4%	52.8%
530B	56.0%	57.0%
1T	56.3%	57.0%

Setup

We strongly recommend using the latest release of NGC's PyTorch container with DGX nodes. If you can't use this for some reason, use the latest pytorch, cuda, nccl, and NVIDIA APEX releases. Data preprocessing requires NLTK, though this is not required for training, evaluation, or downstream tasks.

You can launch an instance of the PyTorch container and mount Megatron, your dataset, and checkpoints with the following Docker commands:

docker pull nvcr.io/nvidia/pytorch:xx.xx-py3
docker run --gpus all -it --rm -v /path/to/megatron:/workspace/megatron -v /path/to/dataset:/workspace/dataset -v /path/to/checkpoints:/workspace/checkpoints nvcr.io/nvidia/pytorch:xx.xx-py3

Downloading Checkpoints

We have provided pretrained BERT-345M and GPT-345M checkpoints to evaluate or for finetuning downstream tasks. To access these checkpoints, first sign up for and setup the NVIDIA GPU Cloud (NGC) Registry CLI. Further documentation for downloading models can be found in the NGC documentation.

Alternatively, you can directly download the checkpoints using:

BERT-345M-uncased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
BERT-345M-cased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
GPT-345M: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip

The models require vocabulary files to run. The BERT WordPiece vocab file can be extracted from Google's pretrained BERT models: uncased, cased. The GPT vocab file and merge table can be downloaded directly.

Usage

After installation, there are several possible workflows. The most comprehensive is:

Data preprocessing
Pretraining
Finetuning (Optional for zero-shot tasks)
Downstream task evaluation or text generation

However, steps 1 and 2 can be replaced by using one of the pretrained models mentioned above.

We've provided several scripts for pretraining both BERT and GPT in the examples directory, as well as scripts for both zero-shot and fine-tuned downstream tasks including MNLI, RACE, WikiText103, and LAMBADA evaluation. There is also a script for GPT interactive text generation.

Training

Data Preprocessing

The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

The name of the text field of the json can be changed by using the --json-key flag in preprocess_data.py The other metadata are optional and are not used in training.

The loose json is then processed into a binary format for training. To convert the json into mmap format use preprocess_data.py. An example script to prepare data for BERT training is:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-bert \
       --vocab-file bert-vocab.txt \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences

The output will be two files named, in this case, my-bert_text_sentence.bin and my-bert_text_sentence.idx. The --data-path specified in later BERT training is the full path and new filename, but without the file extension.

For T5 use the same preprocessing as BERT, perhaps renaming it to:

       --output-prefix my-t5 \

Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab-file gpt2-vocab.json \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod

Here the output files are named my-gpt2_text_document.bin and my-gpt2_text_document.idx. As before, in GPT training, use the longer name without the extension as --data-path.

Further command line arguments are described in the source file preprocess_data.py.

BERT Pretraining

The examples/pretrain_bert.sh script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at --lr to a minimum set by --min-lr over --lr-decay-iters iterations. The fraction of training iterations used for warmup is set by --lr-warmup-fraction. While this is single GPU training, the batch size specified by --micro-batch-size is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches global-batch-size which is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with --seed). We use train-iters as the training iterations requested. Alternatively, one can provide --train-samples which is total number of samples to train on. If this option is present, then instead of providing --lr-decay-iters, one will need to provide --lr-decay-samples.

The logging, checkpoint-saving, and evaluation interval options are specified. Note that the --data-path now includes the additional _text_sentence suffix added in preprocessing, but does not include the file extensions.

Further command line arguments are described in the source file arguments.py.

To run examples/pretrain_bert.sh, make any desired modifications including setting the environment variables for CHECKPOINT_PATH, VOCAB_FILE, and DATA_PATH. Make sure to set these variables to their paths in the container. Then launch the container with Megatron and necessary paths mounted (as explained in Setup) and run the example script.

GPT Pretraining

The examples/pretrain_gpt.sh script runs single GPU 345M parameter GPT pretraining. As mentioned above, single GPU training is primarily intended for debugging purposes, as the code is optimized for distributed training.

It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a json vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the --lr-decay-style has been set to cosine decay. Note that the --data-path now includes the additional _text_document suffix added in preprocessing, but does not include the file extensions.

Further command line arguments are described in the source file arguments.py.

examples/pretrain_gpt.sh can be launched the same way as described for BERT. Set the env vars and make any other modifications, launch the container with appropriate mounts, and run the script.

T5 Pretraining

Very similar to BERT and GPT, the examples/pretrain_t5.sh script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accommodate the T5 architecture:

--kv-channels sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.
--ffn-hidden-size sets the hidden size in the feed-forward networks within a transformer layer. For BERT and GPT this defaults to 4 times the transformer hidden size, but can be configured for T5.
--encoder-seq-length and --decoder-seq-length set the sequence length for the encoder and decoder separately.

All of the other arguments remain as they were for BERT and GPT pretraining. Run this example with the same steps described above for the other scripts.

Distributed Pretraining

The examples/pretrain_{bert,gpt,t5}_distributed.sh scripts use the PyTorch distributed launcher for distributed training. As such, multi-node training can be achieved by properly setting environment variables. See the official PyTorch documentation for further description of these environment variables. By default, multi-node training uses the nccl distributed backend. A simple set of additional arguments and the use of the PyTorch distributed module with the torchrun elastic launcher (equivalent to python -m torch.distributed.run) are the only additional requirements to adopt distributed training. See any of examples/pretrain_{bert,gpt,t5}_distributed.sh for more details.

We use two types of parallelism: data and model parallelism. We facilitate two distributed data parallel implementations: a simple one of our own that performs gradient all-reduce at the end of back propagation step, and Torch's distributed data parallel wrapper that overlaps gradient reduction with back propagation computation. To switch between these two options use --DDP-impl local or --DDP-impl torch, respectively. As expected, Torch distributed data parallelism is more efficient at larger model sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 76% when Torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.

Second, we developed a simple and efficient two-dimensional model-parallel approach. To use the first dimension, tensor model parallelism (splitting execution of a single transformer module over multiple GPUs, see Section 3 of our paper), add the --tensor-model-parallel-size flag to specify the number of GPUs among which to split the model, along with the arguments passed to the distributed launcher as mentioned above. To use the second dimension, sequence parallelism, specify --sequence-parallel, which also requires tensor model parallelism to be enabled because it splits across the same GPUs (more details in Section 4.2.2 of our paper).

To use pipeline model parallelism (sharding the transformer modules into stages with an equal number of transformer modules on each stage, and then pipelining execution by breaking the batch into smaller microbatches, see Section 2.2 of our paper), use the --pipeline-model-parallel-size flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).

We have examples of how to use these two different forms of model parallelism the example scripts ending in distributed_with_mp.sh:

Other than these minor changes, the distributed training is identical to the training on a single GPU.

The interleaved pipelining schedule (more details in Section 2.2.2 of our paper) can be enabled using the --num-layers-per-virtual-pipeline-stage argument, which controls the number of transformer layers in a virtual stage (by default with the non-interleaved schedule, each GPU will execute a single virtual stage with NUM_LAYERS / PIPELINE_MP_SIZE transformer layers). The total number of layers in the transformer model should be divisible by this argument value. Additionally, the number of microbatches in the pipeline (computed as GLOBAL_BATCH_SIZE / (DATA_PARALLEL_SIZE * MICRO_BATCH_SIZE)) should be divisible by the PIPELINE_MP_SIZE when using this schedule (this condition is checked in an assertion in the code). The interleaved schedule is not supported for pipelines with 2 stages (PIPELINE_MP_SIZE=2).

Activation Checkpointing and Recomputation

To reduce GPU memory usage when training a large model, we support various forms of activation checkpointing and recomputation. Instead of all activations being stored in memory to be used during backprop, as was traditionally the case in deep learning models, only activations at certain "checkpoints" in the model are retained (or stored) in memory, and the other activations are recomputed on-the-fly when needed for backprop. Note that this kind of checkpointing, activation checkpointing, is very different from the checkpointing of model parameters and optimizer state, which is mentioned elsewhere.

We support two levels of recompute granularity: selective and full. Selective recomputation is the default and is recommended in almost all cases. This mode retains in memory the activations that take less memory storage space and are more expensive to recompute and recomputes the activations that take more memory storage space but are relatively inexpensive to recompute. See our paper for details. You should find that this mode maximizes performance while minimizing the memory required to store activations. To enable selective activation recompute simply use --recompute-activations.

For cases where memory is very limited, full recompute saves just the inputs to a transformer layer, or a group, or block, of transformer layers, and recomputes everything else. To enable full activation recompute use --recompute-granularity full. When using full activation recompute, there are two methods: uniform and block, chosen using the --recompute-method argument.

The uniform method uniformly divides the transformer layers into groups of layers (each group of size --recompute-num-layers) and stores the input activations of each group in memory. The baseline group size is 1 and, in this case, the input activation of each transformer layer is stored. When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage, enabling a bigger model to be trained. For example, when --recompute-num-layers is set to 4, only the input activation of each group of 4 transformer layers is stored.
The block method recomputes the input activations of a specific number (given by --recompute-num-layers) of individual transformer layers per pipeline stage and stores the input activations of the remaining layers in the pipeline stage. Reducing --recompute-num-layers results in storing the input activations to more transformer layers, which reduces the activation recomputation required in the backprop, thus improving training performance while increasing memory usage. For example, when we specify 5 layers to recompute of 8 layers per pipeline stage, the input activations of only the first 5 transformer layers are recomputed in the backprop step while the input activations for the final 3 layers are stored. --recompute-num-layers can be incrementally increased until the amount of memory storage space required is just small enough to fit in the available memory, thereby both maximally utilizing memory and maximizing performance.

Distributed Optimizer

Usage: --use-distributed-optimizer. Compatible with all model and data types.

The distributed optimizer is a memory savings technique, whereby the optimizer state is evenly distributed across data parallel ranks (versus the traditional method of replicating the optimizer state across data parallel ranks). As described in ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, our implementation distributes all optimizer state that does not overlap with the model state. For example, when using fp16 model params, the distributed optimizer maintains its own separate copy of fp32 main params & grads, which are distributed across DP ranks. When using bf16 model params, however, the distributed optimizer's fp32 main grads are the same as the model's fp32 grads, and so the grads in this case are not distributed (although the fp32 main params are still distributed, as they are separate from the bf16 model params).

Theoretical memory savings vary depending on the combination of the model's param dtype and grad dtype. In our implementation, the theoretical number of bytes per parameter is (where 'd' is the data parallel size):

	Non-distributed optim	Distributed optim
fp16 param, fp16 grads	20	4 + 16/d
bf16 param, fp32 grads	18	6 + 12/d
fp32 param, fp32 grads	16	8 + 8/d

FlashAttention

Usage: --use-flash-attn. Support attention head dimensions at most 128.

FlashAttention is a fast and memory-efficient algorithm to compute exact attention. It speeds up model training and reduces memory requirement.

To install FlashAttention:

pip install flash-attn

GPT-3 Example

In examples/pretrain_gpt3_175B.sh we have provided an example of how to configure Megatron to train GPT-3 with 175 billion parameters on 1024 GPUs. The script is designed for slurm with pyxis plugin but can be easily adopted to any other scheduler. It uses 8-way tensor parallelism and 16-way pipeline parallelism. With options global-batch-size 1536 and rampup-batch-size 16 16 5859375, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incremental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.

With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.

Retro and InstructRetro

Retro (Borgeaud et al., 2022) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation. Retro features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of tokens. Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving lower perplexity than standard GPT. Retro also provides the flexibility to update the knowledge stored in LMs (Wang et al., 2023a) by updating the retrieval database without training LMs again.

InstructRetro (Wang et al., 2023b) further scales up the size of Retro to 48B, featuring the largest LLM pretrained with retrieval (as of December 2023). The obtained foundation model, Retro 48B, largely outperforms the GPT counterpart in terms of perplexity. With instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on downstream tasks in the zero-shot setting. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, and 10% over GPT across 4 challenging long-form QA tasks. We also find that one can ablate the encoder from InstructRetro architecture and directly use the InstructRetro decoder backbone as GPT, while achieving comparable results.

In this repo, we provide an end-to-end reproduction guide to implement Retro and InstructRetro, covering

Retrieval database construction, which supports billions or even trillions of tokens as a large-scale retrieval database.
Pretraining with retrieval, which supports pretraining from scratch and pretraining from a pretrained GPT model (Retro-fitting).
Instruction tuning, where we provide an open-source instruction tuning dataset and the training recipe for instruction tuning on Retro.
Downstream task evaluation, where we provide the text generation and evaluation scripts for zero-shot question answering tasks.

Please see tools/retro/README.md for a detailed overview.

Evaluation and Tasks

We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the --finetune flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the --finetune flag before continuing, otherwise the training will start again from the beginning.

Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on fewer GPUs in downstream tasks. The following script accomplishes this. This example reads in a GPT model with 4-way tensor and 4-way pipeline model parallelism and writes out a model with 2-way tensor and 2-way pipeline model parallelism.

python tools/checkpoint/util.py \
        --model-type GPT \
        --load-dir checkpoints/gpt3_tp4_pp4 \
        --save-dir checkpoints/gpt3_tp2_pp2 \
        --target-tensor-parallel-size 2 \
        --target-pipeline-parallel-size 2

Several downstream tasks are described for both GPT and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.

GPT Text Generation

We have included a simple REST server to use for text generation in tools/run_text_generation_server.py. You run it much like you would start a pretraining job, specifying an appropriate pretrained checkpoint. There are also few optional parameters: temperature, top-kand top-p. See --help or the source file for more information. See examples/run_text_generation_server_345M.sh for an example of how to run the server.

Once the server is running you can use tools/text_generation_cli.py to query it, it takes one argument which is the host the server is running on.

tools/text_generation_cli.py localhost:5000

You can also use CURL or any other tools to query the server directly:

curl 'http://localhost:5000/api' -X 'PUT' -H 'Content-Type: application/json; charset=UTF-8'  -d '{"prompts":["Hello world"], "tokens_to_generate":1}'

See megatron/inference/text_generation_server.py for more API options.

Detoxify GPT via Self-generation

We include an example in examples/detxoify_lm/ to detoxify language models by leveraging the generative power of language models.

See examples/detxoify_lm/README.md for step-by-step tutorials on how to perform domain-adaptive training and detoxify LM using self-generated corpus.

GPT Evaluation

We include example scripts for GPT evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.

WikiText Perplexity Evaluation

For even comparison with prior works, we evaluate perplexity on the word-level WikiText-103 test dataset, and appropriately compute perplexity given the change in tokens when using our subword tokenizer.

We use the following command to run WikiText-103 evaluation on a 345M parameter model.

TASK="WIKITEXT103"

VALID_DATA=<wikitext path>.txt
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m

COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 1024 \
                  --max-position-embeddings 1024 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

LAMBADA Cloze Accuracy

To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceding tokens) we utilize a detokenized, processed version of the LAMBADA dataset.

We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the --strict-lambada flag should be used to require whole word matching. Ensure that lambada is part of the file path.

TASK="LAMBADA"

VALID_DATA=<lambada path>.json
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m
COMMON_TASK_ARGS=<same as those in WikiText Perplexity Evaluation above>

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --strict-lambada \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

Further command line arguments are described in the source file main.py

BERT Task Evaluation

RACE Evaluation

The following script finetunes the BERT model for evaluation on the RACE dataset. The TRAIN_DATA and VALID_DATA directory contain the RACE dataset as separate .txt files. Note that for RACE, the batch size is the number of RACE query's to evaluate. Since each RACE query has four samples, the effective batch size passed through the model will be four times the batch size specified on the command line.

TRAIN_DATA="data/RACE/train/middle"
VALID_DATA="data/RACE/dev/middle \
            data/RACE/dev/high"
VOCAB_FILE=bert-vocab.txt
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
CHECKPOINT_PATH=checkpoints/bert_345m_race
COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 512 \
                  --max-position-embeddings 512 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
                      --valid-data $VALID_DATA \
                      --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
                      --save-interval 10000 \
                      --save $CHECKPOINT_PATH \
                      --log-interval 100 \
                      --eval-interval 1000 \
                      --eval-iters 10 \
                      --weight-decay 1.0e-1"

python tasks/main.py \
       --task RACE \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 3 \
       --micro-batch-size 4 \
       --lr 1.0e-5 \
       --lr-warmup-fraction 0.06

MNLI Evaluation

The following script finetunes the BERT model for evaluation with the MultiNLI sentence pair corpus. Because the matching tasks are quite similar, the script can be quickly tweaked to work with the Quora Question Pairs (QQP) dataset as well.

TRAIN_DATA="data/glue_data/MNLI/train.tsv"
VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
            data/glue_data/MNLI/dev_mismatched.tsv"
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m_mnli
COMMON_TASK_ARGS=<same as those in RACE Evaluation above>
COMMON_TASK_ARGS_EXT=<same as those in RACE Evaluation above>

python tasks/main.py \
       --task MNLI \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 5 \
       --micro-batch-size 8 \
       --lr 5.0e-5 \
       --lr-warmup-fraction 0.065

Llama-2 Inference and Finetuning

The Llama-2 family of models are an open-source set of pretrained & finetuned (for chat) models that have achieved strong results across a wide set of benchmarks. At the time of release, Llama-2 models achieved among the best results for open-source models, and were competitive with the closed-source GPT-3.5 model (see https://arxiv.org/pdf/2307.09288.pdf).

The Llama-2 checkpoints can be loaded into Megatron for inference and finetuning. See documentation here.

Model Optimization and Deployment

Megatron-Core (MCore) GPTModel family supports advanced quantization algorithms and high-performance inference through TensorRT-LLM.

Quantization and TensorRT-LLM Deployment

See Megatron Model Optimization and Deployment for llama2 and nemotron3 examples.

Datasets

We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.

Collecting Wikipedia Training Data

We recommend following the Wikipedia data extraction process specified by Google research: "the recommended pre-processing is to download the latest dump, extract the text with WikiExtractor.py, and then apply any necessary cleanup to convert it into plain text."

We recommend using the --json argument when using WikiExtractor, which will dump the Wikipedia data into loose json format (one json object per line), making it more manageable on the file system and also readily consumable by our codebase. We recommend further preprocessing this json dataset with nltk punctuation standardization. For BERT training, use the --split-sentences flag to preprocess_data.py as described above to include sentence breaks in the produced index. If you'd like to use Wikipedia data for GPT training you should still clean it with nltk/spacy/ftfy, but do not use the --split-sentences flag.

Collecting GPT Webtext Data

We utilize the publicly available OpenWebText library from jcpeterson and eukaryote31's work to download urls. We then filter, clean, and deduplicate all downloaded content according to the procedure described in our openwebtext directory. For reddit URLs corresponding to content up to October 2018 we arrived at approximately 37GB of content.

Reproducibility

Megatron training is intended to be bitwise reproducible. This means that the same training config run twice in the same HW and SW environment should produce identical model checkpoints, losses and accuracy metric values (iteration time metrics may vary).

There are currently two known Megatron optimizations that break reproducibility whilst still producing almost identical training runs. The following workarounds should be applied in cases where reproducibility is required:

When training using --bf16, reproducbility is only obtained when the checkpointing and resume schedule of training is identical. If the checkpointing schedule will change, i.e. checkpointing and resume will occur at different iterations, the option --no-bias-gelu-fusion should be used.
Flash attention is nondeterministic. If reproducibility is required do not use --use-flash-attn.

These sources of nondeterminism are under active investigation. If you observe nondeterminism in Megatron training under other circumstances please open an issue.

Projects Using Megatron

Below are some of the projects where we have directly used Megatron:

megatron-lm's People

Contributors

Stargazers

Watchers

Forkers

8enmann yaroslavvb yyht allensmile fendaq nipengmath legendtianjin cybertronai tianxin1860 sandyhouse jb33k locussam shankar0206 shujian2015 codeaudit pku-wuwei gzjas harm-devries paperplanet chengduozh elhamdolatabadi namisan keep-steady nschuc yhgon semanticbeeng jabogithub mkolod merajat shyamalschandra tspannhw 4r7i5t stjordanis baconwaffle jrdeco560 raymanchester llgaiwyy tqdavid mathlf2015 sky2code ashora romenl yueyedeai benyangalg shihuaxing dgreen2017 sjm112 tom10110887 leezhihui gavinzjchao mbyase nangeblog samithaj j2cms jianweilin w1kke flyingoe ifeela muximuxi zhp510730568 charlottesean csanycall tonyxia2016 sunny8898 dji-transpire huawei-cloud huaweitechnology lycreative niedfelj a515151 cxz zhangjiekui lizzymyth crosstuck ys610zz iamsunguangzhi trantorrepository joeyee007 mbabby foenic realsuperheavy plegresl albatross1997 beekbin addf400 awesome-archive zhchicb1981 ss433s hhy5277 jasvixban batrlatom experiencor dagrigorev dark-dante shubhampachori12110095 lvaleriu ihalage junyuwei b2220333 anatoliipotapov

megatron-lm's Issues

Can we get some samples?

Hi!

Out of interest in GPT-2 and the Megatron LM, can we get a idea what the code outputs? I.e some output samples of what the tool actually does, instead of having to run it just to see what it can do.

Why the backward function of _CopyToModelParallelRegion calls reduce fuction? can somebody share the mathematical proof

class _CopyToModelParallelRegion(torch.autograd.Function):
    """Pass the input to the model parallel region."""

    @staticmethod
    def forward(ctx, input_):
        return input_

    @staticmethod
    def backward(ctx, grad_output):
        return _reduce(grad_output)

Is there a pretrained 8.3B parameter model?

I believe this repo links to the 345M parameter model. Is there any way to get the 8.3B parameter one?

Couldn't find BERT 3.9B related implementation

Hi, I have a question about your BERT xlarge modeling implementation.

From your megatron paper Figure 7, by rearranging the order of the layer normalization and the residual connections, you successfully trained BERT xlarge up to 3.9B parameters.

When I read the related BERT metatron code, for all the LayerNorm-related implementation, I couldn't see the change mentioned in the paper Figure 7.

Can you clarify what I am missing? Thanks.

KeyError: running GPT text generation sample

Hello,

Running in the 20.12 PyTorch NGC container on V100, when I try to run examples/generate_text.sh I get a series of errors...

It can't find tools/generate_samples_gpt2.py

If I change the name of tools/generate_samples_gpt.py to tools/generate_samples_gpt2.py it proceeds for a bit until...

It doesn't like the batch size parameter:

AssertionError: --batch-size argument is no longer valid, use --micro-batch-size instead

If I change the argument name in generate_text.sh I can get it to proceed more until...

I get a KeyError:
`

building GPT2BPETokenizer tokenizer ...
padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
initializing torch distributed ...
initializing tensor model parallel with size 1
initializing pipeline model parallel with size 1
setting random seeds to 1234 ...
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT model ...
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296
WARNING: could not find the metadata file checkpoints/gpt2_345m/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
Avg s/batch: 47.92085814476013
Traceback (most recent call last):
File "tools/generate_samples_gpt2.py", line 116, in
main()
File "tools/generate_samples_gpt2.py", line 111, in main
generate_and_write_samples_unconditional(model)
File "/workspace/megatron/text_generation_utils.py", line 339, in generate_and_write_samples_unconditional
for datum in generate_samples_unconditional(model):
File "/workspace/megatron/text_generation_utils.py", line 317, in generate_samples_unconditional
text = tokenizer.detokenize(tokens)
File "/workspace/megatron/tokenizer/tokenizer.py", line 216, in detokenize
return self.tokenizer.decode(token_ids)
File "/workspace/megatron/tokenizer/gpt2_tokenization.py", line 284, in decode
text = ''.join([self.decoder[token] for token in tokens])
File "/workspace/megatron/tokenizer/gpt2_tokenization.py", line 284, in
text = ''.join([self.decoder[token] for token in tokens])
KeyError: 50280
`

Any advice on how to proceed? Were the changes I made appropriate?

Thank you!

Apex dependency

In the requirements.txt, there is no info about apex. But when we run the Megatron, it required Apex compiled with cpp extension to be installed. Could you update the readme to include apex installation/requirement information?

OOM when training the same size of gpt2(2.6B) with mp=2 dp=8 with 64*V100(32GB)

We train gpt2(2.6B) with following parameters ，but OOM

using world size: 16 and model-parallel size: 2
using torch.float16 for parameters ...
-------------------- arguments --------------------
adam_beta1 ...................... 0.9
adam_beta2 ...................... 0.999
adam_eps ........................ 1e-08
adlr_autoresume ................. False
adlr_autoresume_interval ........ 1000
apply_query_key_layer_scaling ... False
apply_residual_connection_post_layernorm False
attention_dropout ............... 0.1
attention_softmax_in_fp32 ....... False
batch_size ...................... 8
bert_load ....................... None
bias_dropout_fusion ............. False
bias_gelu_fusion ................ False
block_data_path ................. None
checkpoint_activations .......... True
checkpoint_num_layers ........... 1
clip_grad ....................... 1.0
data_impl ....................... mmap
data_path ....................... /raid/gpt3-train-data/filterBy256-100G-notag_text_document
DDP_impl ........................ local
distribute_checkpointed_activations True
distributed_backend ............. nccl
dynamic_loss_scale .............. True
eod_mask_loss ................... False
eval_interval ................... 10000
eval_iters ...................... 10
exit_interval ................... None
faiss_use_gpu ................... False
finetune ........................ False
fp16 ............................ True
fp16_lm_cross_entropy ........... True
fp32_allreduce .................. False
hidden_dropout .................. 0.1
hidden_size ..................... 1920
hysteresis ...................... 2
ict_head_size ................... None
ict_load ........................ None
indexer_batch_size .............. 128
indexer_log_interval ............ 1000
init_method_std ................. 0.02
layernorm_epsilon ............... 1e-05
lazy_mpu_init ................... None
load ............................ checkpoints/gpt2_64_xxxM
local_rank ...................... 0
log_interval .................... 5
loss_scale ...................... None
loss_scale_window ............... 1000
lr .............................. 0.00015
lr_decay_iters .................. 70000
lr_decay_style .................. cosine
make_vocab_size_divisible_by .... 128
mask_prob ....................... 0.15
max_position_embeddings ......... 1024
merge_file ...................... bpe_3w_new/merges.txt
min_lr .......................... 1e-05
min_scale ....................... 1
mmap_warmup ..................... False
model_parallel_size ............. 2
no_load_optim ................... False
no_load_rng ..................... False
no_save_optim ................... False
no_save_rng ..................... False
num_attention_heads ............. 20
num_layers ...................... 54
num_unique_layers ............... None
num_workers ..................... 2
onnx_safe ....................... None
openai_gelu ..................... False
override_lr_scheduler ........... False
param_sharing_style ............. grouped
params_dtype .................... torch.float16
query_in_block_prob ............. 0.1
rank ............................ 0
report_topk_accuracies .......... []
reset_attention_mask ............ False
reset_position_ids .............. False
save ............................ checkpoints/gpt2_64_xxxM
save_interval ................... 10000
scaled_masked_softmax_fusion .... False
scaled_upper_triang_masked_softmax_fusion False
seed ............................ 1234
seq_length ...................... 1024
short_seq_prob .................. 0.1
split ........................... 950,49,1
tensorboard_dir ................. logs/gpt2_64_xxxxxxxxx
titles_data_path ................ None
tokenizer_type .................. GPT2BPETokenizer
train_iters ..................... 120000
use_checkpoint_lr_scheduler ..... False
use_cpu_initialization .......... True
use_one_sent_docs ............... False
vocab_file ...................... bpe_3w_new/vocab.json
warmup .......................... 0.01
weight_decay .................... 0.01
world_size ...................... 16
---------------- end of arguments ----------------

building GPT2BPETokenizer tokenizer ...
padded vocab (size: 30001) with 207 dummy tokens (new size: 30208)
setting tensorboard ...
initializing torch distributed ...
initializing model parallel with size 2
setting random seeds to 1234 ...
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building the checkpointed activations memory buffer with 424673280 num elements and torch.float16 dtype (810.0 MB)...
building GPT2 model ...
number of parameters on model parallel rank 1: 1226348160
number of parameters on model parallel rank 0: 1226348160
learning rate decay style: cosine
WARNING: could not find the metadata file checkpoints/gpt2_64_xxxM/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 7680000
validation: 8320
test: 640
building train, validation, and test datasets for GPT2 ...
building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
finished creating indexed dataset in 0.021106 seconds
number of documents: 48889798
dataset split:
train:
document indices in [0, 46445308) total of 46445308 documents
validation:
document indices in [46445308, 48840908) total of 2395600 documents
test:
document indices in [48840908, 48889798) total of 48890 documents

...

setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 78841.91 | train/valid/test data iterators: 56110.31
training ...
iteration 5/ 120000 | elapsed time per iteration (ms): 16491.5 | learning rate: 0.000E+00 | loss scale: 268435456.0 | number of skipped iterations: 5 | number of nan iterations: 0 |
after 5 iterations memory (MB) | allocated: 10250.5693359375 | max allocated: 12606.06201171875 | reserved: 17482.0 | max reserved: 17482.0
time (ms) | forward: 11716.18 | backward: 4770.37 | backward-backward: 1742.88 | backward-allreduce: 195.31 | backward-master-grad: 2832.04 | backward-clip-grad: 0.03 | optimizer: 0.06 | batch generator: 2.61
iteration 10/ 120000 | elapsed time per iteration (ms): 1993.9 | learning rate: 0.000E+00 | loss scale: 8388608.0 | number of skipped iterations: 5 | number of nan iterations: 0 |
time (ms) | forward: 483.93 | backward: 1502.71 | backward-backward: 1395.29 | backward-allreduce: 103.87 | backward-master-grad: 3.43 | backward-clip-grad: 0.02 | optimizer: 0.04 | batch generator: 3.45
iteration 15/ 120000 | elapsed time per iteration (ms): 1978.1 | learning rate: 0.000E+00 | loss scale: 262144.0 | number of skipped iterations: 5 | number of nan iterations: 0 |
time (ms) | forward: 482.15 | backward: 1488.69 | backward-backward: 1395.11 | backward-allreduce: 88.19 | backward-master-grad: 5.28 | backward-clip-grad: 0.02 | optimizer: 0.04 | batch generator: 1.42
iteration 20/ 120000 | elapsed time per iteration (ms): 4540.0 | learning rate: 6.429E-07 | lm loss: 1.054034E+01 | loss scale: 65536.0 | number of skipped iterations: 2 | number of nan iterations: 0 |
time (ms) | forward: 1179.18 | backward: 1651.14 | backward-backward: 1399.57 | backward-allreduce: 106.88 | backward-master-grad: 107.87 | backward-clip-grad: 36.73 | optimizer: 1703.15 | batch generator: 36.27
iteration 25/ 120000 | elapsed time per iteration (ms): 2335.3 | learning rate: 1.286E-06 | lm loss: 9.976491E+00 | loss scale: 16384.0 | number of skipped iterations: 2 | number of nan iterations: 0 |
time (ms) | forward: 650.95 | backward: 1645.41 | backward-backward: 1397.91 | backward-allreduce: 140.40 | backward-master-grad: 74.19 | backward-clip-grad: 32.81 | optimizer: 34.49 | batch generator: 3.38
iteration 30/ 120000 | elapsed time per iteration (ms): 2325.9 | learning rate: 2.357E-06 | lm loss: 8.780370E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 556.28 | backward: 1708.32 | backward-backward: 1404.33 | backward-allreduce: 113.45 | backward-master-grad: 137.84 | backward-clip-grad: 52.61 | optimizer: 58.05 | batch generator: 1.82
iteration 35/ 120000 | elapsed time per iteration (ms): 2340.5 | learning rate: 3.429E-06 | lm loss: 8.286386E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 561.67 | backward: 1718.01 | backward-backward: 1396.69 | backward-allreduce: 166.71 | backward-master-grad: 102.91 | backward-clip-grad: 51.60 | optimizer: 58.24 | batch generator: 3.14
iteration 40/ 120000 | elapsed time per iteration (ms): 2320.2 | learning rate: 4.500E-06 | lm loss: 7.960351E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 496.76 | backward: 1760.00 | backward-backward: 1401.41 | backward-allreduce: 212.65 | backward-master-grad: 87.91 | backward-clip-grad: 57.93 | optimizer: 57.71 | batch generator: 7.84
iteration 45/ 120000 | elapsed time per iteration (ms): 2222.0 | learning rate: 5.571E-06 | lm loss: 7.752969E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 515.71 | backward: 1644.29 | backward-backward: 1401.37 | backward-allreduce: 93.53 | backward-master-grad: 97.84 | backward-clip-grad: 51.46 | optimizer: 57.72 | batch generator: 2.10
iteration 50/ 120000 | elapsed time per iteration (ms): 2333.7 | learning rate: 6.643E-06 | lm loss: 7.703540E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 547.30 | backward: 1724.37 | backward-backward: 1399.37 | backward-allreduce: 102.34 | backward-master-grad: 171.11 | backward-clip-grad: 51.46 | optimizer: 57.51 | batch generator: 4.48
iteration 55/ 120000 | elapsed time per iteration (ms): 2218.5 | learning rate: 7.714E-06 | lm loss: 7.548244E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 502.01 | backward: 1651.20 | backward-backward: 1399.92 | backward-allreduce: 109.33 | backward-master-grad: 88.23 | backward-clip-grad: 53.63 | optimizer: 59.16 | batch generator: 1.40
iteration 60/ 120000 | elapsed time per iteration (ms): 2236.7 | learning rate: 8.786E-06 | lm loss: 7.576575E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 525.56 | backward: 1647.52 | backward-backward: 1398.55 | backward-allreduce: 100.72 | backward-master-grad: 96.68 | backward-clip-grad: 51.47 | optimizer: 57.61 | batch generator: 1.38
iteration 65/ 120000 | elapsed time per iteration (ms): 2212.0 | learning rate: 9.857E-06 | lm loss: 7.420119E+00 | loss scale: 16384.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
time (ms) | forward: 506.30 | backward: 1643.29 | backward-backward: 1398.46 | backward-allreduce: 96.29 | backward-master-grad: 90.85 | backward-clip-grad: 57.61 | optimizer: 57.66 | batch generator: 1.42
Traceback (most recent call last):
File "pretrain_gpt2.py", line 115, in
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
File "/userhome/megatron/megatron/training.py", line 109, in pretrain
train_data_iterator, valid_data_iterator)
File "/userhome/megatron/megatron/training.py", line 438, in train
lr_scheduler)
File "/userhome/megatron/megatron/training.py", line 300, in train_step
backward_step(optimizer, model, loss)
File "/userhome/megatron/megatron/training.py", line 265, in backward_step
fp32_allreduce=args.fp32_allreduce)
File "/userhome/megatron/megatron/model/distributed.py", line 53, in allreduce_params
coalesced = _flatten_dense_tensors(grads)
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 229, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
RuntimeError: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 0; 31.72 GiB total capacity; 19.22 GiB already allocated; 1.88 GiB free; 28.32 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'pretrain_gpt2.py', '--local_rank=15', '--model-parallel-size', '2', '--num-layers', '54', '--hidden-size', '1920', '--num-attention-heads', '20', '--batch-size', '8', '--seq-length', '1024', '--max-position-embeddings', '1024', '--train-iters', '120000', '--lr-decay-iters', '70000', '--save', 'checkpoints/gpt2_64_xxxM', '--load', 'checkpoints/gpt2_64_xxxM', '--data-path', '/raid/gpt3-train-data/filterBy256-100G-notag_text_document', '--vocab-file', 'bpe_3w_new/vocab.json', '--merge-file', 'bpe_3w_new/merges.txt', '--data-impl', 'mmap', '--split', '950,49,1', '--distributed-backend', 'nccl', '--lr', '0.00015', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '.01', '--checkpoint-activations', '--log-interval', '5', '--tensorboard-dir', 'logs/gpt2_64_xxxxxxxxx', '--save-interval', '10000', '--eval-interval', '10000', '--eval-iters', '10', '--checkpoint-num-layers', '1', '--fp16', '--checkpoint-activations', '--distribute-checkpointed-activations', '--fp16-lm-cross-entropy', '--use-cpu-initialization']' returned non-zero exit status 1.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Error running bert pretraining example

I'm using the following data for my-corpus.json (as demonstrated in the README):

{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

pre-processing the data as follows:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-bert \
       --vocab bert-vocab.txt \
       --dataset-impl mmap \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences

And running the training script as follows (again from README):

CHECKPOINT_PATH=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
DATA_PATH=my-bert_text_sentence

BERT_ARGS="--num-layers 24 \
           --hidden-size 1024 \
           --num-attention-heads 16 \
           --seq-length 512 \
           --max-position-embeddings 512 \
           --lr 0.0001 \
           --lr-decay-iters 990000 \
           --train-iters 2000000 \
           --min-lr 0.00001 \
           --lr-warmup-fraction 0.01 \
	   --micro-batch-size 4 \	   
           --global-batch-size 8 \
           --vocab-file $VOCAB_FILE \
           --split 949,50,1 \
           --fp16"

OUTPUT_ARGS="--log-interval 10 \
             --save-interval 500 \
             --eval-interval 100 \
             --eval-iters 10 \
             --checkpoint-activations"

python pretrain_bert.py \
       $BERT_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH

Although, I run into the following error during training:

Traceback (most recent call last):
  File "pretrain_bert.py", line 155, in <module>
    args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
  File "/data/users/pritam/Megatron-LM/megatron/training.py", line 116, in pretrain
    train_valid_test_dataset_provider)
  File "/data/users/pritam/Megatron-LM/megatron/training.py", line 1000, in build_train_valid_test_data_iterators
    valid_ds, args.consumed_valid_samples)
  File "/data/users/pritam/Megatron-LM/megatron/data/data_loaders.py", line 38, in build_pretraining_data_loader
    data_parallel_size=mpu.get_data_parallel_world_size())
  File "/data/users/pritam/Megatron-LM/megatron/data/data_loaders.py", line 62, in __init__
    'no sample to consume: {}'.format(self.total_samples)
AssertionError: no sample to consume: 0

Full log: https://gist.github.com/pritamdamania87/7141eadd162ba672b465a7920e62508e

FileNotFoundError Issues while running on 2 nodes

Hi there,
I want to run distributed training on two servers, each has 4 GPUs.
I have modified the examples/pretrain_bert_distributed.sh file accordingly, as following:

# this is config on node-1
GPUS_PER_NODE=4
# Change for multinode config
MASTER_ADDR=<peer-ip>
MASTER_PORT=6000
NNODES=2
NODE_RANK=1
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

I can successfully run master process on node-0. While when I launch bash examples/pretrain_bert_distributed.sh on node-1. I will got an error as following:

Traceback (most recent call last):
  File "pretrain_bert.py", line 122, in <module>
Traceback (most recent call last):
  File "pretrain_bert.py", line 122, in <module>
    args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
  File "/home/ubuntu/Megatron-LM/megatron/training.py", line 85, in pretrain
    args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
  File "/home/ubuntu/Megatron-LM/megatron/training.py", line 85, in pretrain
    train_valid_test_dataset_provider)
  File "/home/ubuntu/Megatron-LM/megatron/training.py", line 496, in build_train_valid_test_data_iterators
    train_valid_test_dataset_provider)
  File "/home/ubuntu/Megatron-LM/megatron/training.py", line 496, in build_train_valid_test_data_iterators
    train_val_test_num_samples)
  File "pretrain_bert.py", line 113, in train_valid_test_datasets_provider
    train_val_test_num_samples)
  File "pretrain_bert.py", line 113, in train_valid_test_datasets_provider
    skip_warmup=(not args.mmap_warmup))
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 95, in build_train_valid_test_datasets
    skip_warmup=(not args.mmap_warmup))
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 95, in build_train_valid_test_datasets
    train_dataset = build_dataset(0, 'train')
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 86, in build_dataset
    train_dataset = build_dataset(0, 'train')
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 86, in build_dataset
    seed=seed)
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 125, in __init__
    seed=seed)
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 125, in __init__
    self.name)
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 282, in get_samples_mapping_
    self.name)
  File "/home/ubuntu/Megatron-LM/megatron/data/bert_dataset.py", line 282, in get_samples_mapping_
    samples_mapping = np.load(indexmap_filename, allow_pickle=True)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load
    samples_mapping = np.load(indexmap_filename, allow_pickle=True)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load
    fid = open(os_fspath(file), "rb")
    fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'my-bert_text_sentence_train_indexmap_32000000mns_512msl_0.10ssp_1234s.npy'
FileNotFoundError: [Errno 2] No such file or directory: 'my-bert_text_sentence_train_indexmap_32000000mns_512msl_0.10ssp_1234s.npy'

Does program assume different servers share data over the network? because I see the missing file actually generated at master node. If I copy *.npy files from node-0 to node-1, training runs.

mpu.vocab_parallel_cross_entropy VS cross entropy

After reading your code, I have a question: Is mpu.vocab_parallel_cross_entropy same as the cross entropy in the general classification problem?

Is model mp_rank_00 correct？

Following the steps, I downloaded the gpt model（wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip） and dictionary & merge files, but the wikitext eval result is not so good, is there something wrong ？

validation results on WIKITEXT103 | avg loss: 1.3458E+01 | ppl: 6.9909E+05 | adjusted ppl: 2.7160E+06 | token ratio: 1.1008449901248143 |

what is more, the gpt2 model I download is release, and my bash script is:

TASK="WIKITEXT103"

# VALID_DATA=lambada.valid.tokens
VALID_DATA=../wikitext.test.tokens
VOCAB_FILE=checkpoints/gpt2-vocab.json
MERGE_FILE=checkpoints/gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m

COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 1024 \
                  --max-position-embeddings 1024 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"


export CUDA_VISIBLE_DEVICES=7
python3 -u tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --strict-lambada \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --batch-size 8 \
       --checkpoint-activations \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

The log:

using world size: 1 and model-parallel size: 1 
using torch.float16 for parameters ...
-------------------- arguments --------------------
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... 8
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... True
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  data_impl ....................... infer
  data_path ....................... None
  DDP_impl ........................ local
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  epochs .......................... None
  eval_interval ................... 1000
  eval_iters ...................... 100
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ False
  fp16 ............................ True
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  hidden_dropout .................. 0.1
  hidden_size ..................... 1024
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  keep_last ....................... False
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ checkpoints/gpt2_345m
  local_rank ...................... None
  log_interval .................... 10
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. None
  lr_decay_iters .................. None
  lr_decay_style .................. linear
  make_vocab_size_divisible_by .... 128
  mask_prob ....................... 0.15
  max_position_embeddings ......... 1024
  merge_file ...................... checkpoints/gpt2-merges.txt
  min_lr .......................... 0.0
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 1
  no_load_optim ................... True
  no_load_rng ..................... True
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 16
  num_layers ...................... 24
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  overlapping_eval ................ 32
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float16
  pretrained_checkpoint ........... None
  query_in_block_prob ............. 0.1
  rank ............................ 0
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  save ............................ None
  save_interval ................... None
  scaled_upper_triang_masked_softmax_fusion  False
  seed ............................ 1234
  seq_length ...................... 1024
  short_seq_prob .................. 0.1
  split ........................... 969, 30, 1
  strict_lambada .................. True
  task ............................ WIKITEXT103
  tensorboard_dir ................. None
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  train_data ...................... None
  train_iters ..................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  valid_data ...................... ['../wikitext.test.tokens']
  vocab_file ...................... checkpoints/gpt2-vocab.json
  warmup .......................... 0.01
  weight_decay .................... 0.01
  world_size ...................... 1
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
 > number of parameters on model parallel rank 0: 354871296
global rank 0 is loading checkpoint checkpoints/gpt2_345m/release/mp_rank_00/model_optim_rng.pt
could not find arguments in the checkpoint ...
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
  successfully loaded checkpoints/gpt2_345m/release/mp_rank_00/model_optim_rng.pt
 > number of original tokens: 245566, number of detokenized tokens: 270330
> working on iteration: 0
> working on iteration: 10
> working on iteration: 20
> working on iteration: 30
> working on iteration: 40
> working on iteration: 50
> working on iteration: 60
> working on iteration: 70
> working on iteration: 80
> working on iteration: 90
> working on iteration: 100
> working on iteration: 110
> working on iteration: 120
> working on iteration: 130
> working on iteration: 140
> working on iteration: 150
> working on iteration: 160
> working on iteration: 170
> working on iteration: 180
> working on iteration: 190
> working on iteration: 200
> working on iteration: 210
> working on iteration: 220
> working on iteration: 230
> working on iteration: 240
> working on iteration: 250
> working on iteration: 260
> working on iteration: 270
> working on iteration: 280
> working on iteration: 290
> working on iteration: 300
> working on iteration: 310
> working on iteration: 320
> working on iteration: 330
> working on iteration: 340
> working on iteration: 350
> working on iteration: 360
> working on iteration: 370
> working on iteration: 380
> working on iteration: 390
> working on iteration: 400
> working on iteration: 410
> working on iteration: 420
> working on iteration: 430
> working on iteration: 440
> working on iteration: 450
> working on iteration: 460
> working on iteration: 470
> working on iteration: 480
> working on iteration: 490
> working on iteration: 500
> working on iteration: 510
> working on iteration: 520
> working on iteration: 530
> working on iteration: 540
> working on iteration: 550
> working on iteration: 560
> working on iteration: 570
> working on iteration: 580
> working on iteration: 590
> working on iteration: 600
> working on iteration: 610
> working on iteration: 620
> working on iteration: 630
> working on iteration: 640
> working on iteration: 650
> working on iteration: 660
> working on iteration: 670
> working on iteration: 680
> working on iteration: 690
> working on iteration: 700
> working on iteration: 710
> working on iteration: 720
> working on iteration: 730
> working on iteration: 740
> working on iteration: 750
> working on iteration: 760
> working on iteration: 770
> working on iteration: 780
> working on iteration: 790
> working on iteration: 800
> working on iteration: 810
> working on iteration: 820
> working on iteration: 830
> working on iteration: 840
> working on iteration: 850
> working on iteration: 860
> working on iteration: 870
> working on iteration: 880
> working on iteration: 890
> working on iteration: 900
> working on iteration: 910
> working on iteration: 920
> working on iteration: 930
> working on iteration: 940
> working on iteration: 950
> working on iteration: 960
> working on iteration: 970
> working on iteration: 980
> working on iteration: 990
> working on iteration: 1000
> working on iteration: 1010
> working on iteration: 1020
> working on iteration: 1030
> working on iteration: 1040
> working on iteration: 1050
-------------------------------------------------------------------------------------------------------------------------------------------
 validation results on WIKITEXT103 | avg loss: 1.3458E+01 | ppl: 6.9909E+05 | adjusted ppl: 2.7160E+06 | token ratio: 1.1008449901248143 |
-------------------------------------------------------------------------------------------------------------------------------------------
done :-)

Why NEGOTIATE_ALLREDUCE is much longer in TensorFlow comparing to PyTorch?

We try to translate this model into tensorflow. but find the same issue
horovod/horovod#1454

A loaded model seems not inference properly

Hello,

I fine-tuned a pre-trained BERT with RACE data and tried to inference.

While fine-tuning the BERT, the accuracy rate was about 60~70% with the valid data. But during the inference, the accuracy rate on the valid data was only about 25% which is almost random.

When I saved/loaded a model, I used 'save_checkpoint' and 'load_checkpoint' functions in the megatron/checkingpoint.py, and called model.eval() for evaluation as well. Also, I referenced "tasks/zeroshot_gpt2/evaluate.py" and modified for my own purpose.

Is it a bug? Or it would be good if there's any example code or documentation regarding this issue.

Regards,

Data preprocessing Readme instructions fail

I put example data into a ./data/data.json file:

{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

And run suggested command:

```python tools/preprocess_data.py \
       --input ./data/data.json \
       --output-prefix wtever \
       --vocab bert-vocab.txt \
       --dataset-impl mmap \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences```

Which first results in:

NameError: name 'nltk' is not defined

After installing nltk, running the same script results in:

FileNotFoundError: [Errno 2] No such file or directory: 'bert-vocab.txt'

Is there a description of vocab.txt format and content? Can you provide an example vocab file for the example data?

No module named 'apex'

I get this when running generate_text.sh

Traceback (most recent call last):
  File "generate_samples.py", line 28, in <module>
    from utils import Timers
  File "/content/Megatron-LM/utils.py", line 25, in <module>
    from fp16 import FP16_Optimizer
  File "/content/Megatron-LM/fp16/__init__.py", line 15, in <module>
    from .fp16util import (
  File "/content/Megatron-LM/fp16/fp16util.py", line 21, in <module>
    import mpu
  File "/content/Megatron-LM/mpu/__init__.py", line 35, in <module>
    from .layers import ColumnParallelLinear
  File "/content/Megatron-LM/mpu/layers.py", line 28, in <module>
    from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm
ModuleNotFoundError: No module named 'apex'

Any plans to release pretrained model?

Hello,
Thank you for your project, are you planning to release pretrained models?

Incorporating Megatron-ML with DeepSpeed

Microsoft incorporated Megatron-LM with their DeepSpeed project in DeepSpeedExamples project. The combination of the two projects features increased speed and lower memory requirements compared to standalone Megatron-LM model. However, the version of Megatron-LM used in DeepSpeedExamples dates to February of 2020 and as such lacks the latest updates of your product. Have you considered supporting an up-to-date version of Megatron-LM integrated with DeepSpeed project?

Possible solution for using torch.multiprocessing.spawn

I am using torch.multiprocessing.spawn for another large classification problem and apply the mpu functions to realize 8 world size, 4 data parallel and 2 model parallel training. However, when I try to apply the data loader part, I find out you only load data when mpu.get_model_parallel_rank() == 0, how about the other gpu where mpu.get_model_parallel_rank() != 0, how will they behave at training? The iteration part in the code is same for all the GPU, this is what confuse me.

RuntimeError when running Megatorn-LM with recompute flag turned off

Hi Megatron team,

I am trying to evaluate the Megatron-LM new implementation for the language-modeling tasks and I was running the example script you provided here. I also want to use the GPT2 modeling for running the inference on a different model for which I need to turn off the recompute flag. However, after removing this flag frim the example script, I ran into a runtime error, due to wrong matrix size. Here is the trace-log of my test:

Traceback (most recent call last):
File "tools/generate_samples_gpt2.py", line 127, in
main()
File "tools/generate_samples_gpt2.py", line 122, in main
generate_and_write_samples_unconditional(model)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/text_generation_utils.py", line 275, in generate_and_write_samples_unconditional
for datum in generate_samples_unconditional(model):
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/text_generation_utils.py", line 248, in generate_samples_unconditional
copy.deepcopy(context_tokens)):
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/text_generation_utils.py", line 314, in get_token_stream
for tokens, lengths in batch_token_iterator:
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/text_generation_utils.py", line 381, in sample_sequence_batch
forward_method_parallel_output=False)
File "/home/reyazda/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/model/distributed.py", line 76, in forward
return self.module(*inputs, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(input, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/fp16/fp16.py", line 74, in forward
return fp16_to_fp32(self.module((fp32_to_fp16(inputs)), **kwargs))
File "/home/reyazda/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/model/gpt2_model.py", line 63, in forward
get_key_value=get_key_value)
File "/home/reyazda/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/model/language_model.py", line 309, in forward
get_key_value=get_key_value)
File "/home/reyazda/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/model/transformer.py", line 584, in forward
get_key_value=get_key_value)
File "/home/reyazda/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/model/transformer.py", line 422, in forward
get_key_value=get_key_value)
File "/home/reyazda/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/reyazda/.local/lib/python3.6/site-packages/megatron/model/transformer.py", line 322, in forward
context_layer = torch.bmm(attention_probs, value_layer.transpose(0,1))
RuntimeError: invalid argument 6: wrong matrix size at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:84

Can someone please tell me the reason for this behaviour and whether this is expected?

Thanks.
Reza

Compatibility with pytorch-transformers for fine-tuning

Hi,

Thanks for the great package! I wanted to check about the compatibility of the trained GPT-2 model/tokenizer with the pytorch-transformers package. Is it possible that, with a few changes, the trained model can be imported using that package, in order to perform additional fine-tuning there with different heads for example? I understand that there are some config files expected by that package, so I'm assuming these can be added. But I'm interested in knowing about the compatibility of the model/tokenizer mainly.

Thanks!

Error when runinng script pretrain_gpt2_distributed.sh

When I run
OMP_NUM_THREADS=10 bash scripts/pretrain_gpt2_distributed.sh

I got an error

> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 625, in <module>
    main()
  File "pretrain_gpt2.py", line 569, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
    self.init_weighting()
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 625, in <module>
    main()
  File "pretrain_gpt2.py", line 569, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
    self.init_weighting()
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 625, in <module>
    main()
  File "pretrain_gpt2.py", line 569, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
    self.init_weighting()
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
Traceback (most recent call last):
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 625, in <module>
  File "pretrain_gpt2.py", line 625, in <module>
    main()
  File "pretrain_gpt2.py", line 569, in main
Traceback (most recent call last):
    main()
  File "pretrain_gpt2.py", line 569, in main
    args.eod_token = get_train_val_test_data(args)
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
    self.init_weighting()
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
  File "pretrain_gpt2.py", line 625, in <module>
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    main()
  File "pretrain_gpt2.py", line 569, in main
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    self.init_weighting()
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.init_weighting()
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 625, in <module>
    main()
  File "pretrain_gpt2.py", line 569, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
    self.init_weighting()
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 625, in <module>
    main()
  File "pretrain_gpt2.py", line 569, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 515, in get_train_val_test_data
    args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/home/ubuntu/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/__init__.py", line 114, in <listcomp>
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 477, in __init__
    self.init_weighting()
  File "/home/ubuntu/Megatron-LM/data_utils/datasets.py", line 487, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/Env/ml/lib/python3.6/site-packages/torch/distributed/launch.py", line 246, in <module>
    main()
  File "/home/ubuntu/Env/ml/lib/python3.6/site-packages/torch/distributed/launch.py", line 242, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/Env/ml/bin/python', '-u', 'pretrain_gpt2.py', '--local_rank=7', '--num-layers', '24', '--hidden-size', '1024', '--num-attention-heads', '16', '--batch-size', '8', '--seq-length', '1024', '--max-position-embeddings', '1024', '--train-iters', '320000', '--save', 'checkpoints/gpt2_345m', '--load', 'checkpoints/gpt2_345m', '--resume-dataloader', '--train-data', 'wikipedia', '--lazy-loader', '--tokenizer-type', 'GPT2BPETokenizer', '--cache-dir', 'cache', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr', '0.00015', '--lr-decay-style', 'cosine', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '.01', '--checkpoint-activations', '--fp16']' returned non-zero exit status 1.

Why does this error occur and how to fix it?

Data Preprocess fails if first document in merged json is empty

When running the preprocess_data.py script as described on the documentation, if the first processed entry has an empty text this line fails with an IndexError: list index out of range because len(doc_ids)==0

Adding a check for doc_ids length on this statement can correct the problem:

if self.args.append_eod and len(doc_ids) > 0:
   doc_ids[-1].append(Encoder.tokenizer.eod)

Malicious domain name in openwebtext URL list

I was following the instructions for preparing openwebtext dataset using instructions here: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/openwebtext/README.md

In the URL list downloaded from the link in Step 1 of Download the dataset, one of the domain names ("horsefucker.org") is associated with a known C&C server. This caused security vulnerability on my system. The blacklist_urls list must be updated with this domain name so that its filtered before data download begins.

PyTorch 1.2 support?

I am seeing an error with PyTorch 1.2, has this been tested with Megatron?

Traceback (most recent call last):
File "pretrain_gpt2.py", line 752, in
main()
File "pretrain_gpt2.py", line 690, in main
set_random_seed(args.seed)
File "pretrain_gpt2.py", line 621, in set_random_seed
mpu.model_parallel_cuda_manual_seed(seed)
File "/data/users/jerasley/Megatron-LM/mpu/random.py", line 166, in model_parallel_cuda_manual_seed
model_parallel_seed)
File "/data/users/jerasley/Megatron-LM/mpu/random.py", line 99, in add
_set_cuda_rng_state(orig_rng_state)
File "/data/users/jerasley/Megatron-LM/mpu/random.py", line 49, in _set_cuda_rng_state
_lazy_call(cb)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 139, in _lazy_call
callable()
File "/data/users/jerasley/Megatron-LM/mpu/random.py", line 47, in cb
_C._cuda_setRNGState(new_state)
AttributeError: module 'torch._C' has no attribute '_cuda_setRNGState'

Is it safe in this case to just replace the _C._cuda_setRNGState call with torch.cuda.set_rng_state?

FileExistsError when training with a shared file-system

When training on a multi-node cluster with a shared file-system, I observe the following:

initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
Traceback (most recent call last):
File "/fsx/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py", line 707, in <module>
    main()
File "/fsx/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py", line 652, in main
    args.eod_token = get_train_val_test_data(args)
  File "/fsx/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py", line 598, in get_train_val_test_data
    args)
  File "/fsx/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/fsx/DeepSpeedExamples/Megatron-LM/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/fsx/DeepSpeedExamples/Megatron-LM/data_utils/__init__.py", line 93, in make_dataset
    datasets = [get_dataset_from_path(p) for p in path]
  File "/fsx/DeepSpeedExamples/Megatron-LM/data_utils/__init__.py", line 93, in <listcomp>
    datasets = [get_dataset_from_path(p) for p in path]
  File "/fsx/DeepSpeedExamples/Megatron-LM/data_utils/__init__.py", line 83, in get_dataset_from_path
    make_lazy(path_, text.X, data_type='data')
  File "/fsx/DeepSpeedExamples/Megatron-LM/data_utils/lazy_loader.py", line 51, in make_lazy
    os.makedirs(lazypath)
  File "/usr/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/fsx/datasets/openwebtext/openwebtext.lazy'

I don't see the same error when training on a single independent node (with its own file-system) though. Do you have a clean fix for this?

perplexity too big for gpt2 wikitext evaluation

When running the wikitext evaluation of gpt2

python evaluate_gpt2.py 
    --valid-data wikitext-103-v1/wiki.test.tokens 
    --load-openai 
    --hidden-size 768 
    --vocab-size 50257 
    --tokenizer-type GPT2BPETokenizer 
    --max-position-embeddings 1024

the resulting perplexity is 2.9290E+02 -- why is the value so extremely high?

Here is the console output with logging level DEBUG:

Evaluate GPT2 model
WARNING: No training data specified
using world size: 1 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-vocab.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-merges.txt HTTP/1.1" 200 0
INFO:data_utils.tokenization_gpt2:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /braintree/home/msch/.pytorch_pretrained_bert/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
INFO:data_utils.tokenization_gpt2:loading merges file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /braintree/home/msch/.pytorch_pretrained_bert/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
wikitext
Original Tokens: 270330, Detokenized tokens: 245566
> padded vocab (size: 50257) with 0 dummy tokens (new size: 50257)
global rank: 0 | vocab size: 50257 | eod token: 50256 | num_examples: 8448 | num_original_tokens: 245566 | num_tokenized_tokens: 270330
building GPT2 model ...
 > number of parameters: 209494272
loading openai weights
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-pytorch_model.bin HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): s3.amazonaws.com:443
DEBUG:urllib3.connectionpool:https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/gpt2-config.json HTTP/1.1" 200 0
INFO:pytorch_pretrained_bert.modeling_gpt2:loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin from cache at gpt2_weights/4295d67f022061768f4adc386234dbdb781c814c39662dd1662221c309962c55.778cf36f5c4e5d94c8cd9cefcf2a580c8643570eb327f0d4a1f007fab2acbdf1
INFO:pytorch_pretrained_bert.modeling_gpt2:loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at gpt2_weights/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d80387ad55c1ad9806ee70d272f80
INFO:pytorch_pretrained_bert.modeling_gpt2:Model config {
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "vocab_size": 50257
}

global rank: 0 | max iters: 2112
global rank: 0 | iteration: 0
global rank: 0 | iteration: 100
...
global rank: 0 | iteration: 1900
global rank: 0 | iteration: 2000
global rank: 0 | iteration: 2100
----------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
 validation results on wiki | avg loss: 5.6798E+00 | ppl: 2.9290E+02 | adjusted ppl: 5.1937E+02 | token ratio: 1.1008449901248143 |
------------------------------------------------------------------------------------------------------------------------------------

training in fp32 results error "TypeError: zero_grad() got an unexpected keyword argument 'set_grads_to_None'"

Hi there,
I would like to enable float32 training, thus, I commented out the --fp16 option in example/pretrain_bert_distributed.sh file. During the runtime, I got following errors:

TypeError: zero_grad() got an unexpected keyword argument 'set_grads_to_None'
  File "/home/ubuntu/Megatron-LM/megatron/training.py", line 231, in backward_step
    lr_scheduler)
  File "/home/ubuntu/Megatron-LM/megatron/training.py", line 269, in train_step
    optimizer.zero_grad(set_grads_to_None=True)
TypeError: zero_grad() got an unexpected keyword argument 'set_grads_to_None'
    backward_step(optimizer, model, loss)
  File "/home/ubuntu/Megatron-LM/megatron/training.py", line 231, in backward_step
    optimizer.zero_grad(set_grads_to_None=True)
TypeError: zero_grad() got an unexpected keyword argument 'set_grads_to_None'
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)

I think specifically this line of code resulting the error.
Can I safely make a condition branch to do zero_grads without input option?
e.g.

    # Backward pass.
    if args.fp16:
        optimizer.zero_grad(set_grads_to_None=True)
    else:
        optimizer.zero_grad()

Scaling up to GPT-3 size (175B)

Will it be straightforward to scale Megatron LM up to 175B?

bert model encoding error on python3.6.8

data_config: {'world_size': 1, 'rank': -1, 'persist_state': 0, 'lazy': False, 'transpose': False, 'data_set_type': 'supervised', 'seq_length': 256, 'eval_seq_length': 256, 'samples_per_shard': 100}
configuring data
Traceback (most recent call last):
File "pretrain_bert.py", line 490, in
main()
File "pretrain_bert.py", line 417, in main
(train_data, val_data, test_data), tokenizer = data_config.apply(args)
File "/ssd2/bert1/Megatron-LM/configure_data.py", line 33, in apply
return make_loaders(args)
File "/ssd2/bert1/Megatron-LM/configure_data.py", line 166, in make_loaders
train, tokenizer = data_utils.make_dataset(**data_set_args)
File "/ssd2/bert1/Megatron-LM/data_utils/init.py", line 93, in make_dataset
datasets = [get_dataset_from_path(p) for p in path]
File "/ssd2//bert1/Megatron-LM/data_utils/init.py", line 93, in
datasets = [get_dataset_from_path(p) for p in path]
File "/ssd2/bert1/Megatron-LM/data_utils/init.py", line 82, in get_dataset_from_path
delim=delim, drop_unlabeled=drop_unlabeled, loose_json=loose)
File "/ssd2/bert1/Megatron-LM/data_utils/init.py", line 50, in get_dataset
text = json_dataset(path, **kwargs)
File "/ssd2/bert1/Megatron-LM/data_utils/datasets.py", line 327, in init
for j in self.load_json_stream(self.path):
File "/ssd2/bert1/Megatron-LM/data_utils/datasets.py", line 436, in load_json_stream
for j in generator:
File "/ssd2/bert1/Megatron-LM/data_utils/datasets.py", line 432, in gen_helper
for row in f:
File "/ssd2/bert1/Megatron-LM/bert_env/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5507: ordinal not in range(128)

1:train loss decrease too faster.2:learning rates did not change after warmup iter,always kept on 1.5e-4. Is it normal phenomenon?

Thank your very much! I was encounter some question.
1:train loss decrease too faster.
2:learning rates did not change after warmup iter,always kept on 1.5e-4.
Is it normal phenomenon?

total 10GB chinese corpus ,about 3000000 sample.
python3 -m torch.distributed.launch
--nnodes 1
--nproc_per_node 2
pretrain_gpt2.py
--num-layers 24
--hidden-size 1024
--num-attention-heads 16
--max-position-embeddings 1024
--seq-length 1024
--batch-size 8
--train-iters 1000000
--save-interval 1000
--save checkpoints/gpt2_345m_hm10g
--load checkpoints/gpt2_345m_hm10g
--tensorboard-dir logs/gpt2_345m_hm10g
--resume-dataloader
--train-data corpus_data
--lazy-loader
--tokenizer-type SentencePieceTokenizer
--tokenizer-path data/spm/corpus_bpe_32k.model
--cache-dir cache
--split 949,50,1
--distributed-backend nccl
--lr 0.00015
--lr-decay-style cosine
--weight-decay 1e-2
--clip-grad 1.0
--warmup .01
--checkpoint-activations
--fp16 \

merge_mp_partitions.py fails with an exception

When I run tools/merge_mp_partitions.py, it fails with an exception:

Traceback (most recent call last):
  File "merge_mp_partitions.py", line 286, in <module>
    main()
  File "merge_mp_partitions.py", line 212, in main
    merged_model = get_model(model_type)
  File "merge_mp_partitions.py", line 125, in get_model
    model = model_provider()
  File "/data/gcooper/nlg-evaluation/Megatron-LM/pretrain_gpt2.py", line 35, in model_provider
    model = GPT2Model(num_tokentypes=0, parallel_output=True)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/gpt2_model.py", line 51, in __init__
    args.num_layers))
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 62, in get_language_model
    add_pooler=add_pooler)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 283, in __init__
    self.num_tokentypes)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 123, in __init__
    vocab_size, self.hidden_size, init_method=self.init_method)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 145, in __init__
    partition_dim=0, stride=1)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 58, in _initialize_affine_weight_gpu
    with get_cuda_rng_tracker().fork():
  File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/random.py", line 183, in fork
    raise Exception('cuda rng state {} is not added'.format(name))
Exception: cuda rng state model-parallel-rng is not added

When training, the RNG state gets set in initialize_megatron(), but that is not called in this case.

Note that as of now you need to have PySOL cloned to the directory here before building the container.

When you say PySOL do you mean https://github.com/shlomif/PySolFC ? A cursory Google search of PySOL didn't yield anything meaningful.

Can you share your vocab file when traing bert uncased?

When I use the pretraind model Bert-345m, I find that the size of the vocab file is 29056, but the google's vocab file is 30592. Can you share your vocab file or give some tips to extract vocab from google's?

Collecting Wikipedia Training Data issues

Hi!
I'm dealing with the work of your library and I have a misunderstanding of the point Collecting Wikipedia Training Data.
I run command
python WikiExtractor.py --json enwiki-latest-pages-articles.xml.bz2
and expected get one json file as output ( do I understand correctly that I need an available .json file to start learning the model ?), but got a folder with many inner folders like AA, AB, AC ..

Could you explain what I'm doing wrong and how I can start training a model?

Evaluating on all GLUE tasks

How would you suggest to go about evaluating Megatron's BERT on the full set of GLUE tasks?

dropout should be wrapped in `get_cuda_rng_tracker`

I think there are many places where dropout call is not under scope of get_cuda_rng_tracker.
Just curious is that intention or got left out by mistake? Cause if I understand correctly every dropout call should be wrapped under that scope.

Examples:
https://github.com/NVIDIA/Megatron-LM/blob/master/mpu/transformer.py#L155
https://github.com/NVIDIA/Megatron-LM/blob/master/mpu/transformer.py#L217
https://github.com/NVIDIA/Megatron-LM/blob/master/mpu/transformer.py#L560

etc

Calculate the TFLOPS performance with elapsed time per iteration

Hi!
I'm trying to study the Megatron model-parallel, and was able to run the model on V100 GPU using Pytorch docker container, with GPT2.
However, the benchmark log would only print out the elapsed time spent for every 100 iterations.
My question is, how can I calculate the TFLOPS performance from the elapsed time per iteration?
It would be helpful if you can shed some light on the formula to calculate the TFLOPS/PFLOPS from the model configurations (hidden size, attention heads and number of layers).

I'm now getting 528.9ms per iteration for single GPU, I'm hoping that would match the single GPU baseline performance mentioned in the paper (39 TeraFLOPS).
Following is the model I've been using:

Config	Hidden size	Attention heads	Number of layers	Number of parameters (billion)	Model parallel GPUs
1	1536	16	40	1.2	1

Thanks!

Fully half precision optimizer

Hi,

Has anyone implemented/tried fully half-precision optimizer with Megatron? I see from the GPT-3 paper:

... implemented an early version of the codebase, and developed the memory optimizations for fully
half-precision training.

It looks like Megatron FP16_Optimizer is still using mixed precision. Has anyone looked into this before? This would allow training these big models with much less memory ... Thanks!

Unintended error caused by compiling fused_kernels

https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L186-L198
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/fused_kernels/__init__.py#L46-L72

When I tried to train GPT-3 on multi-node using torch.distributed.launch, sometimes the training process was stuck while compiling the fused_kernels.
This bug can be occurred by timing issue when multiple processes compile concurrently.
The simplest way is to remove ./fused_kernels/build and run script again, but I thought it is not solving the fundamental problem.

In my case, I resolved this issue can be solved by using torch.distributed.barrier, letting the process compile the fused_kernels only on master rank (rank == 0).
If the authors think resolving this issue is necessary for the codes, then I will leave PR :)

Pretrain and generate

if i run with
python -m torch.distributed.launch --nproc_per_node 16 pretrain_gpt2.py --model_parallel_size==16
and after run generate:
python generate_samples.py
i have error while initialization: size mismatch for transformer.layers.15.attention.dense.weight: copying a param with shape torch.Size([1024, 64]) from checkpoint, the shape in current model is torch.Size([1024, 1024]).
can i load model on one GPU and train distributed on 16 GPU with --model_parallel_size==16?
thank you!

BERT Loss not decreasing

I train the BERT_BASE model on 16 V100 GPUs using En Wikipedia dataset. I discover that the NSP loss can decrease to 0.3 from 0.7 and the MLM loss decrease to 6.8 from 10.0. During training, I use the create_pretrainig_data.py from google-research/bert to pre-create bert pretraining examples, since when I use the default setting that creates training samples in dataset the speed is quite slow and sometimes it takes hours to train 100 iterations. With pre-created pretraining data and lazy-dataloader, the training speed is quite normal. But the loss is a big problem for me. I pretrain the BERT_BASE model using the scheme in BERT paper(batch size 256, each GPU batch size 16, train 900000 steps with seqlen128, 100000 steps with seqlen 512), then I run test on SQuAD, the F1 score is quite low.

Could you please offer some help. Thanks!

Save checkpoint error with model parallel size > 1

It seems a bug in the save_checkpoint function,.
Assuming a data parallel rank with two model parallel ranks noted as M1 and M2. Both M1 and M2 finished checking there is no folder, and then M1 create a new one, but M2 also want to create this folder, it will raise FileExistsError.

Reproduce 71.9 TFlops throughput

Hi,

I wonder what batch size per GPU was used for the benchmark below, if I want to reproduce it?

GPT2 evaluation, wikitext-2, wikitext-103, PTB

how can I set the parameters so that I can evaluate the GPT2 on the three task to match the results which are the GPT2 papers reported. such as the parameter "overlapping_eval".

How to train on other corpora besides web/wiki text

For webtext, you mention the following step:
Merge the contents into one loose json file with 1 json per newline of the format {'text': text, 'url': unique_url}. It is important for the url to be unique.

What would be the best way for training on a different text corpus that exists as a single file without topic/page separators (unlike wikipedia or webtext)? I am splitting such a file manually into N (say = 1000) parts and creating a loose json file with 1000 json objects, 1 json per newline. This approach works for the distributed training too.

However if I feed the new corpus as a single json object {'text': <entire new corpus>, 'url': <some id>}, it fails with the following message -

> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 681, in <module>
    main()
  File "pretrain_gpt2.py", line 620, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_gpt2.py", line 550, in get_train_val_test_data
    args)
  File "Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "Megatron-LM/configure_data.py", line 171, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "Megatron-LM/data_utils/__init__.py", line 126, in make_dataset
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "Megatron-LM/data_utils/__init__.py", line 126, in <listcomp>
    ds = [GPT2Dataset(d, max_seq_len=seq_length) if d is not None else None for d in ds]
  File "Megatron-LM/data_utils/datasets.py", line 479, in __init__
    self.init_weighting()
  File "Megatron-LM/data_utils/datasets.py", line 490, in init_weighting
    self.weighting = list(accumulate(lens))
TypeError: iteration over a 0-d array

I'm concerned if this manual splitting impacts the sampling/batching in any way.

I'm changing the PATH variable in Megatron-LM/data-utils/corpora.py inside the wikipedia/webtext class to use this new corpus.

Huggingface <-> Megatron-LM Compatibility

Looking for a way to convert model weights between huggingface and Megatron-LM.
(1): Continual pretraining from pretrained weights from huggingface
(2): Convert Megatron-LM model weights to huggingface

It shouldn't be too difficult to adjust layer names/weights, but I'm hoping someone has already done this.

Related #3 (already closed but couldn't find the solution)

GPT-2 generation samples error

There are 5 parameters in the get_masks_and_position_ids definition:
https://github.com/NVIDIA/Megatron-LM/blob/master/pretrain_gpt2.py#L162

def get_masks_and_position_ids(data,
                               eod_token,
                               reset_position_ids,
                               reset_attention_mask,
                               eod_mask_loss):

But there are only 4 in the function call:
https://github.com/NVIDIA/Megatron-LM/blob/master/generate_samples.py#L95

    attention_mask, loss_mask, position_ids = get_masks_and_position_ids(
        tokens,
        args.eod_token,
        args.reset_position_ids,
        args.reset_attention_mask)

evaluation loss < training loss

I have pre-trained Megatron's BERT large and base with different batch sizes, and it always seems to be the case that the training loss is about .2 higher than the validation loss. Is this the behavior you observed and if yes, what is causing it?

I do not know how to solve it.

When I run
bash scripts/pretrain_bert.sh.
I encounter the error as follows:

Traceback (most recent call last): File "pretrain_bert.py", line 581, in <module> main() File "pretrain_bert.py", line 528, in main args.tokenizer_num_type_tokens = get_train_val_test_data(args) File "pretrain_bert.py", line 475, in get_train_val_test_data (train_data, val_data, test_data), tokenizer = data_config.apply(args) File "/home/z00487393/Documents/Scripts/TensorFlow/Megatron/Megatron-LM-master/configure_data.py", line 34, in apply return make_loaders(args) File "/home/z00487393/Documents/Scripts/TensorFlow/Megatron/Megatron-LM-master/configure_data.py", line 170, in make_loaders train, tokenizer = data_utils.make_dataset(**data_set_args) File "/home/z00487393/Documents/Scripts/TensorFlow/Megatron/Megatron-LM-master/data_utils/__init__.py", line 101, in make_dataset pad_token, character_converage, **kwargs) File "/home/z00487393/Documents/Scripts/TensorFlow/Megatron/Megatron-LM-master/data_utils/tokenization.py", line 39, in make_tokenizer return BertWordPieceTokenizer(model_type, **kwargs) File "/home/z00487393/Documents/Scripts/TensorFlow/Megatron/Megatron-LM-master/data_utils/tokenization.py", line 703, in __init__ self.text_tokenizer.max_len = int(1e12) AttributeError: 'NoneType' object has no attribute 'max_len'

I cannot find the solution.
If someone can help me, thanks a lot.

tokenizer.py is missing in openwebtext

openwebtext/clean_dataset.py fails with this error:

Traceback (most recent call last):
  File "openwebtext/cleanup_dataset.py", line 25, in <module>
    from tokenizer import Tokenizer
ImportError: cannot import name 'Tokenizer'

It seems tokenizer.py was deleted accidentally.

https://github.com/NVIDIA/Megatron-LM/blob/20764e123467893c7132ad89df5e5f5bba8355ae/openwebtext/tokenizer.py