microsoft / deepspeedexamples Goto Github PK

View Code? Open in Web Editor NEW

5.8K 5.8K 974.0 119.46 MB

Example models using DeepSpeed

License: Apache License 2.0

Python 92.57% Shell 7.43%

deepspeedexamples's People

Contributors

Stargazers

Watchers

Forkers

shadensmith jeonsworld dadebarr arashashari goyalankit jithunnair-amd appcoreopc mikalaidrabovich liuyq47 taffywrinkle claudiusgonzo conglongli jren73 srcarroll awan-10 tanghl1994 arita37 rocm bilal-yousaf szhengac pgsrv lynxgsm wintersurvival nitikasaran68 sxjscience peteriz devbox10 cherrybrandy seujung huangjundashuaige fac2003 owmohamm chorseng hasagar shoman2 wawltor hxbloom stellaathena moyix wahibium stas00 victorygogogo subhande zerojooon global-localhost global19 global19-atlassian-net pin0156 michaelek aldrindomer lipiji romittel mrkulk cybernitta yangxjzwd1 kvadrat174 mmdixon rraminen yuqihuo rahul-art hanwgyu ikuyamada siddharth9820 cattacker jglaser andrei-pokrovsky huangrunhui kiranikolaev blacknwhite5 sar xwjbupt shadowtudark peerdavid bing0037 qpc-database wenting-zhao luofuli pixiesunky hzfan feigenbaum4669 pickkaa tobran samarth-b jeffra pandapyh techthiyanes pengwa dy-tl standardgalactic zephyrchenzf marmikreal ericwangcn mark14wu ayushchauhan entodi codedecde kamikazizen reinholdm yyy-apple haihai-00

deepspeedexamples's Issues

bing_bert: CUDA error: out of memory

I always have the error: "RuntimeError: CUDA error: out of memory". Can you give suggestion? Thanks

1 node， 4 v100 cards
Deepspeed master branch + DeepSpeedExamples master branch
Follow bing_bert README.md，nvidia training data “hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en”

Any benefits while using a single GTX 1080 Ti?

Hi, newbie here. Reading the API, I've noticed that memory and speed improvements come from ZeRO. I also read that ZeRO only works with FP16. I read that GTX 1080 Ti has a very low FP16 throughput and using it is very inefficient. Would Deepspeed provide a significant memory or speed improvement using a single GTX 1080 Ti? I'm using Transformers to model sequential data, moderate sequence length (512-1024) and small batch size (4-6). Thanks.

Support new apex API

We will need to update example models to work with the new apex apis when the apex pointer in deepspeed repo is updated.

ds_pretrain_gpt2-zero3.sh, doesn't run

examples/ds_pretrain_gpt2-zero3.sh doesn't run. such error is caused by "UnboundLocalError: local variable 'beta1' referenced before assignment".

"ds_train_bert_nvidia_data_bsz64k_seq128.sh" program stalls at the end of the first epoch

When I run "ds_train_bert_nvidia_data_bsz64k_seq128.sh". It stalls at the end of the first epoch.

Where can I get a deepspeed checkpoint for the bing finetune example?

The documentation mentions that a TF, PT, or HF checkpoint will work for the bing finetune example, but it looks like to run the code as is and try to use any of those checkpoints, it'll still run this section of the code

https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/nvidia_run_squad_deepspeed.py#L831

Which ends up saying

"Unable to find model state in checkpoint"

Even it I try to insert an arg

args.ckpt_type = "HF"

It still tries to run that section of code somehow.

So it looks like you need a deepspeed checkpoint, otherwise there is a

ValueError: Should NOT use --preln if the loading checkpoint doesn't use pre-layer-norm.

error

I also tried altering the code so it uses convert_ckpt_to_deepspeed but then I run into ValueError: Invalid ckpt_type.. So somehow that function is not able to use HF checkpoints.

Zero shot classification inference

I need help with using deepspeed for transformers zero shot classification pipeline.

My dataset has 500K sentences and 52 labels.

I tried editing the gpt generation example but I am not sure if I did it right.

import os
import deepspeed
import torch
import transformers
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
classifier = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli", device=0)



classifier.model = deepspeed.init_inference(classifier.model,
                                           mp_size=world_size,
                                           dtype=torch.float,
                                           replace_method='auto')

res =  classifier(prod_name_lst[:1000], tag_values)
if torch.distributed.get_rank() == 0:
    print(res)

Kindly help with an example.

Thanks,
Subham

ZeRO 3 example does not run

The ZeRO 3 example does not run. The main problem appears to be that the InitContext function does not actually exist despite being called by pretrain_gpt2.py. I have tried to introduce some changes to get it to run (incl. changing the batch size, the initialization function, and some of the inputs to the initialization function) but gave up after it threw the error variable beta1 is referenced before assignment. I think that has to do with something wonky in the optimizer?

ZeRO 2

In the ds_pretrain_gpt2.sh we have

#config_json="$script_dir/ds_zero_stage_2_config.json"
config_json="$script_dir/ds_config.json"

Why is ZeRO 2 disabled when the tutorial for extremely efficient training says ZeRO 2 is 5x faster than ZeRO 1?

ds_train_bert_nvidia_data_bsz64k_seq128.sh dose not work~!

I run the script under the directory of DeepSpeed/DeepSpeedExamples/bing_bert with the following cmd.
sh ds_train_bert_nvidia_data_bsz64k_seq128.sh

I came across the following bug.

[2021-01-21 15:51:32,141] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
VOCAB SIZE: 30528
Traceback (most recent call last):
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_train.py", line 532, in
main()
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_train.py", line 521, in main
model, optimizer = prepare_model_optimizer(args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_train.py", line 396, in prepare_model_optimizer
model = BertMultiTask(args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/turing/models.py", line 123, in init
self.network = BertForPreTrainingPreLN(bert_config, args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 1117, in init
self.bert = BertModel(config, args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 1002, in init
config, args, sparse_attention_config=self.sparse_attention_config)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 596, in init
for i in range(config.num_hidden_layers)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 596, in
for i in range(config.num_hidden_layers)
File "/home/jiaruifang/anaconda3/lib/python3.7/site-packages/deepspeed/ops/transformer/transformer.py", line 487, in init
self.config.layer_id = DeepSpeedTransformerLayer.layer_id
AttributeError: 'int' object has no attribute 'layer_id'

RecursionError: maximum recursion depth exceeded while calling a Python object

I was trying to run Megatron with ZeRO 2 config when I encountered this error

> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 1894.21 | train/valid/test data iterators: 357.88
training ...
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 156, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
    iteration = train(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 481, in train
Traceback (most recent call last):
      File "pretrain_gpt2.py", line 156, in <module>
loss_dict, skipped_iter = train_step(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 324, in train_step
    return train_step_pipe(model, data_iterator)
  File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,    self._exec_schedule(sched)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule

  File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
        self._exec_instr(**cmd.kwargs)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 621, in _exec_load_micro_batch
iteration = train(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 481, in train
    batch = self._next_batch()
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
    return self._next_batch()
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
        return self._next_batch()
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
loss_dict, skipped_iter = train_step(forward_step_func,
  File "/root/megatron-3d/megatron/training.py", line 324, in train_step
    return self._next_batch()
  [Previous line repeated 978 more times]
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 469, in _next_batch
    return train_step_pipe(model, data_iterator)
  File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
    batch = self.batch_fn(batch)
  File "pretrain_gpt2.py", line 110, in get_batch_pipe
    return fp32_to_fp16((tokens, position_ids, attention_mask)), fp32_to_fp16((labels, loss_mask))
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 53, in fp32_to_fp16
    return conversion_helper(val, half_conversion)
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in conversion_helper
    loss = model.train_batch(data_iter=data_iterator)
  File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
    rtn = [conversion_helper(v, conversion) for v in val]
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in <listcomp>
    rtn = [conversion_helper(v, conversion) for v in val]
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 37, in conversion_helper
    return conversion(val)
  File "/root/megatron-3d/megatron/fp16/fp16.py", line 48, in half_conversion
    if isinstance(val_typecheck, (Parameter, Variable)):
  File "/root/anaconda3/lib/python3.8/site-packages/torch/autograd/variable.py", line 7, in __instancecheck__
        return isinstance(other, torch.Tensor)
self._exec_schedule(sched)RecursionError: maximum recursion depth exceeded while calling a Python object

This doesn't occur with the following config

{
  "train_batch_size": 224,
  "train_micro_batch_size_per_gpu": 4,
  "steps_per_print": 10,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0,
      "betas": [0.9, 0.95]
    }
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,

    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false
}

How to run example codes

I tried to use DeepSpeed with the examples in DeepSpeedExamples. However, I couldn't find any explanation about input data such as $SQUAD_DIR or $MODEL_FILE in the shell script. Can you give me some details to run the code?

ds_train_bert_nvidia_data_bsz64k_seq128.sh problem !

There is a problem when I run the script
ds_train_bert_nvidia_data_bsz64k_seq128.sh

I came across the following bug:

Decay for deepspeed norm & bias params?

When segregating the parameters for the optimizer the norm & bias parameters are not included:

DeepSpeedExamples/bing_bert/deepspeed_train.py

Line 370 in 35b4582

no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']

Is this handled elsewhere or was decay intentionally applied to all the deepspeed transformer layer parameters?

Megatron-LM BERT

Following this GPT2 tutorial(https://www.deepspeed.ai/tutorials/megatron/), I modified pretrain_bert to run with deepspeed. However, I got this message. RuntimeError: leaf variable has been moved into the graph interior.
Do you have any idea that I can fix the error?

Full error messages are in the below.

elsa-03-ib0: Traceback (most recent call last):
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 617, in
elsa-03-ib0: main()
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 595, in main
elsa-03-ib0: timers, args)
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 354, in train
elsa-03-ib0: args, timers)
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 310, in train_step
elsa-03-ib0: nsp_loss, args, timers)
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 255, in backward_step
elsa-03-ib0: model.backward(loss)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/deepspeed_light.py", line 665, in backward
elsa-03-ib0: self.optimizer.backward(loss)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/deepspeed_zero_optimizer.py", line 455, in backward
elsa-03-ib0: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/loss_scaler.py", line 174, in backward
elsa-03-ib0: scaled_loss.backward(retain_graph=retain_graph)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
elsa-03-ib0: torch.autograd.backward(self, gradient, retain_graph, create_graph)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward
elsa-03-ib0: allow_unreachable=True) # allow_unreachable flag
elsa-03-ib0: RuntimeError: leaf variable has been moved into the graph interior

Pre-trained Bert model for BingBertSquad

Hi,
Thank you for providing DeepSpeed library. For BingBertSquad example, there is no pre-trained (deepspeed compatible) Bert model. Is there any plan to release it?

Question: update preprocessing scripts to use HuggingFace datasets for pretraining?

Collecting the datasets needed for pretraining is a bit of work, especially when downloading from lots of different URLs behind a firewall.

https://github.com/microsoft/DeepSpeedExamples/tree/25d73cf73fb3dc66faefa141b7319526555be9fc/Megatron-LM-v1.1.5-ZeRO3#datasets

I see that some version of these seem to be available in HuggingFace datasets repo, like openwebtext.

https://huggingface.co/datasets/openwebtext

For the above, it's especially nice since @stas00 has a small subset one can use for testing:

https://huggingface.co/datasets/stas/openwebtext-10k

It's pretty straight-forward to extend the preprocessing script to use the HF datasets as a source rather than a json file. Would something like that be acceptable as a PR?

ImportError: No module named 'fused_adam'

I was trying to run the code with the following command
bash scripts/ds_zero2_pretrain_gpt2_model_parallel.sh

and i got an error like below.

deepspeed --num_nodes 1 --num_gpus 4 pretrain_gpt2.py --model-parallel-size 4 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --batch-size 8 --seq-length 1024 --max-position-embeddings 1024 --train-iters 100000 --resume-dataloader --train-data wikipedia --lazy-loader --tokenizer-type GPT2BPETokenizer --split 949,50,1 --distributed-backend nccl --lr 0.00015 --no-load-optim --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --warmup .01 --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /home/sdl/DeepSpeedExamples/Megatron-LM/scripts/ds_zero2_config.json
[2021-02-14 15:50:02,533] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-02-14 15:50:02,574] [INFO] [runner.py:355:main] cmd = /home/sdl/anaconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 pretrain_gpt2.py --model-parallel-size 4 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --batch-size 8 --seq-length 1024 --max-position-embeddings 1024 --train-iters 100000 --resume-dataloader --train-data wikipedia --lazy-loader --tokenizer-type GPT2BPETokenizer --split 949,50,1 --distributed-backend nccl --lr 0.00015 --no-load-optim --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --warmup .01 --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /home/sdl/DeepSpeedExamples/Megatron-LM/scripts/ds_zero2_config.json
[2021-02-14 15:50:04,024] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-02-14 15:50:04,024] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-02-14 15:50:04,024] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-02-14 15:50:04,024] [INFO] [launch.py:100:main] dist_world_size=4
[2021-02-14 15:50:04,024] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-02-14 15:50:06,570] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
using world size: 4 and model-parallel size: 4 
 > using dynamic loss scaling
[2021-02-14 15:50:06,582] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-14 15:50:06,582] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-14 15:50:06,583] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 4
[2021-02-14 15:50:17,741] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-02-14 15:50:17,742] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
Pretrain GPT2 model
arguments:
  pretrained_bert .............. False
  attention_dropout ............ 0.1
  num_attention_heads .......... 16
  hidden_size .................. 1024
  intermediate_size ............ None
  num_layers ................... 24
  layernorm_epsilon ............ 1e-05
  hidden_dropout ............... 0.1
  max_position_embeddings ...... 1024
  vocab_size ................... 30522
  deep_init .................... False
  make_vocab_size_divisible_by . 128
  cpu_optimizer ................ False
  cpu_torch_adam ............... False
  fp16 ......................... True
  fp32_embedding ............... False
  fp32_layernorm ............... False
  fp32_tokentypes .............. False
  fp32_allreduce ............... False
  hysteresis ................... 2
  loss_scale ................... None
  loss_scale_window ............ 1000
  min_scale .................... 1
  batch_size ................... 8
  weight_decay ................. 0.01
  checkpoint_activations ....... True
  checkpoint_num_layers ........ 1
  deepspeed_activation_checkpointing  True
  clip_grad .................... 1.0
  train_iters .................. 100000
  log_interval ................. 100
  exit_interval ................ None
  seed ......................... 1234
  reset_position_ids ........... False
  reset_attention_mask ......... False
  lr_decay_iters ............... None
  lr_decay_style ............... cosine
  lr ........................... 0.00015
  warmup ....................... 0.01
[2021-02-14 15:50:17,742] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
  save ......................... None
  save_interval ................ 5000
  no_save_optim ................ False
  no_save_rng .................. False
  load ......................... None
  no_load_optim ................ True
  no_load_rng .................. False
  finetune ..................... False
  resume_dataloader ............ True
  distributed_backend .......... nccl
  local_rank ................... 0
  eval_batch_size .............. None
  eval_iters ................... 100
  eval_interval ................ 1000
  eval_seq_length .............. None
  eval_max_preds_per_seq ....... None
  overlapping_eval ............. 32
  cloze_eval ................... False
  eval_hf ...................... False
  load_openai .................. False
  temperature .................. 1.0
  top_p ........................ 0.0
  top_k ........................ 0
  out_seq_length ............... 256
  model_parallel_size .......... 4
  shuffle ...................... False
  train_data ................... ['wikipedia']
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. True
  loose_json ................... False
  presplit_sentences ........... False
  num_workers .................. 2
  tokenizer_model_type ......... bert-large-uncased
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... GPT2BPETokenizer
  cache_dir .................... None
  use_tfrecords ................ False
  seq_length ................... 1024
  max_preds_per_seq ............ None
  deepspeed .................... True
  deepspeed_config ............. /home/sdl/DeepSpeedExamples/Megatron-LM/scripts/ds_zero2_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ False
  cuda ......................... True
  rank ......................... 0
  world_size ................... 4
  dynamic_loss_scale ........... True
[2021-02-14 15:50:17,742] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-02-14 15:50:17,743] [INFO] [checkpointing.py:256:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 431 dummy tokens (new size: 50688)
> found end-of-document token: 50256
building GPT2 model ...
 > number of parameters on model parallel rank 3: 89714688
Optimizer = FusedAdam
 > number of parameters on model parallel rank 1: 89714688
Optimizer = FusedAdam
 > number of parameters on model parallel rank 2: 89714688
Optimizer = FusedAdam
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
 > number of parameters on model parallel rank 0: 89714688
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sdl/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
Optimizer = FusedAdam
learning rate decaying cosine
DeepSpeed is enabled.
[2021-02-14 15:50:30,238] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
[1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/sdl/anaconda3/envs/deepspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o 
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/sdl/anaconda3/envs/deepspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
       __p->_M_set_sharable();
       ~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "pretrain_gpt2.py", line 716, in <module>
    main()
  File "pretrain_gpt2.py", line 664, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
    dist_init_required=False
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    config_params=config_params)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    return self.jit_load(verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fused_adam...
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 716, in <module>
    main()
  File "pretrain_gpt2.py", line 664, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
    dist_init_required=False
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    config_params=config_params)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
Loading extension module fused_adam...
    return self.jit_load(verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 716, in <module>
    keep_intermediates=keep_intermediates)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/imp.py", line 297, in find_module
    main()
  File "pretrain_gpt2.py", line 664, in main
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
Loading extension module fused_adam...
    dist_init_required=False
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    config_params=config_params)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
Traceback (most recent call last):
  File "pretrain_gpt2.py", line 716, in <module>
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    main()
  File "pretrain_gpt2.py", line 664, in main
    return self.jit_load(verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
    dist_init_required=False
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    keep_intermediates=keep_intermediates)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    config_params=config_params)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
    file, path, description = imp.find_module(module_name, [path])
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/imp.py", line 297, in find_module
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    return self.jit_load(verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/imp.py", line 297, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'

Can you please figure it out?
Thank you in advance

Cannot use ZeRO in bing_bert example

Hi,

Great repo! I have some questions about the bing_bert example: doesn't it support ZeRO? I tried stage 1 or 2, but it kept running into errors. If not using deepspeed transformers kernels, I encountered

Invalid to reduce Param 76 with None gradient

If using deepspeed transformers kernels, I encountered,

RuntimeError: CUDA error: misaligned address

I would really appreciate some help. Thanks in advance!

the question for bing-bert

I want to pretrain the bert-large with only single V100-32G, to reproduce the result as follow diagram. However,In BERT pre-training tutorial, it has different hyper-parameters and different script, maybe the code have been updated. I try to run ds_train_bert_nvidia_data_bsz64k_seq128.sh(change train_batch_size=320, and train_micro_batch_size_per_gpu=32)in single GPU to instead the methods of tutorail. And add all the optimize method to reduce the memory and imporve the speed.

python ${base_dir}/deepspeed_train.py
--cf ${base_dir}/bert_large_lamb_nvidia_data.json
--max_seq_length 512
--output_dir $OUTPUT_DIR
--print_steps 1
--deepspeed
--deepspeed_transformer_kernel
--stochastic_mode
--gelu_checkpoint
--normalize_invertible
--job_name $JOB_NAME
--deepspeed_config ${base_dir}/deepspeed_bsz32k_lamb_config_seq512.json
--data_path_prefix /workspace/bert
--use_nvidia_dataset
--rewarmup
--lr_schedule "EE"
--attention_dropout_checkpoint
--lr_offset 0.0
--load_training_checkpoint ${CHECKPOINT_BASE_PATH}
--load_checkpoint_id ${CHECKPOINT_EPOCH_NAME} \

The above are the specific script. However, the test result of samples/seconds is about 47. It is smaller than 52 a lot. Can you give me some suggestions about the result？ What's more, I get the same result about megatron，so there are no hardware problems. Thanks a lot!

Upload train.py in bing_bert?

Hello,
I am testing deepspeed performance. However, I can't run bing_bert without deepspeed.
I try to recover train.py from deepspeed_train.py, it's a little bit difficult.
So is there any reference code or just upload train.py? Thanks

Clarification regarding passing transformations

Hi.

I am referring to the CIFAR10 example. If we directly pass a torch.utils.data.Dataset while calling deepspeed.initialize() what are my options to also pass augmentation transforms? Even though this line creates the torch data loader it basically does not have any usage other than visualization if my understanding is correct.

Any pointers will be appreciated.

Cifar-10 example arguments being overwritten

Trying out the Cifar-10 examples and it appears that adding in some arguments doesn't work because they are being overwritten somewhere.

I tried changing the number of epochs to run by using the flag --epochs but it looks like the cifar10_deepspeed.py script has hard coded 2 epochs:

for epoch in range(2): # loop over the dataset multiple times

I also tried to change the learning rate to 0.0005 by changing the ds_config.json file and it seems like that gets pick up in some parts but overwritten in other parts.

For example I see
worker-0: [2020-10-01 22:41:49,395] [INFO] [config.py:624:print] optimizer_params ............. {'lr': 0.0005, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}

Which seems to have picked it up, but when it actually runs the training it always says:

worker-0: [2020-10-01 22:42:57,190] [INFO] [logging.py:60:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]

Which seems to have not picked up the lr change (it stays at 0.001 throughout as well which suggests it's not doing any lr_warmup either). I haven't tracked down where in the script the learning rate got overwritten... but it does seem to be happening.

About 'cpu_optimizer' in the ZeRO-Offload tutorial

Thank you very much for your contribution！
When I read the ZeRO-Offload tutorial, I found that the 'cpu_optimizer' setting couldn't find the corresponding part of the script 'pretrain_gpt2.py', what went wrong with me? I can't thank you enough for your help!**

Does Megatron-LM-v1.1.5-ZeRO3 support 3D parallelism?

I could not find any scripts in the example folder that enabled pipelining

Megatron-LM-v1.1.5-ZeRO3: evaluation during training disabled

In the Megatron-LM-v1.1.5-ZeRO3 Megatron implementation, periodic evaluations do not happen during training. This is essential functionality for knowing that training is proceeding correctly. Evaluation can be reenabled by removing the quotation marks around lines 484-491 of megatron/training.py. I have tested this with several DeepSpeed settings, and the evaluation seems to work properly. But those lines were clearly quoted out intentionally, since the comment above says # XXX temporarily disabled for ZeRO-3. It would be helpful to know the reason for this at least. Does evaluation work incorrectly only if the ZeRO Optimizer is set to stage 3? What would be needed to get evaluation working for the cases where it doesn't work?

Expect unscaled loss from DeepSpeed forward()

Client should expect unscaled loss values from DeepSpeed forward()

AttributeError: 'Namespace' object has no attribute 'tensor_model_parallel_size'

I am trying to run the GPT2 model on Google Colab. I have placed my training data in a loose JSON format, with one JSON containing a text sample per line. On running the preprocessing script, I am getting the above error. Kindly help me with this.

complete error below:

Opening my-corpus.json

building GPT2BPETokenizer tokenizer ...
Traceback (most recent call last):
File "/content/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/tools/preprocess_data.py", line 200, in
main()
File "/content/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/tools/preprocess_data.py", line 154, in main
tokenizer = build_tokenizer(args)
File "/usr/local/lib/python3.7/dist-packages/megatron/tokenizer/tokenizer.py", line 48, in build_tokenizer
args)
File "/usr/local/lib/python3.7/dist-packages/megatron/tokenizer/tokenizer.py", line 59, in _vocab_size_with_padding
args.tensor_model_parallel_size
AttributeError: 'Namespace' object has no attribute 'tensor_model_parallel_size'

How to use Deepspeed checkpoints for BingBertSquad finetuning?

Hi there,
The tutorial
https://www.deepspeed.ai/tutorials/bert-finetuning/#loading-huggingface-and-tensorflow-pretrained-models
makes clear how to load HF and TF checkpoints into Deepspeed. What if we want to load a Deepspeed checkpoint, like from the Bing BERT example?
Is it that we load the "mp_rank_00_model_states.pt" file in the checkpoint?

I'm currently using fp16 and ZERO-2, so I wonder if using that will lose some precision. Should I use zero_to_fp32 to convert the checkpoint to fp32 for loading?

BingBertSquad should use DeepSpeedLauncher

Currently BingBertSquad is not using DeepSpeed Launcher.
All examples should use the launcher.

https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/run_squad_deepspeed.sh#L22

RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type

I was trying to run Megatron with ZeRO 2 config when I encountered this error
The code version is Megatron-LM-v1.1.5-3D_parallelism.

Traceback (most recent call last):
File "pretrain_gpt2.py", line 158, in <module>
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 100, in pretrain
train_data_iterator, valid_data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 485, in train
lr_scheduler)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 325, in train_step
return train_step_pipe(model, data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 359, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 283, in train_batch
self._exec_schedule(sched)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1161, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 219, in _exec_reduce_tied_grads
self.module.allreduce_tied_weight_gradients()
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 890, in all_reduce
_check_single_tensor(tensor, "tensor")
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_single_tensor
"to be of type torch.Tensor.".format(param_name))
RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor.

It seems that File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
the weight.grad is not Tensor. But this error doesn't occur with ZERO 0 and 1 config

My script is like this:

#! /bin/bash

GPUS_PER_NODE=16
# Change for multinode config
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6000
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

export DLWS_NUM_WORKER=${NNODES}
export DLWS_NUM_GPU_PER_WORKER=${GPUS_PER_NODE}

DATA_PATH=/userhome/ChineseCorpus/Megatron-training/all-sample100G-samplebyfile-combine10M/text_document
VOCAB_PATH=bpe_3w_new/vocab.json
MERGE_PATH=bpe_3w_new/merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m_ds

script_path=$(realpath $0)
script_dir=$(dirname $script_path)
config_json="$script_dir/ds_zero_stage_2_config.json"
# config_json="$script_dir/ds_config.json"

# Megatron Model Parallelism
mp_size=2
# DeepSpeed Pipeline parallelism
pp_size=2

NLAYERS=24
NHIDDEN=1024
BATCHSIZE=4
LOGDIR="tensorboard_data/${NLAYERS}l_${NHIDDEN}h_${NNODES}n_${GPUS_PER_NODE}g_${pp_size}pp_${mp_size}mp_${BATCHSIZE}b_ds4"

GAS=16

#ZeRO Configs
stage=2
reduce_scatter=true
contigious_gradients=true
rbs=50000000
agbs=5000000000

#Actication Checkpointing and Contigious Memory
chkp_layers=1
PA=true
PA_CPU=false
CC=true
SYNCHRONIZE=true
PROFILE=false


gpt_options=" \
        --model-parallel-size ${mp_size} \
        --pipe-parallel-size ${pp_size} \
        --num-layers $NLAYERS \
        --hidden-size $NHIDDEN \
        --num-attention-heads 16 \
        --seq-length 1024 \
        --max-position-embeddings 1024 \
        --batch-size $BATCHSIZE \
        --gas $GAS \
        --train-iters 320000 \
        --lr-decay-iters 320000 \
        --save $CHECKPOINT_PATH \
        --load $CHECKPOINT_PATH \
        --data-path $DATA_PATH \
        --vocab-file $VOCAB_PATH \
        --merge-file $MERGE_PATH \
        --data-impl mmap \
        --split 949,50,1 \
        --distributed-backend nccl \
        --lr 1.5e-4 \
        --lr-decay-style cosine \
        --min-lr 1.0e-5 \
        --weight-decay 1e-2 \
        --clip-grad 1.0 \
        --warmup 0.01 \
        --checkpoint-activations \
        --log-interval 1 \
        --save-interval 500 \
        --eval-interval 100 \
        --eval-iters 10 \
        --fp16 \
        --tensorboard-dir ${LOGDIR}
"
  
 deepspeed_options=" \
                --deepspeed \
                --deepspeed_config ${config_json} \
                --zero-stage ${stage} \
                --zero-reduce-bucket-size ${rbs} \
                --zero-allgather-bucket-size ${agbs} 
            "

if [ "${contigious_gradients}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
                --zero-contigious-gradients"
fi

if [ "${reduce_scatter}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
                --zero-reduce-scatter"
fi

chkp_opt=" \
--checkpoint-activations \
--checkpoint-num-layers ${chkp_layers}"

if [ "${PA}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --partition-activations"
fi

if [ "${PA_CPU}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --checkpoint-in-cpu"
fi

if [ "${SYNCHRONIZE}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --synchronize-each-layer"
fi

if [ "${CC}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --contigious-checkpointing"
fi

if [ "${PROFILE}" = "true" ]; then
chkp_opt="${chkp_opt} \
        --profile-backward"
fi

full_options="${gpt_options} ${deepspeed_options} ${chkp_opt}"

run_cmd="deepspeed --num_nodes ${DLWS_NUM_WORKER} --num_gpus ${DLWS_NUM_GPU_PER_WORKER} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} pretrain_gpt2.py $@ ${full_options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

The zero 2 config is like this:

{
  "train_batch_size":256,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "reduce_scatter": true,
    "allgather_bucket_size": 50000000,
    "reduce_bucket_size": 50000000,
    "overlap_comm": true
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0,
      "betas": [0.9, 0.95]
    }
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,

    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false
}

@ShadenSmith @jeffra

Any log for bert pre-training?

What is the PySOL package and where can it be found?

In the Megatron-LM docker README file, the following instructions can be found:
Note that as of now you need to have PySOL cloned to the directory here before building the container.

What is "PySOL" referring to? I assume we are not talking about http://www.pysol.org/, and I can't find any other relevant reference on a short Google search.

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

when I run DeepSpeedExamples-Megatron-LM-v1.1.5-ZeRO3/ and Megatron-LM-v1.1.5-3D_parallelism , I encounter the same problem : TypeError: unsupported operand type(s) for *: 'NoneType' and 'int', Traceback (most recent call last):
File "pretrain_bert.py", line 123, in
args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
File "/home/DeepSpeedExamples-master/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 88, in pretrain
train_valid_test_dataset_provider)
File "/home/DeepSpeedExamples-master/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 619, in build_train_valid_test_data_iterators
global_batch_size = args.batch_size * data_parallel_size * args.gas
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

my python verson is 3.7.10, torch verison is: 1.8.1+cu101.

DeepSpeed not actually reduce wall clock time for 1 epoch - DCGAN

Hi,
I have successfully run the DCGAN training code using deepspeed, using the celeba dataset.
But the problem is when I run the baseline code, the training clock time also the same as deepspeed-enabled code (137s). So I don't know whether something I am missing now.
My system information is:
OS: Ubuntu 14.04
CUDA Toolkit 10.1.243
GPU: Single GPU - NVIDIA TiTanX
Pytorch version is 1.4.0, and also tested on 1.7.1
Thank you.

Ignore index for mpu cross entropy

It seems to me that

DeepSpeedExamples/Megatron-LM/pretrain_bert.py

Lines 221 to 222 in fa1d1a7

 losses = mpu.vocab_parallel_cross_entropy( 

 output.contiguous().float(), lm_labels.contiguous())

does not take ignore_index=-1 into account, while it does for bing-bert example:

DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py

Line 1148 in a3cfec7

loss_fct = CrossEntropyLoss(ignore_index=-1)

How to enable Model Parallel in bing_bert?

as title, I want to enable activation checkpointing

BERT in Megatron-LM-v1.1.5-3D_parallelism does not support pipeline parallelism

I try to run the BERT with pipeline parallelism, but I get an error:

File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/pretrain_bert.py", line 146, in
args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
File "/DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 81, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
File "/DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 252, in setup_model_and_optimizer
model.set_batch_fn(model.module._megatron_batch_fn)
File "/home/wwu/anaconda3/envs/sospx86/lib/python3.6/site-packages/torch/nn/modules/module.py", line 948, in getattr
type(self).name, name))
AttributeError: 'DeepSpeedEngine' object has no attribute 'set_batch_fn'

I dig into the code a little bit, it seems like the pipeline parallelism is not implemented for BERT.

ascii' codec can't decode byte

When I use Chinese corpus，I got an error

UnicodeDecodeError:'ascii' codec can't decode byte 0xe7 in position 11:oridinal not in range(128)

but I can't find the bug
Does anybody know why？

Dynamic batch support

Machine Translation usually takes dynamically sized batch composed of X tokens instead of X sentences as training input. I'm wondering why deepspeed requires specifying train_batch_size and train_micro_batch_size_per_gpu, both of which refer to the number of samples. Is this a concern for implementation details? Or is it possible to support dynamic size as in the case of machine translation without extra cost of efficiency and memory usage?

Different validation losses when using different micro batch sizes in bing bert

Situation: with different "train_micro_batch_size_per_gpu" in deepspeed_bsz32k_lamb_config_seq512.json (and deepspeed_bsz64k_lamb_config_seq128.json if enable validation in seq128), the validation losses differ by as much as 0.8.

I had created a branch including the test code I used: https://github.com/microsoft/DeepSpeedExamples/tree/conglli/validation_investigation. In deepspeed_train.py, I added another Validation loss calculation ("Validation Loss split") that split the micro batch into single items, so that we can compare the validation loss when using the defined micro batch size and the validation loss when using micro batch size 1. This also exclude any potential input data effect since both calculations use the same validation data. With the test code, I tested seq512 with deepspeed kernel, and I got different losses from the two calculations:

I also added validation loss calculation for seq128 in the branch above. Then I also tested and got different validation losses in seq128 both with and without deepspeed kernel (to test it in seq128 with deepspeed kernel, you need to use this deepspeed branch: https://github.com/microsoft/DeepSpeed/tree/reyazda/support_dynamic_seqlength).

I had internal discussion with Minjia, Samyam, Tunji, and Jeff on Fri Aug 28th, but we didn't reach a conclusion. It is still not verified whether this is a bug, or this is just correct behavior.

training process get killed when using deepspeed.zero.Init for large bing_bert model

Hi there,
I would like to train a 20B model. To successfully initialize the model for training I noticed the context manager, deepspeed.zero.Init()
at here https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models .
While inject this context manager for model initialization at here, https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/turing/models.py#L123,
cause training script to be killed at early stage.

Here is part of the log. The total number of layers is 254, training script fails to initialize all of them.
I have monitored the CPU memory consumption, which is much less than the capacity.

10.0.36.222: layer #252 is created with date type [half].
10.0.60.215: Traceback (most recent call last):
10.0.60.215:   File "/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py", line 607, in <module>
10.0.60.215:     main()
10.0.60.215:   File "/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py", line 596, in main
10.0.60.215:     model, optimizer = prepare_model_optimizer(args)
10.0.60.215:   File "/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py", line 469, in prepare_model_optimizer
10.0.60.215:     model = BertMultiTask(args)
10.0.60.215:   File "/home/ec2-user/DeepSpeedExamples/bing_bert/turing/models.py", line 125, in __init__
10.0.60.215:     self.network = BertForPreTrainingPreLN(bert_config, args)
10.0.60.215:   File "/home/ec2-user/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 266, in wrapper
10.0.60.215:     f(module, *args, **kwargs)
10.0.60.215:   File "/home/ec2-user/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 1119, in __init__
10.0.60.215:     config, self.bert.embeddings.word_embeddings.weight)
10.0.60.215:   File "/home/ec2-user/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 266, in wrapper
10.0.60.215:     f(module, *args, **kwargs)
10.0.60.215:   File "/home/ec2-user/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 760, in __init__
10.0.60.215:     bert_model_embedding_weights)
10.0.60.215:   File "/home/ec2-user/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 266, in wrapper
10.0.60.215:     f(module, *args, **kwargs)
10.0.60.215:   File "/home/ec2-user/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 712, in __init__
10.0.60.215:     self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
10.0.60.215: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
10.0.60.215: DeepSpeed Transformer config is  {'layer_id': 42, 'batch_size': 32, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 40, 'attn_dropout_ratio': 0.1, 'hidden_dropout_ratio': 0.1, 'num_hidden_layers': 254, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': 3, 'seed': 42, 'normalize_invertible': False, 'gelu_checkpoint': True, 'adjust_init_range': True, 'test_gemm': False, 'layer_norm_eps': 1e-12, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': True, 'stochastic_mode': False, 'huggingface': False}
.....

10.0.60.215: layer #42 is created with date type [half].
10.0.60.215: Killing subprocess 11407
10.0.60.215: Killing subprocess 11408
10.0.60.215: Killing subprocess 11409
10.0.60.215: Killing subprocess 11410
10.0.60.215: Killing subprocess 11411
10.0.60.215: Killing subprocess 11413
10.0.60.215: Killing subprocess 11414
10.0.60.215: Killing subprocess 11417
10.0.60.215: Traceback (most recent call last):
10.0.60.215:   File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
10.0.60.215:     "__main__", mod_spec)
10.0.60.215:   File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
10.0.60.215:     exec(code, run_globals)
10.0.60.215:   File "/home/ec2-user/DeepSpeed/deepspeed/launcher/launch.py", line 183, in <module>
10.0.60.215:     main()
10.0.60.215:   File "/home/ec2-user/DeepSpeed/deepspeed/launcher/launch.py", line 173, in main
10.0.60.215:     sigkill_handler(signal.SIGTERM, None)  # not coming back
10.0.60.215:   File "/home/ec2-user/DeepSpeed/deepspeed/launcher/launch.py", line 151, in sigkill_handler
10.0.60.215:     raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
10.0.60.215: subprocess.CalledProcessError: Command '['/bin/python3', '-u', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py', '--local_rank=7', '--max_seq_length', '512', '--print_steps', '10', '--deepspeed', '--data_path_prefix', '/home/ec2-user/small-data', '--use_nvidia_dataset', '--rewarmup', '--lr_schedule', 'EE', '--attention_dropout_checkpoint', '--lr_offset', '0.0', '--gelu_checkpoint', '--deepspeed_transformer_kernel', '--max_steps', '5', '--ckpt_to_save', '200', '--output_dir', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../outputs/zero3_2node_2021-09-14_01:19:59/', '--cf', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../configs/zero3_2nodes_profile.json', '--deepspeed_config', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../configs/zero3_2nodes_profile.json', '--job_name', 'zero3_2node_2021-09-14_01:19:59']' returned non-zero exit status 1.
10.0.36.222: DeepSpeed Transformer config is  {'layer_id': 58, 'batch_size': 32, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 40, 'attn_dropout_ratio': 0.1, 'hidden_dropout_ratio': 0.1, 'num_hidden_layers': 254, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': 6, 'seed': 42, 'normalize_invertible': False, 'gelu_checkpoint': True, 'adjust_init_range': True, 'test_gemm': False, 'layer_norm_eps': 1e-12, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': True, 'stochastic_mode': False, 'huggingface': False}
10.0.36.222: layer #58 is created with date type [half].

I have seen two relevant issues at DeepSpeed repo
microsoft/DeepSpeed#907
microsoft/DeepSpeed#1041

I think it might be more relevant to the implementation of bing_bert rather than deepspeed, so I brought the issue to this repo.

RuntimeError: Error building extension 'fused_adam'

when I run the example of transformers,it occurs this bug.

nvcc fatal : Unsupported gpu architecture 'compute_86'
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1533, in _run_ninja_build
subprocess.run(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "./pretrain_bert_with_trainer.py", line 72, in
main()
File "./pretrain_bert_with_trainer.py", line 70, in main
Pretrain()
File "./pretrain_bert_with_trainer.py", line 67, in Pretrain
trainer.train()
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/trainer.py", line 903, in train
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/integrations.py", line 414, in init_deepspeed
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/init.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 186, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 604, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 676, in _configure_basic_optimizer
optimizer = FusedAdam(model_parameters,
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 215, in load
return self.jit_load(verbose)
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 243, in jit_load
op_module = load(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 986, in load
return _jit_compile(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1193, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1297, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'

ValueError: Output directory () already exists and is not empty

Fllow the BingBertSQuAD Fine-tuning , I want to test the baseline with hugging face bert, my script is:

~/bin/bash

#1: number of GPUs
#2: Model File Address
#3: BertSquad Data Directory Address
#4: Output Directory Address

NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
NUM_NODES=1
NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=6
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
       GRAD_ACCUM_STEPS=1
else
       GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
LR=3e-5
MASTER_PORT=$((NGPU+12345))
JOB_NAME="baseline_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
       nvidia_run_squad_baseline.py \
       --bert_model bert-large-uncased \
       --do_train \
       --do_lower_case \
       --do_predict \
       --train_file $SQUAD_DIR/train-v1.1.json \
       --predict_file $SQUAD_DIR/dev-v1.1.json \
       --train_batch_size $PER_GPU_BATCH_SIZE \
       --learning_rate ${LR} \
       --num_train_epochs 2.0 \
       --max_seq_length 384 \
       --doc_stride 128 \
       --output_dir $OUTPUT_DIR \
       --job_name ${JOB_NAME} \
       --gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
       --fp16 \
       --model_file $MODEL_FILE \
       --ckpt_type HF \
       --origin_bert_config_file ./pre-trained-model/hugging-face/bert-large-uncased-whole-word-masking-config.json
       "
echo ${run_cmd}
eval ${run_cmd}

when I run the script with command:

bash run_squad_baseline_hf.sh 4 pre-trained-model/hugging-face/bert-large-uncased-whole-word-masking-pytorch_model.bin data/SQuAD/ ./tmp

error occur:

ValueError: Output directory () already exists and is not empty.

I do check tmp is empty before run the command, so how can I solve the problem?

Bing BERT

Hi guys,
I have been trying to run the Bing experiment but it seems I can't for now.

"datasets": {
--
  | "wiki_pretrain_dataset": "/data/bert/bnorick_format/128/wiki_pretrain",
  | "bc_pretrain_dataset": "/data/bert/bnorick_format/128/bookcorpus_pretrain"
  | },

I see this stuff is missing to fully validate the code.

Megatron-LM-v1.1.5-ZeRO-2/3 doesn't support model change to Huggingface Bert

I try to replace megatron bert model to huggingface bert model in model_provider function.

However, the program can't pass deepspeed/runtime/zero/stage3.py:1896

assert self.params_already_reduced[param_id] == False,
f"The parameter {param_id} has already been reduced.
Gradient computed twice for this partition.
Multiple gradient reduction is currently not supported"

AssertionError: The parameter 102 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

The parameter 102 is embedding weights, any suggestions?

api breaking in Megatron-LM-v1.1.5-ZeRO3 demo

hi, I've been running demos in Megatron-LM-v1.1.5-ZeRO3 folder and I found some api breaking in /Megatron-LM-v1.1.5-ZeRO3/megatron/training.py

breaking 1

line 327: see_memory_usage(f'before forward {model.global_steps}', force=True)
line 333: see_memory_usage(f'before backward {model.global_steps}', force=True)
line 340: see_memory_usage(f'before optimizer {model.global_steps}', force=True)

while running pretrain_bert.py, some errors emerged and said that "model" does not have attribute "global_steps"

AttributeError: 'DistributedDataParallel' object has no attribute 'global_steps'

therefore, I have to comment these three lines.

breaking 2

line 330: loss, loss_reduced = forward_step_func(data_iterator, model, args.curriculum_learning)

while running this line, it said that forward_step() only receives 2 parameters.

TypeError: forward_step() takes 2 positional arguments but 3 were given

I checked out the source code in pretrain_bert.py, found that:

def forward_step(data_iterator, model):

so I removed "args.curriculum_learning", and it works, lol

I guess it's the upgrade of Megatron-lm or DeepSpeed or something that caused the api breaking, please fix, thanks a lot!

setup

the same as README.md,

python pretrain_bert.py \
       $BERT_ARGS \
       $OUTPUT_ARGS \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH

Why ZeRO-2 use more CUDA Memory than ZeRO-1？

Follow the bing_bert tutorial, my deepspeed_config is:

{
  "train_batch_size": 4096,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1000,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 6e-3,
      "betas": [
        0.9,
        0.99
      ],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "grad_hooks": true,
    "round_robin_gradients": false
  },


  "scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": 1e-8,
        "warmup_max_lr": 6e-3
    }
  },
  "gradient_clipping": 1.0,

  "wall_clock_breakdown": false,

  "fp16": {
    "enabled": true,
    "loss_scale": 0
  },
  "sparse_attention": {
    "mode": "fixed",
    "block": 16,
    "different_layout_per_head": true,
    "num_local_blocks": 4,
    "num_global_blocks": 1,
    "attention": "bidirectional",
    "horizontal_global_attention": false,
    "num_different_global_patterns": 4
  }
}

The CUDA Memory usage for stage 1 is 8900MB per GPU
The CUDA Memory usage for stage 2 is 9600MB per GPU

And the ZeRO-2 is much slower than ZeRO-1 in training speed.

Any help will be appreciate~

Which Wikipedia and BookCorpus datasets to use for Bert-pretraining example?

I am trying to follow the example here

https://www.deepspeed.ai/tutorials/bert-pretraining/

The section on getting the datasets says 'Note: Downloading and pre-processing instructions are coming soon.'.

I tried googling but those datasets seem tricky to find. And even then, I'm not sure if they would be the correct versions to use for the script.

when i disable fp16 ，got error: 'lr_this_step' referenced before assignment

Issue reported in DeepSpeed repo: microsoft/DeepSpeed#426

	losses = mpu.vocab_parallel_cross_entropy(
	output.contiguous().float(), lm_labels.contiguous())