microsoft / deepspeedexamples Goto Github PK
View Code? Open in Web Editor NEWExample models using DeepSpeed
License: Apache License 2.0
Example models using DeepSpeed
License: Apache License 2.0
I always have the error: "RuntimeError: CUDA error: out of memory". Can you give suggestion? Thanks
Hi, newbie here. Reading the API, I've noticed that memory and speed improvements come from ZeRO. I also read that ZeRO only works with FP16. I read that GTX 1080 Ti has a very low FP16 throughput and using it is very inefficient. Would Deepspeed provide a significant memory or speed improvement using a single GTX 1080 Ti? I'm using Transformers to model sequential data, moderate sequence length (512-1024) and small batch size (4-6). Thanks.
We will need to update example models to work with the new apex apis when the apex pointer in deepspeed repo is updated.
The documentation mentions that a TF, PT, or HF checkpoint will work for the bing finetune example, but it looks like to run the code as is and try to use any of those checkpoints, it'll still run this section of the code
Which ends up saying
"Unable to find model state in checkpoint"
Even it I try to insert an arg
args.ckpt_type = "HF"
It still tries to run that section of code somehow.
So it looks like you need a deepspeed checkpoint, otherwise there is a
ValueError: Should NOT use --preln if the loading checkpoint doesn't use pre-layer-norm.
error
I also tried altering the code so it uses convert_ckpt_to_deepspeed
but then I run into ValueError: Invalid ckpt_type.
. So somehow that function is not able to use HF checkpoints.
Hi
I need help with using deepspeed for transformers zero shot classification pipeline.
My dataset has 500K sentences and 52 labels.
I tried editing the gpt generation example but I am not sure if I did it right.
import os
import deepspeed
import torch
import transformers
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
classifier = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli", device=0)
classifier.model = deepspeed.init_inference(classifier.model,
mp_size=world_size,
dtype=torch.float,
replace_method='auto')
res = classifier(prod_name_lst[:1000], tag_values)
if torch.distributed.get_rank() == 0:
print(res)
Kindly help with an example.
Thanks,
Subham
The ZeRO 3 example does not run. The main problem appears to be that the InitContext
function does not actually exist despite being called by pretrain_gpt2.py
. I have tried to introduce some changes to get it to run (incl. changing the batch size, the initialization function, and some of the inputs to the initialization function) but gave up after it threw the error variable beta1 is referenced before assignment
. I think that has to do with something wonky in the optimizer?
In the ds_pretrain_gpt2.sh
we have
#config_json="$script_dir/ds_zero_stage_2_config.json"
config_json="$script_dir/ds_config.json"
Why is ZeRO 2 disabled when the tutorial for extremely efficient training says ZeRO 2 is 5x faster than ZeRO 1?
I run the script under the directory of DeepSpeed/DeepSpeedExamples/bing_bert
with the following cmd.
sh ds_train_bert_nvidia_data_bsz64k_seq128.sh
I came across the following bug.
[2021-01-21 15:51:32,141] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
VOCAB SIZE: 30528
Traceback (most recent call last):
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_train.py", line 532, in
main()
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_train.py", line 521, in main
model, optimizer = prepare_model_optimizer(args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/deepspeed_train.py", line 396, in prepare_model_optimizer
model = BertMultiTask(args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/turing/models.py", line 123, in init
self.network = BertForPreTrainingPreLN(bert_config, args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 1117, in init
self.bert = BertModel(config, args)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 1002, in init
config, args, sparse_attention_config=self.sparse_attention_config)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 596, in init
for i in range(config.num_hidden_layers)
File "/home/jiaruifang/codes/DeepSpeed/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 596, in
for i in range(config.num_hidden_layers)
File "/home/jiaruifang/anaconda3/lib/python3.7/site-packages/deepspeed/ops/transformer/transformer.py", line 487, in init
self.config.layer_id = DeepSpeedTransformerLayer.layer_id
AttributeError: 'int' object has no attribute 'layer_id'
I was trying to run Megatron with ZeRO 2 config when I encountered this error
> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 1894.21 | train/valid/test data iterators: 357.88
training ...
Traceback (most recent call last):
File "pretrain_gpt2.py", line 156, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
iteration = train(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 481, in train
Traceback (most recent call last):
File "pretrain_gpt2.py", line 156, in <module>
loss_dict, skipped_iter = train_step(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 324, in train_step
return train_step_pipe(model, data_iterator)
File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
pretrain(train_valid_test_datasets_provider, model_provider, forward_step, self._exec_schedule(sched)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1162, in _exec_schedule
File "/root/megatron-3d/megatron/training.py", line 97, in pretrain
self._exec_instr(**cmd.kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 621, in _exec_load_micro_batch
iteration = train(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 481, in train
batch = self._next_batch()
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
return self._next_batch()
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
return self._next_batch()
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 480, in _next_batch
loss_dict, skipped_iter = train_step(forward_step_func,
File "/root/megatron-3d/megatron/training.py", line 324, in train_step
return self._next_batch()
[Previous line repeated 978 more times]
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 469, in _next_batch
return train_step_pipe(model, data_iterator)
File "/root/megatron-3d/megatron/training.py", line 358, in train_step_pipe
batch = self.batch_fn(batch)
File "pretrain_gpt2.py", line 110, in get_batch_pipe
return fp32_to_fp16((tokens, position_ids, attention_mask)), fp32_to_fp16((labels, loss_mask))
File "/root/megatron-3d/megatron/fp16/fp16.py", line 53, in fp32_to_fp16
return conversion_helper(val, half_conversion)
File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in conversion_helper
loss = model.train_batch(data_iter=data_iterator)
File "/root/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 273, in train_batch
rtn = [conversion_helper(v, conversion) for v in val]
File "/root/megatron-3d/megatron/fp16/fp16.py", line 38, in <listcomp>
rtn = [conversion_helper(v, conversion) for v in val]
File "/root/megatron-3d/megatron/fp16/fp16.py", line 37, in conversion_helper
return conversion(val)
File "/root/megatron-3d/megatron/fp16/fp16.py", line 48, in half_conversion
if isinstance(val_typecheck, (Parameter, Variable)):
File "/root/anaconda3/lib/python3.8/site-packages/torch/autograd/variable.py", line 7, in __instancecheck__
return isinstance(other, torch.Tensor)
self._exec_schedule(sched)RecursionError: maximum recursion depth exceeded while calling a Python object
This doesn't occur with the following config
{
"train_batch_size": 224,
"train_micro_batch_size_per_gpu": 4,
"steps_per_print": 10,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"max_grad_norm": 1.0,
"betas": [0.9, 0.95]
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false
}
I tried to use DeepSpeed with the examples in DeepSpeedExamples. However, I couldn't find any explanation about input data such as $SQUAD_DIR or $MODEL_FILE in the shell script. Can you give me some details to run the code?
When segregating the parameters for the optimizer the norm & bias parameters are not included:
Is this handled elsewhere or was decay intentionally applied to all the deepspeed transformer layer parameters?
Following this GPT2 tutorial(https://www.deepspeed.ai/tutorials/megatron/), I modified pretrain_bert to run with deepspeed. However, I got this message. RuntimeError: leaf variable has been moved into the graph interior
.
Do you have any idea that I can fix the error?
Full error messages are in the below.
elsa-03-ib0: Traceback (most recent call last):
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 617, in
elsa-03-ib0: main()
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 595, in main
elsa-03-ib0: timers, args)
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 354, in train
elsa-03-ib0: args, timers)
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 310, in train_step
elsa-03-ib0: nsp_loss, args, timers)
elsa-03-ib0: File "/home/soojeong/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 255, in backward_step
elsa-03-ib0: model.backward(loss)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/deepspeed_light.py", line 665, in backward
elsa-03-ib0: self.optimizer.backward(loss)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/deepspeed_zero_optimizer.py", line 455, in backward
elsa-03-ib0: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/loss_scaler.py", line 174, in backward
elsa-03-ib0: scaled_loss.backward(retain_graph=retain_graph)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
elsa-03-ib0: torch.autograd.backward(self, gradient, retain_graph, create_graph)
elsa-03-ib0: File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward
elsa-03-ib0: allow_unreachable=True) # allow_unreachable flag
elsa-03-ib0: RuntimeError: leaf variable has been moved into the graph interior
Hi,
Thank you for providing DeepSpeed library. For BingBertSquad example, there is no pre-trained (deepspeed compatible) Bert model. Is there any plan to release it?
Collecting the datasets needed for pretraining is a bit of work, especially when downloading from lots of different URLs behind a firewall.
I see that some version of these seem to be available in HuggingFace datasets repo, like openwebtext.
https://huggingface.co/datasets/openwebtext
For the above, it's especially nice since @stas00 has a small subset one can use for testing:
https://huggingface.co/datasets/stas/openwebtext-10k
It's pretty straight-forward to extend the preprocessing script to use the HF datasets as a source rather than a json file. Would something like that be acceptable as a PR?
I was trying to run the code with the following command
bash scripts/ds_zero2_pretrain_gpt2_model_parallel.sh
and i got an error like below.
deepspeed --num_nodes 1 --num_gpus 4 pretrain_gpt2.py --model-parallel-size 4 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --batch-size 8 --seq-length 1024 --max-position-embeddings 1024 --train-iters 100000 --resume-dataloader --train-data wikipedia --lazy-loader --tokenizer-type GPT2BPETokenizer --split 949,50,1 --distributed-backend nccl --lr 0.00015 --no-load-optim --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --warmup .01 --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /home/sdl/DeepSpeedExamples/Megatron-LM/scripts/ds_zero2_config.json
[2021-02-14 15:50:02,533] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-02-14 15:50:02,574] [INFO] [runner.py:355:main] cmd = /home/sdl/anaconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 pretrain_gpt2.py --model-parallel-size 4 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --batch-size 8 --seq-length 1024 --max-position-embeddings 1024 --train-iters 100000 --resume-dataloader --train-data wikipedia --lazy-loader --tokenizer-type GPT2BPETokenizer --split 949,50,1 --distributed-backend nccl --lr 0.00015 --no-load-optim --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --warmup .01 --checkpoint-activations --deepspeed-activation-checkpointing --fp16 --deepspeed --deepspeed_config /home/sdl/DeepSpeedExamples/Megatron-LM/scripts/ds_zero2_config.json
[2021-02-14 15:50:04,024] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-02-14 15:50:04,024] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-02-14 15:50:04,024] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-02-14 15:50:04,024] [INFO] [launch.py:100:main] dist_world_size=4
[2021-02-14 15:50:04,024] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-02-14 15:50:06,570] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
using world size: 4 and model-parallel size: 4
> using dynamic loss scaling
[2021-02-14 15:50:06,582] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-14 15:50:06,582] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-14 15:50:06,583] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 4
[2021-02-14 15:50:17,741] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-02-14 15:50:17,742] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
Pretrain GPT2 model
arguments:
pretrained_bert .............. False
attention_dropout ............ 0.1
num_attention_heads .......... 16
hidden_size .................. 1024
intermediate_size ............ None
num_layers ................... 24
layernorm_epsilon ............ 1e-05
hidden_dropout ............... 0.1
max_position_embeddings ...... 1024
vocab_size ................... 30522
deep_init .................... False
make_vocab_size_divisible_by . 128
cpu_optimizer ................ False
cpu_torch_adam ............... False
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
batch_size ................... 8
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing True
clip_grad .................... 1.0
train_iters .................. 100000
log_interval ................. 100
exit_interval ................ None
seed ......................... 1234
reset_position_ids ........... False
reset_attention_mask ......... False
lr_decay_iters ............... None
lr_decay_style ............... cosine
lr ........................... 0.00015
warmup ....................... 0.01
[2021-02-14 15:50:17,742] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
save ......................... None
save_interval ................ 5000
no_save_optim ................ False
no_save_rng .................. False
load ......................... None
no_load_optim ................ True
no_load_rng .................. False
finetune ..................... False
resume_dataloader ............ True
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_iters ................... 100
eval_interval ................ 1000
eval_seq_length .............. None
eval_max_preds_per_seq ....... None
overlapping_eval ............. 32
cloze_eval ................... False
eval_hf ...................... False
load_openai .................. False
temperature .................. 1.0
top_p ........................ 0.0
top_k ........................ 0
out_seq_length ............... 256
model_parallel_size .......... 4
shuffle ...................... False
train_data ................... ['wikipedia']
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ...............
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 949,50,1
test_data .................... None
lazy_loader .................. True
loose_json ................... False
presplit_sentences ........... False
num_workers .................. 2
tokenizer_model_type ......... bert-large-uncased
tokenizer_path ............... tokenizer.model
tokenizer_type ............... GPT2BPETokenizer
cache_dir .................... None
use_tfrecords ................ False
seq_length ................... 1024
max_preds_per_seq ............ None
deepspeed .................... True
deepspeed_config ............. /home/sdl/DeepSpeedExamples/Megatron-LM/scripts/ds_zero2_config.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 4
dynamic_loss_scale ........... True
[2021-02-14 15:50:17,742] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-02-14 15:50:17,743] [INFO] [checkpointing.py:256:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 431 dummy tokens (new size: 50688)
> found end-of-document token: 50256
building GPT2 model ...
> number of parameters on model parallel rank 3: 89714688
Optimizer = FusedAdam
> number of parameters on model parallel rank 1: 89714688
Optimizer = FusedAdam
> number of parameters on model parallel rank 2: 89714688
Optimizer = FusedAdam
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
> number of parameters on model parallel rank 0: 89714688
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sdl/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
Optimizer = FusedAdam
learning rate decaying cosine
DeepSpeed is enabled.
[2021-02-14 15:50:30,238] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
Using /home/sdl/.cache/torch_extensions as PyTorch extensions root...
[1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/sdl/anaconda3/envs/deepspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/sdl/anaconda3/envs/deepspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
__p->_M_set_sharable();
~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
env=env)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "pretrain_gpt2.py", line 716, in <module>
main()
File "pretrain_gpt2.py", line 664, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
dist_init_required=False
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
config_params=config_params)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
optimizer = FusedAdam(model_parameters, **optimizer_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
return self.jit_load(verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
verbose=verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
with_cuda=with_cuda)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
error_prefix="Error building extension '{}'".format(name))
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fused_adam...
Traceback (most recent call last):
File "pretrain_gpt2.py", line 716, in <module>
main()
File "pretrain_gpt2.py", line 664, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
dist_init_required=False
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
config_params=config_params)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
optimizer = FusedAdam(model_parameters, **optimizer_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
Loading extension module fused_adam...
return self.jit_load(verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
verbose=verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
Traceback (most recent call last):
File "pretrain_gpt2.py", line 716, in <module>
keep_intermediates=keep_intermediates)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
file, path, description = imp.find_module(module_name, [path])
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/imp.py", line 297, in find_module
main()
File "pretrain_gpt2.py", line 664, in main
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
Loading extension module fused_adam...
dist_init_required=False
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
config_params=config_params)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
Traceback (most recent call last):
File "pretrain_gpt2.py", line 716, in <module>
optimizer = FusedAdam(model_parameters, **optimizer_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
main()
File "pretrain_gpt2.py", line 664, in main
return self.jit_load(verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
verbose=verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "pretrain_gpt2.py", line 176, in setup_model_and_optimizer
dist_init_required=False
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
keep_intermediates=keep_intermediates)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
config_params=config_params)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
file, path, description = imp.find_module(module_name, [path])
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/imp.py", line 297, in find_module
optimizer = FusedAdam(model_parameters, **optimizer_parameters)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
return self.jit_load(verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
verbose=verbose)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
keep_intermediates=keep_intermediates)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
file, path, description = imp.find_module(module_name, [path])
File "/home/sdl/anaconda3/envs/deepspeed/lib/python3.6/imp.py", line 297, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'
Can you please figure it out?
Thank you in advance
Hi,
Great repo! I have some questions about the bing_bert
example: doesn't it support ZeRO? I tried stage 1 or 2, but it kept running into errors. If not using deepspeed transformers kernels, I encountered
Invalid to reduce Param 76 with None gradient
If using deepspeed transformers kernels, I encountered,
RuntimeError: CUDA error: misaligned address
I would really appreciate some help. Thanks in advance!
I want to pretrain the bert-large with only single V100-32G, to reproduce the result as follow diagram. However,In BERT pre-training tutorial, it has different hyper-parameters and different script, maybe the code have been updated. I try to run ds_train_bert_nvidia_data_bsz64k_seq128.sh(change train_batch_size=320, and train_micro_batch_size_per_gpu=32)in single GPU to instead the methods of tutorail. And add all the optimize method to reduce the memory and imporve the speed.
python ${base_dir}/deepspeed_train.py
--cf ${base_dir}/bert_large_lamb_nvidia_data.json
--max_seq_length 512
--output_dir $OUTPUT_DIR
--print_steps 1
--deepspeed
--deepspeed_transformer_kernel
--stochastic_mode
--gelu_checkpoint
--normalize_invertible
--job_name $JOB_NAME
--deepspeed_config ${base_dir}/deepspeed_bsz32k_lamb_config_seq512.json
--data_path_prefix /workspace/bert
--use_nvidia_dataset
--rewarmup
--lr_schedule "EE"
--attention_dropout_checkpoint
--lr_offset 0.0
--load_training_checkpoint ${CHECKPOINT_BASE_PATH}
--load_checkpoint_id ${CHECKPOINT_EPOCH_NAME} \
The above are the specific script. However, the test result of samples/seconds is about 47. It is smaller than 52 a lot. Can you give me some suggestions about the result? What's more, I get the same result about megatron,so there are no hardware problems. Thanks a lot!
Hello,
I am testing deepspeed performance. However, I can't run bing_bert without deepspeed.
I try to recover train.py from deepspeed_train.py, it's a little bit difficult.
So is there any reference code or just upload train.py? Thanks
Hi.
I am referring to the CIFAR10 example. If we directly pass a torch.utils.data.Dataset
while calling deepspeed.initialize()
what are my options to also pass augmentation transforms? Even though this line creates the torch data loader it basically does not have any usage other than visualization if my understanding is correct.
Any pointers will be appreciated.
Trying out the Cifar-10 examples and it appears that adding in some arguments doesn't work because they are being overwritten somewhere.
I tried changing the number of epochs to run by using the flag --epochs
but it looks like the cifar10_deepspeed.py
script has hard coded 2 epochs:
for epoch in range(2): # loop over the dataset multiple times
I also tried to change the learning rate to 0.0005
by changing the ds_config.json
file and it seems like that gets pick up in some parts but overwritten in other parts.
For example I see
worker-0: [2020-10-01 22:41:49,395] [INFO] [config.py:624:print] optimizer_params ............. {'lr': 0.0005, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
Which seems to have picked it up, but when it actually runs the training it always says:
worker-0: [2020-10-01 22:42:57,190] [INFO] [logging.py:60:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
Which seems to have not picked up the lr
change (it stays at 0.001 throughout as well which suggests it's not doing any lr_warmup
either). I haven't tracked down where in the script the learning rate got overwritten... but it does seem to be happening.
Thank you very much for your contribution!
When I read the ZeRO-Offload tutorial, I found that the 'cpu_optimizer' setting couldn't find the corresponding part of the script 'pretrain_gpt2.py', what went wrong with me? I can't thank you enough for your help!**
I could not find any scripts in the example folder that enabled pipelining
In the Megatron-LM-v1.1.5-ZeRO3
Megatron implementation, periodic evaluations do not happen during training. This is essential functionality for knowing that training is proceeding correctly. Evaluation can be reenabled by removing the quotation marks around lines 484-491 of megatron/training.py
. I have tested this with several DeepSpeed settings, and the evaluation seems to work properly. But those lines were clearly quoted out intentionally, since the comment above says # XXX temporarily disabled for ZeRO-3
. It would be helpful to know the reason for this at least. Does evaluation work incorrectly only if the ZeRO Optimizer is set to stage 3? What would be needed to get evaluation working for the cases where it doesn't work?
Client should expect unscaled loss values from DeepSpeed forward()
I am trying to run the GPT2 model on Google Colab. I have placed my training data in a loose JSON format, with one JSON containing a text sample per line. On running the preprocessing script, I am getting the above error. Kindly help me with this.
complete error below:
Opening my-corpus.json
building GPT2BPETokenizer tokenizer ...
Traceback (most recent call last):
File "/content/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/tools/preprocess_data.py", line 200, in
main()
File "/content/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/tools/preprocess_data.py", line 154, in main
tokenizer = build_tokenizer(args)
File "/usr/local/lib/python3.7/dist-packages/megatron/tokenizer/tokenizer.py", line 48, in build_tokenizer
args)
File "/usr/local/lib/python3.7/dist-packages/megatron/tokenizer/tokenizer.py", line 59, in _vocab_size_with_padding
args.tensor_model_parallel_size
AttributeError: 'Namespace' object has no attribute 'tensor_model_parallel_size'
Hi there,
The tutorial
https://www.deepspeed.ai/tutorials/bert-finetuning/#loading-huggingface-and-tensorflow-pretrained-models
makes clear how to load HF and TF checkpoints into Deepspeed. What if we want to load a Deepspeed checkpoint, like from the Bing BERT example?
Is it that we load the "mp_rank_00_model_states.pt" file in the checkpoint?
I'm currently using fp16 and ZERO-2, so I wonder if using that will lose some precision. Should I use zero_to_fp32 to convert the checkpoint to fp32 for loading?
Currently BingBertSquad is not using DeepSpeed Launcher.
All examples should use the launcher.
https://github.com/microsoft/DeepSpeedExamples/blob/master/BingBertSquad/run_squad_deepspeed.sh#L22
I was trying to run Megatron with ZeRO 2 config when I encountered this error
The code version is Megatron-LM-v1.1.5-3D_parallelism.
Traceback (most recent call last):
File "pretrain_gpt2.py", line 158, in <module>
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 100, in pretrain
train_data_iterator, valid_data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 485, in train
lr_scheduler)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 325, in train_step
return train_step_pipe(model, data_iterator)
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 359, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 283, in train_batch
self._exec_schedule(sched)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 1161, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/engine.py", line 219, in _exec_reduce_tied_grads
self.module.allreduce_tied_weight_gradients()
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 890, in all_reduce
_check_single_tensor(tensor, "tensor")
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_single_tensor
"to be of type torch.Tensor.".format(param_name))
RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor.
It seems that File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/pipe/module.py", line 409, in allreduce_tied_weight_gradients
dist.all_reduce(weight.grad, group=comm['group'])
the weight.grad is not Tensor. But this error doesn't occur with ZERO 0 and 1 config
My script is like this:
#! /bin/bash
GPUS_PER_NODE=16
# Change for multinode config
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6000
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
export DLWS_NUM_WORKER=${NNODES}
export DLWS_NUM_GPU_PER_WORKER=${GPUS_PER_NODE}
DATA_PATH=/userhome/ChineseCorpus/Megatron-training/all-sample100G-samplebyfile-combine10M/text_document
VOCAB_PATH=bpe_3w_new/vocab.json
MERGE_PATH=bpe_3w_new/merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m_ds
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
config_json="$script_dir/ds_zero_stage_2_config.json"
# config_json="$script_dir/ds_config.json"
# Megatron Model Parallelism
mp_size=2
# DeepSpeed Pipeline parallelism
pp_size=2
NLAYERS=24
NHIDDEN=1024
BATCHSIZE=4
LOGDIR="tensorboard_data/${NLAYERS}l_${NHIDDEN}h_${NNODES}n_${GPUS_PER_NODE}g_${pp_size}pp_${mp_size}mp_${BATCHSIZE}b_ds4"
GAS=16
#ZeRO Configs
stage=2
reduce_scatter=true
contigious_gradients=true
rbs=50000000
agbs=5000000000
#Actication Checkpointing and Contigious Memory
chkp_layers=1
PA=true
PA_CPU=false
CC=true
SYNCHRONIZE=true
PROFILE=false
gpt_options=" \
--model-parallel-size ${mp_size} \
--pipe-parallel-size ${pp_size} \
--num-layers $NLAYERS \
--hidden-size $NHIDDEN \
--num-attention-heads 16 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--batch-size $BATCHSIZE \
--gas $GAS \
--train-iters 320000 \
--lr-decay-iters 320000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_PATH \
--merge-file $MERGE_PATH \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 1.5e-4 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--warmup 0.01 \
--checkpoint-activations \
--log-interval 1 \
--save-interval 500 \
--eval-interval 100 \
--eval-iters 10 \
--fp16 \
--tensorboard-dir ${LOGDIR}
"
deepspeed_options=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${stage} \
--zero-reduce-bucket-size ${rbs} \
--zero-allgather-bucket-size ${agbs}
"
if [ "${contigious_gradients}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--zero-contigious-gradients"
fi
if [ "${reduce_scatter}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
--zero-reduce-scatter"
fi
chkp_opt=" \
--checkpoint-activations \
--checkpoint-num-layers ${chkp_layers}"
if [ "${PA}" = "true" ]; then
chkp_opt="${chkp_opt} \
--partition-activations"
fi
if [ "${PA_CPU}" = "true" ]; then
chkp_opt="${chkp_opt} \
--checkpoint-in-cpu"
fi
if [ "${SYNCHRONIZE}" = "true" ]; then
chkp_opt="${chkp_opt} \
--synchronize-each-layer"
fi
if [ "${CC}" = "true" ]; then
chkp_opt="${chkp_opt} \
--contigious-checkpointing"
fi
if [ "${PROFILE}" = "true" ]; then
chkp_opt="${chkp_opt} \
--profile-backward"
fi
full_options="${gpt_options} ${deepspeed_options} ${chkp_opt}"
run_cmd="deepspeed --num_nodes ${DLWS_NUM_WORKER} --num_gpus ${DLWS_NUM_GPU_PER_WORKER} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} pretrain_gpt2.py $@ ${full_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x
The zero 2 config is like this:
{
"train_batch_size":256,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"reduce_scatter": true,
"allgather_bucket_size": 50000000,
"reduce_bucket_size": 50000000,
"overlap_comm": true
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015,
"max_grad_norm": 1.0,
"betas": [0.9, 0.95]
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false
}
In the Megatron-LM docker README file, the following instructions can be found:
Note that as of now you need to have PySOL cloned to the directory here before building the container.
What is "PySOL" referring to? I assume we are not talking about http://www.pysol.org/, and I can't find any other relevant reference on a short Google search.
when I run DeepSpeedExamples-Megatron-LM-v1.1.5-ZeRO3/ and Megatron-LM-v1.1.5-3D_parallelism , I encounter the same problem : TypeError: unsupported operand type(s) for *: 'NoneType' and 'int', Traceback (most recent call last):
File "pretrain_bert.py", line 123, in
args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
File "/home/DeepSpeedExamples-master/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 88, in pretrain
train_valid_test_dataset_provider)
File "/home/DeepSpeedExamples-master/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 619, in build_train_valid_test_data_iterators
global_batch_size = args.batch_size * data_parallel_size * args.gas
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
my python verson is 3.7.10, torch verison is: 1.8.1+cu101.
Hi,
I have successfully run the DCGAN training code using deepspeed, using the celeba dataset.
But the problem is when I run the baseline code, the training clock time also the same as deepspeed-enabled code (137s). So I don't know whether something I am missing now.
My system information is:
OS: Ubuntu 14.04
CUDA Toolkit 10.1.243
GPU: Single GPU - NVIDIA TiTanX
Pytorch version is 1.4.0, and also tested on 1.7.1
Thank you.
It seems to me that
DeepSpeedExamples/Megatron-LM/pretrain_bert.py
Lines 221 to 222 in fa1d1a7
ignore_index=-1
into account, while it does for bing-bert example:
as title, I want to enable activation checkpointing
I try to run the BERT with pipeline parallelism, but I get an error:
File "DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/pretrain_bert.py", line 146, in
args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
File "/DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 81, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
File "/DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 252, in setup_model_and_optimizer
model.set_batch_fn(model.module._megatron_batch_fn)
File "/home/wwu/anaconda3/envs/sospx86/lib/python3.6/site-packages/torch/nn/modules/module.py", line 948, in getattr
type(self).name, name))
AttributeError: 'DeepSpeedEngine' object has no attribute 'set_batch_fn'
I dig into the code a little bit, it seems like the pipeline parallelism is not implemented for BERT.
When I use Chinese corpus,I got an error
UnicodeDecodeError:'ascii' codec can't decode byte 0xe7 in position 11:oridinal not in range(128)
but I can't find the bug
Does anybody know why?
Machine Translation usually takes dynamically sized batch composed of X tokens instead of X sentences as training input. I'm wondering why deepspeed requires specifying train_batch_size
and train_micro_batch_size_per_gpu
, both of which refer to the number of samples. Is this a concern for implementation details? Or is it possible to support dynamic size as in the case of machine translation without extra cost of efficiency and memory usage?
Situation: with different "train_micro_batch_size_per_gpu" in deepspeed_bsz32k_lamb_config_seq512.json (and deepspeed_bsz64k_lamb_config_seq128.json if enable validation in seq128), the validation losses differ by as much as 0.8.
I had created a branch including the test code I used: https://github.com/microsoft/DeepSpeedExamples/tree/conglli/validation_investigation. In deepspeed_train.py, I added another Validation loss calculation ("Validation Loss split") that split the micro batch into single items, so that we can compare the validation loss when using the defined micro batch size and the validation loss when using micro batch size 1. This also exclude any potential input data effect since both calculations use the same validation data. With the test code, I tested seq512 with deepspeed kernel, and I got different losses from the two calculations:
I also added validation loss calculation for seq128 in the branch above. Then I also tested and got different validation losses in seq128 both with and without deepspeed kernel (to test it in seq128 with deepspeed kernel, you need to use this deepspeed branch: https://github.com/microsoft/DeepSpeed/tree/reyazda/support_dynamic_seqlength).
I had internal discussion with Minjia, Samyam, Tunji, and Jeff on Fri Aug 28th, but we didn't reach a conclusion. It is still not verified whether this is a bug, or this is just correct behavior.
Hi there,
I would like to train a 20B model. To successfully initialize the model for training I noticed the context manager, deepspeed.zero.Init()
at here https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models .
While inject this context manager for model initialization at here, https://github.com/microsoft/DeepSpeedExamples/blob/master/bing_bert/turing/models.py#L123,
cause training script to be killed at early stage.
Here is part of the log. The total number of layers is 254, training script fails to initialize all of them.
I have monitored the CPU memory consumption, which is much less than the capacity.
10.0.36.222: layer #252 is created with date type [half].
10.0.60.215: Traceback (most recent call last):
10.0.60.215: File "/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py", line 607, in <module>
10.0.60.215: main()
10.0.60.215: File "/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py", line 596, in main
10.0.60.215: model, optimizer = prepare_model_optimizer(args)
10.0.60.215: File "/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py", line 469, in prepare_model_optimizer
10.0.60.215: model = BertMultiTask(args)
10.0.60.215: File "/home/ec2-user/DeepSpeedExamples/bing_bert/turing/models.py", line 125, in __init__
10.0.60.215: self.network = BertForPreTrainingPreLN(bert_config, args)
10.0.60.215: File "/home/ec2-user/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 266, in wrapper
10.0.60.215: f(module, *args, **kwargs)
10.0.60.215: File "/home/ec2-user/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 1119, in __init__
10.0.60.215: config, self.bert.embeddings.word_embeddings.weight)
10.0.60.215: File "/home/ec2-user/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 266, in wrapper
10.0.60.215: f(module, *args, **kwargs)
10.0.60.215: File "/home/ec2-user/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 760, in __init__
10.0.60.215: bert_model_embedding_weights)
10.0.60.215: File "/home/ec2-user/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 266, in wrapper
10.0.60.215: f(module, *args, **kwargs)
10.0.60.215: File "/home/ec2-user/DeepSpeedExamples/bing_bert/nvidia/modelingpreln.py", line 712, in __init__
10.0.60.215: self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
10.0.60.215: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
10.0.60.215: DeepSpeed Transformer config is {'layer_id': 42, 'batch_size': 32, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 40, 'attn_dropout_ratio': 0.1, 'hidden_dropout_ratio': 0.1, 'num_hidden_layers': 254, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': 3, 'seed': 42, 'normalize_invertible': False, 'gelu_checkpoint': True, 'adjust_init_range': True, 'test_gemm': False, 'layer_norm_eps': 1e-12, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': True, 'stochastic_mode': False, 'huggingface': False}
.....
10.0.60.215: layer #42 is created with date type [half].
10.0.60.215: Killing subprocess 11407
10.0.60.215: Killing subprocess 11408
10.0.60.215: Killing subprocess 11409
10.0.60.215: Killing subprocess 11410
10.0.60.215: Killing subprocess 11411
10.0.60.215: Killing subprocess 11413
10.0.60.215: Killing subprocess 11414
10.0.60.215: Killing subprocess 11417
10.0.60.215: Traceback (most recent call last):
10.0.60.215: File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
10.0.60.215: "__main__", mod_spec)
10.0.60.215: File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
10.0.60.215: exec(code, run_globals)
10.0.60.215: File "/home/ec2-user/DeepSpeed/deepspeed/launcher/launch.py", line 183, in <module>
10.0.60.215: main()
10.0.60.215: File "/home/ec2-user/DeepSpeed/deepspeed/launcher/launch.py", line 173, in main
10.0.60.215: sigkill_handler(signal.SIGTERM, None) # not coming back
10.0.60.215: File "/home/ec2-user/DeepSpeed/deepspeed/launcher/launch.py", line 151, in sigkill_handler
10.0.60.215: raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
10.0.60.215: subprocess.CalledProcessError: Command '['/bin/python3', '-u', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../../deepspeed_train.py', '--local_rank=7', '--max_seq_length', '512', '--print_steps', '10', '--deepspeed', '--data_path_prefix', '/home/ec2-user/small-data', '--use_nvidia_dataset', '--rewarmup', '--lr_schedule', 'EE', '--attention_dropout_checkpoint', '--lr_offset', '0.0', '--gelu_checkpoint', '--deepspeed_transformer_kernel', '--max_steps', '5', '--ckpt_to_save', '200', '--output_dir', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../outputs/zero3_2node_2021-09-14_01:19:59/', '--cf', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../configs/zero3_2nodes_profile.json', '--deepspeed_config', '/home/ec2-user/DeepSpeedExamples/bing_bert/zero_opt_experiments/scripts/../configs/zero3_2nodes_profile.json', '--job_name', 'zero3_2node_2021-09-14_01:19:59']' returned non-zero exit status 1.
10.0.36.222: DeepSpeed Transformer config is {'layer_id': 58, 'batch_size': 32, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 40, 'attn_dropout_ratio': 0.1, 'hidden_dropout_ratio': 0.1, 'num_hidden_layers': 254, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': 6, 'seed': 42, 'normalize_invertible': False, 'gelu_checkpoint': True, 'adjust_init_range': True, 'test_gemm': False, 'layer_norm_eps': 1e-12, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': True, 'stochastic_mode': False, 'huggingface': False}
10.0.36.222: layer #58 is created with date type [half].
I have seen two relevant issues at DeepSpeed repo
microsoft/DeepSpeed#907
microsoft/DeepSpeed#1041
I think it might be more relevant to the implementation of bing_bert rather than deepspeed, so I brought the issue to this repo.
when I run the example of transformers,it occurs this bug.
nvcc fatal : Unsupported gpu architecture 'compute_86'
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1533, in _run_ninja_build
subprocess.run(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./pretrain_bert_with_trainer.py", line 72, in
main()
File "./pretrain_bert_with_trainer.py", line 70, in main
Pretrain()
File "./pretrain_bert_with_trainer.py", line 67, in Pretrain
trainer.train()
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/trainer.py", line 903, in train
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/transformers-4.4.2-py3.8.egg/transformers/integrations.py", line 414, in init_deepspeed
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/init.py", line 116, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 186, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 604, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 676, in _configure_basic_optimizer
optimizer = FusedAdam(model_parameters,
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 215, in load
return self.jit_load(verbose)
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 243, in jit_load
op_module = load(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 986, in load
return _jit_compile(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1193, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1297, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/huazixu/anaconda3/envs/graph/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Fllow the BingBertSQuAD Fine-tuning , I want to test the baseline with hugging face bert, my script is:
~/bin/bash
#1: number of GPUs
#2: Model File Address
#3: BertSquad Data Directory Address
#4: Output Directory Address
NGPU_PER_NODE=$1
MODEL_FILE=$2
SQUAD_DIR=$3
OUTPUT_DIR=$4
NUM_NODES=1
NGPU=$((NGPU_PER_NODE*NUM_NODES))
EFFECTIVE_BATCH_SIZE=24
MAX_GPU_BATCH_SIZE=6
PER_GPU_BATCH_SIZE=$((EFFECTIVE_BATCH_SIZE/NGPU))
if [[ $PER_GPU_BATCH_SIZE -lt $MAX_GPU_BATCH_SIZE ]]; then
GRAD_ACCUM_STEPS=1
else
GRAD_ACCUM_STEPS=$((PER_GPU_BATCH_SIZE/MAX_GPU_BATCH_SIZE))
fi
LR=3e-5
MASTER_PORT=$((NGPU+12345))
JOB_NAME="baseline_${NGPU}GPUs_${EFFECTIVE_BATCH_SIZE}batch_size"
run_cmd="deepspeed --num_nodes ${NUM_NODES} --num_gpus ${NGPU_PER_NODE} \
nvidia_run_squad_baseline.py \
--bert_model bert-large-uncased \
--do_train \
--do_lower_case \
--do_predict \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--train_batch_size $PER_GPU_BATCH_SIZE \
--learning_rate ${LR} \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir $OUTPUT_DIR \
--job_name ${JOB_NAME} \
--gradient_accumulation_steps ${GRAD_ACCUM_STEPS} \
--fp16 \
--model_file $MODEL_FILE \
--ckpt_type HF \
--origin_bert_config_file ./pre-trained-model/hugging-face/bert-large-uncased-whole-word-masking-config.json
"
echo ${run_cmd}
eval ${run_cmd}
when I run the script with command:
bash run_squad_baseline_hf.sh 4 pre-trained-model/hugging-face/bert-large-uncased-whole-word-masking-pytorch_model.bin data/SQuAD/ ./tmp
error occur:
ValueError: Output directory () already exists and is not empty.
I do check tmp is empty before run the command, so how can I solve the problem?
Hi guys,
I have been trying to run the Bing experiment but it seems I can't for now.
"datasets": {
--
| "wiki_pretrain_dataset": "/data/bert/bnorick_format/128/wiki_pretrain",
| "bc_pretrain_dataset": "/data/bert/bnorick_format/128/bookcorpus_pretrain"
| },
I see this stuff is missing to fully validate the code.
I try to replace megatron bert model to huggingface bert model in model_provider function.
However, the program can't pass deepspeed/runtime/zero/stage3.py:1896
assert self.params_already_reduced[param_id] == False,
f"The parameter {param_id} has already been reduced.
Gradient computed twice for this partition.
Multiple gradient reduction is currently not supported"
AssertionError: The parameter 102 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported
The parameter 102 is embedding weights, any suggestions?
hi, I've been running demos in Megatron-LM-v1.1.5-ZeRO3 folder and I found some api breaking in /Megatron-LM-v1.1.5-ZeRO3/megatron/training.py
line 327: see_memory_usage(f'before forward {model.global_steps}', force=True)
line 333: see_memory_usage(f'before backward {model.global_steps}', force=True)
line 340: see_memory_usage(f'before optimizer {model.global_steps}', force=True)
while running pretrain_bert.py, some errors emerged and said that "model" does not have attribute "global_steps"
AttributeError: 'DistributedDataParallel' object has no attribute 'global_steps'
therefore, I have to comment these three lines.
line 330: loss, loss_reduced = forward_step_func(data_iterator, model, args.curriculum_learning)
while running this line, it said that forward_step() only receives 2 parameters.
TypeError: forward_step() takes 2 positional arguments but 3 were given
I checked out the source code in pretrain_bert.py, found that:
def forward_step(data_iterator, model):
so I removed "args.curriculum_learning", and it works, lol
I guess it's the upgrade of Megatron-lm or DeepSpeed or something that caused the api breaking, please fix, thanks a lot!
the same as README.md,
python pretrain_bert.py \
$BERT_ARGS \
$OUTPUT_ARGS \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH
Follow the bing_bert tutorial, my deepspeed_config is:
{
"train_batch_size": 4096,
"train_micro_batch_size_per_gpu": 32,
"steps_per_print": 1000,
"prescale_gradients": false,
"optimizer": {
"type": "Adam",
"params": {
"lr": 6e-3,
"betas": [
0.9,
0.99
],
"eps": 1e-8,
"weight_decay": 0.01
}
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"grad_hooks": true,
"round_robin_gradients": false
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 1e-8,
"warmup_max_lr": 6e-3
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0
},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}
The CUDA Memory usage for stage 1 is 8900MB per GPU
The CUDA Memory usage for stage 2 is 9600MB per GPU
And the ZeRO-2 is much slower than ZeRO-1 in training speed.
Any help will be appreciate~
I am trying to follow the example here
https://www.deepspeed.ai/tutorials/bert-pretraining/
The section on getting the datasets says 'Note: Downloading and pre-processing instructions are coming soon.'.
I tried googling but those datasets seem tricky to find. And even then, I'm not sure if they would be the correct versions to use for the script.
Issue reported in DeepSpeed repo: microsoft/DeepSpeed#426
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.