The MLPerf™ Training v1.1 results.
Additionally, each organization has written approximately 300 words to help explain their submissions in the the Supplemental discussion document.
This repository contains the results and code for the MLPerf™ Training v1.1 benchmark.
Home Page: https://mlcommons.org/en/training-normal-11/
License: Apache License 2.0
The MLPerf™ Training v1.1 results.
Additionally, each organization has written approximately 300 words to help explain their submissions in the the Supplemental discussion document.
Some code files are just empty
example:
https://github.com/mlcommons/training_results_v1.1/blob/main/Azure/benchmarks/bert/implementations/pytorch/model/layers/attention.py
We tried to follow the Dell example to reproduce the Bert Training Benchmark on a server with 2 GPUs. We have encountered an error when running the model encoder layer, and it is related to the fmhalib.fwd
function: Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false
.
The error happens in the last line:
import fmhalib as mha
class FMHAFun(torch.autograd.Function):
@staticmethod
def forward(ctx, qkv, cu_seqlens, p_dropout, max_s, is_training):
b = cu_seqlens.numel() - 1
if b < 4:
max_s = 512
context, S_dmask = mha.fwd_nl(qkv, cu_seqlens, p_dropout, max_s, is_training, None)
else:
context, S_dmask = mha.fwd(qkv, cu_seqlens, p_dropout, max_s, is_training, None)
It seems to be related to the error mentioned here, but I am not entirely sure about how to apply their fix (unpad the qkv).
CPU
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
NUMA node0 CPU(s): 0-47
GPU
Driver Version: 510.47.03
CUDA Version: 11.6
NVIDIA RTX A5000 x2
System:
PyTorch v1.10.1
For reproducing the error, the following settings were used. We created two config files (config_SUT.sh
, config_SUT_common.sh
) and ran the code interactively within a docker container.
Configs in config_SUT.sh
## DL params
export BATCHSIZE=64
export GRADIENT_STEPS=1
export LR=3.5e-4
export MAX_SAMPLES_TERMINATION=4500000
export MAX_STEPS=7100
export OPT_LAMB_BETA_1=0.9
export OPT_LAMB_BETA_2=0.999
export START_WARMUP_STEP=0
export WARMUP_PROPORTION=0.0
export EXTRA_PARAMS="--dense_seq_output --unpad --unpad_fmha --exchange_padding"
export PHASE=2
export EVAL_ITER_START_SAMPLES=150000
export EVAL_ITER_SAMPLES=150000
## System run parms
export DGXNNODES=1
export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' )
export WALLTIME=01:15:00
## System config params
source config_SUT_common.sh
Configs in config_SUT_common.sh
## System config params
export DGXNGPU=2
export DGXSOCKETCORES=24
export DGXNSOCKET=1
export DGXHT=2
export SLURM_NTASKS=${DGXNGPU}
After creating the docker image mlperf-nvidia:language_model
, enter the docker container with the following command:
nvidia-docker run -it --privileged --network host \
--ipc=host -v /data/bert/phase1:/workspace/phase1 \
-v /data/bert/hdf5/training-4320/hdf5_4320_shards_varlength:/workspace/data_phase2 \
--name language_model mlperf-nvidia:language_model
Running the program:
export CUDA_VISIBLE_DEVICES=0,1
export NEXP=1
source config_SUT.sh
./run_and_time.sh
Error Log:
##binding cmd: ['/usr/bin/numactl', '--physcpubind=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46',
'--membind=0', '/opt/conda/bin/python', '-u', '/workspace/bert/run_pretraining.py', '--local_rank=0',
'--train_batch_size=64', '--learning_rate=3.5e-4', '--opt_lamb_beta_1=0.9', '--opt_lamb_beta_2=0.999',
'--warmup_proportion=0.0', '--warmup_steps=0.0', '--start_warmup_step=0', '--max_steps=7100', '--phase2',
'--max_seq_length=512', '--max_predictions_per_seq=76', '--input_dir=/workspace/data_phase2',
'--init_checkpoint=/workspace/phase1/model.ckpt-28252.pt', '--do_train', '--skip_checkpoint',
'--train_mlm_accuracy_window_size=0', '--target_mlm_accuracy=0.720', '--weight_decay_rate=0.01',
'--max_samples_termination=4500000', '--eval_iter_start_samples=150000', '--eval_iter_samples=150000',
'--eval_batch_size=16', '--eval_dir=/workspace/evaldata', '--num_eval_examples', '10000',
'--cache_eval_data','--output_dir=/results', '--fp16', '--fused_bias_fc', '--fused_bias_mha',
'--fused_dropout_add', '--distributed_lamb','--dwu-num-rs-pg=1', '--dwu-num-ar-pg=1', '--dwu-num-ag-pg=1',
'--dwu-num-blocks=1', '--gradient_accumulation_steps=1', '--log_freq=0',
'--bert_config_path=/workspace/phase1/bert_config.json', '--dense_seq_output', '--unpad', '--unpad_fmha',
'--exchange_padding', '--allreduce_post_accumulation', '--allreduce_post_accumulation_fp16', '--seed=15572']
##local_rank: 0
...
Traceback (most recent call last):
File "/workspace/bert/run_pretraining.py", line 1744, in <module>
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1237, in main
model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)
File "/workspace/bert/fwd_loss_bwd_trainer.py", line 99, in capture_bert_model_segment_graph
bert_model_segment = graph(bert_model_segment,
File "/workspace/bert/function.py", line 73, in graph
outputs = func_or_module(*sample_args)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 1095, in forward
sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, position_ids,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 987, in forward
encoded_layers = self.encoder(embedding_output,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 674, in forward
hidden_states = layer_module(hidden_states, cu_seqlens, maxseqlen_in_batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 605, in forward
attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 494, in forward
self_output = self.self(input_tensor, cu_seqlens, max_s, is_training=self.training)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/fmha.py", line 213, in forward
ctx = FMHAFun.apply(qkv.contiguous().view(-1, 3, self.h, self.d), cu_seqlens, p_dropout, max_s, is_training)
File "/workspace/bert/fmha.py", line 32, in forward
context, S_dmask = mha.fwd(qkv, cu_seqlens, p_dropout, max_s, is_training, None)
RuntimeError: Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
ENDING TIMING RUN AT 2022-04-20 10:46:14 AM
RESULT,bert,15572,13,,2022-04-20 10:46:01 AM
The prepare_data.sh
script fails, producing the following error:
Traceback (most recent call last):
File "/workspace/bert/input_preprocessing/../convert_tf_checkpoint.py", line 86, in <module>
main()
File "/workspace/bert/input_preprocessing/../convert_tf_checkpoint.py", line 80, in main
model = prepare_model(args, device)
File "/workspace/bert/input_preprocessing/../convert_tf_checkpoint.py", line 72, in prepare_model
model = BertForPretraining.from_pretrained(args.tf_checkpoint, from_tf=True, config=config)
File "/workspace/bert/modeling.py", line 867, in from_pretrained
model = cls(config, *inputs, **kwargs)
File "/workspace/bert/modeling.py", line 1060, in __init__
self.cls = BertPreTrainingHeads(config, self.bert.embeddings.word_embeddings.weight)
File "/workspace/bert/modeling.py", line 791, in __init__
self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
File "/workspace/bert/modeling.py", line 744, in __init__
self.fused_fc = config.fused_bias_fc_loss_head
AttributeError: 'BertConfig' object has no attribute 'fused_bias_fc_loss_head'
It appears that either the convert_tf_checkpoint.py
is incorrectly referencing this dictionary entry, or, the downloaded bert_config.json
is missing a key/value pair (specifically, the fused_bias_fc_loss_head
key/value).
Clone the repo, browse to NVIDIA/benchmarks/bert/implementations/pytorch
and run the following:
docker build --pull -t nickfraser/mlperf-nvidia:language_model .
docker --rm -it --runtime=nvidia --ipc=host -v /<location on host>/bert_data/:/workspace/bert_data nickfraser/mlperf-nvidia:language_model
./input_preprocessing/prepare_data.sh --outputdir /workspace/bert_data
Which eventually leads to the error in the last command of the prepare_data.sh
script. Note, md5sum of bert_config.json
, vocab.txt
, model.ckpt-28252.data-00000-of-00001
, model.ckpt-28252.index
, model.ckpt-28252.meta
match the expected values. Also, I added set -e
at the top of the prepare_data.sh
script to ensure no other errors occurred on prior commands.
Since bert_config.json
matches the expected md5sum, I expect that the issue is with the convert_tf_checkpoint.py
script. Any help that can be provided is much appreciated.
In the official results dashboard, all links to code and systems for Habana Labs lead to https://github.com/mlcommons/training_results_v1.1/blob/master/HabanaLabs/ (broken), while they should be leading to https://github.com/mlcommons/training_results_v1.1/blob/master/Intel-HabanaLabs/
/cc @bitfort
Hello,
I am following NVIDIA folder and it seems that something is wrong. According to the readme file, TF-1 is used, but when I run the docker commands, I get the following numpy error
module 'numpy.random' has no attribute 'BitGenerator'
To fix that numpy 20 is fine, but numpy 20 is not compatible with that TF.
Can someone confirms that? Maybe I have missed something.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.