mlcommons / training_results_v1.1 Goto Github PK

This repository contains the results and code for the MLPerf™ Training v1.1 benchmark.

Home Page: https://mlcommons.org/en/training-normal-11/

License: Apache License 2.0

Dockerfile 0.25% Shell 4.57% Python 60.94% Cuda 2.64% Jupyter Notebook 24.04% C++ 6.13% C 0.07% Awk 0.06% Starlark 0.60% Makefile 0.04% TypeScript 0.50% HTML 0.09% CSS 0.01% JavaScript 0.06% Visual Basic 6.0 0.02%

training_results_v1.1's Introduction

The MLPerf™ Training v1.1 results.

Additionally, each organization has written approximately 300 words to help explain their submissions in the the Supplemental discussion document.

training_results_v1.1's People

Stargazers

Watchers

training_results_v1.1's Issues

Missing files

Some code files are just empty
example:
https://github.com/mlcommons/training_results_v1.1/blob/main/Azure/benchmarks/bert/implementations/pytorch/model/layers/attention.py

FMHA error when reproducing DELL BERT benchmark

We tried to follow the Dell example to reproduce the Bert Training Benchmark on a server with 2 GPUs. We have encountered an error when running the model encoder layer, and it is related to the fmhalib.fwd function: Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false.

The error happens in the last line:

import fmhalib as mha
class FMHAFun(torch.autograd.Function):
    @staticmethod
    def forward(ctx, qkv, cu_seqlens, p_dropout, max_s, is_training):
        b = cu_seqlens.numel() - 1

        if b < 4:
            max_s = 512
            context, S_dmask = mha.fwd_nl(qkv, cu_seqlens, p_dropout, max_s, is_training, None)
        else:
            context, S_dmask = mha.fwd(qkv, cu_seqlens, p_dropout, max_s, is_training, None)

It seems to be related to the error mentioned here, but I am not entirely sure about how to apply their fix (unpad the qkv).

System Used

CPU

CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
NUMA node0 CPU(s):               0-47

GPU

Driver Version: 510.47.03
CUDA Version: 11.6
NVIDIA RTX A5000 x2

System:

PyTorch v1.10.1

Error Reproduction

For reproducing the error, the following settings were used. We created two config files (config_SUT.sh, config_SUT_common.sh) and ran the code interactively within a docker container.

Configs in config_SUT.sh

## DL params
export BATCHSIZE=64
export GRADIENT_STEPS=1
export LR=3.5e-4
export MAX_SAMPLES_TERMINATION=4500000
export MAX_STEPS=7100
export OPT_LAMB_BETA_1=0.9
export OPT_LAMB_BETA_2=0.999
export START_WARMUP_STEP=0
export WARMUP_PROPORTION=0.0
export EXTRA_PARAMS="--dense_seq_output --unpad --unpad_fmha --exchange_padding"
export PHASE=2
export EVAL_ITER_START_SAMPLES=150000
export EVAL_ITER_SAMPLES=150000

## System run parms
export DGXNNODES=1
export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' )
export WALLTIME=01:15:00

## System config params
source config_SUT_common.sh

Configs in config_SUT_common.sh

## System config params
export DGXNGPU=2
export DGXSOCKETCORES=24
export DGXNSOCKET=1
export DGXHT=2
export SLURM_NTASKS=${DGXNGPU}

After creating the docker image mlperf-nvidia:language_model, enter the docker container with the following command:

nvidia-docker run -it --privileged --network host \
--ipc=host -v /data/bert/phase1:/workspace/phase1 \
-v /data/bert/hdf5/training-4320/hdf5_4320_shards_varlength:/workspace/data_phase2 \
--name language_model mlperf-nvidia:language_model

Running the program:

export CUDA_VISIBLE_DEVICES=0,1
export NEXP=1
source config_SUT.sh
./run_and_time.sh

Error Log:

##binding cmd: ['/usr/bin/numactl', '--physcpubind=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46', 
'--membind=0', '/opt/conda/bin/python', '-u', '/workspace/bert/run_pretraining.py', '--local_rank=0', 
'--train_batch_size=64', '--learning_rate=3.5e-4', '--opt_lamb_beta_1=0.9', '--opt_lamb_beta_2=0.999',
'--warmup_proportion=0.0', '--warmup_steps=0.0', '--start_warmup_step=0', '--max_steps=7100', '--phase2',
'--max_seq_length=512', '--max_predictions_per_seq=76', '--input_dir=/workspace/data_phase2',
'--init_checkpoint=/workspace/phase1/model.ckpt-28252.pt', '--do_train', '--skip_checkpoint',
'--train_mlm_accuracy_window_size=0', '--target_mlm_accuracy=0.720', '--weight_decay_rate=0.01',
'--max_samples_termination=4500000', '--eval_iter_start_samples=150000', '--eval_iter_samples=150000',
'--eval_batch_size=16', '--eval_dir=/workspace/evaldata', '--num_eval_examples', '10000', 
'--cache_eval_data','--output_dir=/results', '--fp16', '--fused_bias_fc', '--fused_bias_mha', 
'--fused_dropout_add', '--distributed_lamb','--dwu-num-rs-pg=1', '--dwu-num-ar-pg=1', '--dwu-num-ag-pg=1',
'--dwu-num-blocks=1', '--gradient_accumulation_steps=1', '--log_freq=0', 
'--bert_config_path=/workspace/phase1/bert_config.json', '--dense_seq_output', '--unpad', '--unpad_fmha',
'--exchange_padding', '--allreduce_post_accumulation', '--allreduce_post_accumulation_fp16', '--seed=15572']
##local_rank: 0
...
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 1744, in <module>
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1237, in main
    model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)
  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 99, in capture_bert_model_segment_graph
    bert_model_segment = graph(bert_model_segment,
  File "/workspace/bert/function.py", line 73, in graph
    outputs  = func_or_module(*sample_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1095, in forward
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, position_ids,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 987, in forward
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 674, in forward
    hidden_states = layer_module(hidden_states, cu_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 605, in forward
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 494, in forward
    self_output = self.self(input_tensor, cu_seqlens, max_s, is_training=self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 213, in forward
    ctx = FMHAFun.apply(qkv.contiguous().view(-1, 3, self.h, self.d), cu_seqlens, p_dropout, max_s, is_training)
  File "/workspace/bert/fmha.py", line 32, in forward
    context, S_dmask = mha.fwd(qkv, cu_seqlens, p_dropout, max_s, is_training, None)
RuntimeError: Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
ENDING TIMING RUN AT 2022-04-20 10:46:14 AM
RESULT,bert,15572,13,,2022-04-20 10:46:01 AM

[NVIDIA/benchmarks/bert/implementations/pytorch] prepare_data.sh fails - issue with BertConfig when `convert_tf_checkpoint.py` is called

The prepare_data.sh script fails, producing the following error:

Traceback (most recent call last):                                                                                                                                                                                                             
  File "/workspace/bert/input_preprocessing/../convert_tf_checkpoint.py", line 86, in <module>                                                                                                                                                 
    main()                                                                                                                                                                                                                                     
  File "/workspace/bert/input_preprocessing/../convert_tf_checkpoint.py", line 80, in main                                                                                                                                                     
    model = prepare_model(args, device)                                                                                                                                                                                                        
  File "/workspace/bert/input_preprocessing/../convert_tf_checkpoint.py", line 72, in prepare_model                                                                                                                                            
    model = BertForPretraining.from_pretrained(args.tf_checkpoint, from_tf=True, config=config)                                                                                                                                                
  File "/workspace/bert/modeling.py", line 867, in from_pretrained                                                                                                                                                                             
    model = cls(config, *inputs, **kwargs)                                                                                                                                                                                                     
  File "/workspace/bert/modeling.py", line 1060, in __init__                                                                                                                                                                                   
    self.cls = BertPreTrainingHeads(config, self.bert.embeddings.word_embeddings.weight)                                                                                                                                                       
  File "/workspace/bert/modeling.py", line 791, in __init__                                                                                                                                                                                    
    self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)                                                                                                                                                              
  File "/workspace/bert/modeling.py", line 744, in __init__                                                                                                                                                                                    
    self.fused_fc = config.fused_bias_fc_loss_head                                                                                                                                                                                             
AttributeError: 'BertConfig' object has no attribute 'fused_bias_fc_loss_head'

It appears that either the convert_tf_checkpoint.py is incorrectly referencing this dictionary entry, or, the downloaded bert_config.json is missing a key/value pair (specifically, the fused_bias_fc_loss_head key/value).

Steps to Reproduce

Clone the repo, browse to NVIDIA/benchmarks/bert/implementations/pytorch and run the following:

docker build --pull -t nickfraser/mlperf-nvidia:language_model .
docker --rm -it --runtime=nvidia --ipc=host -v /<location on host>/bert_data/:/workspace/bert_data nickfraser/mlperf-nvidia:language_model
./input_preprocessing/prepare_data.sh --outputdir /workspace/bert_data

Which eventually leads to the error in the last command of the prepare_data.sh script. Note, md5sum of bert_config.json, vocab.txt, model.ckpt-28252.data-00000-of-00001, model.ckpt-28252.index, model.ckpt-28252.meta match the expected values. Also, I added set -e at the top of the prepare_data.sh script to ensure no other errors occurred on prior commands.

Since bert_config.json matches the expected md5sum, I expect that the issue is with the convert_tf_checkpoint.py script. Any help that can be provided is much appreciated.

Broken links to Habana Labs

In the official results dashboard, all links to code and systems for Habana Labs lead to https://github.com/mlcommons/training_results_v1.1/blob/master/HabanaLabs/ (broken), while they should be leading to https://github.com/mlcommons/training_results_v1.1/blob/master/Intel-HabanaLabs/

/cc @bitfort

Tensoflow version used in NVIDIA docker image is not correct

Hello,
I am following NVIDIA folder and it seems that something is wrong. According to the readme file, TF-1 is used, but when I run the docker commands, I get the following numpy error

module 'numpy.random' has no attribute 'BitGenerator'

To fix that numpy 20 is fine, but numpy 20 is not compatible with that TF.

Can someone confirms that? Maybe I have missed something.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.