tendo33 / bert_test_1n8g Goto Github PK

oneflow test

Python 100.00%

bert_test_1n8g's Introduction

bert_test_1n8g

NVIDIA_GeForce_RTX_3080_Ti	master@59b64db	rank_per_process @59b64db	naive@59b64db
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_actrue DP2_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc8_1n8g	building graph Done! Cost time: 19.14s. building plan Done! Cost time: 18.08s. node0 : 6029-6738MIB	building plan Done! Cost time: 17.91s. building graph Done! Cost time: 18.9 s. node0 : 6026-6728MIB	building plan Done! Cost time: 18.14s. building graph Done! Cost time: 18.99s. node0 : 6598-6728MIB

全局loss曲线对比

50步loss曲线对比

100步loss曲线对比

bert_test_1n8g's People

Contributors

Watchers

bert_test_1n8g's Issues

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc4_2n8g

case2

NVIDIA_GeForce_RTX_3080_Ti	master@b51cb72	rank_per_process @a442869	naive@a442869
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_actrue DP4_MP2_PP2_zerofalse_stage0_ mbs32_gbs512_acc4_2n8g	building graph Done! Cost time: 21.37s. building plan Done! Cost time: 20.65s. node0:6582MIB–6728MIB node1:5126MIB–5126MIB [master_output.log]	building plan Done! Cost time: 17.52s. building graph Done! Cost time: 22.84 s. node0:6576MIB--6722MIB node1:5126MIB–5126MIB [rank_per_process_output.log]	building plan Done! Cost time: 21.07s. building graph Done! Cost time: 23.82s. node0:6576MIB--6722MIB node1:5126MIB–5126MIB[naive_output.log]

全局loss曲线对比
50步loss曲线对比
100步loss曲线对比

Libai Megatron GPT测试

GPT-2

	libai	Megatron
数据集	loss_compara_content_sentence.bin loss_compara_content_sentence.idx	loss_compara_content_sentence.bin loss_compara_content_sentence.idx
vocab.txt	bert-base-chinese-vocab.txt	bert-base-chinese-vocab.txt
测试脚本	args_train.sh	megatron_args_pretrain_gpt2.sh

测试环境

OneFlow Libai Megatron

(master分支)9f08133 (main分支)247cbb7 (mian分支)e156d2f
测试结果

OneFlow	Libai	Megatron
(master分支)9f08133	(main分支)247cbb7	(mian分支)e156d2f

测试了三组，一组纯数据并行，一组混合并行，一组纯模型并行

NVIDIA_GeForce_RTX_3090	Libai	Megatron
gpt2_nl24_nah16_hs768_FP16_acfalse_DP8_MP1_PP1_zerofalse_stage2_mbs4_gbs32_acc1_1n8g	16514–16568 MiB / 112.17 samples/s	[16931 MiB] / 84.7 samples/s
gpt2_nl24_nah16_hs1024_FP16_acfalse_DP8_MP1_PP1_zerofalse_stage2_mbs8_gbs64_acc1_1n8g	OOM	OOM
gpt2_nl24_nah16_hs768_FP16_acfalse_DP2_MP2_PP2_zerofalse_stage2_mbs4_gbs16_acc2_1n8g	16066–16196 MiB / 37.44 samples/s	[8187 MiB] / 45.8 samples/s
gpt2_nl24_nah16_hs1024_FP16_acfalse_DP2_MP2_PP2_zerofalse_stage2_mbs8_gbs16_acc1_1n8g	7987–10258 MiB / 22.40 samples/s	[9317 MiB] / 27.7 samples/s
gpt2_nl24_nah16_hs768_FP16_acfalse_DP1_MP8_PP1_zerofalse_stage2_mbs32_gbs256_acc8_1n8g	18456–18456 MiB / 14.94 samples/s	[23759 MiB] / 14.4 samples/s
gpt2_graph_nl24_nah16_hs1024__acfalse_DP_MP2_PP2_zerofalse_stage2_mbs8_gbs32_acc_1n8g	OOM	[11057MiB] / 35.9 samples/s
gpt2_eager_nl24_nah16_hs768__acfalse_DP_MP2_PP2_zerofalse_stage2_mbs8_gbs64_acc_1n8g	OOM	[14248MiB] / 52.8 samples/s

VIT分离编译回归测试

NVIDIA_GeForce_RTX_3080_Ti	master + oneflow@6e019b7 + libai@d25f09c	rank_per_proces + oneflow@a442869 + libai@d25f09c	naive + oneflow@a442869 + libai@d25f09c6f
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_acfalse_dp1_mp4_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n4g	11002 MiB / 219.67 samples/s	10994 MiB / 219.31 samples/s	10994 MiB / 239.39 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_acfalse_dp4_mp1_pp1_zerotrue_stage2_mbs64_gbs256_acc1_1n4g	7758 MiB / 899.03 samples/s	7742 MiB / 978.03 samples/s 吞吐偏高	7742 MiB / 905.66 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n1g	8017 MiB / 241.06 samples/s	8009 MiB / 255.8 samples/s	8009 MiB / 259.39 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n1g	6613 MiB / 309.55 samples/s	6605 MiB / 308.78 samples/s	6605 MiB / 308.95 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g	8558 MiB / 234.22 samples/s	8550 MiB / 233.68 samples/s	8550 MiB / 231.84 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs256_gbs256_acc1_1n4g	5402 MiB / 275.0 samples/s	5394 MiB / 274.05 samples/s	5394 MiB / 254.43 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g	8034 MiB / 219.82 samples/s	7936 MiB / 220.09 samples/s	7936 MiB / 221.11 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs256_gbs256_acc1_1n4g	4912 MiB / 203.63 samples/s	4830 MiB / 204.88 samples/s	4830 MiB / 203.06 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp4_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g	6470 MiB / 180.8 samples/s	6370 MiB / 178.49 samples/s	6370 MiB / 177.04 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs128_gbs256_acc1_1n4g	4008 MiB / 454.2 samples/s 吞吐偏低	3996 MiB / 502.64 samples/s	3996 MiB / 511.25 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g	5640 MiB / 441.66 samples/s	5510 MiB / 443.4 samples/s	5510 MiB / 466.89 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs128_gbs256_acc1_1n4g	3828 MiB / 388.04 samples/s	3782 MiB / 393.13 samples/s	3782 MiB / 390.96 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g	4728 MiB / 337.81 samples/s	4646 MiB / 330.97 samples/s	4646 MiB / 329.41 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs32_gbs1024_acc8_1n4g	3958 MiB / 772.74 samples/s	3892 MiB / 757.1 samples/s	3882 MiB / 738.88 samples/s

ResNet_run_week分离编译回归测试

NVIDIA_GeForce_RTX_3080_Ti	master + oneflow@2c98eb12 + models@2654092c0	rank_per_process + oneflow@a442869 + models@2654092c0	naive + oneflow@a442869 + models@2654092c0
resnet50_graph_realdata_DCgpu_FP16_mb160_gb1280_acc1_1n8g	[9836-10080] MiB / 1132.94 / 77.02	[9924-10168] MiB / 1090.68 / 77.17	[9922-10160] MiB / 1090.08 / 76.77
resnet50_graph_realdata_DCgpu_FP16_mb40_gb1280_acc4_1n8g	[9838-10136] MiB / 281.92 / 77.1	[9920-10188] MiB / 266.86 / 76.84	[9922-10220] MiB / 272.22 / 76.94

lr 为N/A 的case复现

case: bert actrue_DP4_MP2_PP2_zerotrue_stage2_acc4_2n8g
服务器：26号，28号机 NVIDIA_GeForce_RTX_3080_Ti
master @356829ec：
- 建立模型：log日志
- 加载模型训练lr正常：log日志
plan_sep_compile_merge分支@a442869 + 环境变量rank_per_process
- 加载模型训练lr为N/A：log日志

mT5分离编译回归测试

NVIDIA_GeForce_RTX_3080_Ti	master + oneflow@15c75402 + libai@247cbb7	rank_per_process + oneflow@a442869 + libai@247cbb7	naive + oneflow@a442869 + libai@247cbb7
libai_t5_mt5_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerofalse_stage2_mbs4_gbs32_acc8_1n1g	8219 MiB / 59.05 samples/s	8211 MiB / 59.15 samples/s	8213 MiB / 59.06 samples/s

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc4_2n8g

case1

NVIDIA_GeForce_RTX_3080_Ti	master@b51cb72	rank_per_process @a442869	naive@a442869
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_actrue DP4_MP2_PP2_zerotrue_stage2_ mbs32_gbs512_acc4_2n8g	building graph Done! Cost time: 22.11s. building plan Done! Cost time: 21.92s. node0:5774MIB–5924MIB node1:4262MIB–4621MIB [master_output.log]	building plan Done! Cost time: 23.21s. building graph Done! Cost time: 20.52 s. node0:5774MIB--5864MIB node1:4262MIB–4262MIB [rank_per_process_output.log] 在跑这个case的时候lr为N/A，详见output.log,但是单机测试时lr正常显示	building plan Done! Cost time: 21.77s. building graph Done! Cost time: 23.8s. node0:5736MIB--5886MIB node1:4242MIB–4262MIB [naive_output.log]

全局loss曲线对比
50步loss曲线对比
100步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs32_gbs512_acc1_2n8g

NVIDIA_GeForce_RTX_3080_Ti	master@021e5e62	rank_per_process @021e5e62	naive@021e5e62
LibAI_bert_large_pretrain graph_nl24_nah16_hs1024_FP16_actrue DP16_MP1_PP1_zerotrue_stage2 _mbs32_gbs512_acc1_2n8g	building plan Done! Cost time: 23.85s. building graph Done! Cost time: 11.94s node0:4536MIB–4536MIB node1:3590MIB–3590MIB	building plan Done! Cost time: 24.18s. building graph Done! Cost time: 12.23 s. node0: 8814MIB--8960MIB node1:3638MIB–3638MIB	building plan Done! Cost time: 24.34s. building graph Done! Cost time: 12.21s. node0:5774MIB--5864MIB node1:4242MIB–4262MIB

全局loss曲线对比
50步loss曲线对比
100步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs8_gbs512_acc4_2n8g

case4

NVIDIA_GeForce_RTX_3080_Ti	master@b51cb72	rank_per_process @021e5e62	naive@021e5e62
LibAI_bert_large_pretrain graph_nl24_nah16_hs1024_FP16_actrue DP16_MP1_PP1_zerotrue_stage2 _mbs32_gbs512_acc4_2n8g	building plan Done! Cost time: 24.16s. building graph Done! Cost time: 12.5s node0:4094MIB–4094MIB node1:3148MIB–3148MIB [master_output.log]	building plan Done! Cost time: 14.09s. building graph Done! Cost time: 14.6 s. node0: 4094MIB--4094MIB node1:3148MIB–3148MIB [rank_per_process_output.log]	building plan Done! Cost time: 27s. building graph Done! Cost time: 14.91s. node0:4094MIB--4094MIB node1:3148MIB–3148MIB [naive_output.log]

全局loss曲线对比
50步loss曲线对比
100步loss曲线对比

分离编译任务

NVIDIA_GeForce_RTX_3080_Ti	master+oneflow@e619579b + libai@2654092c	rank_per_process+oneflow@a442869 + libai@2654092c	naive+oneflow@a442869 + libai@2654092c
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs16_gbs128_acc8_1n1g	7801 MiB / 32.06 samples/s	7795 MiB / 31.2 samples/s	7795 MiB / 31.2 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs32_gbs32_acc1_1n1g	7937 MiB(显存不合理) / 32.94 samples/s	7039 MiB / 32.96 samples/s	7039 MiB / 32.94 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp8_pp1_zerofalse_stage0_mbs32_gbs256_acc8_1n8g	3880 MiB / 15.41 samples/s	3874 MiB / 15.38 samples/s	3874 MiB / 15.38 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp1_pp4_zerotrue_stage2_mbs32_gbs512_acc8_1n8g	7722 MiB / 167.49 samples/s	7684 MiB / 168.03 samples/s	7684 MiB / 168.77 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs32_gbs512_acc8_1n8g	6340 MiB / 57.29 samples/s	6334 MiB / 57.19 samples/s	6304 MiB / 57.29 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs32_gbs128_acc1_1n4g	5212 MiB / 96.02 samples/s	5206 MiB / 95.9 samples/s	5206 MiB / 95.8 samples/s lr=NA
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs32_gbs1024_acc8_1n8g	6154 MiB / 71.38 samples/s	6148 MiB / 71.31 samples/s	6148 MiB / 71.45 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs32_gbs128_acc1_1n8g	5442 MiB / 63.83 samples/s	5436 MiB / 63.65 samples/s	5436 MiB / 63.74 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs32_gbs2048_acc8_1n8g	6060 MiB / 223.19 samples/s	6054 MiB / 223.52 samples/s lr=NA	6054 MiB / 223.14 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs32_gbs256_acc1_1n8g	4776 MiB / 187.71 samples/s	4770 MiB / 187.7 samples/s	4770 MiB / 187.37 samples/s
libai_bert_large_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp1_pp4_zerofalse_stage0_mbs16_gbs128_acc8_1n4g	7540 MiB / 47.8 samples/s	7534 MiB / 47.83 samples/s	7534 MiB / 47.84 samples/s
libai_bert_large_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp1_pp8_zerofalse_stage0_mbs24_gbs384_acc16_1n8g	7920 MiB / 72.88 samples/s	7914 MiB / 74.13 samples/s	7914 MiB / 73.36 samples/s
libai_bert_large_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp4_pp1_zerofalse_stage0_mbs32_gbs256_acc8_1n4g	7150 MiB / 8.34 samples/s	7144 MiB / 8.38 samples/s	7144 MiB / 8.34 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs8_gbs64_acc8_1n1g	9081 MiB / 13.14 samples/s lr=NA	9075 MiB / 13.12 samples/s lr=NA	9075 MiB / 13.28 samples/s lr=NA
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs8_gbs8_acc1_1n1g	8235 MiB / 12.95 samples/s	8091 MiB / 12.79 samples/s	8091 MiB / 12.91 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp4_zerofalse_stage0_mbs12_gbs96_acc8_1n4g	8460 MiB / 39.62 samples/s loss=inf	8454 MiB / 39.64 samples/s	8454 MiB / 39.65 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp8_pp1_zerofalse_stage0_mbs8_gbs64_acc8_1n8g	3548 MiB / 7.53 samples/s	3542 MiB / 7.55 samples/s	3542 MiB / 7.53 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp1_pp4_zerotrue_stage2_mbs8_gbs128_acc8_1n8g	7252 MiB / 64.78 samples/s	7230 MiB / 64.73 samples/s	7230 MiB / 64.77 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs8_gbs128_acc8_1n8g	5518 MiB / 24.44 samples/s	5512 MiB / 24.48 samples/s	5512 MiB / 24.44 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs8_gbs256_acc8_1n4g	6594 MiB / 41.94 samples/s	6588 MiB / 41.95 samples/s	6588 MiB / 41.93 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs8_gbs32_acc1_1n4g	5162 MiB / 32.87 samples/s	5156 MiB / 32.84 samples/s	5156 MiB / 32.87 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs8_gbs256_acc8_1n8g	6176 MiB / 32.69 samples/s	6152 MiB / 32.49 samples/s	6152 MiB / 32.43 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs8_gbs32_acc1_1n8g	5458 MiB / 27.01 samples/s	5412 MiB / 27.01 samples/s	5412 MiB / 27.0 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs8_gbs512_acc8_1n8g	6070 MiB / 82.3 samples/s	6064 MiB / 82.18 samples/s	6064 MiB / 82.28 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs8_gbs64_acc1_1n8g	4726 MiB / 63.62 samples/s	4720 MiB / 63.61 samples/s	4720 MiB / 63.61 samples/s
libai_gpt2_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp1_pp8_zerofalse_stage0_mbs6_gbs96_acc16_1n8g	6822 MiB / 34.45 samples/s	6816 MiB / 34.77 samples/s	6816 MiB / 33.95 samples/s
libai_gpt2_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp4_pp1_zerofalse_stage0_mbs8_gbs64_acc8_1n4g	6594 MiB / 4.07 samples/s	6588 MiB / 4.05 samples/s	6588 MiB / 4.03 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n1g	7230 MiB / 108.92 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n1g	6692 MiB / 113.19 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g	0 MiB / 0 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs256_gbs256_acc1_1n4g	6570 MiB / 264.65 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g	10414 MiB / 107.39 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs256_gbs256_acc1_1n4g	5914 MiB / 100.94 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp4_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g	6748 MiB / 198.7 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp4_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n4g	5484 MiB / 218.7 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs128_gbs256_acc1_1n4g	4354 MiB / 495.47 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g	6530 MiB / 214.09 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs128_gbs256_acc1_1n4g	4098 MiB / 163.96 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g	4644 MiB / 199.26 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs128_gbs2048_acc8_1n8g	10452 MiB / 182.07 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs256_gbs512_acc1_1n8g	5940 MiB / 233.05 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs32_gbs1024_acc8_1n4g	3518 MiB / 215.72 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs64_gbs256_acc1_1n4g	3296 MiB / 217.42 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp2_zerotrue_stage2_mbs64_gbs2048_acc8_1n8g	6234 MiB / 219.87 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs128_gbs512_acc1_1n8g	4074 MiB / 350.99 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs64_gbs2048_acc8_1n8g	4618 MiB / 213.33 samples/s

腾讯云A800-Libai-Megatron对比

A800-Libai-Megatron 关于 GPT2 对比测试

Docker 环境

Docker镜像为：基于NGC容器 nvcr.io/nvidia/pytorch:21.07-py3，安装了免密，ib 驱动为5.3版本，配置好了宿主机的IP列表
TCCL 插件为：nccl-rdma-sharp-plugins_1.1_amd64.deb

NCCL-Test

指定 export HOME=/data_turbo/home/share/workspace，后出现HOME均用此路径代替

启动 Docker 容器：docker run --gpus all -itd --shm-size=16g --ulimit memlock=-1 --ulimit core=0 --ulimit stack=67108864 --privileged --cap-add=IPC_LOCK --name "gpt_test" --ipc host --net host -v "$HOME":"$HOME" "ngc/pytorch-21.07:ssh-ib5.4-config-py38" bash -c "sed -i 's/Port 62620/Port 10098/g' /root/.ssh/config && /usr/sbin/sshd -p 10098 && bash"
创建一名为gpt_test 的容器，后出现gpt_test均指此容器

参数解释:

--gpus all：指定容器可以使用所有可用的GPU资源。

-itd：以交互模式并在后台运行容器。

--shm-size=16g：为容器设置共享内存大小为16GB。

--ulimit memlock=-1：设置容器可以锁定任意数量的内存。

--ulimit core=0：禁用容器内的核心转储文件生成。

--ulimit stack=67108864：设置容器的堆栈大小为64MB。

--privileged：赋予容器所有的特权。

--cap-add=IPC_LOCK：允许容器锁定共享内存。

--name "gpt_test"：指定容器的名称为"gpt_test"。

--ipc host：共享宿主机的IPC命名空间。

--net host：使用宿主机的网络命名空间。

-v "$HOME":"$HOME"：将宿主机的$HOME目录挂载到容器的$HOME目录中，以便容器可以访问宿主机上的文件。

"ngc/pytorch-21.07:ssh-ib5.4-config-py38"：指定要使用的Docker镜像的名称和标签。

bash -c "sed -i 's/Port 62620/Port 10098/g' /root/.ssh/config && /usr/sbin/sshd -p 10098 && bash"：在容器启动时执行的命令。此命令将更改容器内的SSH配置文件中的端口号，并启动SSH服务器。最后，它将启动一个新的bash shell供用户使用。

在 051,052,053,054,055,056,057,058 机器中运行容器
登录到任一台GPU节点的 docker 容器中：docker attach gpt_test进入容器
下载 nccl-test文件：cd $HOME && git clone [https://github.com/NVIDIA/nccl-tests.git](https://github.com/NVIDIA/nccl-tests.git)
编译 nccl-test：cd nccl-tests && make MPI=1 MPI_HOME=/usr/local/mpi/
测试 8 台机器，逐步递增，增长因子为 2：mpirun -np 64 -H 051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_GDR_LEVEL=2 -x NCCL_DEBUG=INFO -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=160 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl_tcp_if_include bond0 -mca btl ^openib $HOME/nccl-tests/build/all_reduce_perf -b 2G -e 4G -f 2 -g 1 | tee $HOME/nccl-tests/nccl_log/nccl_increace_16n8g.log
测试 8 台机器，固定大小重复 100 次：mpirun -np 64 -H 051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_GDR_LEVEL=2 -x NCCL_DEBUG=INFO -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=160 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl_tcp_if_include bond0 -mca btl ^openib $HOME/nccl-tests/build/all_reduce_perf -b 4G -e 4G -n 100 -g 1 | tee $HOME/nccl-tests/nccl_log/nccl_stable_16n8g.log

参数详解：

mpirun：这是使用 MPI 启动并行程序的命令。
-np 64：这指定启动的进程数，本例中为 64。

-H 051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8：该选项指定要在哪些节点上启动进程，以及每个节点要启动的进程数。在本例中，每个节点 051、052、053、054、055、056、057 和 058 上启动了 8 个进程。

--allow-run-as-root：此选项允许以 root 用户身份运行程序。

-bind-to none：此选项指定进程不应绑定到特定的处理单元。

-map-by slot：此选项按插槽分配进程到处理单元，每个插槽代表一个处理单元。

-x NCCL_IB_DISABLE=0：此选项将环境变量 `NCCL_IB_DISABLE设置为 0，启用 InfiniBand 通信。

-x NCCL_IB_GID_INDEX=3：此选项将环境变量 NCCL_IB_GID_INDEX 设置为 3，指定 InfiniBand 通信使用的 GID（全局唯一标识符）索引。

-x NCCL_GDR_LEVEL=2：此选项将环境变量 NCCL_GDR_LEVEL 设置为 2，启用 GPU Direct RDMA（远程直接内存访问）用于 GPU 之间的通信。

-x NCCL_DEBUG=INFO：此选项将环境变量 NCCL_DEBUG 设置为 INFO，启用 NCCL（NVIDIA Collective Communications Library）操作的调试信息。

-x NCCL_IB_QPS_PER_CONNECTION=4：此选项将环境变量 NCCL_IB_QPS_PER_CONNECTION 设置为 4，指定每个 IB（InfiniBand）连接使用的 QP（队列对）数。

-x NCCL_IB_TC=160：此选项将环境变量 NCCL_IB_TC 设置为 160，指定 InfiniBand 通信使用的流量类。

-x LD_LIBRARY_PATH：此选项导出当前的 LD_LIBRARY_PATH 值，该环境变量指定要搜索共享库的目录。

-x PATH：此选项导出当前的 PATH 值，该环境变量指定要搜索可执行文件的目录。

-mca pml ob1：此选项指定要使用的 PML（点对点消息传递层）模块，默认为 Open MPI 的 ob1 模块。

-mca btl_tcp_if_include bond0：此选项指定要用于 TCP 通信的 BTL（字节传输层）模块，并指定要使用的网络接口为 bond0。

-mca btl ^openib：此选项指定要禁用 InfiniBand 通信的 Open MPI BTL 模块。

$HOME/nccl-tests/build/all_reduce_perf：此项指定要在 MPI 环境中运行的可执行文件路径。

-b 2G：开始的 size 大小

-e 4G：结束的 size 大小

-f 2：size 之间的乘数因子

-n 100：重复迭代次数

-g 1：此选项指定每个 GPU 上的进程数，本例中为 1。

| tee $HOME/nccl-tests/nccl_log/nccl_increace_16n8g.log：此项将程序的标准输出传输到 tee 命令，该命令将输出写入指定的文件

结果：busbw 值在 167GB/s左右，完整log，在log中注意 Using devices，GDRDMA，Using network IBext等信息。

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
2147483648     536870912     float     sum      -1    25040   85.76  168.84      0    24828   86.49  170.28      0
4294967296    1073741824     float     sum      -1    52185   82.30  162.03      0    50414   85.19  167.73      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 167.221

进行 Libai 训练

查看容器是否已经全部启动：ansible 051,052,053,054,055,056,057,058 -m shell -a "docker ps | grep gpt_test"
环境配置以及数据集下载：run_libai_gpt.sh
在任意一台GPU机器，cd $HOME/libai创建一个DDP运行多机的文件，run_train_libai.sh：

set -ex
# 解析输入参数，使用逗号分割多个主机名
hosts=$(echo $1 | tr "," "\n")
master_addr=$2
# 获取主机数量
global_batch_size=$(($num_hosts * 16))
num_hosts=$(echo $hosts | wc -w)

# 使用ansible在每个主机上执行命令
for (( i=0; i<$num_hosts; i++ )); do
  host=$(echo $hosts | cut -d " " -f $((i+1)))

  ansible $host -m shell -a "docker exec gpt_test bash -c 'cd $/libai && bash tools/args_train.sh configs/gpt2_pretrain.py $num_hosts 8 $i $master_addr 1 1 true true true 2 $global_batch_size false 2 220 100 48 144 2304 9216'" & 

done

bash run_train_libai.sh 051,052,053,054,055,056,057,058 10.0.0.114 ，即可开始8个节点的训练。第一个节点必须是master节点，第二个参数是master节点的域名
Libai 训练脚本：args_train.sh

参数解释如下：

* CONFIG：配置文件路径。
* NNODES：节点数，默认为1。
* GPUS_PER_NODE：每个节点的GPU数量，默认为8。
* NODE_RANK：当前节点在节点组中的排名，默认为0。
* MASTER_ADDR：主节点地址，默认为"127.0.0.1"。
* MASTER_PORT：主节点端口号，默认为12345。
* MP：模型并行分组大小，默认为1。
* DP = (NNODES * GPUS_PER_NODE) / MP / PP   
* PP：pipeline并行分组大小，默认为1。
* GRAPH_ENABLED：是否开启图模式，默认为true。
* USE_FP16：是否使用FP16混合精度训练，默认为true。
* ACTIVATION_CHECKPOINT：是否开启激活函数checkpointing，默认为false。
* MICRO_BATCH_SIZE：每个微小批量的样本数量，默认为4。
* GLOBAL_BATCH_SIZE：全局批量大小，默认为4。
* ZERO_ENABLE：是否开启零纪录优化，默认为false。
* ZERO_STAGE：使用零纪录优化的阶段，默认为2。
* TRAIN_ITERS：训练迭代次数，默认为220。
* LOG_PERIOD：日志输出周期（以迭代为单位），默认为100。
* NUM_LAYER：Transformer层数，默认为12。
* NUM_ATT_HEADS：Self-Attention头数，默认为12。
* HIDDEN_SIZE：隐藏层大小，默认为768。
* INTERMEDIATE_SIZE：Feedforward层中间层大小，默认为3072。
* HEAD_SIZE：Multi-Head Attention中每个头的大小，默认为64。
* SAVE_MODEL：是否保存模型，默认为false。
* UNSET_DROPOUT：是否取消dropout，默认为false
* ACC = GLOBAL_BATCH_SIZE / (DP * MICRO_BATCH_SIZE)

进行 Megatron 训练

查看容器是否已经全部启动：ansible 051,052,053,054,055,056,057,058 -m shell -a "docker ps | grep gpt_test"
环境配置以及数据集下载：run_megatron_ml_gpt.sh
在任意一台GPU机器，cd $HOME/Megatron-LM创建一个DDP运行多机的文件，run_train_megatron.sh：

set -ex
# 解析输入参数，使用逗号分割多个主机名
hosts=$(echo $1 | tr "," "\n")
master_addr=$2
# 获取主机数量
global_batch_size=$(($num_hosts * 16))
num_hosts=$(echo $hosts | wc -w)

# 使用ansible在每个主机上执行命令
for (( i=0; i<$num_hosts; i++ )); do
  host=$(echo $hosts | cut -d " " -f $((i+1)))

  ansible $host -m shell -a "docker exec gpt_test bash -c 'cd $/Megatron-LM && bash examples/megatron_args_pretrain_gpt2.sh $num_hosts 8 $i $master_addr 1 1 true true true 2 $global_batch_size false 2 220 100 48 144 2304 9216'" & 

done

bash run_train_megatron.sh 051,052,053,054,055,056,057,058 10.0.0.114 ，即可开始8个节点的训练。第一个节点必须是 master 节点，第二个参数是 master 节点的域名
Megatron 的训练脚本：megatron_args_pretrain_gpt2.sh，参数已与 libai 的训练脚本完全对齐

训练结果

Libai 与 Megatron 并行度测试

NVIDIA Graphics Device A800 80G	OneFlow_eb3df25	Megatron_e156d2f
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP64_MP1_PP1_zerofalse_stage2_mbs2_gbs128_acc1_8n8g	building graph cost time: 22.23s. / building plan cost time: 166.74s. / 66997-67197 Mib / 138.98 samples/s	64754 - 64826 MiB / 132.6 samples/s
gpt2_pretrain_graph_nl64_nah144_hs2304_FP16_actrue_ DP32_MP2_PP1_zerofalse_stage2_mbs32_gbs1024_acc1_8n8g	building graph cost time: 46.29s. / building plan cost time: 247.49s. / 55737-55811 Mib / 126.40 samples/s	OOM
gpt2_pretrain_graph_nl80_nah144_hs2304_FP16_actrue_ DP16_MP2_PP2_zerofalse_stage2_mbs64_gbs1024_acc1_8n8g	building graph cost time: 55.35s. / building plan cost time: 168.52s. / 66296-67969 Mib / 50.68 samples/s	OOM
gpt2_pretrain_graph_nl64_nah144_hs2304_FP16_actrue_ DP32_MP2_PP1_zerofalse_stage2_mbs16_gbs512_acc1_8n8g	building graph cost time: 49.1s. / building plan cost time: 250.73s. / 45741-45813 Mib / 118.90 samples/s	55190 - 55212 MiB / 121.5 samples/s
gpt2_pretrain_graph_nl80_nah144_hs2304_FP16_actrue_ DP16_MP2_PP2_zerofalse_stage2_mbs16_gbs256_acc1_8n8g	building graph cost time: 54.59s. / building plan cost time: 162.58s. / 33052-34771 Mib / 46.69 samples/s	41326 - 41402 MiB / 51.3 samples/s

GPT2 脚本训练注意事项

在 OneAutoTest 里面有三个 branch ：https://github.com/Oneflow-Inc/OneAutoTest/branches：

分别是 megatron_script_tecent , megatron_script_huoshan , megatron_script_sahngtang , 对应了腾讯云，火山云，商汤大装置, 三个平台所适配的脚本
训练脚本所在的路径位于 OneAutoTest/onebench/libai/ ，其中有四个脚本需要特别注意和使用：
下载时请特别注意使用的分支！

args_train.sh

此脚本为 libai 主要的训练脚本，功能为定义训练所需的参数，将参数传入训练启动文件
run_libai_gpt.sh

此脚本主要为配置 libai 训练所需的环境，包括安装 One Flow，Libai, 下载所需的数据集，以及调用 args_train.sh 发起训练，脚本中只设置了两个例子case，自定义需求可根据参数含义进行修改
megatron_args_pretrain_gpt2.sh

此脚本为 Megatron主要的训练脚本，训练参数已经对齐 libai，
run_megatron_ml_gpt.sh

此脚本为配置 Megatron 训练所需的环境，包括下载 Megaton，数据集，以及调用 megatron_args_pretrain_gpt2 发起训练，脚本中只设置了一个例子case，自定义需求可根据参数含义进行修改

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_acfalse_DP4_MP2_PP2_zerofalse_stage0_mbs4_gbs64_acc4_2n8g

case3

NVIDIA_GeForce_RTX_3080_Ti	master@b51cb72	rank_per_process @a442869	naive@a442869
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_acfalse DP4_MP2_PP2_zerofalse_stage0 _mbs4_gbs64_acc4_2n8g	building plan Done! Cost time: 18.92s. building graph Done! Cost time: 19.91s node0:8814MIB–8960MIB node1:3638MIB–3638MIB [master_output.log]	building plan Done! Cost time: 15.94s. building graph Done! Cost time: 22.09 s. node0: 8808MIB--8960MIB node1:3638MIB–3638MIB [rank_per_process_output.log]	building plan Done! Cost time: 18.92s. building graph Done! Cost time: 22.16s. node0:8808MIB--8954MIB node1:3638MIB–3638MIB [naive_output.log]

全局loss曲线对比
50步loss曲线对比
100步loss曲线对比

表格整理

腾讯云 libai_gpt 与 megatron_gpt 对比测试

腾讯云 Libai 与 Megatron 关于 GPT2 的对比测试

GPT-2

	libai	Megatron
数据集	loss_compara_content_sentence.bin loss_compara_content_sentence.idx	loss_compara_content_sentence.bin loss_compara_content_sentence.idx
gpt2-vocab.json	gpt2-vocab.json	gpt2-vocab.json
gpt2-merges.txt	gpt2-merges.txt	gpt2-merges.txt
测试脚本	args_train.sh	megatron_args_pretrain_gpt2.sh

测试环境

OneFlow Libai Megatron

(master分支)eb3df25 (main分支)f728a5ec (mian分支)e156d2f
NCCL_TEST

OneFlow	Libai	Megatron
(master分支)eb3df25	(main分支)f728a5ec	(mian分支)e156d2f

028,029号机 2n8g 完整log

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
a800-028:49082:49082 [0] NCCL INFO Launch mode Parallel
  2147483648     536870912     float     sum      -1    27260   78.78  147.71      0    28329   75.81  142.14      0
  4294967296    1073741824     float     sum      -1    53872   79.73  149.49      0    53867   79.73  149.50      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 147.207

025,026,028,029号机 4n8g 完整log

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
a800-028:48888:48998 [1] NCCL INFO comm 0x7fe49c000fa0 rank 1 nranks 32 cudaDev 1 busId 24000 - Init COMPLETE
a800-028:48887:48887 [0] NCCL INFO Launch mode Parallel
  2147483648     536870912     float     sum      -1    43516   49.35   95.61      0    43263   49.64   96.17      0
  4294967296    1073741824     float     sum      -1    91891   46.74   90.56      0    94436   45.48   88.12      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 92.616

051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8号机 8n8g 完整log

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  2147483648     536870912     float     sum      -1    25432   84.44  166.24      0    25383   84.60  166.56      0
  4294967296    1073741824     float     sum      -1    53176   80.77  159.01      0    50261   85.45  168.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 165.013

测试结果

NVIDIA Graphics Device A800 80G	OneFlow_eb3df25	Megatron_e156d2f
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs32_acc1_2n8g	building graph cost time: 28.8s. / building plan cost time: 50.61s. / 66491-66801 Mib / 36.36 samples/s	64626 - 64882 MiB / 33.4 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs128_acc4_2n8g	building graph cost time: 33.07s. / building plan cost time: 56.48s. / 64789-65229 Mib / 41.66 samples/s	64626 - 64882 MiB / 41.3 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP32_MP1_PP1_zerofalse_stage2_mbs2_gbs64_acc1_4n8g	building graph cost time: 30.57s. / building plan cost time: 84.44s. / 66491-66803 Mib / 71.95 samples/s	64626 - 64882 MiB / 62.3 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP32_MP1_PP1_zerofalse_stage2_mbs2_gbs256_acc4_4n8g	building graph cost time: 38.18s. / building plan cost time: 85.25s. / 64791-65229 Mib / 79.22 samples/s	64626 - 64882 MiB / 80.2 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP64_MP1_PP1_zerofalse_stage2_mbs2_gbs128_acc1_8n8g	building graph cost time: 22.23s. / building plan cost time: 166.74s. / 66997-67197 Mib / 138.98 samples/s	64754 - 64826 MiB / 132.6 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP64_MP1_PP1_zerofalse_stage2_mbs2_gbs512_acc4_8n8g	building graph cost time: 28.06s. / building plan cost time: 167.93s. / 65547-65819 Mib / 162.54 samples/s	64754 - 64826 MiB / 165.3 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP128_MP1_PP1_zerofalse_stage2_mbs2_gbs256_acc1_16n8g	building graph cost time: 27.79s. / building plan cost time: 394.68s. / 66999-67197 Mib / 249.03 samples/s	64684 - 64828 MiB / 260.4 samples/s
LibAI_gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_ DP128_MP1_PP1_zerofalse_stage2_mbs2_gbs1024_acc4_16n8g	building graph cost time: 35.72s. / building plan cost time: 365.7s. / 65549-65819 Mib / 312.32 samples/s	64684 - 64828 MiB / 277.1 samples/s

火山引擎 libai_gpt 与 megatron_gpt 对比测试

GPT-2

	libai	Megatron
数据集	loss_compara_content_sentence.bin loss_compara_content_sentence.idx	loss_compara_content_sentence.bin loss_compara_content_sentence.idx
vocab.txt	bert-base-chinese-vocab.txt	bert-base-chinese-vocab.txt
测试脚本	args_train.sh	megatron_args_pretrain_gpt2.sh

测试环境

OneFlow Libai Megatron

(master分支)0d4bc37 (main分支)f728a5ec (mian分支)e156d2f
测试结果

OneFlow	Libai	Megatron
(master分支)0d4bc37	(main分支)f728a5ec	(mian分支)e156d2f

NVIDIA A100-SXM4-80GB	OneFlow_0d4bc37	Megatron_e156d2f
gpt2_pretrain_graph_nl60_nah144_hs2304_FP16_actrue_ DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs32_acc1_2n8g	80290-80370 MiB /29.15 samples/s	77944 - 78072 MiB /26.4 samples/s
gpt2_pretrain_graph_nl60_nah144_hs2304_FP16_actrue_ DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs32_acc4_2n8g	78316-78573 MiB/33 samples/s	77944 - 78072 MiB /33.8 samples/s

tendo33 / bert_test_1n8g Goto Github PK

bert_test_1n8g's Introduction

bert_test_1n8g

bert_test_1n8g's People

Contributors

Watchers

bert_test_1n8g's Issues

GPT-2

测试环境

测试结果

A800-Libai-Megatron 关于 GPT2 对比测试

Docker 环境

NCCL-Test

进行 Libai 训练

进行 Megatron 训练

训练结果

腾讯云 Libai 与 Megatron 关于 GPT2 的对比测试

GPT-2

测试环境

NCCL_TEST

测试结果

GPT-2

测试环境

测试结果

Recommend Projects

Recommend Topics

Recommend Org