Giter Club home page Giter Club logo

bert_test_1n8g's Introduction

bert_test_1n8g

NVIDIA_GeForce_RTX_3080_Ti master@59b64db rank_per_process @59b64db naive@59b64db
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_actrue DP2_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc8_1n8g building graph Done! Cost time: 19.14s. building plan Done! Cost time: 18.08s. node0 : 6029-6738MIB building plan Done! Cost time: 17.91s. building graph Done! Cost time: 18.9 s. node0 : 6026-6728MIB building plan Done! Cost time: 18.14s. building graph Done! Cost time: 18.99s. node0 : 6598-6728MIB

  • 全局loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc8_1n8g,png

  • 50步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc8_1n8g_50_220,png

  • 100步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP2_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc8_1n8g_100_220,png

bert_test_1n8g's People

Contributors

tendo33 avatar

Watchers

Kostas Georgiou avatar  avatar

bert_test_1n8g's Issues

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc4_2n8g

case2

NVIDIA_GeForce_RTX_3080_Ti master@b51cb72 rank_per_process @a442869 naive@a442869
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_actrue DP4_MP2_PP2_zerofalse_stage0_ mbs32_gbs512_acc4_2n8g building graph Done! Cost time: 21.37s. building plan Done! Cost time: 20.65s. node0:6582MIB–6728MIB node1:5126MIB–5126MIB [master_output.log] building plan Done! Cost time: 17.52s. building graph Done! Cost time: 22.84 s. node0:6576MIB--6722MIB node1:5126MIB–5126MIB [rank_per_process_output.log] building plan Done! Cost time: 21.07s. building graph Done! Cost time: 23.82s. node0:6576MIB--6722MIB node1:5126MIB–5126MIB[naive_output.log]


  • 全局loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc4_2n8g

  • 50步loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc4_2n8g_50-220

  • 100步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerofalse_stage0_mbs32_gbs512_acc4_2n8g _100-220

Libai Megatron GPT测试

  • GPT-2

libai Megatron
数据集 loss_compara_content_sentence.bin
loss_compara_content_sentence.idx
loss_compara_content_sentence.bin
loss_compara_content_sentence.idx
vocab.txt bert-base-chinese-vocab.txt bert-base-chinese-vocab.txt
测试脚本 args_train.sh megatron_args_pretrain_gpt2.sh
  • 测试环境

    OneFlow Libai Megatron
    (master分支)9f08133 (main分支)247cbb7 (mian分支)e156d2f
  • 测试结果

测试了三组,一组纯数据并行,一组混合并行,一组纯模型并行

NVIDIA_GeForce_RTX_3090 Libai Megatron
gpt2_nl24_nah16_hs768_FP16_acfalse_DP8_MP1_PP1_zerofalse_stage2_mbs4_gbs32_acc1_1n8g 16514–16568 MiB / 112.17 samples/s [16931 MiB] / 84.7 samples/s
gpt2_nl24_nah16_hs1024_FP16_acfalse_DP8_MP1_PP1_zerofalse_stage2_mbs8_gbs64_acc1_1n8g OOM OOM
gpt2_nl24_nah16_hs768_FP16_acfalse_DP2_MP2_PP2_zerofalse_stage2_mbs4_gbs16_acc2_1n8g 16066–16196 MiB / 37.44 samples/s [8187 MiB] / 45.8 samples/s
gpt2_nl24_nah16_hs1024_FP16_acfalse_DP2_MP2_PP2_zerofalse_stage2_mbs8_gbs16_acc1_1n8g 7987–10258 MiB / 22.40 samples/s [9317 MiB] / 27.7 samples/s
gpt2_nl24_nah16_hs768_FP16_acfalse_DP1_MP8_PP1_zerofalse_stage2_mbs32_gbs256_acc8_1n8g 18456–18456 MiB / 14.94 samples/s [23759 MiB] / 14.4 samples/s
gpt2_graph_nl24_nah16_hs1024__acfalse_DP_MP2_PP2_zerofalse_stage2_mbs8_gbs32_acc_1n8g OOM [11057MiB] / 35.9 samples/s
gpt2_eager_nl24_nah16_hs768__acfalse_DP_MP2_PP2_zerofalse_stage2_mbs8_gbs64_acc_1n8g OOM [14248MiB] / 52.8 samples/s

VIT分离编译回归测试

NVIDIA_GeForce_RTX_3080_Ti master + oneflow@6e019b7 + libai@d25f09c rank_per_proces + oneflow@a442869 + libai@d25f09c naive + oneflow@a442869 + libai@d25f09c6f
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_acfalse_dp1_mp4_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n4g 11002 MiB / 219.67 samples/s 10994 MiB / 219.31 samples/s 10994 MiB / 239.39 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_acfalse_dp4_mp1_pp1_zerotrue_stage2_mbs64_gbs256_acc1_1n4g 7758 MiB / 899.03 samples/s 7742 MiB / 978.03 samples/s 吞吐偏高 7742 MiB / 905.66 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n1g 8017 MiB / 241.06 samples/s 8009 MiB / 255.8 samples/s 8009 MiB / 259.39 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n1g 6613 MiB / 309.55 samples/s 6605 MiB / 308.78 samples/s 6605 MiB / 308.95 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g 8558 MiB / 234.22 samples/s 8550 MiB / 233.68 samples/s 8550 MiB / 231.84 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs256_gbs256_acc1_1n4g 5402 MiB / 275.0 samples/s 5394 MiB / 274.05 samples/s 5394 MiB / 254.43 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g 8034 MiB / 219.82 samples/s 7936 MiB / 220.09 samples/s 7936 MiB / 221.11 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs256_gbs256_acc1_1n4g 4912 MiB / 203.63 samples/s 4830 MiB / 204.88 samples/s 4830 MiB / 203.06 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp4_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g 6470 MiB / 180.8 samples/s 6370 MiB / 178.49 samples/s 6370 MiB / 177.04 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs128_gbs256_acc1_1n4g 4008 MiB / 454.2 samples/s 吞吐偏低 3996 MiB / 502.64 samples/s 3996 MiB / 511.25 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g 5640 MiB / 441.66 samples/s 5510 MiB / 443.4 samples/s 5510 MiB / 466.89 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs128_gbs256_acc1_1n4g 3828 MiB / 388.04 samples/s 3782 MiB / 393.13 samples/s 3782 MiB / 390.96 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g 4728 MiB / 337.81 samples/s 4646 MiB / 330.97 samples/s 4646 MiB / 329.41 samples/s
libai_vit_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs32_gbs1024_acc8_1n4g 3958 MiB / 772.74 samples/s 3892 MiB / 757.1 samples/s 3882 MiB / 738.88 samples/s

ResNet_run_week分离编译回归测试

NVIDIA_GeForce_RTX_3080_Ti master + oneflow@2c98eb12 + models@2654092c0 rank_per_process + oneflow@a442869 + models@2654092c0 naive + oneflow@a442869 + models@2654092c0
resnet50_graph_realdata_DCgpu_FP16_mb160_gb1280_acc1_1n8g [9836-10080] MiB / 1132.94 / 77.02 [9924-10168] MiB / 1090.68 / 77.17 [9922-10160] MiB / 1090.08 / 76.77
resnet50_graph_realdata_DCgpu_FP16_mb40_gb1280_acc4_1n8g [9838-10136] MiB / 281.92 / 77.1 [9920-10188] MiB / 266.86 / 76.84 [9922-10220] MiB / 272.22 / 76.94

lr 为N/A 的case复现

  • case: bert actrue_DP4_MP2_PP2_zerotrue_stage2_acc4_2n8g
  • 服务器:26号,28号机 NVIDIA_GeForce_RTX_3080_Ti
  • master @356829ec:
  • plan_sep_compile_merge分支@a442869 + 环境变量rank_per_process

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc4_2n8g

case1

NVIDIA_GeForce_RTX_3080_Ti master@b51cb72 rank_per_process @a442869 naive@a442869
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_actrue DP4_MP2_PP2_zerotrue_stage2_ mbs32_gbs512_acc4_2n8g building graph Done! Cost time: 22.11s. building plan Done! Cost time: 21.92s. node0:5774MIB–5924MIB node1:4262MIB–4621MIB [master_output.log] building plan Done! Cost time: 23.21s. building graph Done! Cost time: 20.52 s. node0:5774MIB--5864MIB node1:4262MIB–4262MIB [rank_per_process_output.log] 在跑这个case的时候lr为N/A,详见output.log,但是单机测试时lr正常显示 building plan Done! Cost time: 21.77s. building graph Done! Cost time: 23.8s. node0:5736MIB--5886MIB node1:4242MIB–4262MIB [naive_output.log]

  • 全局loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc4_2n8g

  • 50步loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc4_2n8g_50-220

  • 100步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP4_MP2_PP2_zerotrue_stage2_mbs32_gbs512_acc4_2n8g _100-220

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs32_gbs512_acc1_2n8g

NVIDIA_GeForce_RTX_3080_Ti master@021e5e62 rank_per_process @021e5e62 naive@021e5e62
LibAI_bert_large_pretrain graph_nl24_nah16_hs1024_FP16_actrue DP16_MP1_PP1_zerotrue_stage2 _mbs32_gbs512_acc1_2n8g building plan Done! Cost time: 23.85s. building graph Done! Cost time: 11.94s node0:4536MIB–4536MIB node1:3590MIB–3590MIB building plan Done! Cost time: 24.18s. building graph Done! Cost time: 12.23 s. node0: 8814MIB--8960MIB node1:3638MIB–3638MIB building plan Done! Cost time: 24.34s. building graph Done! Cost time: 12.21s. node0:5774MIB--5864MIB node1:4242MIB–4262MIB

  • 全局loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs32_gbs512_acc1_2n8g

  • 50步loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs32_gbs512_acc1_2n8g_50-220

  • 100步loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs32_gbs512_acc1_2n8g _100-220

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs8_gbs512_acc4_2n8g

case4

NVIDIA_GeForce_RTX_3080_Ti master@b51cb72 rank_per_process @021e5e62 naive@021e5e62
LibAI_bert_large_pretrain graph_nl24_nah16_hs1024_FP16_actrue DP16_MP1_PP1_zerotrue_stage2 _mbs32_gbs512_acc4_2n8g building plan Done! Cost time: 24.16s. building graph Done! Cost time: 12.5s node0:4094MIB–4094MIB node1:3148MIB–3148MIB [master_output.log] building plan Done! Cost time: 14.09s. building graph Done! Cost time: 14.6 s. node0: 4094MIB--4094MIB node1:3148MIB–3148MIB [rank_per_process_output.log] building plan Done! Cost time: 27s. building graph Done! Cost time: 14.91s. node0:4094MIB--4094MIB node1:3148MIB–3148MIB [naive_output.log]
  • 全局loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs8_gbs512_acc4_2n8g

  • 50步loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs8_gbs512_acc4_2n8g_50-220

  • 100步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_actrue_DP16_MP1_PP1_zerotrue_stage2_mbs8_gbs512_acc4_2n8g _100-220

分离编译任务

NVIDIA_GeForce_RTX_3080_Ti master+oneflow@e619579b + libai@2654092c rank_per_process+oneflow@a442869 + libai@2654092c naive+oneflow@a442869 + libai@2654092c
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs16_gbs128_acc8_1n1g 7801 MiB / 32.06 samples/s 7795 MiB / 31.2 samples/s 7795 MiB / 31.2 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs32_gbs32_acc1_1n1g 7937 MiB(显存不合理) / 32.94 samples/s 7039 MiB / 32.96 samples/s 7039 MiB / 32.94 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp8_pp1_zerofalse_stage0_mbs32_gbs256_acc8_1n8g 3880 MiB / 15.41 samples/s 3874 MiB / 15.38 samples/s 3874 MiB / 15.38 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp1_pp4_zerotrue_stage2_mbs32_gbs512_acc8_1n8g 7722 MiB / 167.49 samples/s 7684 MiB / 168.03 samples/s 7684 MiB / 168.77 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs32_gbs512_acc8_1n8g 6340 MiB / 57.29 samples/s 6334 MiB / 57.19 samples/s 6304 MiB / 57.29 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs32_gbs128_acc1_1n4g 5212 MiB / 96.02 samples/s 5206 MiB / 95.9 samples/s 5206 MiB / 95.8 samples/s lr=NA
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs32_gbs1024_acc8_1n8g 6154 MiB / 71.38 samples/s 6148 MiB / 71.31 samples/s 6148 MiB / 71.45 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs32_gbs128_acc1_1n8g 5442 MiB / 63.83 samples/s 5436 MiB / 63.65 samples/s 5436 MiB / 63.74 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs32_gbs2048_acc8_1n8g 6060 MiB / 223.19 samples/s 6054 MiB / 223.52 samples/s lr=NA 6054 MiB / 223.14 samples/s
libai_bert_large_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs32_gbs256_acc1_1n8g 4776 MiB / 187.71 samples/s 4770 MiB / 187.7 samples/s 4770 MiB / 187.37 samples/s
libai_bert_large_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp1_pp4_zerofalse_stage0_mbs16_gbs128_acc8_1n4g 7540 MiB / 47.8 samples/s 7534 MiB / 47.83 samples/s 7534 MiB / 47.84 samples/s
libai_bert_large_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp1_pp8_zerofalse_stage0_mbs24_gbs384_acc16_1n8g 7920 MiB / 72.88 samples/s 7914 MiB / 74.13 samples/s 7914 MiB / 73.36 samples/s
libai_bert_large_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp4_pp1_zerofalse_stage0_mbs32_gbs256_acc8_1n4g 7150 MiB / 8.34 samples/s 7144 MiB / 8.38 samples/s 7144 MiB / 8.34 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs8_gbs64_acc8_1n1g 9081 MiB / 13.14 samples/s lr=NA 9075 MiB / 13.12 samples/s lr=NA 9075 MiB / 13.28 samples/s lr=NA
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp1_zerofalse_stage0_mbs8_gbs8_acc1_1n1g 8235 MiB / 12.95 samples/s 8091 MiB / 12.79 samples/s 8091 MiB / 12.91 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp1_pp4_zerofalse_stage0_mbs12_gbs96_acc8_1n4g 8460 MiB / 39.62 samples/s loss=inf 8454 MiB / 39.64 samples/s 8454 MiB / 39.65 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp1_mp8_pp1_zerofalse_stage0_mbs8_gbs64_acc8_1n8g 3548 MiB / 7.53 samples/s 3542 MiB / 7.55 samples/s 3542 MiB / 7.53 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp1_pp4_zerotrue_stage2_mbs8_gbs128_acc8_1n8g 7252 MiB / 64.78 samples/s 7230 MiB / 64.73 samples/s 7230 MiB / 64.77 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs8_gbs128_acc8_1n8g 5518 MiB / 24.44 samples/s 5512 MiB / 24.48 samples/s 5512 MiB / 24.44 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs8_gbs256_acc8_1n4g 6594 MiB / 41.94 samples/s 6588 MiB / 41.95 samples/s 6588 MiB / 41.93 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs8_gbs32_acc1_1n4g 5162 MiB / 32.87 samples/s 5156 MiB / 32.84 samples/s 5156 MiB / 32.87 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs8_gbs256_acc8_1n8g 6176 MiB / 32.69 samples/s 6152 MiB / 32.49 samples/s 6152 MiB / 32.43 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs8_gbs32_acc1_1n8g 5458 MiB / 27.01 samples/s 5412 MiB / 27.01 samples/s 5412 MiB / 27.0 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs8_gbs512_acc8_1n8g 6070 MiB / 82.3 samples/s 6064 MiB / 82.18 samples/s 6064 MiB / 82.28 samples/s
libai_gpt2_pretrain_graph_nl24_nah16_hs1024_fp16_actrue_dp8_mp1_pp1_zerotrue_stage2_mbs8_gbs64_acc1_1n8g 4726 MiB / 63.62 samples/s 4720 MiB / 63.61 samples/s 4720 MiB / 63.61 samples/s
libai_gpt2_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp1_pp8_zerofalse_stage0_mbs6_gbs96_acc16_1n8g 6822 MiB / 34.45 samples/s 6816 MiB / 34.77 samples/s 6816 MiB / 33.95 samples/s
libai_gpt2_pretrain_graph_nl48_nah16_hs1024_fp16_actrue_dp1_mp4_pp1_zerofalse_stage0_mbs8_gbs64_acc8_1n4g 6594 MiB / 4.07 samples/s 6588 MiB / 4.05 samples/s 6588 MiB / 4.03 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n1g 7230 MiB / 108.92 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n1g 6692 MiB / 113.19 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g 0 MiB / 0 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp1_pp4_zerotrue_stage2_mbs256_gbs256_acc1_1n4g 6570 MiB / 264.65 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g 10414 MiB / 107.39 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp2_pp2_zerotrue_stage2_mbs256_gbs256_acc1_1n4g 5914 MiB / 100.94 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp4_pp1_zerotrue_stage2_mbs128_gbs1024_acc8_1n4g 6748 MiB / 198.7 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp1_mp4_pp1_zerotrue_stage2_mbs256_gbs256_acc1_1n4g 5484 MiB / 218.7 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs128_gbs256_acc1_1n4g 4354 MiB / 495.47 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp1_pp2_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g 6530 MiB / 214.09 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs128_gbs256_acc1_1n4g 4098 MiB / 163.96 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp1_zerotrue_stage2_mbs64_gbs1024_acc8_1n4g 4644 MiB / 199.26 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs128_gbs2048_acc8_1n8g 10452 MiB / 182.07 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp2_mp2_pp2_zerotrue_stage2_mbs256_gbs512_acc1_1n8g 5940 MiB / 233.05 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs32_gbs1024_acc8_1n4g 3518 MiB / 215.72 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp1_zerotrue_stage2_mbs64_gbs256_acc1_1n4g 3296 MiB / 217.42 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp1_pp2_zerotrue_stage2_mbs64_gbs2048_acc8_1n8g 6234 MiB / 219.87 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs128_gbs512_acc1_1n8g 4074 MiB / 350.99 samples/s
libai_swin_imagenet_graph_nl12_nah12_hs768_fp16_actrue_dp4_mp2_pp1_zerotrue_stage2_mbs64_gbs2048_acc8_1n8g 4618 MiB / 213.33 samples/s

腾讯云A800-Libai-Megatron对比

A800-Libai-Megatron 关于 GPT2 对比测试

Docker 环境

  1. Docker镜像为:基于NGC容器 nvcr.io/nvidia/pytorch:21.07-py3,安装了免密,ib 驱动为5.3版本,配置好了宿主机的IP列表
  2. TCCL 插件为:nccl-rdma-sharp-plugins_1.1_amd64.deb

NCCL-Test

指定 export HOME=/data_turbo/home/share/workspace,后出现HOME均用此路径代替

  1. 启动 Docker 容器:docker run --gpus all -itd --shm-size=16g --ulimit memlock=-1 --ulimit core=0 --ulimit stack=67108864 --privileged --cap-add=IPC_LOCK --name "gpt_test" --ipc host --net host -v "$HOME":"$HOME" "ngc/pytorch-21.07:ssh-ib5.4-config-py38" bash -c "sed -i 's/Port 62620/Port 10098/g' /root/.ssh/config && /usr/sbin/sshd -p 10098 && bash"
    创建一名为gpt_test 的容器,后出现gpt_test均指此容器
参数解释:

--gpus all:指定容器可以使用所有可用的GPU资源。

-itd:以交互模式并在后台运行容器。

--shm-size=16g:为容器设置共享内存大小为16GB。

--ulimit memlock=-1:设置容器可以锁定任意数量的内存。

--ulimit core=0:禁用容器内的核心转储文件生成。

--ulimit stack=67108864:设置容器的堆栈大小为64MB。

--privileged:赋予容器所有的特权。

--cap-add=IPC_LOCK:允许容器锁定共享内存。

--name "gpt_test":指定容器的名称为"gpt_test"。

--ipc host:共享宿主机的IPC命名空间。

--net host:使用宿主机的网络命名空间。

-v "$HOME":"$HOME":将宿主机的$HOME目录挂载到容器的$HOME目录中,以便容器可以访问宿主机上的文件。

"ngc/pytorch-21.07:ssh-ib5.4-config-py38":指定要使用的Docker镜像的名称和标签。

bash -c "sed -i 's/Port 62620/Port 10098/g' /root/.ssh/config && /usr/sbin/sshd -p 10098 && bash":在容器启动时执行的命令。此命令将更改容器内的SSH配置文件中的端口号,并启动SSH服务器。最后,它将启动一个新的bash shell供用户使用。
  1. 051,052,053,054,055,056,057,058 机器中运行容器
  2. 登录到任一台GPU节点的 docker 容器中:docker attach gpt_test进入容器
  3. 下载 nccl-test文件:cd $HOME && git clone [https://github.com/NVIDIA/nccl-tests.git](https://github.com/NVIDIA/nccl-tests.git)
  4. 编译 nccl-test:cd nccl-tests && make MPI=1 MPI_HOME=/usr/local/mpi/
  5. 测试 8 台机器,逐步递增,增长因子为 2:mpirun -np 64 -H 051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_GDR_LEVEL=2 -x NCCL_DEBUG=INFO -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=160 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl_tcp_if_include bond0 -mca btl ^openib $HOME/nccl-tests/build/all_reduce_perf -b 2G -e 4G -f 2 -g 1 | tee $HOME/nccl-tests/nccl_log/nccl_increace_16n8g.log
    测试 8 台机器,固定大小重复 100 次:mpirun -np 64 -H 051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_GDR_LEVEL=2 -x NCCL_DEBUG=INFO -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=160 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl_tcp_if_include bond0 -mca btl ^openib $HOME/nccl-tests/build/all_reduce_perf -b 4G -e 4G -n 100 -g 1 | tee $HOME/nccl-tests/nccl_log/nccl_stable_16n8g.log
参数详解:

mpirun:这是使用 MPI 启动并行程序的命令。
-np 64:这指定启动的进程数,本例中为 64。

-H 051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8:该选项指定要在哪些节点上启动进程,以及每个节点要启动的进程数。在本例中,每个节点 051、052、053、054、055、056、057 和 058 上启动了 8 个进程。

--allow-run-as-root:此选项允许以 root 用户身份运行程序。

-bind-to none:此选项指定进程不应绑定到特定的处理单元。

-map-by slot:此选项按插槽分配进程到处理单元,每个插槽代表一个处理单元。

-x NCCL_IB_DISABLE=0:此选项将环境变量 `NCCL_IB_DISABLE设置为 0,启用 InfiniBand 通信。

-x NCCL_IB_GID_INDEX=3:此选项将环境变量 NCCL_IB_GID_INDEX 设置为 3,指定 InfiniBand 通信使用的 GID(全局唯一标识符)索引。

-x NCCL_GDR_LEVEL=2:此选项将环境变量 NCCL_GDR_LEVEL 设置为 2,启用 GPU Direct RDMA(远程直接内存访问)用于 GPU 之间的通信。

-x NCCL_DEBUG=INFO:此选项将环境变量 NCCL_DEBUG 设置为 INFO,启用 NCCL(NVIDIA Collective Communications Library)操作的调试信息。

-x NCCL_IB_QPS_PER_CONNECTION=4:此选项将环境变量 NCCL_IB_QPS_PER_CONNECTION 设置为 4,指定每个 IB(InfiniBand)连接使用的 QP(队列对)数。

-x NCCL_IB_TC=160:此选项将环境变量 NCCL_IB_TC 设置为 160,指定 InfiniBand 通信使用的流量类。

-x LD_LIBRARY_PATH:此选项导出当前的 LD_LIBRARY_PATH 值,该环境变量指定要搜索共享库的目录。

-x PATH:此选项导出当前的 PATH 值,该环境变量指定要搜索可执行文件的目录。

-mca pml ob1:此选项指定要使用的 PML(点对点消息传递层)模块,默认为 Open MPI 的 ob1 模块。

-mca btl_tcp_if_include bond0:此选项指定要用于 TCP 通信的 BTL(字节传输层)模块,并指定要使用的网络接口为 bond0。

-mca btl ^openib:此选项指定要禁用 InfiniBand 通信的 Open MPI BTL 模块。

$HOME/nccl-tests/build/all_reduce_perf:此项指定要在 MPI 环境中运行的可执行文件路径。

-b 2G:开始的 size 大小

-e 4G:结束的 size 大小

-f 2:size 之间的乘数因子

-n 100:重复迭代次数

-g 1:此选项指定每个 GPU 上的进程数,本例中为 1。

| tee $HOME/nccl-tests/nccl_log/nccl_increace_16n8g.log:此项将程序的标准输出传输到 tee 命令,该命令将输出写入指定的文件
  1. 结果:busbw 值在 167GB/s左右,完整log,在log中注意 Using devicesGDRDMAUsing network IBext等信息。
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
2147483648     536870912     float     sum      -1    25040   85.76  168.84      0    24828   86.49  170.28      0
4294967296    1073741824     float     sum      -1    52185   82.30  162.03      0    50414   85.19  167.73      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 167.221

进行 Libai 训练

  1. 查看容器是否已经全部启动:ansible 051,052,053,054,055,056,057,058 -m shell -a "docker ps | grep gpt_test"
  2. 环境配置以及数据集下载 :run_libai_gpt.sh
  3. 在任意一台GPU机器,cd $HOME/libai创建一个DDP运行多机的文件,run_train_libai.sh
set -ex
# 解析输入参数,使用逗号分割多个主机名
hosts=$(echo $1 | tr "," "\n")
master_addr=$2
# 获取主机数量
global_batch_size=$(($num_hosts * 16))
num_hosts=$(echo $hosts | wc -w)

# 使用ansible在每个主机上执行命令
for (( i=0; i<$num_hosts; i++ )); do
  host=$(echo $hosts | cut -d " " -f $((i+1)))

  ansible $host -m shell -a "docker exec gpt_test bash -c 'cd $/libai && bash tools/args_train.sh configs/gpt2_pretrain.py $num_hosts 8 $i $master_addr 1 1 true true true 2 $global_batch_size false 2 220 100 48 144 2304 9216'" & 

done
  1. bash run_train_libai.sh 051,052,053,054,055,056,057,058 10.0.0.114 ,即可开始8个节点的训练。第一个节点必须是master节点,第二个参数是master节点的域名
  2. Libai 训练脚本:args_train.sh
参数解释如下:

* CONFIG:配置文件路径。
* NNODES:节点数,默认为1。
* GPUS_PER_NODE:每个节点的GPU数量,默认为8。
* NODE_RANK:当前节点在节点组中的排名,默认为0。
* MASTER_ADDR:主节点地址,默认为"127.0.0.1"* MASTER_PORT:主节点端口号,默认为12345。
* MP:模型并行分组大小,默认为1。
* DP = (NNODES * GPUS_PER_NODE) / MP / PP   
* PP:pipeline并行分组大小,默认为1。
* GRAPH_ENABLED:是否开启图模式,默认为true。
* USE_FP16:是否使用FP16混合精度训练,默认为true。
* ACTIVATION_CHECKPOINT:是否开启激活函数checkpointing,默认为false。
* MICRO_BATCH_SIZE:每个微小批量的样本数量,默认为4。
* GLOBAL_BATCH_SIZE:全局批量大小,默认为4。
* ZERO_ENABLE:是否开启零纪录优化,默认为false。
* ZERO_STAGE:使用零纪录优化的阶段,默认为2。
* TRAIN_ITERS:训练迭代次数,默认为220。
* LOG_PERIOD:日志输出周期(以迭代为单位),默认为100。
* NUM_LAYER:Transformer层数,默认为12。
* NUM_ATT_HEADS:Self-Attention头数,默认为12。
* HIDDEN_SIZE:隐藏层大小,默认为768。
* INTERMEDIATE_SIZE:Feedforward层中间层大小,默认为3072。
* HEAD_SIZE:Multi-Head Attention中每个头的大小,默认为64。
* SAVE_MODEL:是否保存模型,默认为false。
* UNSET_DROPOUT:是否取消dropout,默认为false
* ACC = GLOBAL_BATCH_SIZE / (DP * MICRO_BATCH_SIZE)

进行 Megatron 训练

  1. 查看容器是否已经全部启动:ansible 051,052,053,054,055,056,057,058 -m shell -a "docker ps | grep gpt_test"
  2. 环境配置以及数据集下载 :run_megatron_ml_gpt.sh
  3. 在任意一台GPU机器,cd $HOME/Megatron-LM创建一个DDP运行多机的文件,run_train_megatron.sh
set -ex
# 解析输入参数,使用逗号分割多个主机名
hosts=$(echo $1 | tr "," "\n")
master_addr=$2
# 获取主机数量
global_batch_size=$(($num_hosts * 16))
num_hosts=$(echo $hosts | wc -w)

# 使用ansible在每个主机上执行命令
for (( i=0; i<$num_hosts; i++ )); do
  host=$(echo $hosts | cut -d " " -f $((i+1)))

  ansible $host -m shell -a "docker exec gpt_test bash -c 'cd $/Megatron-LM && bash examples/megatron_args_pretrain_gpt2.sh $num_hosts 8 $i $master_addr 1 1 true true true 2 $global_batch_size false 2 220 100 48 144 2304 9216'" & 

done
  1. bash run_train_megatron.sh 051,052,053,054,055,056,057,058 10.0.0.114 ,即可开始8个节点的训练。第一个节点必须是 master 节点,第二个参数是 master 节点的域名
  2. Megatron 的训练脚本:megatron_args_pretrain_gpt2.sh,参数已与 libai 的训练脚本完全对齐

训练结果

Libai 与 Megatron 并行度测试

NVIDIA Graphics Device A800 80G OneFlow_eb3df25 Megatron_e156d2f
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP64_MP1_PP1_zerofalse_stage2_mbs2_gbs128_acc1_8n8g
building graph cost time: 22.23s. / building plan cost time: 166.74s. / 66997-67197 Mib / 138.98 samples/s 64754 - 64826 MiB / 132.6 samples/s
gpt2_pretrain_graph_nl64_nah144_hs2304_FP16_actrue_
DP32_MP2_PP1_zerofalse_stage2_mbs32_gbs1024_acc1_8n8g
building graph cost time: 46.29s. / building plan cost time: 247.49s. / 55737-55811 Mib / 126.40 samples/s OOM
gpt2_pretrain_graph_nl80_nah144_hs2304_FP16_actrue_
DP16_MP2_PP2_zerofalse_stage2_mbs64_gbs1024_acc1_8n8g
building graph cost time: 55.35s. / building plan cost time: 168.52s. / 66296-67969 Mib / 50.68 samples/s OOM
gpt2_pretrain_graph_nl64_nah144_hs2304_FP16_actrue_
DP32_MP2_PP1_zerofalse_stage2_mbs16_gbs512_acc1_8n8g
building graph cost time: 49.1s. / building plan cost time: 250.73s. / 45741-45813 Mib / 118.90 samples/s 55190 - 55212 MiB / 121.5 samples/s
gpt2_pretrain_graph_nl80_nah144_hs2304_FP16_actrue_
DP16_MP2_PP2_zerofalse_stage2_mbs16_gbs256_acc1_8n8g
building graph cost time: 54.59s. / building plan cost time: 162.58s. / 33052-34771 Mib / 46.69 samples/s 41326 - 41402 MiB / 51.3 samples/s

GPT2 脚本训练注意事项

在 OneAutoTest 里面有三个 branch :https://github.com/Oneflow-Inc/OneAutoTest/branches:

截屏2023-04-29 18 37 28

分别是 megatron_script_tecent , megatron_script_huoshan , megatron_script_sahngtang , 对应了腾讯云,火山云,商汤大装置, 三个平台所适配的脚本
训练脚本所在的路径位于 OneAutoTest/onebench/libai/ ,其中有四个脚本需要特别注意和使用:
下载时请特别注意使用的分支!

  1. args_train.sh

    此脚本为 libai 主要的训练脚本,功能为定义训练所需的参数,将参数传入训练启动文件

  2. run_libai_gpt.sh

    此脚本主要为配置 libai 训练所需的环境,包括安装 One Flow,Libai, 下载所需的数据集,以及调用 args_train.sh 发起训练,脚本中只设置了两个例子case,自定义需求可根据参数含义进行修改

  3. megatron_args_pretrain_gpt2.sh

    此脚本为 Megatron主要的训练脚本,训练参数已经对齐 libai,

  4. run_megatron_ml_gpt.sh

    此脚本为配置 Megatron 训练所需的环境,包括下载 Megaton,数据集,以及调用 megatron_args_pretrain_gpt2 发起训练,脚本中只设置了一个例子case,自定义需求可根据参数含义进行修改

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_acfalse_DP4_MP2_PP2_zerofalse_stage0_mbs4_gbs64_acc4_2n8g

case3

NVIDIA_GeForce_RTX_3080_Ti master@b51cb72 rank_per_process @a442869 naive@a442869
LibAI_bert_large_pretrain_graph nl24_nah16_hs1024_FP16_acfalse DP4_MP2_PP2_zerofalse_stage0 _mbs4_gbs64_acc4_2n8g building plan Done! Cost time: 18.92s. building graph Done! Cost time: 19.91s node0:8814MIB–8960MIB node1:3638MIB–3638MIB [master_output.log] building plan Done! Cost time: 15.94s. building graph Done! Cost time: 22.09 s. node0: 8808MIB--8960MIB node1:3638MIB–3638MIB [rank_per_process_output.log] building plan Done! Cost time: 18.92s. building graph Done! Cost time: 22.16s. node0:8808MIB--8954MIB node1:3638MIB–3638MIB [naive_output.log]


  • 全局loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_acfalse_DP4_MP2_PP2_zerofalse_stage0_mbs4_gbs64_acc4_2n8g

  • 50步loss曲线对比
    LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_acfalse_DP4_MP2_PP2_zerofalse_stage0_mbs4_gbs64_acc4_2n8g_50-220

  • 100步loss曲线对比

LibAI_bert_large_pretrain_graph_nl24_nah16_hs1024_FP16_acfalse_DP4_MP2_PP2_zerofalse_stage0_mbs4_gbs64_acc4_2n8g _100-220

腾讯云 libai_gpt 与 megatron_gpt 对比测试

腾讯云 Libai 与 Megatron 关于 GPT2 的对比测试

  • GPT-2

libai Megatron
数据集 loss_compara_content_sentence.bin
loss_compara_content_sentence.idx
loss_compara_content_sentence.bin
loss_compara_content_sentence.idx
gpt2-vocab.json gpt2-vocab.json gpt2-vocab.json
gpt2-merges.txt gpt2-merges.txt gpt2-merges.txt
测试脚本 args_train.sh megatron_args_pretrain_gpt2.sh
  • 测试环境

    OneFlow Libai Megatron
    (master分支)eb3df25 (main分支)f728a5ec (mian分支)e156d2f
  • NCCL_TEST

028,029号机 2n8g 完整log

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
a800-028:49082:49082 [0] NCCL INFO Launch mode Parallel
  2147483648     536870912     float     sum      -1    27260   78.78  147.71      0    28329   75.81  142.14      0
  4294967296    1073741824     float     sum      -1    53872   79.73  149.49      0    53867   79.73  149.50      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 147.207 

025,026,028,029号机 4n8g 完整log

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
a800-028:48888:48998 [1] NCCL INFO comm 0x7fe49c000fa0 rank 1 nranks 32 cudaDev 1 busId 24000 - Init COMPLETE
a800-028:48887:48887 [0] NCCL INFO Launch mode Parallel
  2147483648     536870912     float     sum      -1    43516   49.35   95.61      0    43263   49.64   96.17      0
  4294967296    1073741824     float     sum      -1    91891   46.74   90.56      0    94436   45.48   88.12      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 92.616 

051:8,052:8,053:8,054:8,055:8,056:8,057:8,058:8号机 8n8g 完整log

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  2147483648     536870912     float     sum      -1    25432   84.44  166.24      0    25383   84.60  166.56      0
  4294967296    1073741824     float     sum      -1    53176   80.77  159.01      0    50261   85.45  168.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 165.013 
  • 测试结果

NVIDIA Graphics Device A800 80G OneFlow_eb3df25 Megatron_e156d2f
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs32_acc1_2n8g
building graph cost time: 28.8s. / building plan cost time: 50.61s. / 66491-66801 Mib / 36.36 samples/s 64626 - 64882 MiB / 33.4 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs128_acc4_2n8g
building graph cost time: 33.07s. / building plan cost time: 56.48s. / 64789-65229 Mib / 41.66 samples/s 64626 - 64882 MiB / 41.3 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP32_MP1_PP1_zerofalse_stage2_mbs2_gbs64_acc1_4n8g
building graph cost time: 30.57s. / building plan cost time: 84.44s. / 66491-66803 Mib / 71.95 samples/s 64626 - 64882 MiB / 62.3 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP32_MP1_PP1_zerofalse_stage2_mbs2_gbs256_acc4_4n8g
building graph cost time: 38.18s. / building plan cost time: 85.25s. / 64791-65229 Mib / 79.22 samples/s 64626 - 64882 MiB / 80.2 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP64_MP1_PP1_zerofalse_stage2_mbs2_gbs128_acc1_8n8g
building graph cost time: 22.23s. / building plan cost time: 166.74s. / 66997-67197 Mib / 138.98 samples/s 64754 - 64826 MiB / 132.6 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP64_MP1_PP1_zerofalse_stage2_mbs2_gbs512_acc4_8n8g
building graph cost time: 28.06s. / building plan cost time: 167.93s. / 65547-65819 Mib / 162.54 samples/s 64754 - 64826 MiB / 165.3 samples/s
gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP128_MP1_PP1_zerofalse_stage2_mbs2_gbs256_acc1_16n8g
building graph cost time: 27.79s. / building plan cost time: 394.68s. / 66999-67197 Mib / 249.03 samples/s 64684 - 64828 MiB / 260.4 samples/s
LibAI_gpt2_pretrain_graph_nl48_nah144_hs2304_FP16_actrue_
DP128_MP1_PP1_zerofalse_stage2_mbs2_gbs1024_acc4_16n8g
building graph cost time: 35.72s. / building plan cost time: 365.7s. / 65549-65819 Mib / 312.32 samples/s 64684 - 64828 MiB / 277.1 samples/s

火山引擎 libai_gpt 与 megatron_gpt 对比测试

  • GPT-2

libai Megatron
数据集 loss_compara_content_sentence.bin
loss_compara_content_sentence.idx
loss_compara_content_sentence.bin
loss_compara_content_sentence.idx
vocab.txt bert-base-chinese-vocab.txt bert-base-chinese-vocab.txt
测试脚本 args_train.sh megatron_args_pretrain_gpt2.sh
  • 测试环境

    OneFlow Libai Megatron
    (master分支)0d4bc37 (main分支)f728a5ec (mian分支)e156d2f
  • 测试结果

NVIDIA A100-SXM4-80GB OneFlow_0d4bc37 Megatron_e156d2f
gpt2_pretrain_graph_nl60_nah144_hs2304_FP16_actrue_
DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs32_acc1_2n8g
80290-80370 MiB /29.15 samples/s 77944 - 78072 MiB /26.4 samples/s
gpt2_pretrain_graph_nl60_nah144_hs2304_FP16_actrue_
DP16_MP1_PP1_zerofalse_stage2_mbs2_gbs32_acc4_2n8g
78316-78573 MiB/33 samples/s 77944 - 78072 MiB /33.8 samples/s

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.