Describe the bug During our training sessions utilizing Megatron'

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] Permormance drop while training with MoE about megatron-lm HOT 7 OPEN

Teng-xu commented on June 21, 2024

[BUG] Permormance drop while training with MoE

from megatron-lm.

Comments (7)

ktaebum commented on June 21, 2024

Also encountered the same problem with non-MoE model.
I tried to run training job of Llama 13B model on two DGX A100 nodes, but the time breakdown shows:

    forward-backward ...............................: (5662.82, 5666.72)
    forward-compute ................................: (2146.30, 2210.56)
    backward-compute ...............................: (3431.20, 3509.58)
    batch-generator ................................: (17.31, 33.45)
    layernorm-grads-all-reduce .....................: (5.24, 218.94)
    embedding-grads-all-reduce .....................: (0.06, 0.11)
    all-grads-sync .................................: (215891.91, 225072.22)
    optimizer-copy-to-main-grad ....................: (9.13, 9.19)
    optimizer-unscale-and-check-inf ................: (9.69, 9.88)
    optimizer-clip-main-grad .......................: (14.55, 14.77)
    optimizer-count-zeros ..........................: (0.02, 0.07)
    optimizer-inner-step ...........................: (31.58, 32.33)
    optimizer-copy-main-to-model-params ............: (9.36, 9.57)
    optimizer ......................................: (77.15, 77.37)

(I disabled all overlap-* optimizations and distributed-optimizer for more accurate time breakdown. Gradient AllReduce takes more than 200 seconds while forward-backward takes just 5.6 seconds.
The problem occurs regardless of using distributed-optimizer:

    forward-backward ...............................: (6640.79, 6647.08)
    forward-compute ................................: (3118.90, 3181.81)
    backward-compute ...............................: (3428.96, 3512.83)
    batch-generator ................................: (16.72, 34.26)
    layernorm-grads-all-reduce .....................: (4.97, 11.08)
    embedding-grads-all-reduce .....................: (0.06, 0.12)
    all-grads-sync .................................: (77025.69, 112368.28)
    params-all-gather ..............................: (77461.61, 112343.53)
    optimizer-copy-to-main-grad ....................: (4.65, 4.82)
    optimizer-unscale-and-check-inf ................: (5.37, 5.39)
    optimizer-clip-main-grad .......................: (7.70, 7.74)
    optimizer-count-zeros ..........................: (0.02, 0.03)
    optimizer-inner-step ...........................: (15.89, 16.30)
    optimizer-copy-main-to-model-params ............: (4.53, 4.56)
    optimizer ......................................: (77502.36, 112384.28)

When I run the same job on a single node, the problem disappears.

My environment is

Megatron commit ID: core_v0.4.0
NGC PyTorch container version: 23.04

from megatron-lm.

ktaebum commented on June 21, 2024

My issue has been resolved by passing --device=/dev/infiniband in docker run argument.

from megatron-lm.

rahul003 commented on June 21, 2024

ktaebum's issue is unrelated. We only notice slowdown in some steps, and due to intra node AllGather calls which are surprisingly high for those steps

from megatron-lm.

dawson-chen commented on June 21, 2024

I have encountered the same problem with MoE, when route type is sinkhorn and topK > 1.

From my log, I found the main comsumption is from sinkhorn function

norm_logits = sinkhorn(
                    logits.to(dtype=torch.float32)
                )

from megatron-lm.

dawson-chen commented on June 21, 2024

When topk > 1 and route type is sinkhorn, the sinkhorn function inner code loop thousands times for some logits cases.

But I didn't found any clue on those logits, look similar with normal ones.

from megatron-lm.

yanring commented on June 21, 2024

@Teng-xu @dawson-chen
Thanks for reporting this issue. This could be due to too many iterations in Sinkhorn on some ranks. You can try adding an early stop to Sinkhorn or using aux_loss for load balancing.

from megatron-lm.

wen020 commented on June 21, 2024

how to get Model TFLOPS/GPU?

from megatron-lm.

[BUG] Permormance drop while training with MoE about megatron-lm HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent