Giter Club home page Giter Club logo

Comments (7)

ktaebum avatar ktaebum commented on June 21, 2024

Also encountered the same problem with non-MoE model.
I tried to run training job of Llama 13B model on two DGX A100 nodes, but the time breakdown shows:

    forward-backward ...............................: (5662.82, 5666.72)
    forward-compute ................................: (2146.30, 2210.56)
    backward-compute ...............................: (3431.20, 3509.58)
    batch-generator ................................: (17.31, 33.45)
    layernorm-grads-all-reduce .....................: (5.24, 218.94)
    embedding-grads-all-reduce .....................: (0.06, 0.11)
    all-grads-sync .................................: (215891.91, 225072.22)
    optimizer-copy-to-main-grad ....................: (9.13, 9.19)
    optimizer-unscale-and-check-inf ................: (9.69, 9.88)
    optimizer-clip-main-grad .......................: (14.55, 14.77)
    optimizer-count-zeros ..........................: (0.02, 0.07)
    optimizer-inner-step ...........................: (31.58, 32.33)
    optimizer-copy-main-to-model-params ............: (9.36, 9.57)
    optimizer ......................................: (77.15, 77.37)

(I disabled all overlap-* optimizations and distributed-optimizer for more accurate time breakdown. Gradient AllReduce takes more than 200 seconds while forward-backward takes just 5.6 seconds.
The problem occurs regardless of using distributed-optimizer:

    forward-backward ...............................: (6640.79, 6647.08)
    forward-compute ................................: (3118.90, 3181.81)
    backward-compute ...............................: (3428.96, 3512.83)
    batch-generator ................................: (16.72, 34.26)
    layernorm-grads-all-reduce .....................: (4.97, 11.08)
    embedding-grads-all-reduce .....................: (0.06, 0.12)
    all-grads-sync .................................: (77025.69, 112368.28)
    params-all-gather ..............................: (77461.61, 112343.53)
    optimizer-copy-to-main-grad ....................: (4.65, 4.82)
    optimizer-unscale-and-check-inf ................: (5.37, 5.39)
    optimizer-clip-main-grad .......................: (7.70, 7.74)
    optimizer-count-zeros ..........................: (0.02, 0.03)
    optimizer-inner-step ...........................: (15.89, 16.30)
    optimizer-copy-main-to-model-params ............: (4.53, 4.56)
    optimizer ......................................: (77502.36, 112384.28)

When I run the same job on a single node, the problem disappears.

My environment is

  • Megatron commit ID: core_v0.4.0
  • NGC PyTorch container version: 23.04

from megatron-lm.

ktaebum avatar ktaebum commented on June 21, 2024

My issue has been resolved by passing --device=/dev/infiniband in docker run argument.

from megatron-lm.

rahul003 avatar rahul003 commented on June 21, 2024

ktaebum's issue is unrelated. We only notice slowdown in some steps, and due to intra node AllGather calls which are surprisingly high for those steps

from megatron-lm.

dawson-chen avatar dawson-chen commented on June 21, 2024

I have encountered the same problem with MoE, when route type is sinkhorn and topK > 1.

image

From my log, I found the main comsumption is from sinkhorn function

norm_logits = sinkhorn(
                    logits.to(dtype=torch.float32)
                )

from megatron-lm.

dawson-chen avatar dawson-chen commented on June 21, 2024

When topk > 1 and route type is sinkhorn, the sinkhorn function inner code loop thousands times for some logits cases.

But I didn't found any clue on those logits, look similar with normal ones.

from megatron-lm.

yanring avatar yanring commented on June 21, 2024

@Teng-xu @dawson-chen
Thanks for reporting this issue. This could be due to too many iterations in Sinkhorn on some ranks. You can try adding an early stop to Sinkhorn or using aux_loss for load balancing.

from megatron-lm.

wen020 avatar wen020 commented on June 21, 2024

how to get Model TFLOPS/GPU?

from megatron-lm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.