Hi stuff, I observed mystery hangs with mpi + nccl. Basically, I follow the testca

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

mpi + nccl hangs in muti-process scenarios about nccl HOT 6 CLOSED

nvidia commented on July 17, 2024

mpi + nccl hangs in muti-process scenarios

from nccl.

Comments (6)

Yangqing commented on July 17, 2024 1

FWIW - we found when developing Caffe2 that if you happen to have multi-threading going on and one thread tries to call cudaMallocHost() or cudaFree, then that causes nccl to freeze. Essentially, what helped us solving the problem is to guard all malloc, free and nccl calls so they don't overlap. That needed some good care (thanks to @akyrola ). An example can be seen here:

https://github.com/caffe2/caffe2/blob/master/caffe2/core/context_gpu.cu#L284

from nccl.

sjeaugey commented on July 17, 2024

First, are you running all the processes on the same node ?

In any case, you should not need a barrier in the case of MPI, only with multiple threads. A small code sample to reproduce could help.

from nccl.

hiyijian commented on July 17, 2024

Yes,it happens on a single node.
The hang project is caffe with ncclallreduce during backward. I will try to reproduce it with a small demo.
thanks.

from nccl.

hiyijian commented on July 17, 2024

hi @Yangqing. Thank you very much. I am working on our own branch of a very old version of caffe, with OpenMPI Integrated. In other words, our case is multi-process on a single node, one GPU per process. I think no thread-based lock is needed. Am I right?

from nccl.

hiyijian commented on July 17, 2024

ps: it hangs every 1 day or 1.5 days. I have to keep training with snapshot after freezon. The trained moel performance is quite good

from nccl.

sjeaugey commented on July 17, 2024

Closing old issue. Please re-open if you still have issues with 2.3.

from nccl.

Recommend Projects

mpi + nccl hangs in muti-process scenarios about nccl HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent