Hi All, I have noticed crashes when I overload a device with more th

We have now clarified this in NCCL documentation: <a href="https://docs.nvidia.com

occasional crashes when using more than one comm per GPU about nccl HOT 5 CLOSED

nvidia commented on August 15, 2024

occasional crashes when using more than one comm per GPU

from nccl.

Comments (5)

nluehr commented on August 15, 2024

NCCL should support multiple ranks per GPU. After a few dozen runs on my workstation, I have not been able to reproduce the bug. How frequently do you see it? Does it happen if all ranks use just a single GPU? What GPU/CPU and OS are you running on?

Thanks,
Nathan

from nccl.

amithr1 commented on August 15, 2024

Hi Nathan,

I still see the seg faults..Looks like the crashes are seen in ncclCommInitAll routine as I exit immediately after. I am also pasting the nvidia-smi cmd result below

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

from nccl.

amithr1 commented on August 15, 2024

I tried a simple loop of:

for (( i=1; i < 20; i++ )); do echo $i; ./build/test/single/broadcast_test 1024 2 0 1 ; done

and I see a couple of seg faults. Pasting the errors..

*** Error in `./build/test/single/broadcast_test': free(): invalid pointer: 0x00003efff80008c0 ***
======= Backtrace: =========
/lib64/power8/libc.so.6(+0x8f284)[0x3fff76a7f284]
/usr/lib/nvidia/libcuda.so.1(+0xa7f22c)[0x3fff7b04f22c]
/usr/lib/nvidia/libcuda.so.1(+0x276f50)[0x3fff7a846f50]
/usr/lib/nvidia/libcuda.so.1(+0xa7ffbc)[0x3fff7b04ffbc]
/lib64/power8/libpthread.so.0(+0x8728)[0x3fff76eb8728]
/lib64/power8/libc.so.6(clone+0x98)[0x3fff76b07ae0]
======= Memory map: ========
10000000-10050000 r-xp 00000000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10050000-10060000 r--p 00040000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10060000-10070000 rw-p 00050000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
200000000-200100000 rw-s 55fbfa0000 00:05 104789 /dev/nvidiactl
200100000-200500000 rw-s 6ceccc0000 00:05 104789 /dev/nvidiactl
200500000-200900000 rw-s 47301c0000 00:05 104789 /dev/nvidiactl

from nccl.

amithr1 commented on August 15, 2024

Noticed that on a separate machine, this doesn;t happen..when I did nvidia-smi, I see that on the
machine with persistence-settings, it crashes and on the other one with no persistence-settings, it passes..

DOESNOT PASS

from nccl.

spotluri commented on August 15, 2024

We have now clarified this in NCCL documentation:
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/usage/communicators.html

From the documentation: "Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs."

Closing this as it is expected behavior. Please reopen if you have further questions.

from nccl.

occasional crashes when using more than one comm per GPU about nccl HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent