Giter Club home page Giter Club logo

Comments (5)

nluehr avatar nluehr commented on August 15, 2024

NCCL should support multiple ranks per GPU. After a few dozen runs on my workstation, I have not been able to reproduce the bug. How frequently do you see it? Does it happen if all ranks use just a single GPU? What GPU/CPU and OS are you running on?

Thanks,
Nathan

from nccl.

amithr1 avatar amithr1 commented on August 15, 2024

Hi Nathan,

I still see the seg faults..Looks like the crashes are seen in ncclCommInitAll routine as I exit immediately after. I am also pasting the nvidia-smi cmd result below

nvidia-smi
Mon Mar 21 11:40:42 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.59 Driver Version: 352.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:03:00.0 Off | 0 |
| N/A 27C P8 27W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

from nccl.

amithr1 avatar amithr1 commented on August 15, 2024

I tried a simple loop of:

for (( i=1; i < 20; i++ )); do echo $i; ./build/test/single/broadcast_test 1024 2 0 1 ; done

and I see a couple of seg faults. Pasting the errors..

*** Error in `./build/test/single/broadcast_test': free(): invalid pointer: 0x00003efff80008c0 ***
======= Backtrace: =========
/lib64/power8/libc.so.6(+0x8f284)[0x3fff76a7f284]
/usr/lib/nvidia/libcuda.so.1(+0xa7f22c)[0x3fff7b04f22c]
/usr/lib/nvidia/libcuda.so.1(+0x276f50)[0x3fff7a846f50]
/usr/lib/nvidia/libcuda.so.1(+0xa7ffbc)[0x3fff7b04ffbc]
/lib64/power8/libpthread.so.0(+0x8728)[0x3fff76eb8728]
/lib64/power8/libc.so.6(clone+0x98)[0x3fff76b07ae0]
======= Memory map: ========
10000000-10050000 r-xp 00000000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10050000-10060000 r--p 00040000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10060000-10070000 rw-p 00050000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
200000000-200100000 rw-s 55fbfa0000 00:05 104789 /dev/nvidiactl
200100000-200500000 rw-s 6ceccc0000 00:05 104789 /dev/nvidiactl
200500000-200900000 rw-s 47301c0000 00:05 104789 /dev/nvidiactl

from nccl.

amithr1 avatar amithr1 commented on August 15, 2024

Noticed that on a separate machine, this doesn;t happen..when I did nvidia-smi, I see that on the
machine with persistence-settings, it crashes and on the other one with no persistence-settings, it passes..

PASSES
-bash-4.2$ nvidia-smi
Tue May 31 16:27:16 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.59 Driver Version: 352.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:03:00.0 Off | 0 |
| N/A 41C P0 59W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 32C P0 72W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 36C P0 59W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 27C P0 73W / 149W | 55MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

DOESNOT PASS

ue May 31 16:35:47 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.59 Driver Version: 352.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:03:00.0 Off | 0 |
| N/A 27C P8 27W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

from nccl.

spotluri avatar spotluri commented on August 15, 2024

We have now clarified this in NCCL documentation:
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/usage/communicators.html

From the documentation: "Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs."

Closing this as it is expected behavior. Please reopen if you have further questions.

from nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.