Do we need to enable the peer2peer communication betweens GPU manually or NCCL does it

Do we need to manually activate the peer2peer communication between GPUs. about nccl HOT 6 CLOSED

nvidia commented on July 17, 2024

Do we need to manually activate the peer2peer communication between GPUs.

from nccl.

Comments (6)

sjeaugey commented on July 17, 2024

NCCL will enable P2P if needed, but will not fail if already enabled.

from nccl.

nouiz commented on July 17, 2024

thanks!

from nccl.

svnoesis commented on July 17, 2024

I am observing that there is no P2P communication seen in nvprof when using BVLC caffe with NCCL for multi-gpu case. In the caffe version without NCCL, I could see the P2P between GPUs. Is there a reason why P2P is not being used by NCCL ?

from nccl.

sjeaugey commented on July 17, 2024

P2P is used, but through CUDA kernels. So you will not see explicit P2P cudaMemcpy operations, but CUDA kernels doing computation as well as remote P2P writes.

from nccl.

pseudotensor commented on July 17, 2024

Problem is cuda-memcheck will still complain about it already being enabled, which makes it hard to use when debugging nccl applications. cuda-memcheck complains even if no other problems with the application. It repeats this error message for every device communicator being initialized.

NCCL: Using devices

Rank 0 uses device 0 [0x01] GeForce GTX TITAN X

Rank 1 uses device 1 [0x02] GeForce GTX TITAN X

Rank 2 uses device 2 [0x03] GeForce GTX TITAN X

========= CUDA-MEMCHECK
========= Program hit cudaErrorPeerAccessAlreadyEnabled (error 50) due to "peer access is already enabled" on CUDA API call to cud
aDeviceEnablePeerAccess.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.8.0 (cudaDeviceEnablePeerAccess + 0x1a9) [0x38f29]
========= Host Frame:/usr/local/cuda/lib64/libnccl.so.1 [0x56c2]
========= Host Frame:/usr/local/cuda/lib64/libnccl.so.1 (ncclCommInitAll + 0x646) [0x7a66]

from nccl.

cliffwoolley commented on July 17, 2024

@pseudotensor - You can tell cuda-memcheck to ignore those API error return values with an extra command line flag; see the --help for details.

…

On May 24, 2017 10:34 PM, "pseudotensor" ***@***.***> wrote: Problem is cuda-memcheck will still complain about it already being enabled, which makes it hard to use when debugging nccl applications. cuda-memcheck complains even if no other problems with the application. It repeats this error message for every device communicator being initialized. NCCL: Using devices Rank 0 uses device 0 [0x01] GeForce GTX TITAN X Rank 1 uses device 1 [0x02] GeForce GTX TITAN X Rank 2 uses device 2 [0x03] GeForce GTX TITAN X ========= CUDA-MEMCHECK ========= Program hit cudaErrorPeerAccessAlreadyEnabled (error 50) due to "peer access is already enabled" on CUDA API call to cud aDeviceEnablePeerAccess. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03] ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.8.0 (cudaDeviceEnablePeerAccess + 0x1a9) [0x38f29] ========= Host Frame:/usr/local/cuda/lib64/libnccl.so.1 [0x56c2] ========= Host Frame:/usr/local/cuda/lib64/libnccl.so.1 (ncclCommInitAll + 0x646) [0x7a66] — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#68 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJO93s0OOx2xDJ4CtYOjkKee9zuUt3O5ks5r9RLlgaJpZM4L530H> .

from nccl.

Do we need to manually activate the peer2peer communication between GPUs. about nccl HOT 6 CLOSED

Comments (6)

NCCL: Using devices

Rank 0 uses device 0 [0x01] GeForce GTX TITAN X

Rank 1 uses device 1 [0x02] GeForce GTX TITAN X

Rank 2 uses device 2 [0x03] GeForce GTX TITAN X

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent