I am running a NCCL reduction across multiple GPUs on an Amazon P2 16x instance in a m

peer mapping resources exhausted for < 8 GPUs about nccl HOT 23 CLOSED

nvidia commented on July 17, 2024

peer mapping resources exhausted for < 8 GPUs

from nccl.

Comments (23)

sjeaugey commented on July 17, 2024

The cudaDeviceEnablePeerAccess documentation says : "Each device can support a system-wide maximum of eight peer connections."

So if you try to connect to a 9th peer, you'll get this error. Note this is system-wide for every single CUDA device. So if you use other libraries (e.g. a CUDA aware MPI library) which enable peer access to other devices, NCCL may fail with less than 9 ranks.

from nccl.

extabgrad commented on July 17, 2024

Hi, can you explain this in more detail? I don't have a CUDA aware MPI library other than NCCL, all I am doing is reducing between 16 GPUs in two groups of 8. That means each process attempts to connect to just 7 peers. Once I am reduced to two arrays on rank 1 of each group, I do the final reduction using system memory which doesn't involve any peer access. And I still get this error.

In addition, should NCCL not be taking care of this for me? At the very least it should error if my NCCL communicator has more than 9 ranks in it, not allow some internal error to be thrown.

from nccl.

sjeaugey commented on July 17, 2024

OK. Since you mentioned you were using MPI, I thought it could be a CUDA-aware version of MPI. Can you tell us which MPI implementation/version you are using ?

We're thinking about improving NCCL to workaround this limit internally and not fail in this case, but we've never seen it fail for less than 8 GPUs .. something else must be preventing the P2P connections from being established.

from nccl.

extabgrad commented on July 17, 2024

I'm using MPICH2. But I don't think that's relevant here...the resource failure occurs while the two sets of 8 processes are carrying out their reductions in NCCL, not during any MPI operations.

It's worth noting that on an 8 GPU machine everything works fine. It's only when I have more than 8 GPUs that there's an issue. I'm guessing that two completely independent reductions would also fail whether or not I then carried out a final reduction, but I haven't checked that.

I can try and put together some repro code, but it's going to be difficult to handle because it requires a machine with 16 GPUs that can access each other peer-to-peer...

from nccl.

sjeaugey commented on July 17, 2024

The error should happen when you create the NCCL communicator with ncclCommInitRank() ... not during the nccl[All]Reduce operation. Can you confirm that's what happens in your case ?

For the reproducer, I guess just taking the MPI example and creating two communicators instead of one would work. Here is a patch to do that :

diff --git a/test/mpi/mpi_test.cu b/test/mpi/mpi_test.cu
index fea6ae5..fdef16b 100644
--- a/test/mpi/mpi_test.cu
+++ b/test/mpi/mpi_test.cu
@@ -36,10 +36,17 @@ int main(int argc, char *argv[]) {
   CUDACHECK(cudaSetDevice(gpu));
   MPI_Barrier(MPI_COMM_WORLD);

+  // Split in 8 GPUs groups
+  MPI_Comm mpicomm;
+  int group = rank/8;
+  MPI_Comm_split(MPI_COMM_WORLD, group, rank, &mpicomm);
+  MPI_Comm_size(mpicomm, &size);
+  MPI_Comm_rank(mpicomm, &rank);
+
   // NCCL Communicator creation
   ncclComm_t comm;
   NCCLCHECK(ncclGetUniqueId(&commId));
-  MPI_Bcast(&commId, NCCL_UNIQUE_ID_BYTES, MPI_CHAR, 0, MPI_COMM_WORLD);
+  MPI_Bcast(&commId, NCCL_UNIQUE_ID_BYTES, MPI_CHAR, 0, mpicomm);
   ret = ncclCommInitRank(&comm, size, commId, rank);
   if (ret != ncclSuccess) {
     printf("NCCL Init failed (%d) '%s'\n", ret, ncclGetErrorString(ret));
@@ -74,7 +81,7 @@ int main(int argc, char *argv[]) {
   for (int v=0; v<SIZE; v++) {
     if (val[v] != ref) {
       errors++;
-      printf("[%d] Error at %d : got %d instead of %d\n", rank, v, val[v], ref);
+      printf("[%d/%d] Error at %d : got %d instead of %d\n", group, rank, v, val[v], ref);
     }
   }
   CUDACHECK(cudaFree(dptr));
@@ -82,9 +89,9 @@ int main(int argc, char *argv[]) {
   MPI_Allreduce(MPI_IN_PLACE, &errors, 1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD);
   if (rank == 0) {
     if (errors)
-      printf("%d errors. Test FAILED.\n", errors);
+      printf("%d : %d errors. Test FAILED.\n", group, errors);
     else
-      printf("Test PASSED.\n");
+      printf("%d : test PASSED.\n", group);
   }

   MPI_Finalize();

Can you confirm it reproduces the issue ?

from nccl.

extabgrad commented on July 17, 2024

Hi, thanks for the very useful code. So far I haven't been able to reproduce the error. But as things stand I can't reproduce the error even if I have a NCCL collective with all 16 GPUs, so clearly I'm missing something. I'm trying to change the example to be more like my case and hopefully I'll get it to reproduce.

from nccl.

sjeaugey commented on July 17, 2024

I think you have 8 GPUs attached to each CPU, which means each GPU will not try to connect to more than 7 peers, since 8 of them are on the other side and we don't have peer access to them.

There is definitely something preventing P2P connections to happen.

from nccl.

extabgrad commented on July 17, 2024

No, this is an Amazon P2 16x instance, it has 16 GPUs and 32 CPUs, and they can all communicate peer to peer (we've checked). However, I can instrument the code to check that all communication is happening peer to peer.

from nccl.

extabgrad commented on July 17, 2024

Okay, I have a bit more thorough information. In my application I only get an error when the array is non-scalar and of type double, and I only get it the second time I run it - not the first, third, or any subsequent calls.

In the stand-alone I cannot yet get it to error, even when I do not split the communicator and I use all 16 GPUs. I've confirmed that peer-to-peer is enabled between all GPUs using the p2pBandwidthLatencyTest CUDA sample, although I haven't confirmed what's going on inside the stand-alone repro code yet. Is there a way to tell whether NCCL is using peer-to-peer or falling back to standard memcpy?

from nccl.

extabgrad commented on July 17, 2024

I don't know the details but perhaps since NCCL reductions are implemented using a ring collective, in a multi-process context each GPU only ever has a single peer at a time? So it shouldn't matter how many GPUs there are.

from nccl.

sjeaugey commented on July 17, 2024

Setting NCCL_DEBUG=INFO should give you all the information about what NCCL does.
In your case, it should be using CUDA IPCs to map memory directly from peers.

I'm not sure to understand the "second time you run it". Is it on the second run of your MPI app, on the second ncclInit, or on the second ncclAllReduce ?

from nccl.

extabgrad commented on July 17, 2024

In each call I'm calling ncclCommInitRank, ncclAllReduce, and ncclCommDestroy, basically much like the example. The error actually occurs on the next cuda call (or possibly kernel launch) after the first run, which means that there is an uncleared CUDA error still in the system following that first run. I'm going to call cudaGetLastError at the end of my function to see if that gets rid of the issue (but then I'd still want to know why the error is thrown).

from nccl.

extabgrad commented on July 17, 2024

Woohoo! I've reproduced the error using the standard unmodified example except for one extra call at the end to CUDACHECK( cudaGetLastError() ). The peer mapping resources exhausted error is then thrown. Note, this is AFTER the result of the reduction is checked. I think what's happening is NCCL is trapping the error and using a different strategy when there are >8 peers, but not clearing that error. Would that be fair?

I will now go back to the split-communicator example to see if that errors in the same way even for <=8 peers.

from nccl.

extabgrad commented on July 17, 2024

Okay, confirmed. Even when the size of the group is 8, the peer mapping resources error is thrown.

from nccl.

extabgrad commented on July 17, 2024

Sorry, scrap that, my test was wrong. It DIDN'T reproduce when I kept the group size down to 8. So I need to go back and confirm that it does still error for my own code and work out why.

So I get the correct answer for >8 GPUs without needing to keep them in groups of 8, although I have to discard this one-time error. Is NCCL successfully dividing the work up into smaller groups, or has it reverted to using no peer-to-peer at all? In which case, I would definitely need to prefer the dividing-up option.

from nccl.

extabgrad commented on July 17, 2024

I've done some tests and it's clear it's twice as slow to let NCCL deal with 16 GPUs than it is to split the communicator and tidy up afterwards. So I'm guessing my assumption is correct and NCCL is just switching off peer comms.

from nccl.

sjeaugey commented on July 17, 2024

Were you able to confirm that exporting NCCL_DEBUG=INFO ?

from nccl.

extabgrad commented on July 17, 2024

Okay, I believe I have this under control now. The fact that NCCL works for >8 GPUs, but just has this unhandled CUDA error threw me, but if I do a correct reduction-in-groups I can get the behaviour I want. The only remaining issue would be 1) NCCL should not leave this error queued when it's resorting to the non-peer-to-peer version and 2) a request that NCCL manage this 8-peer limitation internally.

I will give NCCL_DEBUG=INFO a go too. Is it an environment variable, a build variable for my own code, or do I need to rebuild NCCL with it?

from nccl.

nluehr commented on July 17, 2024

NCCL_DEBUG is an environment variable that is available in every build configuration. No need to recompile.

from nccl.

extabgrad commented on July 17, 2024

Great, thanks. I ran out of time again today but I'll look at it next week. I'm still getting that peer mapping resources exhausted error occasionally and I don't understand why.

from nccl.

extabgrad commented on July 17, 2024

I can't see any output related to NCCL when I set NCCL_DEBUG - am I not supposed to see text in the terminal?

from nccl.

sjeaugey commented on July 17, 2024

If you run with OpenMPI, make sure you export the NCCL_DEBUG env var with

mpirun -x NCCL_DEBUG -np ... ./mpi_test

or directly setting it with :

mpirun -x NCCL_DEBUG=INFO -np ... ./mpi_test

Yes, you should see the output in the terminal.

from nccl.

sjeaugey commented on July 17, 2024

Closing old issue.

from nccl.

peer mapping resources exhausted for < 8 GPUs about nccl HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent