Comments (7)
My understanding is that since the performance with one GPU per node, it's not due to network limitations but rather limitations within the node itself, correct?
Correct.
Why does the addition of network cards result in decreased performance? Are there any optimization methods available?
I didn't get that. Where did you add network cards and it degraded performance?
from nccl.
Thank you for your reply!
My point is, why is there a performance drop when there are two GPUs per node(20GB/s) compared to one GPU per node(21.4GB)? As we discussed earlier, it could be assumed that this degradation due to the limitations within the node. What exactly causes this limitation? possibly PCIe switching performance?
However, performance testing within the node with 2/3/4 GPUs does not exhibit any degradation. From the perspective of a single server, compared to intra-node, inter-node simply adds another data stream from the NIC. If the degradation is due to switching performance, it should also happen during intra-node test.
Or could there be other reasons?
from nccl.
Yes it could be the PCI switch degrading performance a bit, but it's not a big degradation; it's unlikely to matter much in real apps.
Can you see if NCCL_NET_GDR_READ=1 gets better results?
from nccl.
Alright, thank you for your advice.
Our servers are busy now, I will try once they are free.
from nccl.
Thank you very much for your advice! It's really helpful.
Actually, I tried NCCL_NET_GDR_READ=0 and get better results (22GB/s). It seems that the default value in our environment is 1, although there is no NVLINK in my setup. Our NCCL version is 2.19.4.
from nccl.
Ah, indeed GDR Read was disabled by default only on Volta and before. Since Ampere, it is enabled by default. Disabling it may improve performance, but it will cause more traffic to the CPU and back, so in a real application, that could slow down the CPU-GPU exchanges and NCCL could also be slowed down by those transfers. So I would advise to make sure real applications do see a benefit when using this environment variable before setting it.
from nccl.
Got it. Thanks again.
from nccl.
Related Issues (20)
- How to determine the value of NCCL_P2P_LEVEL? HOT 4
- ALLTOALL ERROR ON 700 node: misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
- NCCL hang when invoking Reduce and ncclSend/Recv concurrently HOT 4
- NCCL allreduce is timing out HOT 1
- Question about allreduce performance with different numbers of GPUs on single PCIe switch HOT 1
- Question about error handling in a stream HOT 1
- NCCL 2.19.4 hang if NCCL_NVLS_ENABLE=1 HOT 6
- Why P2P requires more channels ?
- nccltest allreduce is with a lot of wrongs with the NCCL_P2P_DISABLE=1 env or NCCL_PXN_DISABLE=1 env HOT 13
- How to use the API ncclReduceScatter? HOT 3
- Duplicated ncclCommRegister in nccl.h.in? HOT 2
- NCCL WARN NET/Socket : message truncated in PyTorch multiple machines and multiple GPUs HOT 1
- `thrust::partition` failed to compile on CUDA 12.2 HOT 1
- Can NCCL_IB_PCI_RELAXED_ORDERING only be used in virtualized environments? HOT 2
- Why { "16 GT/s", 120 } paired in kvDictPciGen? HOT 2
- How much can 512 H100 ReduceScatter/AllGather-8GB_msg_size run to HOT 1
- How can I see which version of NCCL pytorch is using?
- why load repeatedly when receiving in prims_ll128 HOT 1
- Cannot use P2P in Azure GPU cluster HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.