Comments (12)
I don't have any explanation for the difference currently. I'd need to see full NCCL_DEBUG=INFO logs to diagnose it.
But the external plugin in NGC is this one: MLNX-SHARP-Plugin
The IB verbs code in the external plugin and NCCL internal plugin are kept pretty much in sync, so I cannot explain the difference in performance, unless you've enabled COLLNET and SHARP support in the IB network?
from nccl.
Many thanks for you response. I'm sorry that I could not access the testing environment to have the detailed log message. But it looks the process to create the QP connections was no problem. Additionally, I'm sure that COLLNET and SHARP on switch side is not enabled because the problem can be reproduced in RoCE network. Anyway, this significant performance gap is a problem. And this issue can be reproduced in both IB and RoCE network.
from nccl.
@AddyLaddy additionally, are you sure that the external plugin in NGC is this one: MLNX-SHARP-Plugin? It looks this one targets at the SHARP/COLLNET application scenario.
from nccl.
That Nvidia developed plugin is provided with the Nvidia HPCX SW stack, which I assume is part of the NGC containers?
But there are only a handful of NCCL plugins available, and most of the others are for supporting Slingshot or AWS/EFA over libfabric. I don't know of any other external plugins that support IB Verbs apart from the one I linked above.
NB: That plugin supports both IB Verbs and SHARP/COLLNET and will identify itself with "IBext_vX"
in the INFO logs.
RoCE doesn't support SHARP and unless you've explicitly requested COLLNET support then it will be disabled by default.
from nccl.
Thanks for your information @AddyLaddy. I suspect Nvidia developed pugin in NGC has some kind of enhancement to boost the performance in NGC somehow. It's really tricky. Anyway, as far as my understanding, the bus bandwidh should reach 360+ GB/s in case of 4 H100 servers with 8x400Gbps network connection.
from nccl.
We do try and keep both the external and internal IB plugin code lockstep, so I have no idea why there is a performance difference with the NVLSTree protocol. The NCCL_DEBUG=INFO
logs may give some clues. Also adding NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH
will give more information.
from nccl.
OK. Let me check if we could have extra test window to have the detailed log messages. If you guys have chance to test this case, please try. It is much easy to reproduce this problem. Many thanks.
from nccl.
Hello @AddyLaddy I'm not sure if the below information on this Nvidia page makes sense.
from nccl.
The text above is not correct and I've asked the team responsible to update it.
The default P2P protocol for the SHARP/COLLNET plugin is still IB Verbs. You have to set an additional environment variable to switch to UCX.
All Nvidia sourced containers include HPCX so most Deep Learning training and testing is actually done with the IB Verbs code in the external plugin.
But we carry out extensive QA on each release, where we test both internal and external plugins on numerous platforms, for both correctness and performance.
from nccl.
Hello @AddyLaddy FYI as attached - detailed log message in host system and NGC respectively.
all_reduce_in_host_system.txt
all_reduce_in_ngc.txt
Please let me know if any finding.
Thanks.
from nccl.
Thanks for the log files. I've not seen any obvious reasons for the performance differences apart from the NCCL environment variables being set are different between the "host" and "ngc" runs.
With the "ngc" run I see the following extra env vars being set
NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
NCCL_NVLS_ENABLE set by environment to 1.
NCCL_CROSS_NIC set by environment to 0.
But only NCCL_CROSS_NIC=0
is overriding the default NCCL library setting.
However, I am a bit concerned as to why these env variables are being set as they can cause sub-optimal behaviour:
NCCL_CROSS_NIC set by environment to 0.
NCCL_PXN_DISABLE set by environment to 1.
NCCL_NET_GDR_LEVEL set by environment to SYS
Also, we have found setting NCCL_IB_QPS_PER_CONNECTION=2
only helps on RoCE machines and should normally be used together with NCCL_IB_SPLIT_DATA_ON_QPS=0
Both runs don't appear to be using library binaries provided by Nvidia, so it's hard to tell what other changes may have been made to the source code. NCCL 2.19.44 is not an official version!
NCCL version 2.19.44+cuda12.2
Plugin Path : /workspace/infrawaves/nccl_2.19.44/build/lib/libnccl-net.so
Maybe as the next step you could unify the environment variable settings between "host" and "ngc" to see if that is the root cause of the performance differences?
But as you're not using official NCCL software versions, any further analysis may be difficult.
from nccl.
Many thanks for your response and time, @AddyLaddy.
Will try by unifying the environment variables if I could have the test window. Additionally, we did update the nccl code. But I don't think that could cause the performance gap betweem host and ngc system. I suspect this problem could be reproduced by going with the original nccl code.
from nccl.
Related Issues (20)
- NCCL with WARN socketTryAccept: Accept failed: Bad file descriptor HOT 5
- Some questions about how NCCL uses IB network for data transmission
- Data transfer from shared buffer to network
- Is there any option to use copy engine in ncclSend and ncclRecv ?
- RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers HOT 4
- Will ncclSend, ncclRecv launched in different cuda streams blocking each other?
- Could anyone provide some suggestions to help me optimize my NCCL code for transmitting KV cache to improve performance?
- Why NCCL LL128 proto need to load data twice?
- Issues with Limited HCA Utilization and RDMA in Multi-node Training HOT 6
- Why only flush once using the last non-zero receive?
- Why different shape of tensor can be all reduced when using nccl as backend?
- How I can modify the source code to change the send data size to 16K in IB verbs?
- [Question] Is SendRecv always block GPU?
- Why choose 20.6 as Hopper GPU’s nvlink bandwith? HOT 1
- how to Improve VLLM KVCACHE Transfer Efficiency with NCCL P2P Communication
- How NCCL utilizes shared memory with the dynamic tensor shape varies across training iterations? HOT 7
- NCCL all-reduce test failure due to TL_SHM ERROR
- Documentation: default of NCCL_IB_SPLIT_DATA_ON_QPS is wrong HOT 1
- Which path will be choosen with the Specific TOPO? HOT 3
- Allreduce bus bandwidth is very low and unstable when ECE (enhanced connection establishment) is enabled. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.