Comments (6)
FWIW - we found when developing Caffe2 that if you happen to have multi-threading going on and one thread tries to call cudaMallocHost() or cudaFree, then that causes nccl to freeze. Essentially, what helped us solving the problem is to guard all malloc, free and nccl calls so they don't overlap. That needed some good care (thanks to @akyrola ). An example can be seen here:
https://github.com/caffe2/caffe2/blob/master/caffe2/core/context_gpu.cu#L284
from nccl.
First, are you running all the processes on the same node ?
In any case, you should not need a barrier in the case of MPI, only with multiple threads. A small code sample to reproduce could help.
from nccl.
Yes,it happens on a single node.
The hang project is caffe with ncclallreduce during backward. I will try to reproduce it with a small demo.
thanks.
from nccl.
hi @Yangqing. Thank you very much. I am working on our own branch of a very old version of caffe, with OpenMPI Integrated. In other words, our case is multi-process on a single node, one GPU per process. I think no thread-based lock is needed. Am I right?
from nccl.
ps: it hangs every 1 day or 1.5 days. I have to keep training with snapshot after freezon. The trained moel performance is quite good
from nccl.
Closing old issue. Please re-open if you still have issues with 2.3.
from nccl.
Related Issues (20)
- Inquiry about NCCL's Tree Algorithm Performance in Single and Dual Machine Scenarios
- nccl-tests with two GH200 over Quantum2 iB stuck HOT 2
- Failed to find ncclNetPlugin_v8 symbol HOT 4
- How to understand "bank" in net.cc?
- Does ncclBroadcast call return at same time on different ranks? HOT 1
- NCCL_NET_GDR_READ's performance impact on a PCIe platform HOT 2
- What's the relationship between nccl protcols and inter-node communication?
- Internal error when submitting a job to a Ray cluster HOT 1
- all-reduce slower on v2.20.5 compared to v2.18.5 on AWS g5.48xlarge (8 x A10G) HOT 15
- How is the logic for allocating data across different channels? HOT 2
- How does collective operations call runRing, runTreeUpDown, and runTreeSplit HOT 1
- Why there are two IDs for MNNVL support? HOT 2
- Why duplicate nChannels in connect.cc HOT 1
- All Reduce Performance on H100 VMs HOT 1
- NCCL fallback to Ring,LL on broadcast perf and NCCL_ALGO=Tree HOT 1
- why two GPU far than PXB under intel cpu use P2P will be slower(without NVLink) HOT 2
- About NVLS MC/UC buffer
- nccl-test can use nvidia sharp, but training job can not use nvidia sharp
- Dual 4090 bandwidth slower with PCIe HOT 1
- Profiling Tools for NCCL collective operations
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.