Comments (6)
I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH (on
one of my systems, only libcuda.so.1 was there, but the usual libcuda.so
symlink was absent). Can you please check that? If it turns out that's
your issue, you can either simply ln -s libcuda.so.1 libcuda.so
in the
relevant directory or you can modify NCCL to dlopen libcuda.so.1 instead of
libcuda.so.
Thanks,
Cliff
On Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin [email protected]
wrote:
I build nccl with cuda-7.5:
make CUDA_HOME=/usr/local/cuda-7.5 test
And run test with the following command:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_testcauses a segmentation fault:
Using devices
Segmentation fault
But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?—
Reply to this email directly or view it on GitHub
#8.
from nccl.
Oh, and yes, NCCL is normally compatible with CUDA 7.5. It actually is a
bit more complete on CUDA 7.5 than on 7.0, since 7.0 lacked some of the
necessary support for the fp16 'half' datatype.
On Thu, Jan 14, 2016 at 10:41 PM, Cliff Woolley [email protected]
wrote:
I've seen a similar issue when libcuda.so is not in the LD_LIBRARY_PATH
(on one of my systems, only libcuda.so.1 was there, but the usual
libcuda.so symlink was absent). Can you please check that? If it turns
out that's your issue, you can either simplyln -s libcuda.so.1 libcuda.so
in the relevant directory or you can modify NCCL to dlopen
libcuda.so.1 instead of libcuda.so.Thanks,
CliffOn Thu, Jan 14, 2016 at 10:28 PM, Jerry Lin [email protected]
wrote:I build nccl with cuda-7.5:
make CUDA_HOME=/usr/local/cuda-7.5 test
And run test with the following command:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_testcauses a segmentation fault:
Using devices
Segmentation fault
But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?—
Reply to this email directly or view it on GitHub
#8.
from nccl.
Side note: assuming this is the same issue with libcuda.so that I'm referring to, we should fix the tests to fail more gracefully when the communicator cannot be created. The segfault happens when we pass a NULL communicator to some subsequent routine.
from nccl.
@cliffwoolley Thanks for the explanation.
I create a symlink to libcuda.so.1
and now it works!
So it's the same issue with libcuda.so
.
from nccl.
Great! Glad to hear it. We'll leave this issue open to deal with the libcuda.so[.1] loading (perhaps we could try both variants before giving up) as well as to detect communicator creation failure in the test apps without segfaulting. I believe @nluehr already has fixes pending for one or both of these issues.
from nccl.
These issues are resolved in change sets caa40b8 and 2758353.
from nccl.
Related Issues (20)
- Why duplicate nChannels in connect.cc HOT 1
- All Reduce Performance on H100 VMs HOT 1
- NCCL fallback to Ring,LL on broadcast perf and NCCL_ALGO=Tree HOT 1
- why two GPU far than PXB under intel cpu use P2P will be slower(without NVLink) HOT 2
- About NVLS MC/UC buffer
- nccl-test can use nvidia sharp, but training job can not use nvidia sharp
- Dual 4090 bandwidth slower with PCIe HOT 1
- Profiling Tools for NCCL collective operations
- Local user buffer registration for NVLink SHARP HOT 1
- Some questions about selecting NET when searching channels. HOT 12
- Compute time in the reduction operation
- Understanding LL, LL128, and Simple Protocols
- Performance Degradation in Alltoall Operation with NCCL 2.19 and 2.20 HOT 5
- NCCL2.21 hangs at cudaLaunchKernelExC() HOT 6
- How are threads in different channels parallelized
- How sendProxyProgress() in net.cc works HOT 2
- Execute all_reduce_perf block HOT 1
- Has NCCL support inter-node through NVswitch and NVlink? HOT 7
- For channel computing, why nvlinkBw is accumulated, but pciBw is not? Is this a BUG? HOT 2
- nccl with specified pkey_index HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.