nvidia / nccl Goto Github PK
View Code? Open in Web Editor NEWOptimized primitives for collective multi-GPU communication
License: Other
Optimized primitives for collective multi-GPU communication
License: Other
The header file suggests that the number of ranks (or tasks) must be less than (or equal to) the number of devices. However, it would be convenient to have, say, two processes training their own copies of a neural net on the same GPU and then using the reduce and bcast functionality to transfer data between the models during an update. Specifically, using reduce to sum all the gradient parameters onto the master nnet, and then, after updating the master nnet's parameters, using bcast to send the updated parameters to the slave nnets. Is this already possible, or do I need to wait for an enhancement?
I was trying to install nccl on my PC:
Ubuntu 16.04 LTS
GTX 1080
CUDA 8.0
CUDNN 5.1
but got errors:
minedl@minedl-machine:~/nccl$ make install
Compiling src/core.cu > /home/minedl/nccl/build/obj/core.o
src/common_coll.h(96): error: class "ncclComm" has no member "buffSizePerRing"src/common_coll.h(100): error: class "ncclMem" has no member "doneCount"
src/common_coll.h(105): error: class "ncclComm" has no member "nRings"
3 errors detected in the compilation of "/tmp/tmpxft_000037f6_00000000-5_core.cpp4.ii".
Makefile:109: recipe for target '/home/minedl/nccl/build/obj/core.o' failed
make: *** [/home/minedl/nccl/build/obj/core.o] Error 2
Did anyone get the same error and solve it before?
Hi,
I have single GPU application which I am now trying to extend to work on multiple GPUs, which is offloading data in segments and I would like to let it scale both horizontally and vertically.
Q1: I am considering adopting nccl for single node multi-gpu task distribution. I think this is the way to go in the moment. Is this where nccl will help me?
Q2: When it comes to multigpu clusters distribution(for example I have 2x 8GPU systems), is there also work for nccl or better to look into frameworks like rCUDA MPI?
Q3: Does nccl try to aspirate to become a generic layer into which I could potentially connect any system with any number of GPUs and distribute tasks?
Thanks for suggestions/answers.
Ladislav
Not really an issue.
In the perspective to move toward CUDA 8.0 and Pascal (PCIe/NVlink) architecture, is Nickel already available in CUDA 8.0 ?
Do I need to download it from GitHub site ?
Is Nickel already optimized for NVlink ?
Thanks,
Franco
I noticed that there is only ncclReduceScatter, but no ncclSctter(it was commented in header file).
Why is that?
Hello,
I tried running test/single/all_reduce
test on M40 nodes and the test just hangs. The same test works fine on TitanX nodes. I'm running the test on 2 GPUs with cuda 7.5. The driver version for the TitanX node is 352.79. Here is the output from the TitanX nodes:
~/nccl$ ./build/test/single/all_reduce_test 100# Using devices
# Rank 0 uses device 0 [0x04] GeForce GTX TITAN X
# Rank 1 uses device 1 [0x05] GeForce GTX TITAN X
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
100 100 char sum 0.038 0.00 0.00 0e+00 0.046 0.00 0.00 0e+00
100 100 char prod 0.032 0.00 0.00 0e+00 0.046 0.00 0.00 0e+00
100 100 char max 0.032 0.00 0.00 0e+00 0.045 0.00 0.00 0e+00
100 100 char min 0.037 0.00 0.00 0e+00 0.044 0.00 0.00 0e+00
100 25 int sum 0.033 0.00 0.00 0e+00 0.045 0.00 0.00 0e+00
100 25 int prod 0.032 0.00 0.00 0e+00 0.044 0.00 0.00 0e+00
100 25 int max 0.032 0.00 0.00 0e+00 0.044 0.00 0.00 0e+00
100 25 int min 0.032 0.00 0.00 0e+00 0.043 0.00 0.00 0e+00
100 50 half sum 0.033 0.00 0.00 0e+00 0.044 0.00 0.00 0e+00
100 50 half prod 0.048 0.00 0.00 0e+00 0.069 0.00 0.00 0e+00
100 50 half max 0.035 0.00 0.00 0e+00 0.053 0.00 0.00 0e+00
100 50 half min 0.036 0.00 0.00 0e+00 0.048 0.00 0.00 0e+00
100 25 float sum 0.034 0.00 0.00 0e+00 0.052 0.00 0.00 0e+00
100 25 float prod 0.035 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
100 25 float max 0.035 0.00 0.00 0e+00 0.050 0.00 0.00 0e+00
100 25 float min 0.036 0.00 0.00 0e+00 0.050 0.00 0.00 0e+00
96 12 double sum 0.033 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
96 12 double prod 0.034 0.00 0.00 0e+00 0.050 0.00 0.00 0e+00
96 12 double max 0.033 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
96 12 double min 0.046 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
96 12 int64 sum 0.034 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
96 12 int64 prod 0.033 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
96 12 int64 max 0.034 0.00 0.00 0e+00 0.052 0.00 0.00 0e+00
96 12 int64 min 0.034 0.00 0.00 0e+00 0.048 0.00 0.00 0e+00
96 12 uint64 sum 0.046 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
96 12 uint64 prod 0.033 0.00 0.00 0e+00 0.049 0.00 0.00 0e+00
96 12 uint64 max 0.034 0.00 0.00 0e+00 0.052 0.00 0.00 0e+00
96 12 uint64 min 0.034 0.00 0.00 0e+00 0.053 0.00 0.00 0e+00
I tried running the same binary on M40 nodes with drivers 352.79 and 352.93. However, the test just stalls:
~/nccl$ ./build/test/single/all_reduce_test 100
# Using devices
# Rank 0 uses device 0 [0x04] Tesla M40
# Rank 1 uses device 1 [0x05] Tesla M40
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
Can you advice regarding this problem? Have you tried running these tests on M40s?
I replaced MPI with NCCL in my procedure, and I'm surprised that it greatly outperforms than MPI. Thank you for your wonderful work very much. I'm going to call Gather and Scatter in my program, but these two functions are not implemented yet. Could you place them on the agenda?
Hi,
I've built and run the mpi_test
on 1 node with 8 TitanX gpus successfully. I use srun
to launch the mpi test and it passes. However, the test fails when run across 2 nodes with 8 TitanX gpus per node. I use the following command line:
srun -N2 -n16 --gres=gpu:8 -p TitanXx8 build/test/mpi/mpi_test 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
The test fails with the following error:
WARN src/core.cu:225 failed to allocate 2101248 byte device buffer
WARN src/core.cu:596 rank 12 failed to allocate device buffer
WARN src/core.cu:683 rank 12 failed to allocate communicator
NCCL Init failed (10) 'cuda malloc failed'
Does NCCL run across multiple nodes?
Hello!
I'm trying to compile nccl, but I'm getting the following errors:
Compiling src/all_reduce.cu > build/obj/all_reduce.o
src/reduce_kernel.h(199): error: identifier "__half22float2" is undefined
src/reduce_kernel.h(203): error: identifier "__float22half2_rn" is undefined
src/reduce_kernel.h(214): error: identifier "__half22float2" is undefined
src/reduce_kernel.h(218): error: identifier "__float22half2_rn" is undefined
src/reduce_kernel.h(229): error: identifier "__half22float2" is undefined
src/reduce_kernel.h(233): error: identifier "__float22half2_rn" is undefined
src/reduce_kernel.h(248): error: identifier "__half22float2" is undefined
src/reduce_kernel.h(252): error: identifier "__float22half2_rn" is undefined
8 errors detected in the compilation of "/tmp/tmpxft_000004bb_00000000-13_all_reduce.compute_52.cpp1.ii".
make: *** [build/obj/all_reduce.o] Error 2
I have a TITAN X and cuda-7.5 installed. I ran make CUDA_HOME=/usr/local/cuda-7.5 test
to build the library.
Do you have any idea why does it fail? I've seen that these identifiers are defined in /usr/local/cuda-7.5/includes/cuda_fp16.h
, but it's not included in reduce_kernel.h
. Also, they are guarded by a check __CUDA_ARCH__ >= 530
, but my GPU has capability 5.2. Since TITAN is a Maxwell card, then it should be supported, right?
Dear NCCL developers,
I got a confusing error message when trying to run mpi_test.
I build mpi_test using $ make CUDA_HOME=/usr/local/cuda MPI_HOME=$WORK_PATH/openmpi mpitest
And then $ export PATH=$PATH:./build/test/mpi/
And when I run the test using $ mpirun -np 2 mpi_test 0 1 (there are 4 gpus in the machine), I got the follwoing message:
* stack smashing detected _: mpi_test terminated
[snake07:16792] ** Process received signal ***
[snake07:16792] Signal: Aborted (6)
[snake07:16792] Signal code: (-6)
[snake07:16792] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36cb0)[0x7f1ffd7c7cb0]
[snake07:16792] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f1ffd7c7c37]
[snake07:16792] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f1ffd7cb028]
[snake07:16792] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x732a4)[0x7f1ffd8042a4]
[snake07:16792] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f1ffd89bbbc]
[snake07:16792] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x0)[0x7f1ffd89bb60]
[snake07:16792] [ 6] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x7e315)[0x7f1ffd0a5315]
[snake07:16792] [ 7] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_backends_notify_new_object+0x41)[0x7f1ffd0a0291]
[snake07:16792] [ 8] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_insert_pci_device_list+0x1b5)[0x7f1ffd0a47d5]
[snake07:16792] [ 9] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x821fc)[0x7f1ffd0a91fc]
[snake07:16792] [10] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_topology_load+0x29d)[0x7f1ffd0c6abd]
[snake07:16792] [11] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc_base_get_topology+0xe2)[0x7f1ffd0991d2]
[snake07:16792] [12] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(ompi_mpi_init+0x5dd)[0x7f20046f7b2d]
[snake07:16792] [13] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(MPI_Init+0x16b)[0x7f20047176eb]
[snake07:16792] [14] mpi_test[0x4014e4]
[snake07:16792] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f1ffd7b2f45]
[snake07:16792] [16] mpi_test[0x401c77]
[snake07:16792] *** End of error message ***
* stack smashing detected _: mpi_test terminated
[snake07:16793] ** Process received signal ***
[snake07:16793] Signal: Aborted (6)
[snake07:16793] Signal code: (-6)
[snake07:16793] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36cb0)[0x7f975342fcb0]
[snake07:16793] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f975342fc37]
[snake07:16793] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f9753433028]
[snake07:16793] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x732a4)[0x7f975346c2a4]
[snake07:16793] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f9753503bbc]
[snake07:16793] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x0)[0x7f9753503b60]
[snake07:16793] [ 6] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x7e315)[0x7f9752d0d315]
[snake07:16793] [ 7] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_backends_notify_new_object+0x41)[0x7f9752d08291]
[snake07:16793] [ 8] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_insert_pci_device_list+0x1b5)[0x7f9752d0c7d5]
[snake07:16793] [ 9] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x821fc)[0x7f9752d111fc]
[snake07:16793] [10] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_topology_load+0x29d)[0x7f9752d2eabd]
[snake07:16793] [11] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc_base_get_topology+0xe2)[0x7f9752d011d2]
[snake07:16793] [12] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(ompi_mpi_init+0x5dd)[0x7f975a35fb2d]
[snake07:16793] [13] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(MPI_Init+0x16b)[0x7f975a37f6eb]
[snake07:16793] [14] mpi_test[0x4014e4]
[snake07:16793] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f975341af45]
[snake07:16793] [16] mpi_test[0x401c77]
[snake07:16793] *** End of error message ***
I don't get any clue from the error message. Could you help me?
I am running a NCCL reduction across multiple GPUs on an Amazon P2 16x instance in a multi-process context (one MPI rank per GPU). When I added small arrays together across 16 workers I got the error "peer mapping resources exhausted". Looking online I determined that perhaps I was limited to 8 GPUs in a group and NCCL wasn't dealing with this limitation internally.
However, when I reduced between two groups of 8 GPUs using NCCL (by splitting MPI_COMM_WORLD into two separate communicators) and then did a standard MPI reduction in host memory to reduce the remaining two arrays, I got the same error. Same for 7 GPUs. I had to reduce the group size to 4 to get the correct behaviour.
It seems this is unrelated to the peer ensemble limitation but instead is related to other resources needed for multi-process reductions on a single node.
Joss Knight
The documentation specifies
"Test binaries are located in the subdirectories nccl/build/test and nccl/build/mpitest."
and
"./build/test/all_reduce_test"
Whereas when I build the library; mpitest library was absent and all_reduce_test was present in ./build/test/single/ directory.
Am I missing something here
When I use the nccl to the cifar10 example, I find the time is no change. How use it?
Using $TOOLS/caffe train --solver=examples/cifar10/cifar10_full_solver.prototxt -gpu 0
I0302 03:19:23.473100 8238 caffe.cpp:197] Using GPUs 0
I0302 03:19:23.473930 8238 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0302 03:19:23.825523 8238 solver.cpp:48] Initializing solver from parameters:
I0302 03:28:01.841064 8238 solver.cpp:362] Iteration 55000, Testing net (#0)
I0302 03:28:02.094408 8238 solver.cpp:429] Test net output #0: accuracy = 0.7896
I0302 03:28:02.094431 8238 solver.cpp:429] Test net output #1: loss = 0.620957 (* 1 = 0.620957 loss)
I0302 03:28:02.104485 8238 solver.cpp:242] Iteration 55000 (102.234 iter/s, 1.9563s/200 iter), loss = 0.370225
I0302 03:28:02.104549 8238 solver.cpp:261] Train net output #0: loss = 0.370225 (* 1 = 0.370225 loss)
Using $TOOLS/caffe train --solver=examples/cifar10/cifar10_full_solver.prototxt -gpu all ,the time is no little change!
I0302 03:31:23.499303 8361 solver.cpp:362] Iteration 5000, Testing net (#0)
I0302 03:31:23.723893 8361 solver.cpp:429] Test net output #0: accuracy = 0.6884
I0302 03:31:23.723942 8361 solver.cpp:429] Test net output #1: loss = 0.890609 (* 1 = 0.890609 loss)
I0302 03:31:23.733755 8361 solver.cpp:242] Iteration 5000 (99.2158 iter/s, 2.01581s/200 iter), loss = 0.571789
I0302 03:31:23.733794 8361 solver.cpp:261] Train net output #0: loss = 0.571788 (* 1 = 0.571788 loss)
Hi, all
I Receive following warning when runing test/mpi/mpi_test:
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[46957,1],3] (PID 2896)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
Test Passed!
The test passed, But does it matter and how to get avoid of it?
I use OpenMPI-2.0.0. It seems that I can not receive any information in the case of program crash. It is important in my application.
Hi stuff,
I observed mystery hangs with mpi + nccl. Basically, I follow the testcase nccl/test/mpi/mpi_test.cu. My scenarios is also one single GPU per process(rank). According to issue#37 discussed about mutil-threads scenarios, they resolve the hang issue by add boost::barrier before nccl call. Similarly, I add MPI_Barrier() before nccl call in my case. But still hang.
Is it a known issue of nccl? Maybe I miss something. Did you have any sugguestion about fix this?
I'm looking at the NV branch of Caffe with NCCL support. It uses a barrier before doing allreduce. Is it still necessary, or is NCCL tracking data dependencies already?
I used CentOS 7.0 and CUDA 7.5 on the server with 6pcs Tesla cards, it stop and has no response when running ./all_reduce_test 10000000 under single folder.
My GPU topo is as below
CPU 0 -- GPU0
-- GPU1
-- GPU2
CPU 1 -- GPU3
-- GPU4
-- GPU5
Even I ran with ./all_reduce_test 2 0 1, it still didn't run.
Do I need to install MPI even if I use tests in single folder? Is single test valid for multi-CPU as the topo above?
I checked ACSCtl, all are negative. I don't know what I can do.
Does this tool support the P8NVLink system with P100 GPU? From the blog below, this tool need the GPUDirect but OpenPOWER system doesn't have GPUDirect. That is why I ask if this tool is ready for OpenPOWER P8NVLink system with P100 GPU.
https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/
"NCCL makes extensive use of GPUDirect Peer-to-Peer direct access to push data between processors."
Hello,
It comes to my attention that all examples are in single thread. I tried a multithread example by the following codes:
void GPU(void threadid)
{
const int size = 2;
int tid = ((int) threadid);
cudaSetDevice(tid);
ncclComm_t comm = comms[tid]; //init as a in file variable in main thread
PerThreadData* data = (PerThreadData_) malloc(sizeof(PerThreadData));
int cudaDev;
int rank;
cudaDeviceProp prop;
ncclCommCuDevice(comm, &cudaDev);
ncclCommUserRank(comm, &rank);
cudaGetDeviceProperties(&prop, cudaDev);
//initialization
cudaStreamCreate(&(data->stream));
cudaMalloc(&(data->sendBuff), sizeof(double)_size);
cudaMalloc(&(data->recvBuff), sizeof(double)_size);
double temp[2] = {tid+1, tid+1};
cudaMemcpy(data->sendBuff, temp, sizeof(double)_size, cudaMemcpyHostToDevice);
cudaMemcpy(data->recvBuff, temp, sizeof(double)*size, cudaMemcpyHostToDevice);
data->size = size;
printf("# Rank %2d uses device %2d [0x%02x] %s\n", rank, cudaDev, prop.pciBusID, prop.name);
printf("Hello World! It's me, thread #%d!\n", tid);
//destruction
pthread_exit(NULL);
}
Each thread is responsible for a GPU running the above code. Then it creates a deadlock.
I assume NCCL is not thread safe in this case. Is that true? Do we only use it in a single thread?
Thank you.
I followed the instructions from the readme, but I can't get the tests to run.
Is someone help me?
my options:
1、git clone https://github.com/NVIDIA/nccl.git
2、cd nccl
3、sudo make CUDA_HOME=/usr/local/cuda-8.0 test
4、sudo make install
4、export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
and then:
sudo ./build/test/single/all_reduce_test 10000000
tips:
./build/test/single/all_reduce_test: error while loading shared libraries: libnccl.so.1: cannot open shared object file: No such file or directory
but:
zengraoli@zengraoli-desktop:/dl/nccl$ ll /usr/local/lib/libnccl.so*/dl/nccl$
lrwxrwxrwx 1 root root 12 3月 4 18:51 /usr/local/lib/libnccl.so -> libnccl.so.1*
lrwxrwxrwx 1 root root 16 3月 4 18:51 /usr/local/lib/libnccl.so.1 -> libnccl.so.1.3.3*
-rwxr-xr-x 1 root root 23126897 3月 4 18:51 /usr/local/lib/libnccl.so.1.3.3*
zengraoli@zengraoli-desktop:
hi, all
Is it safe to reuse a cudaStream_t object(say stream0) after cudaStreamSynchronize(stream0)? More specifically, Is following code safe:
cudaStream_t stream0;
CHECK_EQ(cudaStreamCreateWithFlags(&stream0, cudaStreamNonBlocking), ncclSuccess);
for(int i = 0; i < 1000000; ++i) {
// communition
CHECK_EQ(ncclAllReduce(src_ptr, dst_ptr, some_count, ncclFloat,
ncclSum, some_comm, stream0), ncclSuccess);
// do something else
...
...
// wait it
cudaStreamSynchronize(stream0)
// without destory stream0
}
This is an excellent and necessary library. My understanding is that each collective communication is implemented via ring communications. If this is the case, a large class of problems (e.g. halo communications) could benefit greatly from exposing the collective ring communication as another primitive.
I imagine this could look similar to MPI's virtual topology:
https://computing.llnl.gov/tutorials/mpi/#Virtual_Topologies
where the ncclComm (or a wrapper-like object) would be exposed as a ring_communicator that could be passed to ring_rank, ring_coord, ring_shift, send, recv, and sendrecv-like functions.
I was going to take a quick crack at this, but thought I would get some feedback from the experts first.
I build nccl with cuda-7.5:
make CUDA_HOME=/usr/local/cuda-7.5 test
And run test with the following command:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test
causes a segmentation fault:
# Using devices
Segmentation fault
But all tests run smoothly if I build nccl
with cuda-7.0.
Is the current version of nccl
not compatible with cuda-7.5?
Dear All:
I got some questions while testing the nccl/test/mpi/nccl_test.cu code. Firstly, does it support multi-workstation communication cross different workstations. Secondly, will it block the GPU computation during the nccl data communication(such as ncclallredue). Thanks a lot.
Yours
Sincerely
while i use the nccl for the program, i found that every gpu would use the extra memorys. For example, i used 4 gpus, and every gpu would use the extra 3 memorys of 110M. just like the iamge. my question is how to reduce the 110M?
I followed the instructions from the readme, but I can't get the tests to run. Is there any additional advice someone can give me?
# make CUDA_HOME=/usr/local/cuda test
Compiling src/libwrap.cu > build/obj/libwrap.o
Compiling src/core.cu > build/obj/core.o
Compiling src/all_gather.cu > build/obj/all_gather.o
Compiling src/all_reduce.cu > build/obj/all_reduce.o
Compiling src/broadcast.cu > build/obj/broadcast.o
Compiling src/reduce.cu > build/obj/reduce.o
Compiling src/reduce_scatter.cu > build/obj/reduce_scatter.o
Linking build/lib/libnccl.so.1.2.2
Grabbing src/nccl.h > build/include/nccl.h
Building test/single/all_gather_test.cu > build/test/single/all_gather_test
Building test/single/all_reduce_test.cu > build/test/single/all_reduce_test
Building test/single/broadcast_test.cu > build/test/single/broadcast_test
Building test/single/reduce_test.cu > build/test/single/reduce_test
Building test/single/reduce_scatter_test.cu > build/test/single/reduce_scatter_test
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
# ./build/test/single/all_reduce_test
Error: must specify at least data size in bytes!
Tests nccl AllReduce with user supplied arguments.
Usage: all_reduce_test <data size in bytes> [number of GPUs] [GPU 0] [GPU 1] ...
# ./build/test/single/all_reduce_test 10000000
NCCL failure test/single/all_reduce_test.cu:259 'unhandled cuda error'
# nvidia-smi
Tue Jun 7 18:35:23 2016
+------------------------------------------------------+
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:02:00.0 Off | N/A |
| 22% 35C P8 15W / 250W | 23MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 34C P8 14W / 250W | 23MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:83:00.0 Off | N/A |
| 22% 34C P8 16W / 250W | 23MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TIT... Off | 0000:84:00.0 Off | N/A |
| 22% 32C P8 15W / 250W | 23MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Hi, I am new guy to Digits and Linux, So pls forgive me if the issue is too naive.
I biuld the NV caffe using the source code, my system have 4 K40c Tesla cards.
I build the nccl lib according the guidline in this page, but when I "cmake .." the caffe,
the system tell me that NCCL not found. I think it is some problem with enveirment
paras setup. So would pls tell me how to fix it, thanks.
Distro: Linux 4.2.6-300.fc23.x86_64
/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/mwaitxintrin.h(36): error: identifier "__builtin_ia32_monitorx" is undefined
/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/mwaitxintrin.h(42): error: identifier "__builtin_ia32_mwaitx" is undefined
However when I add the CXX flags:
-D_MWAITXINTRIN_H_INCLUDED
-D_FORCE_INLINES
-D__STRICT_ANSI__
the build completes successfully see here:
tensorflow/tensorflow#1066
Hi Nickel team,
I have introduced your library into a my application. The integration was done into a multi-thread scenario. Each thread uses allreduce and in principle the allreduce is called into a loop.
The first part of the body of the loop is used to compute intermediate data then at the end
of the body of the loop I jump into allreduce.
It works perfectly but from time to time I fall into deadlock. Attaching the process with gdb,
I can see that (N-1) threads are into the cudaStreamSynchronize() (each allreduce has its own
custom cuda stream) while 1 thread is into cuFreeHost() (I use GPU malloc for GPU&CPU memory
allocator).
What's happening there is that during the first part of the body of the loop a thread needs to reallocate
some memory before doing its processing while the others (N-1) threads make their own processing
and jump into Nickel allreduce.
This creates from time to time some deadlock condition. What I can guiess is that there is some timeing condition with which threads make action that produce deadlock.
This is not deterministic: the need to reallocate is deterministic after some iteration but not always produces deadlock.
Could you help me in some way?
Not clear if it is a Cuda issue, a Nickel/Cuda bug or a Cuda limitation.
Does any memory management action alloc/free CPU/GPU require that gpus are idle?
I use Nickel allreduce GPU-based sync methods. No CPU-based barrier introduced before entering into allreduce().
Do I need to add a CPU-based barrier? Any C/C++ safe-code to use in case?
Thanks a lot,
Franco
Next some details about gdb info:
(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc41538ffde in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415412916 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415412fa8 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153793fc in ?? () from /usr/lib64/libcuda.so.1
#8 0x00007fc415347392 in cuMemFreeHost () from /usr/lib64/libcuda.so.1
#9 0x00007fc41ac6284d in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#10 0x00007fc41ac4782c in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc415414e33 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415414f89 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415388c87 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153610c2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1
#8 0x00007fc41ac40d90 in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#9 0x00007fc41ac781fd in cudaStreamSynchronize () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
Just out of curiosity, is there a reason we infer CUDA_VERSION out of libcudart.so, instead of CUDACC_VER_MAJOR and CUDACC_VER_MINOR defined by nvcc? I am curious mainly because having CUDA_VERSION figured out via shell script makes it kind of hard to compile in a separate environment without manually feeding in these macros.
Hi All,
I have noticed crashes when I overload a device with more than one nccl comm. For example,
below I want to use 6 instances of the comm with only two devices 0, 1. I see crashes even with smaller instances..for eg. two instances of comm with each device. Does nccl assume that only one comm is created per device? This is restrictive if this is the case,
./build/test/single/broadcast_test 10000000 6 0 0 0 0 0 1
INFO NCCL debug level set to INFO
INFO rank 0 using buffSize = 2097152
INFO rank 0 using device 0 (0000:03:00.0)
INFO rank 1 using buffSize = 2097152
INFO rank 1 using device 0 (0000:03:00.0)
INFO rank 2 using buffSize = 2097152
INFO rank 2 using device 0 (0000:03:00.0)
INFO rank 3 using buffSize = 2097152
INFO rank 3 using device 0 (0000:03:00.0)
INFO rank 4 using buffSize = 2097152
INFO rank 4 using device 0 (0000:03:00.0)
Segmentation fault
Amith
I could follow this library easily for memories allocated by cudaMalloc since I know the exact number of count/datatype I requested. How can I use memory allocated by cudaMallocPitch?
Thanks,
Pranav
I have attached a script that used to work and now does not (parallel-nccl.lua); I have also attached an ancillary file (nccl_ffi.lua) which serves as a Torch/Lua wrapper for nccl.h and is needed to run parallel-nccl.lua
The runtime environment is a Linux laptop, Ubuntu 16.04, 2 x GTX 970M cards, driver version 367.35 (the current release level for the GTX 970M's). Cuda Toolkit v. 8rc.
I am using the Torch development environment.
The script tests the basic nccl operations. Instead of MPI, it uses a functionally similar package in Torch called "parallel", which is a multi-process harness, not to be confused with other Torch packages which also use the word "parallel" in their names.
The script hangs during the "reduce-out-of-place" test when running 2 workers per GPU.
I have seen similar behavior when training a neural net using the "AllReduce" function to consolidate gradient parameters between processes (each process trains a clone of a network), the code hangs at the first subsequent call to inspect/modify a (Torch) tensor.
Stopping the X Windows Server (sudo service lightdm stop) does not help.
Any insights would be much appreciated.
Save the attached files into the same folder, run the test script using "th parallel-nccl.lua".
The script hangs (for me, at any rate) during the reduce-out-of-place code block towards the beginning. Note that the script comprises a "parent" process and and a "worker" process. The code for the worker is above the code for the parent.
which gcc version needed by NCCL?
/usr/bin/make64 MAC=64 CUDA_HOME=/home/work/cuda-7.5/ test
Compiling src/libwrap.cu > build/obj/libwrap.o
nvcc warning : The -c++11 flag is not supported with the configured host compiler. Flag will be ignored.
src/core.h(47): error: expected an identifier
src/core.h(61): warning: parsing restarts here after previous syntax error
src/core.h(111): error: DevRing is not a template
2 errors detected in the compilation of "/tmp/tmpxft_00004dea_00000000-13_libwrap.compute_52.cpp1.ii".
make64: *** [build/obj/libwrap.o] Error 2
when I compile this project , an ERROR occured as above.
Dear NCCL team,
First of all, thx much for such nice open-source project.
I just got to know about you through the Parallel-Forall Blog.
Currently, I'm testing your examples in a small production PC, and I noticed that the topology that I'm using is a little bit complex, namely:
[r1bsl@supermicro single]$ nvidia-smi topo --matrix
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 CPU Affinity
GPU0 X PIX SOC SOC SOC SOC 0-7,16-23
GPU1 PIX X SOC SOC SOC SOC 0-7,16-23
GPU2 SOC SOC X PIX PHB PHB 8-15,24-31
GPU3 SOC SOC PIX X PHB PHB 8-15,24-31
GPU4 SOC SOC PHB PHB X PIX 8-15,24-31
GPU5 SOC SOC PHB PHB PIX X 8-15,24-31
Legend:
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
As you may see, I'm working with K80-type GPUs in this machine.
I've noticed that I have no problem running your tests using one of the internal GPUs, e.g.:
[r1bsl@supermicro single]$ ./all_gather_test 10000000 3 1 3 5
# Using devices
# Rank 0 uses device 1 [0x06] Tesla K80
# Rank 1 uses device 3 [0x85] Tesla K80
# Rank 2 uses device 5 [0x89] Tesla K80
# bytes N type time algbw busbw delta
10000000 10000000 char 5.247 3.81 3.81 0e+00
10000000 2500000 int 4.872 4.11 4.11 0e+00
10000000 5000000 half 4.802 4.16 4.16 0e+00
10000000 2500000 float 4.816 4.15 4.15 0e+00
10000000 1250000 double 4.793 4.17 4.17 0e+00
10000000 1250000 int64 4.766 4.20 4.20 0e+00
10000000 1250000 uint64 4.731 4.23 4.23 0e+00
However, it I want to run the test using both internal GPU in a single K80 card, I get in troubles:
[r1bsl@supermicro single]$ ./all_gather_test 100000 2 2 3
# Using devices
# Rank 0 uses device 2 [0x84] Tesla K80
# Rank 1 uses device 3 [0x85] Tesla K80
# bytes N type time algbw busbw delta
[code stalls]
^C
The execution stalls and I have no more option that to kill the execution.
My question is: Can NCCL handle such complex topology? and if so, what can I do to modify the examples for the case that I can run them with all my 6 GPUs?
So we have some 8 gpu machines running Maxwell TitanX's and we decided to try swapping them out with the newer cards. The basic hardware architecture is a pair of 80 lane switches connected to the same CPU. The driver is version 367.35
When running your benchmark on the Maxwell cards I get pretty much the expected numbers:
./build/test/single/all_reduce_test 10000000 2 0 1
N type op time algbw busbw
10000000 char sum 0.886 11.28 11.28
./build/test/single/all_reduce_test 10000000 2 0 7
N type op time algbw busbw
10000000 char sum 1.215 8.23 8.23
I get about a 6x slowdown running on the exact same system but with Pascal instead of Maxwell cards. Also, the test that traverses the CPU runs at the same speed:
./build/test/single/all_reduce_test 10000000 2 0 1
N type op time algbw busbw
10000000 char sum 5.650 1.77 1.77
./build/test/single/all_reduce_test 10000000 2 0 7
N type op time algbw busbw
10000000 char sum 5.661 1.77 1.77
The slowdown is about 5x when running the test with all 8 gpus enabled.
Here are the results on a on an Intel Z170 chipset running two Pascal Titans on 8x PCIe. There doesn't seem to be an issue here (about 2x slower than the Maxwells running on 16x PCIe).
N type op time algbw busbw
10000000 char sum 1.674 5.97 5.97
When testing with the cuda sample p2pBandwidthLatencyTest program, I get nearly identical results with the 2 sets of cards. The exception is the latency numbers with peer access enabled:
Maxwell:
P2P=Enabled Latency Matrix (us)
D\D 0 1 2 3 4 5 6 7
0 4.10 8.03 8.22 7.46 7.25 6.95 7.47 6.76
1 7.37 4.49 7.26 7.25 7.05 6.66 7.25 6.72
2 7.27 7.33 4.24 7.34 7.16 6.66 7.49 6.85
3 7.19 7.05 7.38 4.04 6.94 6.47 6.86 6.73
4 7.03 6.85 6.89 7.30 3.90 6.52 7.19 6.72
5 7.42 7.24 6.94 7.02 7.00 4.22 7.17 7.09
6 8.68 7.32 7.11 7.24 7.07 7.10 4.41 6.39
7 7.77 7.76 7.20 7.68 8.09 6.77 7.55 4.01
Pascal:
P2P=Enabled Latency Matrix (us)
D\D 0 1 2 3 4 5 6 7
0 3.39 20.00 14.53 19.57 19.73 16.15 19.17 17.48
1 16.02 11.21 14.42 19.54 19.75 16.20 19.11 17.55
2 16.07 19.93 3.79 19.58 19.72 16.48 19.18 17.56
3 16.03 19.85 14.43 4.35 19.72 16.25 19.10 17.47
4 16.21 19.81 14.58 19.88 11.34 16.06 19.09 17.39
5 16.28 19.76 14.62 19.63 19.61 3.27 19.14 17.31
6 16.07 19.95 14.55 19.70 19.62 16.18 4.03 17.36
7 16.08 20.03 14.62 19.89 19.78 16.05 19.27 11.23
We see similar slowdowns when training large models in tensorflow (which is how this came to our attention). Your tool seemed like a good way to probe this issue. Is there a more appropriate place to submit this bug?
Hello everyone!
I tried compiling NCCL from source on Fedora 24 with GCC 6.1 and CUDA 8.0 on a server with 2 GPUs (a single Tesla K80 card). However, I ran into the following error while trying to compile:
/usr/local/cuda/include/cuda_fp16.h(2970): error: more than one instance of overloaded function "isinf" matches the argument list
I then tried adding the -std=c++98 flag to the CXXFLAGS in the Makefile, and it progressed further, albeit with the following warnings:
cc1: warning: command line option ‘-std=c++98’ is valid for C++/ObjC++ but not for C
The error it threw this time was:
error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options
So if I understand correctly, if I try to compile with GCC 6 without any flags, it throws the overloaded error, but if I try to compile with an older GCC standard, it can't compile because it needs the newer GCC functions.
Do you maybe have any suggestions or am I out of luck trying to get NCCL working in this environment? Is there perhaps any alternative to NCCL for multi-GPU usage for Deep Learning (specifically, Caffe)?
Thank you very much! :)
Do we need to enable the peer2peer communication betweens GPU manually or NCCL does it automatically/don't need it?
As more and more libraries are built on this project, the support for Windows platform is necessary. Although PR #31 already makes such effort, but it is lag behind the newest version. I think the official support for Windows platform should be added in the future.
I have written up some notes and cookbook examples of using MPS and nccl with Torch, which may help Torch users who are new to multi-process, multi-GPU environments.
My notes can be found at:
https://github.com/CCorfield/Torch-parallel-nccl-MPS-Example
Please advise on corrections and additions.
Hi All,
I was having a difficult time understanding the latency of around 1.6 usec for a 10000000 byte allreduce posted..which gives bw ~ 6 GB/sec. Where do all the other overheads such as kernel launch, cuda synchronization go?All these easily amount to more than 10 usec? I am missing something here..please help me understand..
thanks..
Hi,
I included the Nickel library into my tool and I'm up and running.
Till now I used ncclAllReduce with float 32 bufferes and ncclSum reduction.
Is there an easy way to have the API working with float 32 buffers but transfer and adding
realized with HF16 floats? (sum even emulated on float 32 but transfer at HF 16)
Thanks,
franco
hi all
i miss a error ,i cant make the nccl
user@user-ProLiant-DL380-Gen9:~/nccl$ make CUDA_HOME=</usr/local/cuda-7.5> test
-bash: test: Is a directory
user@user-ProLiant-DL380-Gen9:~/nccl$ ls
debian fortran LICENSE.txt Makefile Makefile~ README.md src test
test is a existed file, i dont know where is my false
Some information:
I installed by below command:
[@ppk_02 nccl-1.2.3-1-cuda7.5]$ make CUDA_HOME=/usr/local/cuda test
Compiling src/libwrap.cu > build/obj/libwrap.o
nvcc fatal : Value 'gnu++0x' is not defined for option 'std'
make: *** [build/obj/libwrap.o] Error 1
Is there anyone who can help me figure it out? Thanks very much.
Dear NCCL team:
I had no problem compiling with OpenMPI 1.10.2 libraries and CUDA toolkit 7.5, to execute your example in my workstation. The PC I'm using has the following topology:
[manuel@nhri single]$ nvidia-smi topo --matrix
GPU0 GPU1 CPU Affinity
GPU0 X PHB 0-7
GPU1 PHB X 0-7
Legend:
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
I'm testing your MPI example but I'm running into Segmentation fault errors, namely:
[manuel@nhri mpi]$ ~/openMPI/bin/mpirun -np 1 mpi_test
[nhri:08445] *** Process received signal ***
[nhri:08445] Signal: Segmentation fault (11)
[nhri:08445] Signal code: Address not mapped (1)
[nhri:08445] Failing at address: (nil)
[nhri:08445] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7f1b82419100]
[nhri:08445] [ 1] /lib64/libc.so.6(+0x3a167)[0x7f1b8165f167]
[nhri:08445] [ 2] mpi_test[0x40151d]
[nhri:08445] [ 3] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1b81646b15]
[nhri:08445] [ 4] mpi_test[0x401c09]
[nhri:08445] *** End of error message ***
%--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 8445 on node nhri exited on signal 11 (Segmentation fault).
%--------------------------------------------------------------------------
All my machines use CentOS 7.0 and my cuda toolkit is 7.5. I know is a very naive my question but is there any preference in the machine configuration to run your examples?
architecture 6.1 is missing for the GTX1080 cards here:
https://github.com/NVIDIA/nccl/blob/master/Makefile#L41
as reported here: torch/cunn#289
Hello,
What does out-of-place and in-place mean in the test routines? Thank you.
I am building the master version (at commit 2a974f5), with Mac OS 10.12 and got the following error:
jiayq-mbp:nccl jiayq$ make
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
Grabbing src/nccl.h > /Users/jiayq/Research/nccl/build/include/nccl.h
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
Compiling src/libwrap.cu > /Users/jiayq/Research/nccl/build/obj/libwrap.o
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
Compiling src/core.cu > /Users/jiayq/Research/nccl/build/obj/core.o
src/core.cu(717): error: expected an expressionsrc/core.cu(717): error: expected an expression
2 errors detected in the compilation of "/var/folders/4x/jpsdl58x643dsgw7tbq1zs5clp6v5p/T//tmpxft_00013b6e_00000000-11_core.compute_52.cpp1.ii".
make: *** [/Users/jiayq/Research/nccl/build/obj/core.o] Error 2
@slayton58 recommended me opening an issue - happy to provide more details :)
My build environment is nvcc 8.0.54 and Apple LLVM version 8.0.0 (clang-800.0.42.1).
I am facing an issue while installing 'nccl' for Tesla M2070 + CUDA library 7.0.
-> % ./build/test/single/all_gather_test 10000000
# Using devices
# Rank 0 uses device 0 [0x06] Tesla M2070
# Rank 1 uses device 1 [0x14] Tesla M2070
# Rank 2 uses device 2 [0x11] Tesla M2070
# bytes N type time algbw busbw delta
CuRAND error 204 at test/include/test_utilities.h:112
NCCL has been compiled without any error. but when the test is run, CuRAND error occurs.
What could be the reason & how to fix this?
Thanks in advance!
Hi NCCL team,
I downloaded NCCL library and compiled the whole suite (lib+sample code) under CUDA 7.5.
Then, I tried one of your sample run (./build/test/single/all_reduce_test 10000000)
using my in-house hw setup (K10/K80).
In case of K10 run, I have this:
CuRAND error 204 at test/include/test_utilities.h:111
triggered by a Randomize() into the sample code.
I can overcome that simply by resetting to 0 the buffer instead of in random way,
but weird elapsed and bandwidth estimation values are print on screen like this:
$ ./build/test/single/all_reduce_test 10000000
# Using devices
# Rank 0 uses device 0 [0x26] Tesla K10.G1.8GB
# Rank 1 uses device 1 [0x27] Tesla K10.G1.8GB
# Rank 2 uses device 2 [0x2a] Tesla K10.G1.8GB
# Rank 3 uses device 3 [0x2b] Tesla K10.G1.8GB
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
10000000 10000000 char sum 0.009 1153.20 1729.81 2e-316 0.020 493.75 740.63 7e-310
10000000 10000000 char prod 0.008 1239.88 1859.82 7e-310 0.020 510.67 766.01 7e-310
10000000 10000000 char max 0.008 1221.47 1832.21 7e-310 0.020 510.46 765.70 7e-310
10000000 10000000 char min 0.008 1244.91 1867.37 7e-310 0.020 511.95 767.93 7e-310
In case of K80 (Randomize replaced by memset0) it seems fine:
$ ./build/test/single/all_reduce_test 10000000
# Using devices
# Rank 0 uses device 0 [0x0c] Tesla K80
# Rank 1 uses device 1 [0x0d] Tesla K80
# Rank 2 uses device 2 [0x10] Tesla K80
# Rank 3 uses device 3 [0x11] Tesla K80
# Rank 4 uses device 4 [0x14] Tesla K80
# Rank 5 uses device 5 [0x15] Tesla K80
# Rank 6 uses device 6 [0x18] Tesla K80
# Rank 7 uses device 7 [0x19] Tesla K80
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
10000000 10000000 char sum 2.887 3.46 6.06 0e+00 2.895 3.45 6.05 0e+00
10000000 10000000 char prod 2.547 3.93 6.87 0e+00 2.583 3.87 6.77 0e+00
10000000 10000000 char max 2.151 4.65 8.14 0e+00 2.166 4.62 8.08 0e+00
10000000 10000000 char min 1.966 5.09 8.90 0e+00 1.994 5.02 8.78 0e+00
What's going to happen with K10? It seems to me covered by NCCL requirement.
In any case, could you support K10? Despite we're moving toward new GPU arch, my lab is quite populate by K10.
Thanks,
Franco
Edited for legibility by @lukeyeager
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.