Comments (9)
K10 has compute capability 3.0. By default NCCL is built for sm_35 and newer. If you tweak the makefile to include sm_30 support, NCCL should work on the K10. You will still need to forgo the random number generator, however, as cuRand's MT19937 supports sm_35 and newer only.
from nccl.
indeed recompiling including sm_30 support fixes my issue on K10 (cuRand till disabled).
Thanks.
from nccl.
I succeeded in running on K10 and so I tried profiling all_reduce_test sample code.
In attach I reported the nvvp picture of all_reduce_test/float/sum "outplace/inplace".
Looking at GTC16 NCCL presentation, I expected to see the whole memory plit in chunks and each
chunk sent in concurrency between each gpu pairs (ring algorithm).
What I see is a serial data move cross gpus to complete a loop.
What did I miss?
Something related to K10?
Thanks,
Franco
from nccl.
Most of what you see there is bookkeeping for the test framework. The
actual reduction is the smaller kernel in the middle that actually does run
simultaneously on all GPUs. Zoom in on the timeline and you'll see it.
PS: For sm_30, rather than forgoing the random number generation in the
test suite completely, you could just switch cuRAND to a different
generator. It has other ones that do support older GPUs.
-Cliff
On Apr 29, 2016 8:59 AM, "fmana" [email protected] wrote:
I succeeded in running on K10 and so I tried profiling all_reduce_test
sample code.
In attach I reported the nvvp picture of all_reduce_test/float/sum
"outplace/inplace".
Looking at GTC16 NCCL presentation, I expected to see the whole memory
plit in chunks and each
chunk sent in concurrency between each gpu pairs (ring algorithm).[image: nccl_allreduce_4]
https://cloud.githubusercontent.com/assets/18715804/14916585/cd09d64c-0e1a-11e6-8bc8-e2ea8399c0ee.gifWhat I see is a serial data move cross gpus to complete a loop.
What did I miss?
Something related to K10?Thanks,
Franco—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#23 (comment)
from nccl.
Adding to Cliff's comments, all of the "chunking" and inter-gpu synchronizations are handled by direct peer memory accesses from within a single CUDA kernel (rather than, for example, using cudaEvents and separate cudaMemcpys). So the details of the algorithm won't show up in the nvvp profile.
from nccl.
Thanks for comments. I'll had a deeper look at the kernel realizing the core processing (chunk&transfer). It is quite interesting since I never wrote so complex algorithm into a single kernel.
I'll learn a lot.
I'll go to integrate your NCCL lib into a my toy sample code using the lib into my target scenario.
I'll let you know my findings and in case issues.
BTW what about of repeatability?
Do you have fixed order collecting contributes in case of collectives?
Thanks,
Franco
from nccl.
As long as the mapping of GPU IDs to ranks remains the same, then yes, the
calculations should be deterministic and therefore the output repeatable.
On May 2, 2016 2:04 AM, "fmana" [email protected] wrote:
Thanks for comments. I'll had a deeper look at the kernel realizing the
core processing (chunk&transfer). It is quite interesting since I never
wrote so complex algorithm into a single kernel.
I'll learn a lot.
I'll go to integrate your NCCL lib into a my toy sample code using the lib
into my target scenario.
I'll let you know my findings and in case issues.
BTW what about of repeatability?
Do you have fixed order collecting contributes in case of collectives?Thanks,
Franco—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#23 (comment)
from nccl.
Hi NCCL team,
I've finished in writing my sample code integrating NCCL into a multi-threads logic application.
I followed your's sample code: no problem at all. I can run w/ our in-house K10/K20/K80.
Just 2 comments:
1] I read about 8GPU limitation (GTC slides). According to my trials, NCCL works till 16GPU
Was I lucky? Did I misunderstand the GTC slide content?
2] I verified your comment on repeatability: changing the commIgpu allocation I lost repeatability.
No work-around on that? This is quite common using cluster that my app will be allocated
to different machines at each run. Isn't possible into your "ring algorithm" define the starting
point of the ring? The thread0 could define to NCCL which is the gpu to be used as start/end
of the ring. In this case, could make sense having repeatability achieved?
Thanks,
Franco
from nccl.
I'm glad you're up and running. The 8 GPU requirement was relaxed in revision Iaa1841036a7bfdad6ebec99fed0adcd2bbe6ffad. The GTC slide was just out of date.
At this point, we don't have a good solution for general reproducibility. To avoid contending PCIe links, GPUs must communicate in a specific hardware order. If the software changes the mapping of ranks to physical GPUs, the communication order must change too, and the (non-associative) floating-point operations get carried out in a different order. Without a significant performance penalty, we can only provide deterministic output for runs with the same configuration.
from nccl.
Related Issues (20)
- Questions about tuning parameters in graph/tuning.cc
- some confusion about slice and chunk HOT 2
- Global lock for multiple communicators in one process HOT 2
- nccl uct_md_mem_reg(address=(nil) length=140222665224192): invalid parameters HOT 1
- Message truncated : received 256 bytes instead of 4 in bootstrap.cc HOT 2
- Where do the values of baseLat in graph/tuning.cc(Line 56) come from?
- What is netOverhead in graph/tuning.cc ?
- torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister HOT 5
- Why setting NCCL_NET_GDR_READ to 0 perform better than seeting NCCL_NET_GDR_READ to 1 on PCI-E platform with multiple nodes?
- It looks not fine in dumped topo xml file HOT 2
- all_reduce_perf hangs on DGX-H100 HOT 15
- FSDP fails on v2.20.3-1 HOT 5
- What is the relationship between compute channel and search channel? HOT 2
- how to modify the value of NCCL_UNIQUE_ID_BYTES?
- The nccl P2P test reflects no bw & latency differences with or without GDR HOT 11
- Please help for clarification of P2P described in NCCL?
- How to determine the value of NCCL_P2P_LEVEL? HOT 4
- ALLTOALL ERROR ON 700 node: misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
- NCCL hang when invoking Reduce and ncclSend/Recv concurrently HOT 4
- NCCL allreduce is timing out HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.