Comments (20)
For the programming model issue Simon mentioned, we have to wait for a
later version of CUDA (after 8.0). The feature we're wanting is in
development by the CUDA team, so hopefully we're talking sooner rather than
later, but we can't comment on CUDA release schedules or roadmaps of which
future version will contain which feature.
-Cliff
On Sep 26, 2016 2:03 AM, "fmana" [email protected] wrote:
@cypof https://github.com/cypof Thanks for help. boost::barrier() works
for me too. I stayied up&running all the weekend.@slayton58 https://github.com/slayton58 in my case I observe the same
side-effects of cypon, that is GPUs at 100% apparently polling.
Could you tell us more about when it could be relaxed?â
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJO93iR0lTGO1NfKoJgHpVlUIn-MQaepks5qt4pIgaJpZM4JYYQU
.
from nccl.
I don't think it is really necessary, just better while developing and debugging. I didn't experience much performance degradation with the barrier, but that may depend on the network/use case.
from nccl.
OK thanks, also did you see a significant difference between layer by layer reduce, v.s. a single large reduce at the end? And between single process and multi-process? I see multi-process involves another copy. thks
from nccl.
@slayton58 to comment on that.
On Aug 3, 2016 2:09 PM, "Cyprien Noel" [email protected] wrote:
OK thanks, also did you see a significant difference between layer by
layer reduce, v.s. a single large reduce at the end?â
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#37 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJO93gtl9Qw2h31HWJihqDdM6Ug_hQcFks5qcQNxgaJpZM4JYYQU
.
from nccl.
@cypof Layer-by-layer allows for overlapping of allReduce calls and other computation, so there can be large improvements in performance and scalability.
from nccl.
OK, I could not see a significant difference but I think I'm measuring wrong. I will try again. thks
from nccl.
It seems layer-wise or not, the barrier is necessary. Otherwise occasionally NCCL hangs, and some of the GPUs are pegged to 100%, as if some of them skipped a collective and were still polling on the other ones. Have you seen that behavior?
I guess leaving the barrier is OK in single process mode, as it only seems to cost a few percent, but I hope I don't have this issue in multi-process mode as we are trying to finish Caffe's python version. thks
from nccl.
Have you tried to remove the barrier in the NV version of Caffe? It's hanging pretty systematically for me. As we add more collectives in Caffe, e.g. for solver state broadcast and support of both layer by layer and global all-reduce it's becoming a bit tedious to pass the barrier around and call it before every collective.
from nccl.
Hi folks,
I've just opened an issue (#48) since I fall into some deadlock condition using NCCL.
My case is different from yours since multiple allreduce hang waiting for someone
who never will meet enqueued in some place.
I wanted to evaluate the choice to introduce a CPU-based barrier.
Do you have any suggestion in using a specific C/C++ barrier implementation?
Franco
from nccl.
Boost:barrier works great. But it's a large dependency if your app doesn't already uses boost.
from nccl.
Thanks for feedbacks.... let me try.
If it will be successful I'll take care of boost dependency.
from nccl.
Is anybody looking at this? The only way to get full reliability is to call a barrier before each collective, which for Caffe means before each layer during backprop. If I only call the barrier once per iteration, perf is in some case up to 20% better on 8 GPUs, but it will occasionally hang with some of the GPUs at 100%, apparently polling on something. If you have a plan to fix this, I can wait or leave it like that, otherwise I can try to switch to a separate thread for reduce and barriers, but it's more complicated.
from nccl.
@cypof Currently a barrier is required due to a programming model limitation - we expect this restriction to be relaxed in a future version of CUDA.
from nccl.
@cypof Thanks for help. boost::barrier() works for me too. I stayied up&running all the weekend.
@slayton58 in my case I observe the same side-effects of cypon, that is GPUs at 100% apparently polling.
Could you tell us more about when it could be relaxed?
from nccl.
hi @cypof. I come across the same issue. Dose barrier in a sperate thread works? I think it is maybe better since we can keep computing without any barrier in main thread?
from nccl.
It might help but not sure, I tried playing a bit with using multiple streams, waiting on events etc. but it didn't seem to help vs just making the main thread wait. It hard to know exactly how the tasks interact internally. I'd like to remove the barrier for speed, but also mostly to simplify the code.
from nccl.
Thanks @cypof . Do you think the MPI_Barrier(MPI_COMM_WORLD) should work too in mutiple-process mode, as boost::barrier works in mutiple-thread mode please?
from nccl.
It should, but I never tried MPI on a single node.
from nccl.
Why not?
from nccl.
Closing old issue. Should no longer be a problem since NCCL 2.1.
from nccl.
Related Issues (20)
- Profiling Tools for NCCL collective operations
- Local user buffer registration for NVLink SHARP HOT 1
- Some questions about selecting NET when searching channels. HOT 12
- Compute time in the reduction operation
- Understanding LL, LL128, and Simple Protocols
- Performance Degradation in Alltoall Operation with NCCL 2.19 and 2.20 HOT 5
- NCCL2.21 hangs at cudaLaunchKernelExC() HOT 6
- How are threads in different channels parallelized
- How sendProxyProgress() in net.cc works HOT 2
- Execute all_reduce_perf block HOT 1
- Has NCCL support inter-node through NVswitch and NVlink? HOT 7
- For channel computing, why nvlinkBw is accumulated, but pciBw is not? Is this a BUG? HOT 2
- nccl with specified pkey_index HOT 1
- How to locate the hanging node? HOT 1
- Why dose theoretical busBw multiply by the ratio 5/6?
- how double binary tree communicate
- NCCL error "receiving 524288 bytes instead of 65536" HOT 1
- Why can't two GPUs in a virtual machine communicate using P2Pīŧ HOT 1
- The variable NCCL_IB_ADDR_RANGE did not work properly after being configured HOT 3
- GID index change cause training to stop on ConnectX-7 400G Adapters when traing LLM HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ī¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.