I'm looking at the NV branch of Caffe with NCCL support. It uses a barrier before doin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi folks, I've just opened an issue (<a class="issue-link js-issue-l

allreduce and barriers about nccl HOT 20 CLOSED

nvidia commented on July 17, 2024

allreduce and barriers

from nccl.

Comments (20)

cliffwoolley commented on July 17, 2024 1

For the programming model issue Simon mentioned, we have to wait for a
later version of CUDA (after 8.0). The feature we're wanting is in
development by the CUDA team, so hopefully we're talking sooner rather than
later, but we can't comment on CUDA release schedules or roadmaps of which
future version will contain which feature.

-Cliff

On Sep 26, 2016 2:03 AM, "fmana" [email protected] wrote:

@cypof https://github.com/cypof Thanks for help. boost::barrier() works
for me too. I stayied up&running all the weekend.

@slayton58 https://github.com/slayton58 in my case I observe the same
side-effects of cypon, that is GPUs at 100% apparently polling.
Could you tell us more about when it could be relaxed?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJO93iR0lTGO1NfKoJgHpVlUIn-MQaepks5qt4pIgaJpZM4JYYQU
.

from nccl.

sjeaugey commented on July 17, 2024

I don't think it is really necessary, just better while developing and debugging. I didn't experience much performance degradation with the barrier, but that may depend on the network/use case.

from nccl.

cypof commented on July 17, 2024

OK thanks, also did you see a significant difference between layer by layer reduce, v.s. a single large reduce at the end? And between single process and multi-process? I see multi-process involves another copy. thks

from nccl.

cliffwoolley commented on July 17, 2024

@slayton58 to comment on that.
On Aug 3, 2016 2:09 PM, "Cyprien Noel" [email protected] wrote:

OK thanks, also did you see a significant difference between layer by
layer reduce, v.s. a single large reduce at the end?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#37 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJO93gtl9Qw2h31HWJihqDdM6Ug_hQcFks5qcQNxgaJpZM4JYYQU
.

from nccl.

slayton58 commented on July 17, 2024

@cypof Layer-by-layer allows for overlapping of allReduce calls and other computation, so there can be large improvements in performance and scalability.

from nccl.

cypof commented on July 17, 2024

OK, I could not see a significant difference but I think I'm measuring wrong. I will try again. thks

from nccl.

cypof commented on July 17, 2024

It seems layer-wise or not, the barrier is necessary. Otherwise occasionally NCCL hangs, and some of the GPUs are pegged to 100%, as if some of them skipped a collective and were still polling on the other ones. Have you seen that behavior?

I guess leaving the barrier is OK in single process mode, as it only seems to cost a few percent, but I hope I don't have this issue in multi-process mode as we are trying to finish Caffe's python version. thks

from nccl.

cypof commented on July 17, 2024

Have you tried to remove the barrier in the NV version of Caffe? It's hanging pretty systematically for me. As we add more collectives in Caffe, e.g. for solver state broadcast and support of both layer by layer and global all-reduce it's becoming a bit tedious to pass the barrier around and call it before every collective.

from nccl.

fmana commented on July 17, 2024

Hi folks,

I've just opened an issue (#48) since I fall into some deadlock condition using NCCL.
My case is different from yours since multiple allreduce hang waiting for someone
who never will meet enqueued in some place.

I wanted to evaluate the choice to introduce a CPU-based barrier.
Do you have any suggestion in using a specific C/C++ barrier implementation?

Franco

from nccl.

cypof commented on July 17, 2024

Boost:barrier works great. But it's a large dependency if your app doesn't already uses boost.

from nccl.

fmana commented on July 17, 2024

Thanks for feedbacks.... let me try.
If it will be successful I'll take care of boost dependency.

from nccl.

cypof commented on July 17, 2024

Is anybody looking at this? The only way to get full reliability is to call a barrier before each collective, which for Caffe means before each layer during backprop. If I only call the barrier once per iteration, perf is in some case up to 20% better on 8 GPUs, but it will occasionally hang with some of the GPUs at 100%, apparently polling on something. If you have a plan to fix this, I can wait or leave it like that, otherwise I can try to switch to a separate thread for reduce and barriers, but it's more complicated.

from nccl.

slayton58 commented on July 17, 2024

@cypof Currently a barrier is required due to a programming model limitation - we expect this restriction to be relaxed in a future version of CUDA.

from nccl.

fmana commented on July 17, 2024

@cypof Thanks for help. boost::barrier() works for me too. I stayied up&running all the weekend.

@slayton58 in my case I observe the same side-effects of cypon, that is GPUs at 100% apparently polling.
Could you tell us more about when it could be relaxed?

from nccl.

hiyijian commented on July 17, 2024

hi @cypof. I come across the same issue. Dose barrier in a sperate thread works? I think it is maybe better since we can keep computing without any barrier in main thread?

from nccl.

cypof commented on July 17, 2024

It might help but not sure, I tried playing a bit with using multiple streams, waiting on events etc. but it didn't seem to help vs just making the main thread wait. It hard to know exactly how the tasks interact internally. I'd like to remove the barrier for speed, but also mostly to simplify the code.

from nccl.

hiyijian commented on July 17, 2024

Thanks @cypof . Do you think the MPI_Barrier(MPI_COMM_WORLD) should work too in mutiple-process mode, as boost::barrier works in mutiple-thread mode please?

from nccl.

cypof commented on July 17, 2024

It should, but I never tried MPI on a single node.

from nccl.

hiyijian commented on July 17, 2024

Why not?

from nccl.

sjeaugey commented on July 17, 2024

Closing old issue. Should no longer be a problem since NCCL 2.1.

from nccl.

allreduce and barriers about nccl HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent