Giter Club home page Giter Club logo

Comments (20)

cliffwoolley avatar cliffwoolley commented on July 17, 2024 1

For the programming model issue Simon mentioned, we have to wait for a
later version of CUDA (after 8.0). The feature we're wanting is in
development by the CUDA team, so hopefully we're talking sooner rather than
later, but we can't comment on CUDA release schedules or roadmaps of which
future version will contain which feature.

-Cliff

On Sep 26, 2016 2:03 AM, "fmana" [email protected] wrote:

@cypof https://github.com/cypof Thanks for help. boost::barrier() works
for me too. I stayied up&running all the weekend.

@slayton58 https://github.com/slayton58 in my case I observe the same
side-effects of cypon, that is GPUs at 100% apparently polling.
Could you tell us more about when it could be relaxed?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#37 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJO93iR0lTGO1NfKoJgHpVlUIn-MQaepks5qt4pIgaJpZM4JYYQU
.

from nccl.

sjeaugey avatar sjeaugey commented on July 17, 2024

I don't think it is really necessary, just better while developing and debugging. I didn't experience much performance degradation with the barrier, but that may depend on the network/use case.

from nccl.

cypof avatar cypof commented on July 17, 2024

OK thanks, also did you see a significant difference between layer by layer reduce, v.s. a single large reduce at the end? And between single process and multi-process? I see multi-process involves another copy. thks

from nccl.

cliffwoolley avatar cliffwoolley commented on July 17, 2024

@slayton58 to comment on that.
On Aug 3, 2016 2:09 PM, "Cyprien Noel" [email protected] wrote:

OK thanks, also did you see a significant difference between layer by
layer reduce, v.s. a single large reduce at the end?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#37 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AJO93gtl9Qw2h31HWJihqDdM6Ug_hQcFks5qcQNxgaJpZM4JYYQU
.

from nccl.

slayton58 avatar slayton58 commented on July 17, 2024

@cypof Layer-by-layer allows for overlapping of allReduce calls and other computation, so there can be large improvements in performance and scalability.

from nccl.

cypof avatar cypof commented on July 17, 2024

OK, I could not see a significant difference but I think I'm measuring wrong. I will try again. thks

from nccl.

cypof avatar cypof commented on July 17, 2024

It seems layer-wise or not, the barrier is necessary. Otherwise occasionally NCCL hangs, and some of the GPUs are pegged to 100%, as if some of them skipped a collective and were still polling on the other ones. Have you seen that behavior?

I guess leaving the barrier is OK in single process mode, as it only seems to cost a few percent, but I hope I don't have this issue in multi-process mode as we are trying to finish Caffe's python version. thks

from nccl.

cypof avatar cypof commented on July 17, 2024

Have you tried to remove the barrier in the NV version of Caffe? It's hanging pretty systematically for me. As we add more collectives in Caffe, e.g. for solver state broadcast and support of both layer by layer and global all-reduce it's becoming a bit tedious to pass the barrier around and call it before every collective.

from nccl.

fmana avatar fmana commented on July 17, 2024

Hi folks,

I've just opened an issue (#48) since I fall into some deadlock condition using NCCL.
My case is different from yours since multiple allreduce hang waiting for someone
who never will meet enqueued in some place.

I wanted to evaluate the choice to introduce a CPU-based barrier.
Do you have any suggestion in using a specific C/C++ barrier implementation?

Franco

from nccl.

cypof avatar cypof commented on July 17, 2024

Boost:barrier works great. But it's a large dependency if your app doesn't already uses boost.

from nccl.

fmana avatar fmana commented on July 17, 2024

Thanks for feedbacks.... let me try.
If it will be successful I'll take care of boost dependency.

from nccl.

cypof avatar cypof commented on July 17, 2024

Is anybody looking at this? The only way to get full reliability is to call a barrier before each collective, which for Caffe means before each layer during backprop. If I only call the barrier once per iteration, perf is in some case up to 20% better on 8 GPUs, but it will occasionally hang with some of the GPUs at 100%, apparently polling on something. If you have a plan to fix this, I can wait or leave it like that, otherwise I can try to switch to a separate thread for reduce and barriers, but it's more complicated.

from nccl.

slayton58 avatar slayton58 commented on July 17, 2024

@cypof Currently a barrier is required due to a programming model limitation - we expect this restriction to be relaxed in a future version of CUDA.

from nccl.

fmana avatar fmana commented on July 17, 2024

@cypof Thanks for help. boost::barrier() works for me too. I stayied up&running all the weekend.

@slayton58 in my case I observe the same side-effects of cypon, that is GPUs at 100% apparently polling.
Could you tell us more about when it could be relaxed?

from nccl.

hiyijian avatar hiyijian commented on July 17, 2024

hi @cypof. I come across the same issue. Dose barrier in a sperate thread works? I think it is maybe better since we can keep computing without any barrier in main thread?

from nccl.

cypof avatar cypof commented on July 17, 2024

It might help but not sure, I tried playing a bit with using multiple streams, waiting on events etc. but it didn't seem to help vs just making the main thread wait. It hard to know exactly how the tasks interact internally. I'd like to remove the barrier for speed, but also mostly to simplify the code.

from nccl.

hiyijian avatar hiyijian commented on July 17, 2024

Thanks @cypof . Do you think the MPI_Barrier(MPI_COMM_WORLD) should work too in mutiple-process mode, as boost::barrier works in mutiple-thread mode please?

from nccl.

cypof avatar cypof commented on July 17, 2024

It should, but I never tried MPI on a single node.

from nccl.

hiyijian avatar hiyijian commented on July 17, 2024

Why not?

from nccl.

sjeaugey avatar sjeaugey commented on July 17, 2024

Closing old issue. Should no longer be a problem since NCCL 2.1.

from nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.