Giter Club home page Giter Club logo

Comments (7)

garrett361 avatar garrett361 commented on June 3, 2024

CC @jingxu10 @tye1, thank you!

from intel-extension-for-pytorch.

YuningQiu avatar YuningQiu commented on June 3, 2024

Hello, thanks for reporting this issue. I will look into this issue and get back to you.

from intel-extension-for-pytorch.

garrett361 avatar garrett361 commented on June 3, 2024

Thank you @YuningQiu , greatly appreciated!

from intel-extension-for-pytorch.

YuningQiu avatar YuningQiu commented on June 3, 2024

Hello @garrett361, regarding the specific script mentioned in the GitHub issue, it currently does not overlapping function on PVC.

How it operates on the A100 GPU:

  1. The script dispatches a series of compute tasks followed by collective operations. These are issued to the GPU without blocking the host, meaning that the compute kernels and collectives are queued before most of them are executed.
  2. On the A100 GPU, the compute and collective kernels are initiated in an alternating pattern and are executed concurrently. Additional information: On the A100, collectives are executed within kernels that utilize only a few threads. As the first compute kernel nears completion and hardware resources free up, the first independent allreduce from a separate stream is scheduled (while the second compute kernel, which is dependent, waits for its complete execution). Once the first compute kernel finishes, the threads from the second compute kernel begin to operate simultaneously with the collective, as the collective kernel occupies only a limited number of streaming multiprocessors.

Reasons for incompatibility with PVC:

  1. By default, the initiation of the second allreduce is implicitly delayed until the first allreduce is complete. At this point, several compute tasks but only one collective have been sent to the PVC. Additional information: When using the default (scheduled) path in oneCCL, the destruction of the event at the end of the collective submission code snippet triggers an artificial wait for the collective to complete within the event destructor. This wait blocks the host thread from continuing.
  2. On PVC, non-dependent kernels from multiple streams are executed in the order they were submitted. The reduction kernel in the first allreduce cannot start until the final compute kernel has finished. Note: Even though oneCCL might use the copy command for data transfer by default, the copy and reduction operations are still interdependent. Therefore, the possibility of overlapping is restricted to the last compute task and a portion of the first allreduce.

from intel-extension-for-pytorch.

garrett361 avatar garrett361 commented on June 3, 2024

Hi @YuningQiu , thank you for the very detailed response! I have a few follow-ups.

By default, the initiation of the second allreduce is implicitly delayed until the first allreduce is complete. At this point, several compute tasks but only one collective have been sent to the PVC

  1. Ah, you mean even the launch of the second allreduce kernel is delayed?

the destruction of the event at the end of the collective submission code snippet triggers an artificial wait for the collective to complete within the event destructor. This wait blocks the host thread from continuing.

  1. And this means that the collective blocks any additional kernels being launched, irrespective of what Stream they were sent to?

non-dependent kernels from multiple streams are executed in the order they were submitted.

  1. This means that kernels are executed in launch order regardless of what stream they are put into? If so, I don't understand the utility of Streams.

Note: Even though oneCCL might use the copy command for data transfer by default, the copy and reduction operations are still interdependent. Therefore, the possibility of overlapping is restricted to the last compute task and a portion of the first allreduce.

  1. I didn't quite understand this. What is the importance of the copy operation here with respect to overlapping?

Finally: I am a little confused about where in the stack the issue lies. Is there an obstruction to overlapping compute and comms at the hardware level? Or is it something in ipex, torch-ccl, elsewhere?

from intel-extension-for-pytorch.

garrett361 avatar garrett361 commented on June 3, 2024

And for more color, all of the above seems consistent with what I have seen from the pytorch profiler.

These are traces of a very similar workload where I attempted to overlap comms and compute for two iterations on cuda (A100) and xpu (1550).

CUDA

cuda: both compute and comms operations launch kernels and return immediately on the host, as seen in the minuscule vertical lines preceding the cudaDeviceSynchronize.

cuda_comms_compute

XPU

xpu: compute launches kernels and returns immediately, but collectives block and span a long time period until the collective finishes.

xpu_comms_compute

Isolated Compute and Comms on XPU

I also isolated the xpu cases where I perform only the compute or the comms individually. The same effects can be seen.

Compute only:

xpu_compute_only

Comms only:

xpu_comms_only

from intel-extension-for-pytorch.

YuningQiu avatar YuningQiu commented on June 3, 2024

Hello @garrett361, thanks for providing more details. We will take them back and discuss internally. We will keep you posted with any updates.

Also, could you please share with us the PyTorch profiling file that you are showing above? Thanks a lot!

from intel-extension-for-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.