Describe the bug Communication and computation do not appear to ov

CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Communication and compute on separate Streams do not overlap about intel-extension-for-pytorch HOT 7 OPEN

garrett361 commented on June 3, 2024

Communication and compute on separate Streams do not overlap

from intel-extension-for-pytorch.

Comments (7)

garrett361 commented on June 3, 2024

CC @jingxu10 @tye1, thank you!

from intel-extension-for-pytorch.

YuningQiu commented on June 3, 2024

Hello, thanks for reporting this issue. I will look into this issue and get back to you.

from intel-extension-for-pytorch.

garrett361 commented on June 3, 2024

Thank you @YuningQiu , greatly appreciated!

from intel-extension-for-pytorch.

YuningQiu commented on June 3, 2024

Hello @garrett361, regarding the specific script mentioned in the GitHub issue, it currently does not overlapping function on PVC.

How it operates on the A100 GPU:

The script dispatches a series of compute tasks followed by collective operations. These are issued to the GPU without blocking the host, meaning that the compute kernels and collectives are queued before most of them are executed.
On the A100 GPU, the compute and collective kernels are initiated in an alternating pattern and are executed concurrently. Additional information: On the A100, collectives are executed within kernels that utilize only a few threads. As the first compute kernel nears completion and hardware resources free up, the first independent allreduce from a separate stream is scheduled (while the second compute kernel, which is dependent, waits for its complete execution). Once the first compute kernel finishes, the threads from the second compute kernel begin to operate simultaneously with the collective, as the collective kernel occupies only a limited number of streaming multiprocessors.

Reasons for incompatibility with PVC:

By default, the initiation of the second allreduce is implicitly delayed until the first allreduce is complete. At this point, several compute tasks but only one collective have been sent to the PVC. Additional information: When using the default (scheduled) path in oneCCL, the destruction of the event at the end of the collective submission code snippet triggers an artificial wait for the collective to complete within the event destructor. This wait blocks the host thread from continuing.
On PVC, non-dependent kernels from multiple streams are executed in the order they were submitted. The reduction kernel in the first allreduce cannot start until the final compute kernel has finished. Note: Even though oneCCL might use the copy command for data transfer by default, the copy and reduction operations are still interdependent. Therefore, the possibility of overlapping is restricted to the last compute task and a portion of the first allreduce.

from intel-extension-for-pytorch.

garrett361 commented on June 3, 2024

Hi @YuningQiu , thank you for the very detailed response! I have a few follow-ups.

By default, the initiation of the second allreduce is implicitly delayed until the first allreduce is complete. At this point, several compute tasks but only one collective have been sent to the PVC

Ah, you mean even the launch of the second allreduce kernel is delayed?

the destruction of the event at the end of the collective submission code snippet triggers an artificial wait for the collective to complete within the event destructor. This wait blocks the host thread from continuing.

And this means that the collective blocks any additional kernels being launched, irrespective of what Stream they were sent to?

non-dependent kernels from multiple streams are executed in the order they were submitted.

This means that kernels are executed in launch order regardless of what stream they are put into? If so, I don't understand the utility of Streams.

Note: Even though oneCCL might use the copy command for data transfer by default, the copy and reduction operations are still interdependent. Therefore, the possibility of overlapping is restricted to the last compute task and a portion of the first allreduce.

I didn't quite understand this. What is the importance of the copy operation here with respect to overlapping?

Finally: I am a little confused about where in the stack the issue lies. Is there an obstruction to overlapping compute and comms at the hardware level? Or is it something in ipex, torch-ccl, elsewhere?

from intel-extension-for-pytorch.

garrett361 commented on June 3, 2024

And for more color, all of the above seems consistent with what I have seen from the pytorch profiler.

These are traces of a very similar workload where I attempted to overlap comms and compute for two iterations on cuda (A100) and xpu (1550).

CUDA

cuda: both compute and comms operations launch kernels and return immediately on the host, as seen in the minuscule vertical lines preceding the cudaDeviceSynchronize.

XPU

xpu: compute launches kernels and returns immediately, but collectives block and span a long time period until the collective finishes.

Isolated Compute and Comms on XPU

I also isolated the xpu cases where I perform only the compute or the comms individually. The same effects can be seen.

Compute only:

Comms only:

from intel-extension-for-pytorch.

YuningQiu commented on June 3, 2024

Hello @garrett361, thanks for providing more details. We will take them back and discuss internally. We will keep you posted with any updates.

Also, could you please share with us the PyTorch profiling file that you are showing above? Thanks a lot!

from intel-extension-for-pytorch.

Communication and compute on separate Streams do not overlap about intel-extension-for-pytorch HOT 7 OPEN

Comments (7)

CUDA

XPU

Isolated Compute and Comms on XPU

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent