Comments (7)
CC @jingxu10 @tye1, thank you!
from intel-extension-for-pytorch.
Hello, thanks for reporting this issue. I will look into this issue and get back to you.
from intel-extension-for-pytorch.
Thank you @YuningQiu , greatly appreciated!
from intel-extension-for-pytorch.
Hello @garrett361, regarding the specific script mentioned in the GitHub issue, it currently does not overlapping function on PVC.
How it operates on the A100 GPU:
- The script dispatches a series of compute tasks followed by collective operations. These are issued to the GPU without blocking the host, meaning that the compute kernels and collectives are queued before most of them are executed.
- On the A100 GPU, the compute and collective kernels are initiated in an alternating pattern and are executed concurrently. Additional information: On the A100, collectives are executed within kernels that utilize only a few threads. As the first compute kernel nears completion and hardware resources free up, the first independent allreduce from a separate stream is scheduled (while the second compute kernel, which is dependent, waits for its complete execution). Once the first compute kernel finishes, the threads from the second compute kernel begin to operate simultaneously with the collective, as the collective kernel occupies only a limited number of streaming multiprocessors.
Reasons for incompatibility with PVC:
- By default, the initiation of the second allreduce is implicitly delayed until the first allreduce is complete. At this point, several compute tasks but only one collective have been sent to the PVC. Additional information: When using the default (scheduled) path in oneCCL, the destruction of the event at the end of the collective submission code snippet triggers an artificial wait for the collective to complete within the event destructor. This wait blocks the host thread from continuing.
- On PVC, non-dependent kernels from multiple streams are executed in the order they were submitted. The reduction kernel in the first allreduce cannot start until the final compute kernel has finished. Note: Even though oneCCL might use the copy command for data transfer by default, the copy and reduction operations are still interdependent. Therefore, the possibility of overlapping is restricted to the last compute task and a portion of the first allreduce.
from intel-extension-for-pytorch.
Hi @YuningQiu , thank you for the very detailed response! I have a few follow-ups.
By default, the initiation of the second allreduce is implicitly delayed until the first allreduce is complete. At this point, several compute tasks but only one collective have been sent to the PVC
- Ah, you mean even the launch of the second allreduce kernel is delayed?
the destruction of the event at the end of the collective submission code snippet triggers an artificial wait for the collective to complete within the event destructor. This wait blocks the host thread from continuing.
- And this means that the collective blocks any additional kernels being launched, irrespective of what
Stream
they were sent to?
non-dependent kernels from multiple streams are executed in the order they were submitted.
- This means that kernels are executed in launch order regardless of what stream they are put into? If so, I don't understand the utility of
Stream
s.
Note: Even though oneCCL might use the copy command for data transfer by default, the copy and reduction operations are still interdependent. Therefore, the possibility of overlapping is restricted to the last compute task and a portion of the first allreduce.
- I didn't quite understand this. What is the importance of the copy operation here with respect to overlapping?
Finally: I am a little confused about where in the stack the issue lies. Is there an obstruction to overlapping compute and comms at the hardware level? Or is it something in ipex
, torch-ccl
, elsewhere?
from intel-extension-for-pytorch.
And for more color, all of the above seems consistent with what I have seen from the pytorch profiler.
These are traces of a very similar workload where I attempted to overlap comms and compute for two iterations on cuda
(A100) and xpu
(1550).
CUDA
cuda
: both compute and comms operations launch kernels and return immediately on the host, as seen in the minuscule vertical lines preceding the cudaDeviceSynchronize
.
XPU
xpu
: compute launches kernels and returns immediately, but collectives block and span a long time period until the collective finishes.
Isolated Compute and Comms on XPU
I also isolated the xpu
cases where I perform only the compute or the comms individually. The same effects can be seen.
Compute only:
Comms only:
from intel-extension-for-pytorch.
Hello @garrett361, thanks for providing more details. We will take them back and discuss internally. We will keep you posted with any updates.
Also, could you please share with us the PyTorch profiling file that you are showing above? Thanks a lot!
from intel-extension-for-pytorch.
Related Issues (20)
- OSError: [WinError 1920] The file cannot be accessed by the system. Error loading "C:\Users\techn\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\lib\backend_with_compiler.dll" or one of its dependencies. HOT 43
- test
- test
- initial import time of `torch` & `intel_extension_for_pytorch` is long on Windows Core Ultra HOT 3
- Incorrect inference output when not computing gradients on XPU HOT 8
- Run WOQ INT8 and WOQ INT4 in distributed way don't show token latency information even added --token-latency HOT 1
- How do you save a model after using WOQ? HOT 2
- Unable to run llama_cpp example from quickstart guide (PI_ERROR_BUILD_PROGRAM_FAILURE)
- failed to run on ultra 5 with Arc(TM) graphics (16GB VRAM) HOT 5
- compatability issues with oobabooga HOT 8
- operator does not exist for XPU backend
- IPEX takes 10min+ for "warmup" on MTL iGPU HOT 1
- Failed to run interger matrix multiply HOT 3
- Error can't import the library: Key already registered with the same priority HOT 2
- How to use ipex.llm.optimize HOT 1
- IFFTRN Arguments RuntimeError: FFT_INVALID_DESCRIPTOR HOT 3
- model.to('xpu') does not terminate HOT 3
- CPU Memory leak when training any model HOT 5
- Hi @qslia , thanks for reporting this issue. Yes we support Arc graphics.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from intel-extension-for-pytorch.