I used the same data to run the same function five times, and the five running times w

The relationship between operator running time and number of runs about triton HOT 2 OPEN

Ppaddington commented on August 24, 2024

The relationship between operator running time and number of runs

from triton.

Comments (2)

yyinter commented on August 24, 2024

When you run a function for the first time, the GPU needs to initialize and load the necessary computational resources, which may result in longer execution times. However, subsequent runs of the same function may benefit from cached resources, leading to faster execution times. And you also can record the start and end time points with event.record(), and then calculate the time difference with event.elapsed_time(). This allows for more accurate measurement of GPU runtime

from triton.

Ppaddington commented on August 24, 2024

When you run a function for the first time, the GPU needs to initialize and load the necessary computational resources, which may result in longer execution times. However, subsequent runs of the same function may benefit from cached resources, leading to faster execution times. And you also can record the start and end time points with event.record(), and then calculate the time difference with event.elapsed_time(). This allows for more accurate measurement of GPU runtime

Thanks a lot!
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for num in range(5):
start.record()
res1 = my_func(inp, fc1.weight, fc1.bias)
end.record()
torch.cuda.synchronize()
elapsed_time = start.elapsed_time(end)
print(f'elapsed time {num}: ', elapsed_time)
elapsed time 0: 1194.6536865234375
elapsed time 1: 0.1934719979763031
elapsed time 2: 0.1515520066022873
elapsed time 3: 0.1443839967250824
elapsed time 4: 0.14035199582576752
The GPU runtime and wall-clock time conclusions appear to be consistent.

Actually, I focus on wall-clock time because I tend to implement an efficient operator for LLMs inference.

For example, during the decoding stage of Llama-2-7B, there are two MLP feedforward operations (matrix multiplication) in each transformer layer.
After the triton compiler (1282.27764 ms) was completed, I got 0.35153 ms wall-clock time during MLP forward operation.
However, I wish got 0.14346 ms. If that, I can accelerate the inference end-to-end time!

Could you provide some advice on this?

from triton.

Recommend Projects

The relationship between operator running time and number of runs about triton HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent