Giter Club home page Giter Club logo

Comments (2)

yyinter avatar yyinter commented on August 24, 2024

When you run a function for the first time, the GPU needs to initialize and load the necessary computational resources, which may result in longer execution times. However, subsequent runs of the same function may benefit from cached resources, leading to faster execution times. And you also can record the start and end time points with event.record(), and then calculate the time difference with event.elapsed_time(). This allows for more accurate measurement of GPU runtime

from triton.

Ppaddington avatar Ppaddington commented on August 24, 2024

When you run a function for the first time, the GPU needs to initialize and load the necessary computational resources, which may result in longer execution times. However, subsequent runs of the same function may benefit from cached resources, leading to faster execution times. And you also can record the start and end time points with event.record(), and then calculate the time difference with event.elapsed_time(). This allows for more accurate measurement of GPU runtime

Thanks a lot!
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for num in range(5):
start.record()
res1 = my_func(inp, fc1.weight, fc1.bias)
end.record()
torch.cuda.synchronize()
elapsed_time = start.elapsed_time(end)
print(f'elapsed time {num}: ', elapsed_time)
elapsed time 0: 1194.6536865234375
elapsed time 1: 0.1934719979763031
elapsed time 2: 0.1515520066022873
elapsed time 3: 0.1443839967250824
elapsed time 4: 0.14035199582576752
The GPU runtime and wall-clock time conclusions appear to be consistent.

Actually, I focus on wall-clock time because I tend to implement an efficient operator for LLMs inference.

For example, during the decoding stage of Llama-2-7B, there are two MLP feedforward operations (matrix multiplication) in each transformer layer.
After the triton compiler (1282.27764 ms) was completed, I got 0.35153 ms wall-clock time during MLP forward operation.
However, I wish got 0.14346 ms. If that, I can accelerate the inference end-to-end time!

Could you provide some advice on this?

from triton.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.