Proposal to improve performance No response <h3 dir="

Couple things: <div class="snippet-clipboard-content notranslate posi

[Performance]: Vllm performance on L40s GPU about vllm HOT 4 CLOSED

warlock135 commented on September 26, 2024 1

[Performance]: Vllm performance on L40s GPU

from vllm.

Comments (4)

robertgshaw2-neuralmagic commented on September 26, 2024 4

Im not sure what is going on with your setup in triton for L40. That amount of overhead for triton vs api_server looks off.

Here are the specs for L40 vs A100. Memory bandwidth is the most important metric for LLM inference (since the decode phase of generation is memory bound). As we can see, the A100 has more memory bandwidth than the L40.

GPU	FP16 FLOPs	Memory Bandwidth	Memory
L40S	366 TFLOPs	~850 GB/s	48GB
A100-40GB	312 TFLOPs	~1500 GB/s	40GB
A100-80GB	312 TFLOPs	~2000 GB/s	80GB

We definitely have more to do to optimize for Hopper and Ada Lovelace, but I do not think the results here are too surprising given the specs of the GPUs.

from vllm.

rucnyz commented on September 26, 2024

I noticed the same issue. I previously switched from A100 to L40s and found a significant slowdown in vllm. I'm surprised to see someone has conducted benchmark tests.

from vllm.

robertgshaw2-neuralmagic commented on September 26, 2024

Couple things:

The benchmark_throughput.py script computes prompt_tokens + generation_tokens / time, so you are comparing Apples and Oranges here
The shape of your workload will have big impacts on the percentage of time spent in prefill vs decode. Prefill can do ~15k tokens/second on A100 for 7b model, whereas Decode is much much less (at batch 1 its ~95 tok/sec at batch 64 its ~3000 tok/sec). So if more time is being spent in decode, then your aggregate throughput will be less. I see in the benchmark_thoughput.py that you are generating 128 tokens. Im not sure what dataset you are using for benchmark_serving.py, but I have found in the past that its ~225 tokens for ShareGPT on average. This is a huge difference in the amount of time spent in prefill vs decode

So please make sure you are doing a like-for-like comparison

For offline use cases, you might also consider increasing the max batch size for more throughput

from vllm.

warlock135 commented on September 26, 2024

Couple things:

* The `benchmark_throughput.py` script computes `prompt_tokens + generation_tokens / time`, so you are comparing Apples and Oranges here

* The shape of your workload will have big impacts on the percentage of time spent in prefill vs decode. Prefill can do ~15k tokens/second on A100 for 7b model, whereas Decode is much much less (at batch 1 its ~95 tok/sec at batch 64 its ~3000 tok/sec). So if more time is being spent in decode, then your aggregate throughput will be less. I see in the `benchmark_thoughput.py` that you are generating 128 tokens. Im not sure what dataset you are using for `benchmark_serving.py`, but I have found in the past that its ~225 tokens for ShareGPT on average. This is a huge difference in the amount of time spent in prefill vs decode

So please make sure you are doing a like-for-like comparison

For offline use cases, you might also consider increasing the max batch size for more throughput

Thanks for your reply.
I know offline benchmark test and online serving is very different. I just wonder if there is any problem/ under optimizing with vllm when running on L40s, since it shows lower performance than expected.

Let me explain more details about the test I have done:

I use a dataset with more than 9000 prompts and about 62 words average
Sampling parameters: {"max_tokens": 128, "temperature": 0, "ignore_eos": False}
Model llama-2-7b from huggingface here
Serving using triton inference server with the vllm backend here and vllm api server.
I calculate the tps using the total latency to finish all prompts

With a L40s GPU, I got the result as below:

Using triton: ~500 tps
Using api server: ~1300 tps
Running benchmark_throughput with ShareGPT dataset show ~ 3100 tps

With an A100 PCIe 40G GPU:

Using triton: ~1500 tps
Using api server: ~ 1600 tps
Running benchmark_throughput with ShareGPT dataset show ~ 4800 tps

Despite having more memory, and more compute power, using L40s shows lower performance than the A100.

from vllm.

[Performance]: Vllm performance on L40s GPU about vllm HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent