Comments (4)
Im not sure what is going on with your setup in triton for L40. That amount of overhead for triton vs api_server looks off.
Here are the specs for L40 vs A100. Memory bandwidth is the most important metric for LLM inference (since the decode phase of generation is memory bound). As we can see, the A100 has more memory bandwidth than the L40.
GPU | FP16 FLOPs | Memory Bandwidth | Memory |
---|---|---|---|
L40S | 366 TFLOPs | ~850 GB/s | 48GB |
A100-40GB | 312 TFLOPs | ~1500 GB/s | 40GB |
A100-80GB | 312 TFLOPs | ~2000 GB/s | 80GB |
We definitely have more to do to optimize for Hopper and Ada Lovelace, but I do not think the results here are too surprising given the specs of the GPUs.
from vllm.
I noticed the same issue. I previously switched from A100 to L40s and found a significant slowdown in vllm. I'm surprised to see someone has conducted benchmark tests.
from vllm.
Couple things:
- The
benchmark_throughput.py
script computesprompt_tokens + generation_tokens / time
, so you are comparing Apples and Oranges here - The shape of your workload will have big impacts on the percentage of time spent in prefill vs decode. Prefill can do ~15k tokens/second on A100 for 7b model, whereas Decode is much much less (at batch 1 its ~95 tok/sec at batch 64 its ~3000 tok/sec). So if more time is being spent in decode, then your aggregate throughput will be less. I see in the
benchmark_thoughput.py
that you are generating 128 tokens. Im not sure what dataset you are using forbenchmark_serving.py
, but I have found in the past that its ~225 tokens for ShareGPT on average. This is a huge difference in the amount of time spent in prefill vs decode
So please make sure you are doing a like-for-like comparison
For offline use cases, you might also consider increasing the max batch size for more throughput
from vllm.
Couple things:
* The `benchmark_throughput.py` script computes `prompt_tokens + generation_tokens / time`, so you are comparing Apples and Oranges here * The shape of your workload will have big impacts on the percentage of time spent in prefill vs decode. Prefill can do ~15k tokens/second on A100 for 7b model, whereas Decode is much much less (at batch 1 its ~95 tok/sec at batch 64 its ~3000 tok/sec). So if more time is being spent in decode, then your aggregate throughput will be less. I see in the `benchmark_thoughput.py` that you are generating 128 tokens. Im not sure what dataset you are using for `benchmark_serving.py`, but I have found in the past that its ~225 tokens for ShareGPT on average. This is a huge difference in the amount of time spent in prefill vs decode
So please make sure you are doing a like-for-like comparison
For offline use cases, you might also consider increasing the max batch size for more throughput
Thanks for your reply.
I know offline benchmark test and online serving is very different. I just wonder if there is any problem/ under optimizing with vllm when running on L40s, since it shows lower performance than expected.
Let me explain more details about the test I have done:
- I use a dataset with more than 9000 prompts and about 62 words average
- Sampling parameters: {"max_tokens": 128, "temperature": 0, "ignore_eos": False}
- Model llama-2-7b from huggingface here
- Serving using triton inference server with the vllm backend here and vllm api server.
- I calculate the tps using the total latency to finish all prompts
With a L40s GPU, I got the result as below:
- Using triton: ~500 tps
- Using api server: ~1300 tps
- Running benchmark_throughput with ShareGPT dataset show ~ 3100 tps
With an A100 PCIe 40G GPU:
- Using triton: ~1500 tps
- Using api server: ~ 1600 tps
- Running benchmark_throughput with ShareGPT dataset show ~ 4800 tps
Despite having more memory, and more compute power, using L40s shows lower performance than the A100.
from vllm.
Related Issues (20)
- [New Model]: qwen2-audio HOT 4
- [Bug]: Qwen72B service(TP=4) gets stuck after running N requests. The GPU utilization of 3 GPUs is at 100%, while 1 GPU is at 0%. Simultaneously, the CPU utilization is at 100%, and many requests are in CLOSE_WAIT status. HOT 5
- [Bug]: Error when using tensor_parallel in v0.6.1 HOT 8
- [Bug]: Pixtral leads to Expected at least 18286 dummy tokens for profiling, but found 16640 tokens instead or seq_len 25254 should be equal to N_txt + N_img (806, 12224, 24448) HOT 18
- [Usage]: how to use openai compatible api to run GGUF model? HOT 3
- [Bug]: CUDA device detection issue with KubeRay distributed inference for quantized models HOT 2
- [Misc]: Memory Order in Custom Allreduce HOT 19
- [Bug]: Qwen2-VL GPTQ does not work HOT 4
- [Bug]: The accuracy of vllm-Qwen2-VL-7B-Instruct is low. HOT 20
- [Usage]: ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.
- [Bug]: Obvious hang caused by Custom All Reduce OP(Valuable Debug Info Obtained) HOT 6
- [Bug]: Pixtral inference not working correctly with LLMEngine/AsyncEngine HOT 8
- [Bug]: 启动之后 用了一段时间 显存越占越多 HOT 3
- [Bug]: use speculative model in vllm error: TypeError: Worker.__init__() got an unexpected keyword argument 'num_speculative_tokens' HOT 10
- [Misc]: What is the relationship between max-num-seqs, max-num-batched-tokens and max-model-len?
- [Misc]: Can vLLM go out of memory during the decode phase if too many tokens are generated? HOT 1
- [Bug]: PyNcclCommunicator error when inferencing HOT 4
- [Bug]: mismatch between multimodal tokens and placeholders for Llava-Next (4 GPUs) HOT 12
- [Bug]: NCCL get stuck when instantiating the LLM class. HOT 5
- v0.6.1.post1 Release Tracker HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.