Giter Club home page Giter Club logo

Comments (4)

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on July 22, 2024 3

Im not sure what is going on with your setup in triton for L40. That amount of overhead for triton vs api_server looks off.

Here are the specs for L40 vs A100. Memory bandwidth is the most important metric for LLM inference (since the decode phase of generation is memory bound). As we can see, the A100 has more memory bandwidth than the L40.

GPU FP16 FLOPs Memory Bandwidth Memory
L40S 366 TFLOPs ~850 GB/s 48GB
A100-40GB 312 TFLOPs ~1500 GB/s 40GB
A100-80GB 312 TFLOPs ~2000 GB/s 80GB

We definitely have more to do to optimize for Hopper and Ada Lovelace, but I do not think the results here are too surprising given the specs of the GPUs.

from vllm.

rucnyz avatar rucnyz commented on July 22, 2024

I noticed the same issue. I previously switched from A100 to L40s and found a significant slowdown in vllm. I'm surprised to see someone has conducted benchmark tests.

from vllm.

robertgshaw2-neuralmagic avatar robertgshaw2-neuralmagic commented on July 22, 2024

Couple things:

  • The benchmark_throughput.py script computes prompt_tokens + generation_tokens / time, so you are comparing Apples and Oranges here
  • The shape of your workload will have big impacts on the percentage of time spent in prefill vs decode. Prefill can do ~15k tokens/second on A100 for 7b model, whereas Decode is much much less (at batch 1 its ~95 tok/sec at batch 64 its ~3000 tok/sec). So if more time is being spent in decode, then your aggregate throughput will be less. I see in the benchmark_thoughput.py that you are generating 128 tokens. Im not sure what dataset you are using for benchmark_serving.py, but I have found in the past that its ~225 tokens for ShareGPT on average. This is a huge difference in the amount of time spent in prefill vs decode

So please make sure you are doing a like-for-like comparison

For offline use cases, you might also consider increasing the max batch size for more throughput

from vllm.

warlock135 avatar warlock135 commented on July 22, 2024

Couple things:

* The `benchmark_throughput.py` script computes `prompt_tokens + generation_tokens / time`, so you are comparing Apples and Oranges here

* The shape of your workload will have big impacts on the percentage of time spent in prefill vs decode. Prefill can do ~15k tokens/second on A100 for 7b model, whereas Decode is much much less (at batch 1 its ~95 tok/sec at batch 64 its ~3000 tok/sec). So if more time is being spent in decode, then your aggregate throughput will be less. I see in the `benchmark_thoughput.py` that you are generating 128 tokens. Im not sure what dataset you are using for `benchmark_serving.py`, but I have found in the past that its ~225 tokens for ShareGPT on average. This is a huge difference in the amount of time spent in prefill vs decode

So please make sure you are doing a like-for-like comparison

For offline use cases, you might also consider increasing the max batch size for more throughput

Thanks for your reply.
I know offline benchmark test and online serving is very different. I just wonder if there is any problem/ under optimizing with vllm when running on L40s, since it shows lower performance than expected.

Let me explain more details about the test I have done:

  • I use a dataset with more than 9000 prompts and about 62 words average
  • Sampling parameters: {"max_tokens": 128, "temperature": 0, "ignore_eos": False}
  • Model llama-2-7b from huggingface here
  • Serving using triton inference server with the vllm backend here and vllm api server.
  • I calculate the tps using the total latency to finish all prompts

With a L40s GPU, I got the result as below:

  • Using triton: ~500 tps
  • Using api server: ~1300 tps
  • Running benchmark_throughput with ShareGPT dataset show ~ 3100 tps

With an A100 PCIe 40G GPU:

  • Using triton: ~1500 tps
  • Using api server: ~ 1600 tps
  • Running benchmark_throughput with ShareGPT dataset show ~ 4800 tps

Despite having more memory, and more compute power, using L40s shows lower performance than the A100.

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.