Comments (2)
APC improves TPOT because there is less overhead on the system from processing prefills.
vllm has a central LLMEngine
, which runs a step
each timestep. A step
can be either a prefill
or a decode
. So a good mental model for vllm is that it is running decodes for all active requests constantly with "pauses" to process prefills from new requests to add them to the batch.
As a result, TPOT = n_tokens_generated * time_per_decode_step + n_prefills_processed * time_per_prefill_step
Since APC reduces time_per_prefill_step
TPOT is reduced. Additionally, APC indirectly reduces (a) n_prefills_processed
and time_per_decode_step
because it reduces E2E latency and therefore (a) there are on average less prefills to process while a specific request is running and (b) the average batch size is lower.
from vllm.
APC improves TPOT because there is less overhead on the system from processing prefills.
vllm has a central
LLMEngine
, which runs astep
each timestep. Astep
can be either aprefill
or adecode
. So a good mental model for vllm is that it is running decodes for all active requests constantly with "pauses" to process prefills from new requests to add them to the batch.As a result,
TPOT = n_tokens_generated * time_per_decode_step + n_prefills_processed * time_per_prefill_step
Since APC reduces
time_per_prefill_step
TPOT is reduced. Additionally, APC indirectly reduces (a)n_prefills_processed
andtie_per_decode_step
because it reduces E2E latency and therefore (a) there are on average less prefills to process while a specific request is running and (b) the average batch size is lower.
@robertgshaw2-neuralmagic Thank you for your reply~
But through nsys analysis, I saw that in the decoding stage, only the page attention kernel has reduced time consumption, and other kernels are close. Is this reasonable? Why?
kernel name:
void vllm::paged_attention_v1_kernel<unsigned short, unsigned short, (int)128, (int)16, (int)128, (bool)0>(T1 *, const T1 *, const T2 *, const T2 *, int, float, const int *, const int *, int, const float *, int, int, int, float)
with APC vs without APC time cost:
252.511 us vs 382.047 us
13b model , 300 input_len / 20 output_len, bs 100 in A100 TP1
from vllm.
Related Issues (20)
- [Bug]: internvl2-8bæé®æ éåŸªç¯ HOT 2
- [Feature]: Why vllm cli not provide a config arg? HOT 4
- Create speculative decode dynamic parallel strategy HOT 1
- [Bug]: CUDA out of memory for llama3.1 70gb gptq, while in llama3 70gb gptq doesn't HOT 2
- [Feature]: continuous batching for vllm.LLM HOT 3
- [Bug]: Using LLM Engine to infer the MiniCPM-V-2_6 model, the result is wrong HOT 2
- [Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. HOT 2
- [Bug]: `gemma-2-27b-it-GGUF`: `Architecture gemma2 not supported` HOT 5
- [RFC]: Encoder/decoder models & feature compatibility HOT 3
- [Usage]: how to use LLM class with AsyncLLMEngine HOT 2
- [Installation]: git clone cutlass fails HOT 7
- [Misc]: Improving VLLM KVCACHE Transfer Efficiency with NCCL P2P Communication HOT 2
- [Feature]: Support block manager v2 for chunked prefill HOT 3
- [Bug]: Phi-3-vision: ERROR 08-09 11:41:40 async_llm_engine.py:56] RuntimeError: stack expects each tensor to be equal size, but got [1933, 4096] at entry 0 and [2509, 4096] at entry 1 HOT 14
- [Bug]: Tensor Parallel > 1 causes desc_act=True GPTQ models to give bad output on ROCm
- [Usage]: Getting empty text using llm.generate of mixtral-8X7b-Instruct AWQ model HOT 1
- [Bug]: prefill/prefix FP8 triton kernel for opt-125m - an illegal memory access was encountered
- [Performance]: vllm inference in CPU instance has generation < 10 tokens / second HOT 4
- [Bug]: LLaMa 3.1 8B/70B/405B all behave poorly and differently using completions API as compared to good chat API HOT 18
- [Bug]: some questions regarding the usage of NCCL allreduce/broadcast/allgather/send/recv in VLLM using pycomm and torch's distributed. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
ð Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ððð
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â€ïž Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.