Comments (2)
I noticed that in flash-attn backends. forward_prefix and forward_decode seem to be executed serially. Does forward_decode wait for forward_prefix to finish before running? Can this take advantage of the performance provided by chunked-prefill? I mean the tokens of prefill are in the same batch as the tokens of decode.
Yeah right now, it is running serially. I think after #4681, it should be possible to run them in the same attn kernel, but based on our past internal benchmark before, it didn't make much difference (we can definitely try to see how much perf improvement it will have). But this could be different now.
Note that this should be done after we re-revert #4820 because we should use prefix kernel to run both in the same attn kernel, and existing prefix kernel is too slow (flash attn varlen has at least 3X faster than this kernel)
from vllm.
I noticed that in flash-attn backends. forward_prefix and forward_decode seem to be executed serially. Does forward_decode wait for forward_prefix to finish before running? Can this take advantage of the performance provided by chunked-prefill? I mean the tokens of prefill are in the same batch as the tokens of decode.
Yeah right now, it is running serially. I think after #4681, it should be possible to run them in the same attn kernel, but based on our past internal benchmark before, it didn't make much difference (we can definitely try to see how much perf improvement it will have). But this could be different now.
Note that this should be done after we re-revert #4820 because we should use prefix kernel to run both in the same attn kernel, and existing prefix kernel is too slow (flash attn varlen has at least 3X faster than this kernel)
Is there a Issue/PR to "re-revert #4820 " for us to track?
from vllm.
Related Issues (20)
- [Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json" HOT 2
- [Bug]: InternVl2-8B-AWQ gives error when trying to run with vllm-openai cuda 11.8 docker image HOT 1
- [Bug]: Server crashes when kv cache exhausted HOT 2
- [Feature]: Support Inference Overrides for mm_processor_kwargs HOT 1
- Why is the bitsandbytes model significantly slower than the AWQ model? HOT 3
- Error loading models since versions 0.6.1xxx HOT 1
- [Bug]: OLMoE produces incorrect output with TP>1 HOT 1
- [Performance]: Analysis of performance dashboard movements
- [Bug]: OLMoForCausalLM not supported HOT 1
- [Bug]: 请求报错 HOT 1
- [Bug]: tensor parallel processes not working in vllm_cpu HOT 2
- [Bug]: use cpu_offload_gb in gguf failed. HOT 1
- [Usage]: Question about dequantization HOT 1
- [Usage]: how to acquire logits in vllm HOT 2
- [Usage]: Total generated tokens in benchmarking script HOT 5
- [Usage]: output were empty HOT 6
- [Usage]: output were empty HOT 1
- [Bug]: output is empty HOT 1
- [Bug]: LLMEngine cannot be pickled error vllm 0.6.1.post2 HOT 2
- vLLM's V2 Engine Architecture HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.