Comments (5)
By the way, it would be great to measure the cost coefficient in vLLM and report it in the metrics. I can direct you to the relevant code sections if that interests you.
from vllm.
Speculative decoding might be slower than non-speculative if the drafter model is too slow or inaccurate. Using a simple simulation, we get the following heatmap:
(SI over non-SI. Pink marks slowdowns)
The distributed variation of the algorithm avoids slowdowns. It is always faster than the non-distributed version:
(DSI speedups over SI)
For a drafter latency of 68% (as you mentioned for LLaMa 1.1B), DSI offers up to 1.8x speedup compared to SI. The heatmap also shows that DSI’s speedup increases as the acceptance rate decreases.
DSI is not yet supported in vLLM though. 🥲
from vllm.
@bong-furiosa the cost coefficient is more like a configuration by users, meaning that you could refer to this criteria when selecting the draft model. vLLM just executes whatever you configured, because it cannot select the draft model for you, after all.
For your measurement, although this doesn't include other system overheads such as scheduling and sampling, I'd say it's generally ok, because @cadedaniel is leading community efforts to resolve such overheads as possible. Please keep an eye on related updates on issues and PRs and you would see obvious improvements over time.
from vllm.
Hello @keyboardAnt ! I have read your DSI paper before (also DISCO, recently). At that time, I didn't pay close attention to it because I wasn't considering Specuding Decoding. It's happy to be reminded of your paper and to meet the author.
The table you provided from the paper seems like it will be a great reference when using Speculative Decoding in vLLM!
@comaniac, Thank you for understanding my interest in the Cost Coefficient values for Speculative Decoding in vLLM!
Indeed, through PRs and ISSUEs, I have observed that many experts, including @cadedaniel, are making efforts to reduce serving overhead during Speculative Decoding. I will keep looking forward to seeing further improvements in vLLM.
Since I received excellent response to this ISSUE, I will close it.
from vllm.
Hello @cadedaniel !
Thank you for understanding my interest in the Cost Coefficient impact on the vLLM serving system. 🙇
I have reached out to you via email regarding this concern.
I would appreciate it if you could check it.
from vllm.
Related Issues (20)
- [Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere HOT 2
- [Bug]: RuntimeError: CUDA error: an illegal memory access was encountered HOT 1
- [Bug]: I get one word inconsistent responses using the v0.5.4 with LL HOT 1
- [Installation]: LGPL license in dependencies HOT 6
- [Usage]: How can I determine the maximum number of concurrent requests? HOT 1
- [Misc]: Question about Serving with Server API HOT 7
- [Feature]: Contribute T5 model to vLLM HOT 1
- [Performance]: Sampler is too slow? HOT 1
- [Installation]: Issues with installing vLLM on ROCM without sudo access HOT 1
- [Bug]: Inconsistent generation with guided_json, speculative decoding and temp > 0.0
- [Bug]: flakey test found in #7874
- [Usage]: Bad Request with multiple multimodal inputs when using vision LLM. HOT 1
- [Bug]: vllm0.4.3 guided decoding HOT 1
- [Bug]: vLLM hang at nccl step when trying to use multiple GPUs HOT 1
- [Usage]: Using TPU example with InternVL2 Model HOT 5
- [Bug]: TPU InternVL2 Model Error Graph break due to unsupported builtin _XLAC.PyCapsule._xla_get_replication_devices_count HOT 3
- [Feature]: Beam Search with Temperature > 0 HOT 2
- [Bug]: ValueError: could not broadcast input array from shape (513,) into shape (512,) HOT 7
- [Usage]: Does VLLM support starting multiple cards using mpirun? Want to bind different CPUs to each card. HOT 1
- [New Model]: FM9GForCausalLM HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.