Proposal to improve performance Recently, vLLM has been conducting

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

[Performance]: [Speculative Decoding] Measurement of Cost Coefficient through vLLM about vllm HOT 5 CLOSED

bong-furiosa commented on September 18, 2024

[Performance]: [Speculative Decoding] Measurement of Cost Coefficient through vLLM

from vllm.

Comments (5)

cadedaniel commented on September 18, 2024 2

By the way, it would be great to measure the cost coefficient in vLLM and report it in the metrics. I can direct you to the relevant code sections if that interests you.

from vllm.

keyboardAnt commented on September 18, 2024 1

Speculative decoding might be slower than non-speculative if the drafter model is too slow or inaccurate. Using a simple simulation, we get the following heatmap:

(SI over non-SI. Pink marks slowdowns)

The distributed variation of the algorithm avoids slowdowns. It is always faster than the non-distributed version:

(DSI speedups over SI)

For a drafter latency of 68% (as you mentioned for LLaMa 1.1B), DSI offers up to 1.8x speedup compared to SI. The heatmap also shows that DSI’s speedup increases as the acceptance rate decreases.

DSI is not yet supported in vLLM though. 🥲

from vllm.

comaniac commented on September 18, 2024 1

@bong-furiosa the cost coefficient is more like a configuration by users, meaning that you could refer to this criteria when selecting the draft model. vLLM just executes whatever you configured, because it cannot select the draft model for you, after all.

For your measurement, although this doesn't include other system overheads such as scheduling and sampling, I'd say it's generally ok, because @cadedaniel is leading community efforts to resolve such overheads as possible. Please keep an eye on related updates on issues and PRs and you would see obvious improvements over time.

from vllm.

bong-furiosa commented on September 18, 2024 1

Hello @keyboardAnt ! I have read your DSI paper before (also DISCO, recently). At that time, I didn't pay close attention to it because I wasn't considering Specuding Decoding. It's happy to be reminded of your paper and to meet the author.
The table you provided from the paper seems like it will be a great reference when using Speculative Decoding in vLLM!

@comaniac, Thank you for understanding my interest in the Cost Coefficient values for Speculative Decoding in vLLM!
Indeed, through PRs and ISSUEs, I have observed that many experts, including @cadedaniel, are making efforts to reduce serving overhead during Speculative Decoding. I will keep looking forward to seeing further improvements in vLLM.

Since I received excellent response to this ISSUE, I will close it.

from vllm.

bong-furiosa commented on September 18, 2024

Hello @cadedaniel !
Thank you for understanding my interest in the Cost Coefficient impact on the vLLM serving system. 🙇
I have reached out to you via email regarding this concern.
I would appreciate it if you could check it.

from vllm.

Recommend Projects

[Performance]: [Speculative Decoding] Measurement of Cost Coefficient through vLLM about vllm HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent