Giter Club home page Giter Club logo

Comments (5)

cadedaniel avatar cadedaniel commented on September 18, 2024 2

By the way, it would be great to measure the cost coefficient in vLLM and report it in the metrics. I can direct you to the relevant code sections if that interests you.

from vllm.

keyboardAnt avatar keyboardAnt commented on September 18, 2024 1

Speculative decoding might be slower than non-speculative if the drafter model is too slow or inaccurate. Using a simple simulation, we get the following heatmap:
image
(SI over non-SI. Pink marks slowdowns)

The distributed variation of the algorithm avoids slowdowns. It is always faster than the non-distributed version:
image
(DSI speedups over SI)

For a drafter latency of 68% (as you mentioned for LLaMa 1.1B), DSI offers up to 1.8x speedup compared to SI. The heatmap also shows that DSI’s speedup increases as the acceptance rate decreases.

DSI is not yet supported in vLLM though. 🥲

from vllm.

comaniac avatar comaniac commented on September 18, 2024 1

@bong-furiosa the cost coefficient is more like a configuration by users, meaning that you could refer to this criteria when selecting the draft model. vLLM just executes whatever you configured, because it cannot select the draft model for you, after all.

For your measurement, although this doesn't include other system overheads such as scheduling and sampling, I'd say it's generally ok, because @cadedaniel is leading community efforts to resolve such overheads as possible. Please keep an eye on related updates on issues and PRs and you would see obvious improvements over time.

from vllm.

bong-furiosa avatar bong-furiosa commented on September 18, 2024 1

Hello @keyboardAnt ! I have read your DSI paper before (also DISCO, recently). At that time, I didn't pay close attention to it because I wasn't considering Specuding Decoding. It's happy to be reminded of your paper and to meet the author.
The table you provided from the paper seems like it will be a great reference when using Speculative Decoding in vLLM!

@comaniac, Thank you for understanding my interest in the Cost Coefficient values for Speculative Decoding in vLLM!
Indeed, through PRs and ISSUEs, I have observed that many experts, including @cadedaniel, are making efforts to reduce serving overhead during Speculative Decoding. I will keep looking forward to seeing further improvements in vLLM.

Since I received excellent response to this ISSUE, I will close it.

from vllm.

bong-furiosa avatar bong-furiosa commented on September 18, 2024

Hello @cadedaniel !
Thank you for understanding my interest in the Cost Coefficient impact on the vLLM serving system. 🙇
I have reached out to you via email regarding this concern.
I would appreciate it if you could check it.

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.