🚀 The feature, motivation and pitch Speculative decoding can achi

Hello, you mentioned optimizations for scoring time in <a class="issue-link js-issue-l

That's awesome. You should chat with <a class="user-mention notranslate" data-hovercar

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding about vllm HOT 2 OPEN

cadedaniel commented on July 22, 2024 4

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding

from vllm.

Comments (2)

Dbxwz commented on July 22, 2024 2

Hello, you mentioned optimizations for scoring time in #4630

P1 (Large) Replace CPU-based batch expansion with multi-query attention kernel call

I think multi-query attention kernel is not equal to MQA here, it is more like the append stage in flashinfer, am I right?
And I notice that the calculation process of append is similar to that of chunked prefill's one step. So I use chunked prefill to implement the AppendTop1Scorer which get a 10% speedup compared to BatchExpansionTop1Scorer. It's a dirty solution, since I create a new SequenceGroupMetadata which change the scoring sequence to a chunked prefill sequence. This implementation conflicts with recompute and chunked prefille.

So the perfect implementation should be that ModelRunner and Backend support the append stage, Backend should already support it if it supports chunked prefill.

In addition, is this issue about solving the scheduling problem of speculative decoding? Can you give a detailed introduction to what needs to be done in this issue?

from vllm.

cadedaniel commented on July 22, 2024 1

That's awesome. You should chat with @LiuXiaoxuanPKU who is removing batch expansion from vLLM.

FYI this issue is about combining the ITL improvements obtained from chunked prefill scheduling with spec decode.

from vllm.

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding about vllm HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent