Your current environment <div class="snippet-clipboard-content notranslate posit

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for creating the issue. Two questions: Does the problem

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The code that is breaking is: <div class="snippet-clipboard-content notranslate po

[Bug]: Speculative decoding server: `ValueError: could not broadcast input array from shape (513,) into shape (512,)` about vllm HOT 14 OPEN

jeffreyling commented on August 26, 2024 1

[Bug]: Speculative decoding server: `ValueError: could not broadcast input array from shape (513,) into shape (512,)`

from vllm.

Comments (14)

Adhyyan1252 commented on August 26, 2024 2

Does the sequence length plus proposal length go over the max model length ?

That was our suspicion as well so we made speculative-max-model-len shorter than the max-model-len - num-speculative-tokens but that doesnt seem to stop that issue.

--max-model-len 16384 \
--speculative-max-model-len 16000 \
--speculative-model [ngram] \
--num-speculative-tokens 128 \
--ngram-prompt-lookup-max 32 \
--ngram-prompt-lookup-min 16 \

from vllm.

cadedaniel commented on August 26, 2024 1

yep.

from vllm.

jeffreyling commented on August 26, 2024 1

@njhill we ran into this error with 0.4.3 originally before we tried upgrading to 0.5.0.

from vllm.

cadedaniel commented on August 26, 2024

Thanks for creating the issue. Two questions:

Does the problem still occur if prefix caching is disabled?
Does the problem still occur if cuda graphs are disabled?

from vllm.

jeffreyling commented on August 26, 2024

@cadedaniel Thanks for the quick response!

I tried without --enable-prefix-caching and it eventually ran into the same error.
Then I tried without --enable-prefix-caching, and enabled --enforce-eager. This didn't error on the set of queries I ran.

from vllm.

cadedaniel commented on August 26, 2024

Thanks for trying those out so fast :)

OK the issue is very likely caused by CUDA graphs + batch expansion. This should be fixed, but currently since spec decode performance isn't good, it won't be prioritized until after that.

FYI @LiuXiaoxuanPKU another issue with batch expansion + cuda graph

from vllm.

Adhyyan1252 commented on August 26, 2024

Do you recommend just using --enforce-eager until this is fixed?

from vllm.

cadedaniel commented on August 26, 2024

If you are blocked by this issue, the fix shouldn't be very hard. I think we simply need to configure the cuda graph max size to include the expanded batch size.

from vllm.

Adhyyan1252 commented on August 26, 2024

The code that is breaking is:

if use_captured_graph:
            # The shape of graph_block_tables is
            # [max batch size, max context len // block size].
            input_block_tables = self.graph_block_tables[:batch_size]
            for i, block_table in enumerate(block_tables):
                if block_table:
                    input_block_tables[i, :len(block_table)] = block_table

The issue is that len(block_table) > input_block_tables.shape[1] and the second dimension corresponds to max context len // block size. Am i misunderstanding in how this is a batch-size issue and not a context len issue?

from vllm.

cadedaniel commented on August 26, 2024

good point. Wonder why this is specific to spec decode then.

Does the sequence length plus proposal length go over the max model length ?

from vllm.

njhill commented on August 26, 2024

@Adhyyan1252 could you see if you also get this error with vLLM 0.4.3?

from vllm.

Ximingwang-09 commented on August 26, 2024

try to add params : '--max-seq-len-to-capture' eqauls to max_model_len

from vllm.

Adhyyan1252 commented on August 26, 2024

Still the same issue

from vllm.

hitcoogle commented on August 26, 2024

Still the same issue..

from vllm.

[Bug]: Speculative decoding server: `ValueError: could not broadcast input array from shape (513,) into shape (512,)` about vllm HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent