Giter Club home page Giter Club logo

Comments (14)

Adhyyan1252 avatar Adhyyan1252 commented on August 26, 2024 2

Does the sequence length plus proposal length go over the max model length ?

That was our suspicion as well so we made speculative-max-model-len shorter than the max-model-len - num-speculative-tokens but that doesnt seem to stop that issue.

--max-model-len 16384 \
--speculative-max-model-len 16000 \
--speculative-model [ngram] \
--num-speculative-tokens 128 \
--ngram-prompt-lookup-max 32 \
--ngram-prompt-lookup-min 16 \

from vllm.

cadedaniel avatar cadedaniel commented on August 26, 2024 1

yep.

from vllm.

jeffreyling avatar jeffreyling commented on August 26, 2024 1

@njhill we ran into this error with 0.4.3 originally before we tried upgrading to 0.5.0.

from vllm.

cadedaniel avatar cadedaniel commented on August 26, 2024

Thanks for creating the issue. Two questions:

  1. Does the problem still occur if prefix caching is disabled?
  2. Does the problem still occur if cuda graphs are disabled?

from vllm.

jeffreyling avatar jeffreyling commented on August 26, 2024

@cadedaniel Thanks for the quick response!

  1. I tried without --enable-prefix-caching and it eventually ran into the same error.
  2. Then I tried without --enable-prefix-caching, and enabled --enforce-eager. This didn't error on the set of queries I ran.

from vllm.

cadedaniel avatar cadedaniel commented on August 26, 2024

Thanks for trying those out so fast :)

OK the issue is very likely caused by CUDA graphs + batch expansion. This should be fixed, but currently since spec decode performance isn't good, it won't be prioritized until after that.

FYI @LiuXiaoxuanPKU another issue with batch expansion + cuda graph

from vllm.

Adhyyan1252 avatar Adhyyan1252 commented on August 26, 2024

Do you recommend just using --enforce-eager until this is fixed?

from vllm.

cadedaniel avatar cadedaniel commented on August 26, 2024

If you are blocked by this issue, the fix shouldn't be very hard. I think we simply need to configure the cuda graph max size to include the expanded batch size.

from vllm.

Adhyyan1252 avatar Adhyyan1252 commented on August 26, 2024

The code that is breaking is:

if use_captured_graph:
            # The shape of graph_block_tables is
            # [max batch size, max context len // block size].
            input_block_tables = self.graph_block_tables[:batch_size]
            for i, block_table in enumerate(block_tables):
                if block_table:
                    input_block_tables[i, :len(block_table)] = block_table

The issue is that len(block_table) > input_block_tables.shape[1] and the second dimension corresponds to max context len // block size. Am i misunderstanding in how this is a batch-size issue and not a context len issue?

from vllm.

cadedaniel avatar cadedaniel commented on August 26, 2024

good point. Wonder why this is specific to spec decode then.

Does the sequence length plus proposal length go over the max model length ?

from vllm.

njhill avatar njhill commented on August 26, 2024

@Adhyyan1252 could you see if you also get this error with vLLM 0.4.3?

from vllm.

Ximingwang-09 avatar Ximingwang-09 commented on August 26, 2024

try to add params : '--max-seq-len-to-capture' eqauls to max_model_len

from vllm.

Adhyyan1252 avatar Adhyyan1252 commented on August 26, 2024

Still the same issue

from vllm.

hitcoogle avatar hitcoogle commented on August 26, 2024

Still the same issue..

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.