Your current environment <div class="snippet-clipboard-content notranslate posit

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[Bug]: Eabling Prefix-Caching doesn't speed up inference about vllm HOT 2 CLOSED

yangelaboy commented on July 23, 2024

[Bug]: Eabling Prefix-Caching doesn't speed up inference

from vllm.

Comments (2)

youkaichao commented on July 23, 2024

cc @zhuohan123

from vllm.

yangelaboy commented on July 23, 2024

We have done a lot tests with Qwen1.5-4B and 51 input tokens and 11 output tokens in RTX 3090. The results are:
1）Under 50 qps, prefix-caching doesn't speedup prefill phase.
2) From 50qps to 100qps, the first token latency increase from 90ms to 360ms with non prefix-caching, the first token latency increase a little from 90ms to 110ms with prefix-caching。
3) For GPU Utilization, there are 20%-50% benefits from 50 qps to 100qps。

from vllm.

Related Issues (20)

[Feature]: Request for Ascend NPU support HOT 2
[Bug]: vLLM 0.5.1 tensor parallel 2 hang HOT 12
stream_options.include_usage does not work HOT 1
gfx908 architecture not working for version 0.5.1 HOT 3
[Bug]: llava model gets stuck with RuntimeError: Please increase the max_chunk_bytes parameter. HOT 3
[RFC]: A Graph Optimization System in vLLM using torch.compile HOT 2
[Bug]: Runtime AssertionError: 32768 is not divisible by 3, multiproc_worker_utils.py:120, when using 3 GPUs for tensor-parallel HOT 6
[Bug]: Problem loading Gemma 2 27b-it HOT 1
[Bug]: Gemma-2 + FlashInfer: ValueError: Unsupported max_frags_z:
[Misc]: _run_workers_async function of DistributedGPUExecutorAsync HOT 1
when i set tensor_parallel_size>1(A100 * 4), it does not work HOT 8
[Bug]: `samplers/test_logprobs.py` fail on H100
[Bug]: Timeout Error When Deploying Llamafied InternLM2-5-7B-Chat-1M Model via vLLM OpenAI API Server
[Feature]: Apply chat template through `LLM` class HOT 11
[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ HOT 1
[Feature]: Return softmax of attention layer. HOT 3
[Bug]: Paligemma support for PNG files HOT 5
[Bug]: illegal memory access when increase max_model_length on FP8 models HOT 6
[Bug]: autogen can't work with vllm v0.5.1
v0.5.2, v0.5.3, v0.6.0 Release Tracker HOT 5

[Bug]: Eabling Prefix-Caching doesn't speed up inference about vllm HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent