Anything you want to discuss about vllm. Please excuse the naive q

[Misc]: PagedAttention + cudagraphs about vllm HOT 2 CLOSED

jeromeku commented on July 4, 2024

[Misc]: PagedAttention + cudagraphs

from vllm.

Comments (2)

simon-mo commented on July 4, 2024

Yes. Although KV cache for a particular request is dynamically allocated. vLLM allocates a continuous memory space for all KV cache ahead of time.

from vllm.

tricky61 commented on July 4, 2024

Yes. Although KV cache for a particular request is dynamically allocated. vLLM allocates a continuous memory space for all KV cache ahead of time.

as the pagedattention has two versions: v1 and v2
In the normal inference, with the increase of max_num_partitions, pagedattention may changed from v1 to v2.
does the Cuda graphs will use two versions? or it will use v1 as default since it is compiled beforehead

from vllm.

Related Issues (20)

[Usage]: Is this an error ? "async_llm_engine.py:154] Aborted request cmpl-xxxxx"
[Bug]: "Triton Error [CUDA]: device kernel image is invalid" when loading Mixtral-8x7B-Instruct-v0.1 in fused_moe.py HOT 9
[RFC]: proper resource cleanup for LLM class with file-like usage HOT 10
[New Model]: Chameleon support HOT 1
[Feature]: Support Nemotron-4-340B HOT 1
[RFC]: Add runtime weight update API HOT 3
[Usage]: qwen2-1.5b-gptq-in4 single gpu multiprocessing deployment fail
[Bug]: Two V100 server with a total of 16GPU running Distributed Inference and Serving Vllm with error HOT 7
[Misc]: how to understand: NUM_ELEMS_PER_THREAD = HEAD_SIZE / THREAD_GROUP_SIZE
[Bug]: asyncio.exceptions.CancelledError asyncio.exceptions.TimeoutError HOT 1
api_server.py: error: unrecognized arguments: --tool-use-prompt-template --enable-api-tools --enable-auto-tool-choice HOT 1
[Bug]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 t transpose_mat2 n m 9216 n 3398 k 7168 mat1_ld 7168 mat2_ld 7168 result_ld 9216 computeType 68 scaleType 0
[Bug]: asyncio.exceptions.CancelledError asyncio.exceptions.TimeoutError HOT 4
[Feature]: Support for OpenAIEmbeddings with Langchain HOT 8
[Bug]: which torchvision version required HOT 12
[Usage]: has vllm supported encoder-only model such as bge-m3?
[Bug]: VLLM usage on AWS Inferentia instances HOT 6
[Bug]: KeyError: '/psm_ed65b7e3' HOT 2
[Feature]: Need CPU inferencing support for non-x86 architectures HOT 2
[Bug]: 'int' object has no attribute 'expansion'

[Misc]: PagedAttention + cudagraphs about vllm HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent