Anything you want to discuss about vllm. Issue <p dir="auto"

[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 about vllm HOT 2 OPEN

nuzant commented on July 17, 2024

[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2

from vllm.

Comments (2)

hanzhi713 commented on July 17, 2024 1

You might want to share a minimal reproducible code snippet. The stage selection behavior you mentioned is expected so that shouldn't be the problem.

Also, please try the following first and see if they still hang

disable cuda graph but enable custom allreduce using your current strategy
enable cuda graph but disable custom allreduce using your current strategy

from vllm.

youkaichao commented on July 17, 2024

it's quite difficult to help custom usage of custom allreduce, I suggest asking @hanzhi713 for help, who originally contributed this code.

from vllm.

Related Issues (20)

gfx908 architecture not working for version 0.5.1 HOT 3
[Bug]: llava model gets stuck with RuntimeError: Please increase the max_chunk_bytes parameter. HOT 3
[RFC]: A Graph Optimization System in vLLM using torch.compile HOT 2
[Bug]: Runtime AssertionError: 32768 is not divisible by 3, multiproc_worker_utils.py:120, when using 3 GPUs for tensor-parallel HOT 6
[Bug]: Problem loading Gemma 2 27b-it HOT 1
[Bug]: Gemma-2 + FlashInfer: ValueError: Unsupported max_frags_z:
[Misc]: _run_workers_async function of DistributedGPUExecutorAsync HOT 1
when i set tensor_parallel_size>1(A100 * 4), it does not work HOT 8
[Bug]: `samplers/test_logprobs.py` fail on H100
[Bug]: Timeout Error When Deploying Llamafied InternLM2-5-7B-Chat-1M Model via vLLM OpenAI API Server
[Feature]: Apply chat template through `LLM` class HOT 11
[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ HOT 1
[Feature]: Return softmax of attention layer. HOT 3
[Bug]: Paligemma support for PNG files HOT 5
[Bug]: illegal memory access when increase max_model_length on FP8 models HOT 6
[Bug]: autogen can't work with vllm v0.5.1
v0.5.2, v0.5.3, v0.6.0 Release Tracker HOT 4
[Bug]: Severe computation errors when batching request for microsoft/Phi-3-mini-128k-instruct HOT 3
[Bug]: The shape of the embed_tokens of llama model doesn't match the llama3 configuration
[Bug]: TypeError: 'NoneType' object is not callable when start Gemma2-27b-it HOT 5

[Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 about vllm HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent