Giter Club home page Giter Club logo

Comments (9)

G-z-w avatar G-z-w commented on August 24, 2024 4

[Bug] No available block found in 60 second in shm

Your current environment

/path/vllm/vllm/usage/usage_lib.py:19: RuntimeWarning: Failed to read commit hash: No module named 'vllm.commit_id' from vllm.version import version as VLLM_VERSION PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.0 Libc version: glibc-2.35 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.2.140 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB GPU 1: NVIDIA A800-SXM4-80GB GPU 2: NVIDIA A800-SXM4-80GB GPU 3: NVIDIA A800-SXM4-80GB GPU 4: NVIDIA A800-SXM4-80GB GPU 5: NVIDIA A800-SXM4-80GB GPU 6: NVIDIA A800-SXM4-80GB GPU 7: NVIDIA A800-SXM4-80GB
Nvidia driver version: 525.60.13
cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5

HIP runtime version: N/A

MIOpen runtime version: N/A

Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 6 CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 5200.00

Flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

Virtualization: VT-x L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 80 MiB (64 instances) L3 cache: 96 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-127 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.0 [pip3] transformers==4.42.4 [pip3] triton==2.3.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi [conda] transformers 4.42.4 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.2 vLLM Build Flags:

CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

GPU Topology:

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_4 mlx5_5 CPU Affinity NUMA Affinity GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 0-127 N/A GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 0-127 N/A GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PXB SYS SYS 0-127 N/A GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PXB SYS SYS 0-127 N/A GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS PXB SYS 0-127 N/A GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS PXB SYS 0-127 N/A GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS PXB 0-127 N/A GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS PXB 0-127 N/A mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS mlx5_1 SYS SYS PXB PXB SYS SYS SYS SYS SYS X SYS SYS mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS X SYS mlx5_5 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

To support our model, I built it from source code.

🐛 Describe the bug

#6614.

I printed out the rank information at the warning and found that one of the GPUs was stuck (a total of 4 GPUs for tensor parallelism). This bug is highly reproducible, especially when running models above 70B(like qwen2)and encountering a large number of requests.

image
image
image

from vllm.

doYrobot avatar doYrobot commented on August 24, 2024 3
echo "Activating conda environment..."
source activate vllm_infer || { echo "Failed to activate conda environment"; exit 1; }

# 定义变量
HOST="0.0.0.0"
PORT=11434
MODEL_PATH="xx/OpenGVLab/InternVL2-40B"
LOG_FILE="xx/internvl40b_autotriage-agent-server-backup.log"
TENSOR_PARALLEL_SIZE=4
MAX_NUM_SEQS=16
SERVE_MODEL_NAME="InternVL2-40B"
DISTRIBUTED_EXECUTOR_BACKEND="mp"

# 启动命令,并将输出重定向到指定的日志文件中,并在后台运行
echo "Starting API server..."
nohup python -m vllm.entrypoints.openai.api_server \
    --host $HOST \
    --port $PORT \
    --model $MODEL_PATH \
    --tensor_parallel_size $TENSOR_PARALLEL_SIZE \
    --trust_remote_code \
    --max-num-seqs $MAX_NUM_SEQS \
    --distributed-executor-backend $DISTRIBUTED_EXECUTOR_BACKEND \
    --served-model-name $SERVE_MODEL_NAME \
    --gpu-memory-utilization 0.9 \
    --swap-space 20 \
    --block-size 16 \
    --use-v2-block-manager \
    --preemption-mode "recompute" \
    > $LOG_FILE 2>&1 &

echo "API server started. Logs are being written to $LOG_FILE"

image

I think it's because there are requests still in the queue, but since the process is stuck, no tokens have been generated, which eventually caused the server to crash.

after server dead
img_v3_02do_130e0219-2a04-4477-abcc-4d7fd92c17dg

from vllm.

changshivek avatar changshivek commented on August 24, 2024 3

Got the same issue across multiple versions of official vllm docker images. I first encountered this issue with version v0.4.3, then 0.5.0, 0.5.3.post1, now 0.5.4.

envs and launch commands

My environment info:

PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.45.1.el7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
GPU 2: NVIDIA A800-SXM4-80GB
GPU 3: NVIDIA A800-SXM4-80GB
GPU 4: NVIDIA A800-SXM4-80GB
GPU 5: NVIDIA A800-SXM4-80GB
GPU 6: NVIDIA A800-SXM4-80GB
GPU 7: NVIDIA A800-SXM4-80GB

Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 57 bits virtual
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Stepping:            6
Frequency boost:     enabled
CPU MHz:             3400.000
CPU max MHz:         3400.0000
CPU min MHz:         800.0000
BogoMIPS:            5200.00
Virtualization:      VT-x
L1d cache:           3 MiB
L1i cache:           2 MiB
L2 cache:            80 MiB
L3 cache:            96 MiB
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] flashinfer==0.1.2+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.1.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.43.4
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4@
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity NUMA Affinity    GPU NUMA ID
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    NODE    SYS     SYS     NODE    0-31 0N/A
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     PXB     NODE    NODE    SYS     SYS     NODE    0-31 0N/A
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     NODE    PXB     NODE    SYS     SYS     PXB     0-31 0N/A
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     NODE    PXB     NODE    SYS     SYS     PXB     0-31 0N/A
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     SYS     SYS     SYS     PXB     NODE    SYS     32-631N/A
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     SYS     SYS     SYS     PXB     NODE    SYS     32-631N/A
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     SYS     SYS     SYS     NODE    PXB     SYS     32-631N/A
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      SYS     SYS     SYS     NODE    PXB     SYS     32-631N/A
NIC0    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    SYS     SYS     NODE
NIC1    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE     X      NODE    SYS     SYS     PIX
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      SYS     SYS     NODE
NIC3    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS      X      NODE    SYS
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     NODE     X      SYS
NIC5    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    PIX     NODE    SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_bond_0

My launch command is: (part of a k8s yaml file)

        command: ["/bin/bash", "-c"]
        args: [
        # "sudo sed -i '175,+2s/\"dns.google\"/\"8.8.8.8\"/g' /workspace/vllm/utils.py && \
        "nvidia-smi;python3 -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 \
        --model /fl/nlp/common/plms/qwen2/Qwen2-72B-Instruct \
        --trust-remote-code \
        --enforce-eager \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.9 \
        --served-model-name qwen2-72bc \
        --tensor-parallel-size 8"
         ]

This error seems ALWAYS occurs under a continuous inference workload, before version v0.5.4, the service can realize its own error status and restart automatically, but with v0.5.4, auto restart doesn't work.

error log

  • First this error log happens:
INFO 08-14 14:25:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 08-14 14:25:52 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
ERROR 08-14 14:25:53 async_llm_engine.py:663] Engine iteration timed out. This should never happen!
ERROR 08-14 14:25:53 async_llm_engine.py:57] Engine background task failed
ERROR 08-14 14:25:53 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
ERROR 08-14 14:25:53 async_llm_engine.py:57]     done, _ = await asyncio.wait(
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 08-14 14:25:53 async_llm_engine.py:57]     return await _wait(fs, timeout, return_when, loop)
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 08-14 14:25:53 async_llm_engine.py:57]     await waiter
ERROR 08-14 14:25:53 async_llm_engine.py:57] asyncio.exceptions.CancelledError
ERROR 08-14 14:25:53 async_llm_engine.py:57]
ERROR 08-14 14:25:53 async_llm_engine.py:57] During handling of the above exception, another exception occurred:
ERROR 08-14 14:25:53 async_llm_engine.py:57]
ERROR 08-14 14:25:53 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-14 14:25:53 async_llm_engine.py:57]     return_value = task.result()
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-14 14:25:53 async_llm_engine.py:57]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 08-14 14:25:53 async_llm_engine.py:57]     self._do_exit(exc_type)
ERROR 08-14 14:25:53 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 08-14 14:25:53 async_llm_engine.py:57]     raise asyncio.TimeoutError
ERROR 08-14 14:25:53 async_llm_engine.py:57] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7f0a5d6569b0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f0a5d6569b0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
INFO 08-14 14:25:53 async_llm_engine.py:181] Aborted request chat-afa7a52c064a466c952a2eaf29c376a9.
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-14 14:25:53 async_llm_engine.py:181] Aborted request chat-fe65c4670df04192becd6af726e294ca.
INFO:     10.233.99.0:48827 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
asyncio.exceptions.TimeoutError
INFO:     10.233.99.0:2747 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
asyncio.exceptions.TimeoutError
(VllmWorkerProcess pid=143) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=146) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=149) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=145) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=147) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=144) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=148) WARNING 08-14 14:25:53 shm_broadcast.py:386] No available block found in 60 second.
  • Then if new request was post, got this error log:
INFO:     10.233.99.0:24460 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 189, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 185, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 436, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate
    raise request_output
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
  • After a long wait without new post requests, this happens:
[rank0]:[F814 14:47:39.862277231 ProcessGroupNCCL.cpp:1224] [PG 3 Rank 0] [PG 3 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 10
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Then the service does not restart, just stuck here.

question

  • Since this issue was firstly raised in June, with more similar issues reported earlier, is there any recent update on this error? Is it still being tracked?
  • What can I do to avoid this error, Has anyone found any effective practices?

Thanks!

from vllm.

etiennebonnafoux avatar etiennebonnafoux commented on August 24, 2024 1

Here is mine

VLLM serve with the command

export CUDA_VISIBLE_DEVICES=0

python -m vllm.entrypoints.openai.api_server \
        --port 31002 \
        --<some_path_on_my_computer>/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/\
        --served-model-name llama3 \
        --gpu-memory-utilization 0.4 > worker.out &

The Input

image

The error

    if sampling_params.seed is not None:
AttributeError: 'NoneType' object has no attribute 'seed'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/cors.py", line 93, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/cors.py", line 148, in simple_response
    await self.app(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in create_embedding
    generator = await openai_serving_embedding.create_embedding(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_embedding.py", line 146, in create_embedding
    async for i, res in result_generator:
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/utils.py", line 329, in consumer
    raise e
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/utils.py", line 320, in consumer
    raise item
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/utils.py", line 304, in producer
    async for item in iterator:
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 850, in encode
    async for output in self._process_request(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 873, in _process_request
    stream = await self.add_request(
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 676, in add_request
    self.start_background_loop()
  File "/home/ebonnafoux/.cache/pypoetry/virtualenvs/test-graphrag-DUgVK7v_-py3.10/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 516, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

from vllm.

LIUKAI0815 avatar LIUKAI0815 commented on August 24, 2024

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 604, in run_engine_loop
done, _ = await asyncio.wait(
File "/home/ubuntu/miniconda3/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/home/ubuntu/miniconda3/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError

File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 124, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 305, in create_chat_completion
return await self.chat_completion_full_generator(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 505, in chat_completion_full_generator
async for res in result_generator:
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 765, in generate
async for output in self._process_request(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 881, in _process_request
raise e
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 877, in _process_request
async for request_output in stream:
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 91, in anext
raise result
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 44, in _log_task_completion
return_value = task.result()
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 603, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m vllm.entrypoints.openai.api_server
--model /data/qwen/
--port 3004
--tensor-parallel-size 4
--gpu-memory-utilization 0.9
--trust-remote-code
--enforce-eager
--served-model-name Qwen2-72B-Instruct-awq \

from vllm.

justinthelaw avatar justinthelaw commented on August 24, 2024

The crash seems to happen after a cold-start or long pauses before the next generation happens. Below is example of the engine working, but then failing after 30 minutes of multi-user (parallel requests) usage, and then a short pause.

vLLM version: 0.5.3.post1
OS/Machine: System76 adder WS, Ubuntu 22.04 Desktop, eGPU NVIDIA GeForce RTX 4090 (24Gb vRAM)
Isolated virtual env, Python version : 3.11.6
Base LLM: Mistral-7b Instruct v0.3

INFO 08-05 16:38:20 async_llm_engine.py:140] Finished request d37eacc1e8fb411e99362648eab38666.
INFO:root:Generated 745 tokens in 45.47s
INFO:root:Finished request d37eacc1e8fb411e99362648eab38666
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 84785d7df09b4128b11e2113fed6cde7
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 84785d7df09b4128b11e2113fed6cde7
INFO:root:Begin iteration for request 84785d7df09b4128b11e2113fed6cde7
INFO 08-05 16:38:23 async_llm_engine.py:173] Added request 84785d7df09b4128b11e2113fed6cde7.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:25 metrics.py:396] Avg prompt throughput: 676.5 tokens/s, Avg generation throughput: 6.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.0%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:30 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:35 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:40 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.2%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:45 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.3%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:50 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:38:56 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.4%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:01 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.5%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:06 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 3.6%, CPU KV cache usage: 0.0%.
INFO 08-05 16:39:06 async_llm_engine.py:140] Finished request 84785d7df09b4128b11e2113fed6cde7.
INFO:root:Generated 711 tokens in 42.73s
INFO:root:Finished request 84785d7df09b4128b11e2113fed6cde7
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 6476b49a79714e4d8bd6e5c65107b586
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 6476b49a79714e4d8bd6e5c65107b586
INFO:root:Begin iteration for request 6476b49a79714e4d8bd6e5c65107b586
INFO 08-05 16:39:53 async_llm_engine.py:173] Added request 6476b49a79714e4d8bd6e5c65107b586.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:54 metrics.py:396] Avg prompt throughput: 1.2 tokens/s, Avg generation throughput: 0.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:39:59 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
(_AsyncLLMEngine pid=2189) INFO 08-05 16:40:04 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-05 16:40:05 async_llm_engine.py:140] Finished request 6476b49a79714e4d8bd6e5c65107b586.
INFO:root:Generated 177 tokens in 10.35s
INFO:root:Finished request 6476b49a79714e4d8bd6e5c65107b586
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Begin reading the output for request 512aeb24c57f4f1eba974ecaba0e9522
DEBUG:root:SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=False, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
INFO:root:Begin generation for request 512aeb24c57f4f1eba974ecaba0e9522
INFO:root:Begin iteration for request 512aeb24c57f4f1eba974ecaba0e9522
INFO 08-05 16:51:50 async_llm_engine.py:173] Added request 512aeb24c57f4f1eba974ecaba0e9522.
ERROR 08-05 16:51:50 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
ERROR 08-05 16:51:50 async_llm_engine.py:56] Engine background task failed
ERROR 08-05 16:51:50 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-05 16:51:50 async_llm_engine.py:56] await asyncio.sleep(0)
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 640, in sleep
ERROR 08-05 16:51:50 async_llm_engine.py:56] await __sleep0()
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 634, in __sleep0
ERROR 08-05 16:51:50 async_llm_engine.py:56] yield
ERROR 08-05 16:51:50 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 08-05 16:51:50 async_llm_engine.py:56]
ERROR 08-05 16:51:50 async_llm_engine.py:56] The above exception was the direct cause of the following exception:
ERROR 08-05 16:51:50 async_llm_engine.py:56]
ERROR 08-05 16:51:50 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 08-05 16:51:50 async_llm_engine.py:56] return_value = task.result()
ERROR 08-05 16:51:50 async_llm_engine.py:56] ^^^^^^^^^^^^^
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
ERROR 08-05 16:51:50 async_llm_engine.py:56] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-05 16:51:50 async_llm_engine.py:56] File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
ERROR 08-05 16:51:50 async_llm_engine.py:56] raise TimeoutError from exc_val
ERROR 08-05 16:51:50 async_llm_engine.py:56] TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=>)() at /home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:36
handle: >)() at /home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
    await asyncio.sleep(0)
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 640, in sleep
    await __sleep0()
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/tasks.py", line 634, in __sleep0
    yield
asyncio.exceptions.CancelledError
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
    raise TimeoutError from exc_val
TimeoutError
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "/home/nonroot/.pyenv/versions/3.11.6/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/leapfrogai/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

from vllm.

yitianlian avatar yitianlian commented on August 24, 2024

This error will occur during the process of my code randomly. I was running the Llama3.1 model. The responses of the previous few requests were normal. My environment is

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.91-014-kangaroo.2.10.13.5c249cdaf.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.199.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构:                           x86_64
CPU 运行模式:                   32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
字节序:                         Little Endian
CPU:                             96
在线 CPU 列表:                  0-95
厂商 ID:                        GenuineIntel
型号名称:                       Intel(R) Xeon(R) Processor @ 2.90GHz
CPU 系列:                       6
型号:                           106
每个核的线程数:                 1
每个座的核数:                   96
座:                             1
步进:                           6
BogoMIPS:                       5800.00
标记:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities
虚拟化:                         VT-x
超管理器厂商:                   KVM
虚拟化类型:                     完全
L1d 缓存:                       4.5 MiB (96 instances)
L1i 缓存:                       3 MiB (96 instances)
L2 缓存:                        120 MiB (96 instances)
L3 缓存:                        48 MiB (1 instance)
NUMA 节点:                      1
NUMA 节点0 CPU:                 0-95
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.3
[pip3] triton==2.3.1
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.1                    pypi_0    pypi
[conda] torchvision               0.18.1                   pypi_0    pypi
[conda] transformers              4.43.3                   pypi_0    pypi
[conda] triton                    2.3.1                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PHB     PHB     PHB     PHB     0-95           N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PHB     PHB     PHB     PHB     0-95           N/A
mlx5_0  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB     PHB
mlx5_1  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB     PHB
mlx5_2  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X      PHB
mlx5_3  PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

T

from vllm.

nivibilla avatar nivibilla commented on August 24, 2024

Hi,

Ive been having this issue when using the llama 3.1 70b bf16 (no quants) on a 8xL4 node. I am using guided decoding for all my requests.

What i noticed was it seems to happen when the server is overloaded with lots of parallel requests. I was trying to do 64 calls using the async openai package. And in the error i noticed i got a cuda oom error along with this error.
I reduced my parallel call size to 8 and it seems to be working fine now. I also reduced my max-model-len to 4096.
The same code works fine with the 8b with batch size 64, So for me it seems to be a GPU memory issue.

hope this helps debug whats going on.

from vllm.

nivibilla avatar nivibilla commented on August 24, 2024

As this was a controlled environment it was working but in production i cant estimate how many paralell calls will be made so this error might rise in real world testing. Is there a way to test and artificially limit the queue when using guided decoding so that the server doesnt go OOM?

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.