Your current environment I am currently using a T4 instance on Goo

Thanks to <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

You should clone my repo using: <div class="highlight highlight-source-shell notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi again <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

So I tried to build a local docker image using your branch: (<code class="notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug]: Issues with Applying LoRA in vllm on a T4 GPU,about vllm-project/vllm

Comments (14)

jeejeelee commented on September 27, 2024 3

Thanks to @mgoin for the mention.

#5036 have currently addressed this issue preliminarily, we have tested it on TITAN RTX. You can clone this branch, and build.

from vllm.

jeejeelee commented on September 27, 2024 2

You should clone my repo using:

git clone -b refactor-punica-kernel https://github.com/jeejeelee/vllm.git

from vllm.

jeejeelee commented on September 27, 2024 1

@emillykkejensen I can run awq+lora properly on TITAN RTX. FYI https://github.com/vllm-project/vllm/blob/main/csrc/quantization/awq/dequantize.cuh#L18

from vllm.

emillykkejensen commented on September 27, 2024 1

Hi again @jeejeelee

Sorry for that, you are 100% right! If I do the above, but clone the correct branch (!!) it works.

Thanks for the fix, and hope it will be merged into master soon :)

from vllm.

emillykkejensen commented on September 27, 2024

Have the same issue, however is running it on an Azure VM with a T4 GPU using docker

from vllm.

mgoin commented on September 27, 2024

Hi @rikitomo and @emillykkejensen, it is unfortunately the case that punica does not support T4 or V100, per #3197

Please follow up with this in the issue on their repo punica-ai/punica#44. Once it is addressed, we can pull in the updated kernels into vLLM - thanks!

On another note: perhaps this will be addressed by this recent work on using Triton for LoRA inference! #5036

from vllm.

emillykkejensen commented on September 27, 2024

Hi @jeejeelee

Thanks a lot for the proposed fix. However, when I try to build from your branch I get the same error. I'm building inside a Docker Container, so don't know if that is the issue.

What I did:

docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

# and then from within the container
git clone https://github.com/jeejeelee/vllm.git
cd vllm
export VLLM_INSTALL_PUNICA_KERNELS=1
pip install -e .

Ones done building I ran:

python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ \
    --quantization awq \
    --dtype half \
    --enable-lora \
    --enforce-eager \
    --gpu-memory-utilization 0.90 \
    --lora-modules sql-lora=jashing/tinyllama-colorist-lora/

That gave me this output:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████| 854/854 [00:00<00:00, 11.6MB/s]
WARNING 06-11 08:57:14 config.py:192] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-11 08:57:14 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', speculative_config=None, tokenizer='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████| 1.42k/1.42k [00:00<00:00, 25.9MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 35.2MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 18.1MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████| 69.0/69.0 [00:00<00:00, 1.30MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████| 96.0/96.0 [00:00<00:00, 1.90MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
generation_config.json: 100%|██████████████████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 1.10MB/s]
INFO 06-11 08:57:16 selector.py:113] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-11 08:57:16 selector.py:44] Using XFormers backend.
INFO 06-11 08:57:18 weight_utils.py:206] Using model weights format ['*.safetensors']
model.safetensors: 100%|████████████████████████████████████████████████████████████████████| 766M/766M [00:02<00:00, 262MB/s]
INFO 06-11 08:57:22 model_runner.py:146] Loading model weights took 0.7370 GB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 186, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 382, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 336, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/engine/async_llm_engine.py", line 458, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/engine/llm_engine.py", line 178, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/workspace/vllm/vllm/engine/llm_engine.py", line 255, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/workspace/vllm/vllm/executor/gpu_executor.py", line 75, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/worker/worker.py", line 154, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/worker/model_runner.py", line 787, in profile_run
[rank0]:     self.execute_model(seqs, kv_caches)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/worker/model_runner.py", line 706, in execute_model
[rank0]:     hidden_states = model_executable(**execute_model_kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 367, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 292, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 231, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/model_executor/models/llama.py", line 160, in forward
[rank0]:     qkv, _ = self.qkv_proj(hidden_states)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/workspace/vllm/vllm/lora/layers.py", line 470, in forward
[rank0]:     output_parallel = self.apply(input_, bias)
[rank0]:   File "/workspace/vllm/vllm/lora/layers.py", line 853, in apply
[rank0]:     output = self.base_layer.quant_method.apply(self.base_layer, x, bias)
[rank0]:   File "/workspace/vllm/vllm/model_executor/layers/quantization/awq.py", line 168, in apply
[rank0]:     out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
[rank0]:   File "/workspace/vllm/vllm/_custom_ops.py", line 119, in awq_dequantize
[rank0]:     return vllm_ops.awq_dequantize(qweight, scales, zeros, split_k_iters, thx,
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]: Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
[rank0]: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x70ddb257a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[rank0]: frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70ddb252ab25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
[rank0]: frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x70ddb29e1718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #3: <unknown function> + 0x2ea76 (0x70ddb29bda76 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #4: <unknown function> + 0x343e4 (0x70ddb29c33e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #5: <unknown function> + 0x35ca7 (0x70ddb29c4ca7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #6: <unknown function> + 0x360e7 (0x70ddb29c50e7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
[rank0]: frame #7: <unknown function> + 0x1866589 (0x70dd9a7bb589 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #8: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>) + 0x14 (0x70dd9a7b51e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #9: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>) + 0x111 (0x70dd660f6641 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #10: at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0x36 (0x70dd660f6916 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #11: at::native::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0x20 (0x70dd66334a30 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #12: <unknown function> + 0x329a789 (0x70dd6833f789 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #13: <unknown function> + 0x329a86b (0x70dd6833f86b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
[rank0]: frame #14: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0xe7 (0x70dd9b7b9be7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #15: <unknown function> + 0x2c10def (0x70dd9bb65def in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #16: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>) + 0x1a0 (0x70dd9b801a00 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
[rank0]: frame #17: at::empty(c10::ArrayRef<long>, c10::TensorOptions, std::optional<c10::MemoryFormat>) + 0x150 (0x70dcec735c60 in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #18: torch::empty(c10::ArrayRef<long>, c10::TensorOptions, std::optional<c10::MemoryFormat>) + 0x8a (0x70dcec735dea in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #19: awq_dequantize(at::Tensor, at::Tensor, at::Tensor, int, int, int) + 0x249 (0x70dcec759609 in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #20: <unknown function> + 0xf5449 (0x70dcec74f449 in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: frame #21: <unknown function> + 0xf123d (0x70dcec74b23d in /workspace/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so)
[rank0]: <omitting python frames>

from vllm.

jeejeelee commented on September 27, 2024

@emillykkejensen It seems that the error is triggered by awq. It's possible that awq only supports SM80+. Have you tested lora using an FP16 model?

from vllm.

emillykkejensen commented on September 27, 2024

So I tried to build a local docker image using your branch: (docker build -t my-vllm-image https://github.com/jeejeelee/vllm.git#refactor-punica-kernel)

It seems to load vLLM and also load the model okay, but when I call it I get the following error:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING 06-13 11:09:54 config.py:192] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-13 11:09:54 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', speculative_config=None, tokenizer='TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-13 11:09:55 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-13 11:09:55 selector.py:51] Using XFormers backend.
INFO 06-13 11:09:56 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-13 11:09:56 selector.py:51] Using XFormers backend.
INFO 06-13 11:09:56 weight_utils.py:207] Using model weights format ['*.safetensors']
INFO 06-13 11:09:57 weight_utils.py:250] No model.safetensors.index.json found in remote.
INFO 06-13 11:10:08 model_runner.py:146] Loading model weights took 0.7370 GB
INFO 06-13 11:10:11 gpu_executor.py:83] # GPU blocks: 32795, # CPU blocks: 11915
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-13 11:10:14 serving_chat.py:83] No chat template provided. Chat API will not work.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-13 11:10:15 serving_embedding.py:131] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 06-13 11:10:25 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:10:35 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:10:45 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:10:55 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:11:05 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:11:15 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-13 11:11:23 async_llm_engine.py:545] Received request cmpl-ca79698496dd4702a6e821afaef7b588-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 3087, 8970, 338, 263], lora_request: LoRARequest(lora_name='sql-lora', lora_int_id=1, lora_local_path='jashing/tinyllama-colorist-lora/', long_lora_max_len=None).
WARNING 06-13 11:11:23 tokenizer.py:142] No tokenizer found in jashing/tinyllama-colorist-lora/, using base model tokenizer instead. (Exception: Incorrect path_or_model_id: 'jashing/tinyllama-colorist-lora/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.)
ERROR 06-13 11:11:23 async_llm_engine.py:44] Engine background task failed
ERROR 06-13 11:11:23 async_llm_engine.py:44] Traceback (most recent call last):
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 174, in _load_lora
ERROR 06-13 11:11:23 async_llm_engine.py:44]     lora = self._lora_model_cls.from_local_checkpoint(
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 314, in from_local_checkpoint
ERROR 06-13 11:11:23 async_llm_engine.py:44]     with open(lora_config_path) as f:
ERROR 06-13 11:11:23 async_llm_engine.py:44] FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'
ERROR 06-13 11:11:23 async_llm_engine.py:44] 
ERROR 06-13 11:11:23 async_llm_engine.py:44] The above exception was the direct cause of the following exception:
ERROR 06-13 11:11:23 async_llm_engine.py:44] 
ERROR 06-13 11:11:23 async_llm_engine.py:44] Traceback (most recent call last):
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 39, in _raise_exception_on_finish
ERROR 06-13 11:11:23 async_llm_engine.py:44]     task.result()
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 517, in run_engine_loop
ERROR 06-13 11:11:23 async_llm_engine.py:44]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-13 11:11:23 async_llm_engine.py:44]     return fut.result()
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 491, in engine_step
ERROR 06-13 11:11:23 async_llm_engine.py:44]     request_outputs = await self.engine.step_async()
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 225, in step_async
ERROR 06-13 11:11:23 async_llm_engine.py:44]     output = await self.model_executor.execute_model_async(
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-13 11:11:23 async_llm_engine.py:44]     output = await make_async(self.driver_worker.execute_model
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-13 11:11:23 async_llm_engine.py:44]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-13 11:11:23 async_llm_engine.py:44]     return func(*args, **kwargs)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
ERROR 06-13 11:11:23 async_llm_engine.py:44]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-13 11:11:23 async_llm_engine.py:44]     return func(*args, **kwargs)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 689, in execute_model
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self.set_active_loras(lora_requests, lora_mapping)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 827, in set_active_loras
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self.lora_manager.set_active_loras(lora_requests, lora_mapping)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 137, in set_active_loras
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self._apply_loras(lora_requests)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 266, in _apply_loras
ERROR 06-13 11:11:23 async_llm_engine.py:44]     self.add_lora(lora)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 274, in add_lora
ERROR 06-13 11:11:23 async_llm_engine.py:44]     lora = self._load_lora(lora_request)
ERROR 06-13 11:11:23 async_llm_engine.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 187, in _load_lora
ERROR 06-13 11:11:23 async_llm_engine.py:44]     raise RuntimeError(
ERROR 06-13 11:11:23 async_llm_engine.py:44] RuntimeError: Loading lora jashing/tinyllama-colorist-lora/ failed
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7328afd917e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7328a5023160>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7328afd917e0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7328a5023160>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 174, in _load_lora
    lora = self._lora_model_cls.from_local_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 314, in from_local_checkpoint
    with open(lora_config_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 39, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 517, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 491, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 225, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 689, in execute_model
    self.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 827, in set_active_loras
    self.lora_manager.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 137, in set_active_loras
    self._apply_loras(lora_requests)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 266, in _apply_loras
    self.add_lora(lora)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 274, in add_lora
    lora = self._load_lora(lora_request)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 187, in _load_lora
    raise RuntimeError(
RuntimeError: Loading lora jashing/tinyllama-colorist-lora/ failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 46, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 06-13 11:11:23 async_llm_engine.py:157] Aborted request cmpl-ca79698496dd4702a6e821afaef7b588-0.
INFO:     172.17.0.1:55552 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 174, in _load_lora
    lora = self._lora_model_cls.from_local_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 314, in from_local_checkpoint
    with open(lora_config_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 118, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 155, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 241, in consumer
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 234, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 218, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 662, in generate
    async for output in self.process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 780, in process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 776, in process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 79, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 39, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 517, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 491, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 225, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 689, in execute_model
    self.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 827, in set_active_loras
    self.lora_manager.set_active_loras(lora_requests, lora_mapping)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 137, in set_active_loras
    self._apply_loras(lora_requests)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 266, in _apply_loras
    self.add_lora(lora)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 274, in add_lora
    lora = self._load_lora(lora_request)
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 187, in _load_lora
    raise RuntimeError(
RuntimeError: Loading lora jashing/tinyllama-colorist-lora/ failed

from vllm.

jeejeelee commented on September 27, 2024

@emillykkejensen

FileNotFoundError: [Errno 2] No such file or directory: 'jashing/tinyllama-colorist-lora/adapter_config.json'
ERROR 06-13 11:11:23 async_llm_engine.py:44]

maybe you can try passing the lora path using a local absolute path.

from vllm.

XiaoZ259 commented on September 27, 2024

@jeejeelee Hi, thank you so much for your work! If I just want to run LoRA on a T4, which of your previous commit should I build from?

from vllm.

jeejeelee commented on September 27, 2024

@jeejeelee Hi, thank you so much for your work! If I just want to run LoRA on a T4, which of your previous commit should I build from?

You can build from the last commit. If you have any questions, please feel free to contact me.

from vllm.

naturomics commented on September 27, 2024

the same problem applying lora for chatglm3-6b on T4 GPU

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 216, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 360, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 76, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 173, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 874, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1243, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 371, in forward
[rank0]:     hidden_states = self.transformer(input_ids, positions, kv_caches,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 319, in forward
[rank0]:     hidden_states = self.encoder(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 274, in forward
[rank0]:     hidden_states = layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 209, in forward
[rank0]:     attention_output = self.self_attention(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/chatglm.py", line 108, in forward
[rank0]:     context_layer = self.attn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 94, in forward
[rank0]:     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/xformers.py", line 279, in forward
[rank0]:     output = torch.empty_like(query)
[rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
``

from vllm.

jeejeelee commented on September 27, 2024

@naturomics Hi you can try #5036. It should be able to address your issues.

from vllm.

[Bug]: Issues with Applying LoRA in vllm on a T4 GPU about vllm HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent