Comments (4)
Here's a diff of the telemetry code I used to generate the timings:
diff --git a/vllm/lora/models.py b/vllm/lora/models.py
index 3e828568..ad09ecc3 100644
--- a/vllm/lora/models.py
+++ b/vllm/lora/models.py
@@ -217,6 +217,8 @@ class LoRAModel:
"""Create a LoRAModel from a dictionary of tensors."""
pin_memory = str(device) == "cpu" and is_pin_memory_available()
loras: Dict[str, LoRALayerWeights] = {}
+ import time
+ a = time.time()
for tensor_name, tensor in tensors.items():
module_name, is_lora_a = parse_fine_tuned_lora_name(tensor_name)
if module_name not in loras:
@@ -258,8 +260,12 @@ class LoRAModel:
loras[module_name].lora_b = loras[
module_name].lora_b.pin_memory()
+ b = time.time()
+ print("Loaded lora(s) in ", b - a)
+
for lora in loras.values():
lora.optimize()
+ print("Optimized lora(s) in ", time.time() - b)
return cls(lora_model_id, rank, loras, scaling_factor=scaling_factor)
@classmethod
from vllm.
is it some general slowdown, or just slowdown when lora is used?
from vllm.
I only see this slowdown with lora loading functionality (from_lora_tensors function)
from vllm.
@youkaichao Coming back to this issue, do you know why LoRA's are always loaded on CPU:
For example, here:
vllm/vllm/lora/worker_manager.py
Lines 175 to 186 in 7c008c5
It's hardcoded to use always use CPU even if the main model (self.device) is on cuda.
from vllm.
Related Issues (20)
- [RFC]: Add Ascend NPU as a new backend
- [Bug]: Inconsistent Output Behavior with and without tools and tool_choice Parameters
- [Usage]: Periodic snapshots for spot instances HOT 1
- [RFC]: Enable Memory Tiering for vLLM HOT 5
- [Bug]: vLLM inconsistently crashes on startup for multinode cluster HOT 7
- [Usage]: how to abort request? HOT 1
- [Usage]: alignment between trl and llm.generate
- [Bug]: errors when loading mixtral 8x7b
- [Bug]: The MixtralForCausalLM architecture and the mistralai/Mixtral-8x7B-Instruct-v0.1 model are stated to be supported by vLLM, but an error occurs during model loading. HOT 3
- [Bug]: Unable to use fp8 kv cache with neuralmagic quants on ampere HOT 1
- [Doc]: AutoAWQ quantization example fails HOT 3
- [Bug]: Error loading microsoft/Phi-3.5-vision-instruct HOT 1
- [Bug]: torch.OutOfMemoryError: CUDA out of memory HOT 5
- [Bug]: Using CPU for inference, an error occurred. [Engine iteration timed out. This should never happen! ] HOT 4
- [Usage]: How to use FP8 or other quantization algorithms for Minicpmv2_6
- [Bug]: Unexpected non-determinism with vLLM 0.5.4 and Llama 3.1 HOT 3
- [New Model]: MiniCPM-V-2_6-int4 HOT 1
- [Usage]: Potential Hardware Failure when running vllm HOT 3
- [New Model]: ValueError: Model architectures ['PhiMoEForCausalLM'] are not supported for now HOT 1
- [Bug]: vLLM server not supporting stabilityai/stablelm-3b-4e1t model on CPU
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.