Comments (6)
The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.
from lmdeploy.
Also, as an aside, the docs say:
--cache-max-entry-count
to adjust the GPU mem ratio for k/v cache etc.
but this is a bit confusing, and lacking in what I think are important details. TGI has --cuda-memory-fraction
and vLLM has --gpu-memory-utilization
, and the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available - this is simple and less confusing. I'm not sure if this approach is compatible with lmdeploy's paradigm, but I'm just mentioning it as a point of comparison.
I eventually found this page: https://lmdeploy.readthedocs.io/en/v0.4.2/api/pipeline.html#turbomindengineconfig but it was a bit hard to find, and I think should be linked from the above-mentioned point in the docs.
For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache
I think a new parameter should be added with a more appropriate name. I'm also wondering if it is safe to set this value to 1.0? Or is the default 0.8 to prevent CUDA OOM errors? Or is it for testing on Desktop (non-server) machines where some VRAM must be left for the OS/desktop rendering? If so then it is safe to set it to 1.0 on servers?
from lmdeploy.
You may refer to the design and implementation of prefix cache as follows:
design #1407 (comment)
implementation #1450
The implementation is consistent with the overall design, and subsequent modifications were made based on the review comments in the PR.
I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?
Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.
from lmdeploy.
the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available
yep. vLLM uses the profile_run
to check if the gpu-memory-utilization
is ok.
I have mentioned using a similar approach before. #973 (comment) And in the end, it is still implemented according to ratio * free
#973 (comment)
I think a new parameter should be added with a more appropriate name.
In fact, it can be a ratio or a count.
lmdeploy/src/turbomind/models/llama/BlockManager.cc
Lines 30 to 39 in 735d9a3
from lmdeploy.
Thank you very much for the details!
I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?
Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.
I'm not sure you mean by "there is no need" - wouldn't it be better to have more cache storage space if VRAM is limited?
For example: Currently with two RTX 4090s, using LLama2 70B, I can store about 32 prefixes in the free VRAM, since the 70B model consumes almost all of the 48GB VRAM. This means that there's a high probability of cache eviction if there are e.g. 100 people using the service.
So if it were possible to use the system/cpu RAM too, that would be awesome, because I have >100 GB of system RAM available, so there would be a lower cache eviction probability for each request.
from lmdeploy.
Ah, I see. Thanks for explaining! I'll close this issue now.
from lmdeploy.
Related Issues (20)
- [Bug] deepseek-vl模型推理时结果差异 HOT 1
- i want to run profile_throughput.py using the smooth_quant model. Why did an error occur? HOT 3
- [Bug] 使用 lmdeploy 部署 internVL2-40B-AWQ, 容器中有triton环境,但是在triton环境检查时报错
- [Bug] 通过lmdeploy上线 Qwen-vl及其lora,但检查后发现lora并没有上线成功 HOT 3
- [Bug] Lmdeploy LLM Llama3在4090单卡和双卡上的推理结果不一致
- [Feature] multi-node training HOT 2
- [Bug] LMDeploy docker image with finetuned InternVL model doesnt work HOT 1
- [Bug] lmdeploy卡住,不能接收任何请求 HOT 3
- smooth 量化后推理性能没有提升 HOT 1
- [Feature] Add `logits_processor` to `GenerationConfig` HOT 3
- CPU offload when InternVL2-40B inference using lmdeploy.pipeline HOT 1
- [Docs] llava-llama3的图片预处理和前向推理过程 HOT 2
- [Bug] internvl2-2b使用awq量化后,推理速度基本上没有提升,精度还掉点 HOT 4
- [Bug] lmdeploy部署报错API call is not supported in the installed CUDA driver HOT 5
- [Bug] 一张卡上部署多个模型 HOT 3
- question about implements LRU policy
- [Feature] Support InternVL2-1B with the Turbomind Engine?
- 能否支持InternVL2-8B量化,有无相关文档 HOT 1
- [Bug] lmdeploy - ERROR - run out of tokens. session_id=1 HOT 1
- Scale out llm model deployment across different machine gpu's HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lmdeploy.