I'm guessing prefix cache is stored in the GPU VRAM. I'm wondering whether it's possib

Also, as an aside, the <a href="https://lmdeploy.readthedocs.io/en/latest/serving/api_

[Docs] Where is prefix cache data stored? about lmdeploy HOT 6 CLOSED

josephrocca commented on August 16, 2024

[Docs] Where is prefix cache data stored?

from lmdeploy.

Comments (6)

zhyncs commented on August 16, 2024 1

The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.

from lmdeploy.

josephrocca commented on August 16, 2024

Also, as an aside, the docs say:

--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

but this is a bit confusing, and lacking in what I think are important details. TGI has --cuda-memory-fraction and vLLM has --gpu-memory-utilization, and the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available - this is simple and less confusing. I'm not sure if this approach is compatible with lmdeploy's paradigm, but I'm just mentioning it as a point of comparison.

I eventually found this page: https://lmdeploy.readthedocs.io/en/v0.4.2/api/pipeline.html#turbomindengineconfig but it was a bit hard to find, and I think should be linked from the above-mentioned point in the docs.

For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

I think a new parameter should be added with a more appropriate name. I'm also wondering if it is safe to set this value to 1.0? Or is the default 0.8 to prevent CUDA OOM errors? Or is it for testing on Desktop (non-server) machines where some VRAM must be left for the OS/desktop rendering? If so then it is safe to set it to 1.0 on servers?

from lmdeploy.

zhyncs commented on August 16, 2024

You may refer to the design and implementation of prefix cache as follows:

design #1407 (comment)

implementation #1450

The implementation is consistent with the overall design, and subsequent modifications were made based on the review comments in the PR.

I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?

Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.

from lmdeploy.

zhyncs commented on August 16, 2024

the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available

yep. vLLM uses the profile_run to check if the gpu-memory-utilization is ok.

https://github.com/vllm-project/vllm/blob/c96fc067479453b02e92d9378eeeaebb6b3816de/vllm/worker/worker.py#L135-L154

I have mentioned using a similar approach before. #973 (comment) And in the end, it is still implemented according to ratio * free #973 (comment)

I think a new parameter should be added with a more appropriate name.

In fact, it can be a ratio or a count.

lmdeploy/src/turbomind/models/llama/BlockManager.cc

Lines 30 to 39 in 735d9a3

 BlockManager::BlockManager( 

 size_t block_size, double block_count, int chunk_size, IAllocator* allocator, GetFreeMemSize get_free_size): 

 block_size_(block_size), allocator_(allocator) 

 { 

 if (block_count < 1.) { 

 max_block_count_ = GetBlockCount(block_size, block_count, get_free_size); 

 } 

 else { 

 max_block_count_ = block_count; 

 }

from lmdeploy.

josephrocca commented on August 16, 2024

Thank you very much for the details!

I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?

Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.

I'm not sure you mean by "there is no need" - wouldn't it be better to have more cache storage space if VRAM is limited?

For example: Currently with two RTX 4090s, using LLama2 70B, I can store about 32 prefixes in the free VRAM, since the 70B model consumes almost all of the 48GB VRAM. This means that there's a high probability of cache eviction if there are e.g. 100 people using the service.

So if it were possible to use the system/cpu RAM too, that would be awesome, because I have >100 GB of system RAM available, so there would be a lower cache eviction probability for each request.

from lmdeploy.

josephrocca commented on August 16, 2024

Ah, I see. Thanks for explaining! I'll close this issue now.

from lmdeploy.

[Docs] Where is prefix cache data stored? about lmdeploy HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	BlockManager::BlockManager(
	size_t block_size, double block_count, int chunk_size, IAllocator* allocator, GetFreeMemSize get_free_size):
	block_size_(block_size), allocator_(allocator)
	{
	if (block_count < 1.) {
	max_block_count_ = GetBlockCount(block_size, block_count, get_free_size);
	}
	else {
	max_block_count_ = block_count;
	}