Giter Club home page Giter Club logo

Comments (6)

zhyncs avatar zhyncs commented on August 16, 2024 1

The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.

from lmdeploy.

josephrocca avatar josephrocca commented on August 16, 2024

Also, as an aside, the docs say:

--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

but this is a bit confusing, and lacking in what I think are important details. TGI has --cuda-memory-fraction and vLLM has --gpu-memory-utilization, and the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available - this is simple and less confusing. I'm not sure if this approach is compatible with lmdeploy's paradigm, but I'm just mentioning it as a point of comparison.

I eventually found this page: https://lmdeploy.readthedocs.io/en/v0.4.2/api/pipeline.html#turbomindengineconfig but it was a bit hard to find, and I think should be linked from the above-mentioned point in the docs.

For lmdeploy versions greater than v0.2.1, it defaults to 0.8, signifying the percentage of FREE GPU memory to be reserved for the k/v cache

I think a new parameter should be added with a more appropriate name. I'm also wondering if it is safe to set this value to 1.0? Or is the default 0.8 to prevent CUDA OOM errors? Or is it for testing on Desktop (non-server) machines where some VRAM must be left for the OS/desktop rendering? If so then it is safe to set it to 1.0 on servers?

from lmdeploy.

zhyncs avatar zhyncs commented on August 16, 2024

You may refer to the design and implementation of prefix cache as follows:

design #1407 (comment)

implementation #1450

The implementation is consistent with the overall design, and subsequent modifications were made based on the review comments in the PR.

I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?

Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.

from lmdeploy.

zhyncs avatar zhyncs commented on August 16, 2024

the engine itself will determine how much to allocate to k/v cache, based on how much VRAM is available

yep. vLLM uses the profile_run to check if the gpu-memory-utilization is ok.

https://github.com/vllm-project/vllm/blob/c96fc067479453b02e92d9378eeeaebb6b3816de/vllm/worker/worker.py#L135-L154

I have mentioned using a similar approach before. #973 (comment) And in the end, it is still implemented according to ratio * free #973 (comment)

I think a new parameter should be added with a more appropriate name.

In fact, it can be a ratio or a count.

BlockManager::BlockManager(
size_t block_size, double block_count, int chunk_size, IAllocator* allocator, GetFreeMemSize get_free_size):
block_size_(block_size), allocator_(allocator)
{
if (block_count < 1.) {
max_block_count_ = GetBlockCount(block_size, block_count, get_free_size);
}
else {
max_block_count_ = block_count;
}

from lmdeploy.

josephrocca avatar josephrocca commented on August 16, 2024

Thank you very much for the details!

I'm wondering whether it's possible to allocate a percentage of system RAM to store prefix cache?

Prefix Cache is a reuse of the existing KV cache blocks, there is no need to do this.

I'm not sure you mean by "there is no need" - wouldn't it be better to have more cache storage space if VRAM is limited?

For example: Currently with two RTX 4090s, using LLama2 70B, I can store about 32 prefixes in the free VRAM, since the 70B model consumes almost all of the 48GB VRAM. This means that there's a high probability of cache eviction if there are e.g. 100 people using the service.

So if it were possible to use the system/cpu RAM too, that would be awesome, because I have >100 GB of system RAM available, so there would be a lower cache eviction probability for each request.

from lmdeploy.

josephrocca avatar josephrocca commented on August 16, 2024

Ah, I see. Thanks for explaining! I'll close this issue now.

from lmdeploy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.