Comments (3)
This is not a bug. We pre-allocate memory for the KV cache based on the following formula:
- (total memory * gpu_util) - weights - maximum_activation_size`
The maximum_activation_size
is measured from peak memory from running a profile run with your model_max_length
. So you are seeing memory go up during the profiling run and then drop back down
from vllm.
This is not a bug. We pre-allocate memory for the KV cache based on the following formula:
- (total memory * gpu_util) - weights - maximum_activation_size`
The
maximum_activation_size
is measured from peak memory from running a profile run with yourmodel_max_length
. So you are seeing memory go up during the profiling run and then drop back down
Small context lengths may consume more memory than large context lengths. What is the reason?
from vllm.
This is not a bug. We pre-allocate memory for the KV cache based on the following formula:
- (total memory * gpu_util) - weights - maximum_activation_size`
The
maximum_activation_size
is measured from peak memory from running a profile run with yourmodel_max_length
. So you are seeing memory go up during the profiling run and then drop back downSmall context lengths may consume more memory than large context lengths. What is the reason?
We allocate memory for the KV cache size and weights based on the maximum potential activation size.
- longer_content ==> large maximum activation size ==> less space for kv cache ==> less memory allocated
from vllm.
Related Issues (20)
- [Bug]: flashinfer backend bug HOT 1
- [RFC] Changes to CI workflow for PRs HOT 6
- [Bug]: 推理时异常 HOT 1
- [Feature]: Phi-3 vision -- allow multiple images as Microsoft shows can be done HOT 1
- [Bug]: AsyncEngineDeadError: Background loop is stopped after invalid parameter in request
- [Bug]: Request never returns if temperature > 2
- [RFC]: Classifier-Free Guidance
- [Bug]: Model architectures ['NVEmbedModel'] are not supported for now
- [Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct HOT 1
- [Bug]: unhandled system error with NCCL on v0.5.0.post1 HOT 3
- [Usage]: Multi-LoRA questions
- [Bug]: Inconsistent Output from OPT-x models
- [Feature]: Support in distributed speculative inference
- [Bug]: Neuron offline inferenc example assertion error
- [Installation]: ValueError: Quantization method specified in the model config (gptq) does not match the quantization method specified in the `quantization` argument (gptq_marlin). HOT 4
- [Bug]: VLM same chat different image results in serving_chat.py:238 Error in loading image data: HOT 5
- [Bug]: OutOfMemoryError when loading a small model with a huge context length HOT 3
- [CI] [Flaky test] distributed/test_shm_broadcast.py is flaky HOT 3
- [Bug]: server error when hosting TheBloke/Llama-2-7B-Chat-GPTQ with chunked-prefill HOT 1
- [Misc]: CUDAGraph captured generation stuck with custom_all_reduce and tensor_parallel=2 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.