Comments (3)
This error is in the profile_run
used to determine memory usage. If I patch the code to ignore the error, and a few lines below patch the model length to make raise_if_cache_size_invalid
happy, the model starts. Sure it won't reach the full 1M context, but it will work with 200k on an 80GB GPU.
This will be more pressing for users of small GPUs as the popular models increase their context lengths beyond 8k.
I'm happy to submit my monkey patches if there isn't already a plan to support large context models.
from vllm.
You can manually set --max-model-len
to reduce the context length.
Not sure whether it's a good idea to automatically limit the context length based on available memory. @simon-mo any thoughts?
from vllm.
You can manually set
--max-model-len
to reduce the context length.Not sure whether it's a good idea to automatically limit the context length based on available memory. @simon-mo any thoughts?
Agreed that a purely automatic setting may give folks the wrong impression that they can use the full context of the model even if their hardware won't allow it. One alternative is --max-model-len max
that would start the model no matter what and report the actual max context in the logs.
Right now someone must start vllm, see the crash, parse out the max context size from the log, and set that with --max-model-len
. But that's only if the profile_run()
doesn't OOM with the exception in the OP, in that case the user must guess at the max model len (the log message with the actual max is printed later, and depends on profile_run()
succeeding).
from vllm.
Related Issues (20)
- 为了便于交流,创建了一个多模态大模型交流群,欢迎大家入群交流学习~ HOT 2
- [Usage]: 流式输出前面几个字符为啥要设置成空字符?不能直接输出模型的生成吗 HOT 1
- [Bug]: Compiling FSM index high memory && subprocess OOM
- [Usage]: Does vllm support dynamic quantization
- [Feature]: support voice llm like cosyvoice HOT 1
- [Bug]: Extra body don't work when response_format is also sent for serving. HOT 6
- [Feature]: Small Model Large Latency Compared to SGLang and TensorRT-LLM HOT 1
- [Bug]: `ops.scaled_fp8_quant` returns wrong shape when input shape is () HOT 1
- [Bug]: LLama3 LoRA load failed HOT 1
- [Bug]:`vllm server` will get some error and `python3 -m vllm.entrypoints.openai.api_server` is correct HOT 1
- [Bug]: internvl2-8b 提问无限循环回答 HOT 1
- [Bug]: internvl2-8b提问无限循环 HOT 2
- [Feature]: Why vllm cli not provide a config arg? HOT 2
- Create speculative decode dynamic parallel strategy
- [Bug]: CUDA out of memory for llama3.1 70gb gptq, while in llama3 70gb gptq doesn't HOT 1
- [Feature]: continuous batching for vllm.LLM HOT 2
- [Bug]: Using LLM Engine to infer the MiniCPM-V-2_6 model, the result is wrong HOT 1
- [Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. HOT 1
- [Bug]: `gemma-2-27b-it-GGUF`: `Architecture gemma2 not supported` HOT 5
- [RFC]: Encoder/decoder models & feature compatibility
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.