Comments (5)
Thanks for the reply @cadedaniel.
I tried now with --use-v2-block-manager
(version 0.5.0.post1) and it still happens unfortunately.
Edit: Tried also building current main
branch (commit e2b85cf) where #5364 is already merged, and the issue still happens (also with --use-v2-block-manager
)
from vllm.
Thanks @colefranks
I tried and seems that the workaround doesn't seem to help but it does change the behavior, tried several combinations (all with version 0.5.0.post1).
On first iteration, there is difference in outputs between VLLM_ATTENTION_BACKEND=XFORMERS
and without. And if we assume that's ok, anyway when --enable-prefix-caching
is used, than second iteration with --enable-prefix-caching
differs from the first one.
from vllm.
We have an improved block manager which has better test coverage for prefix caching. We have tests which compare equality of prefix caching vs non-prefix caching -- so this case shouldn't happen // if it is happening, we can more easily diagnose the failure. Note the v2 block manager is not yet optimized for performance.
Can you see if it occurs with --use-block-manager-v2
?
from vllm.
Built also the branch of #5188 and it doesn't resolve the issue
from vllm.
possible workaround #5376 (comment)
from vllm.
Related Issues (20)
- [Bug]: Internal Server Error when hosting Salesforce/SFR-Embedding-Mistral
- [Bug]: TRACKING ISSUE: CUDA OOM with Logprobs
- [Feature]: Support for google/gemma-2-9b-it / gemma-2-27b-it HOT 1
- [Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly HOT 2
- [Bug]: Model "talking to itself" and ignoring `<|im_end|>` HOT 13
- [New Model]: Florence-2
- Virtual Office Hours: July 9 and July 25
- [Bug]: Illegal memory access for MoE kernel with large workloads HOT 1
- [Bug]: "work_use_ray" not work anymore in the latest version HOT 2
- [Usage]: can I save log to a file? HOT 1
- [Usage]: 是否可以多节点多CPU推理 HOT 1
- [Feature]: Way to using LLM's last hidden state embedding vector
- [Bug]: Can't support Phi-3-medium-* models with more than 2 GPUs
- [Bug]: Chunked prefill vs. non-chunked output is different for a long prompt HOT 1
- Gemma2 models from google HOT 3
- [Bug]: qwen1.5-32b-chat no response HOT 1
- [Feature]: `/info` endpoint for OpenAI-compatible API Server HOT 1
- [Bug]: vLLM crash when running Phi-3-small-8k-instruct with enable-chunked-prefill HOT 3
- [Bug]: Phi-3 vision crash: TypeError: only integer tensors of a single element can be converted to an index HOT 2
- [Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.