Comments (15)
Hello @atineoSE , I installed vllm 0.5.0.post1 via pip: pip install vllm
It also installs vllm-flash-attn
package. However, when I run my script, I still get this message:
INFO 06-21 01:37:43 selector.py:150] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-21 01:37:43 selector.py:51] Using XFormers backend.
Should I do something different in my code to use FlashAttention-2? what does this message mean?
With vllm 0.4.2 FlashAttention-2 was working.
from vllm.
Since #4686, we can use vllm-flash-attn
instead of flash-attn
.
This is not yet available in the latest release, v0.4.2 but you can build a new vLLM wheel from source, here is how I did it.
git clone [email protected]:vllm-project/vllm.git
cd vllm
sudo docker build --target build -t vllm_build .
container_id=$(sudo docker create --name vllm_temp vllm_build:latest)
sudo docker cp ${container_id}:/workspace/dist .
This builds the container up to the build stage, which will contain the wheel for vllm in the /workspace/dist
directory. We can then extract it with docker cp
.
Then install with:
pip install vllm-flash-attn
pip install dist/vllm-0.4.2+cu124-cp310-cp310-linux_x86_64.whl
Now you can run vllm and get:
Using FlashAttention-2 backend.
from vllm.
I have same problelm on Linux (CentOS 7
).
My Env
torch 2.3.0
xformers 0.0.26.post1
vllm 0.4.2
vllm-flash-attn 2.5.8.post2
vllm_nccl_cu12 2.18.1.0.4.0
CUDA
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:26:00.0 Off | 0 |
| N/A 25C P0 56W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
Probelm:
INFO 05-24 16:04:56 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
INFO 05-24 16:04:56 selector.py:32] Using XFormers backend.
from vllm.
4070 ti super
ubuntu 22
from vllm.
I also met the problem.
from vllm.
@atineoSE Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti?
from vllm.
@atineoSE can you share the wheel somewhere, i cannot compile the wheel using this docker. Thanks.
from vllm.
@mces89 you have to compile for your architecture, so it's not universal. You can use the steps above.
Alternatively, you can:
- use the Docker version of the current release v0.4.2, as explained here (support for
flash-attn-2
built in) - wait until the next version is released for the pip version, as explained here (support for
vllm-flash-attn
will be available)
from vllm.
There's not an an absolute need to go through Docker – I just looked at the instructions in the README to build from source and ran
pip install vllm@git+https://github.com/vllm-project/vllm
That seemed to get me further (now I'm dealing with an unrelated error, so I can't confirm everything's peachy)
from vllm.
@ameza13 this is a new issue and not what the OP mentioned. I have encountered this when running vLLM with microsoft/Phi-3-medium-4k-instruct
Indeed, it looks like the FlashAttention-2 backend does not support the sliding window, so such a model needs to fall back to some other backend (XFormers in this case). The model works just fine, though I'm not sure if this implies some performance penalty.
from vllm.
Same for me.
My env:
Driver Version: 555.42.06
CUDA Version: 12.1
python3.12.4
vllm 0.5.1.post1
flash_attn 2.5.9.post1
torch 2.3.1
from vllm.
I am confused too - we can't use CUDA 12.x vllm_flash_attn due driver restrictions and vllm is complaining:
Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found.
pip install vllm-flash-attn for better performance.
from vllm.
ubuntu20.04+python3.10.14+cuda11.8+cudnn8.9.6+A100
vllm==0.5.4
torch==2.4.0+cu118
transformers==4.44.0
flash-attn == 2.6.1
vllm-flash-attn == 2.6.1
Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. pip install vllm-flash-attn
for better performance.
INFO 08-07 15:31:39 selector.py:54] Using XFormers backend.
from vllm.
Same for me.
python==3.11.0
torch==2.3.0+cu118
torchvision==0.18.0+cu118
flash-attn==2.6.3
vllm-flash-attn==2.5.9
from vllm.
Hello,
we had the same issue and just used the build wheels from the vllm-flash-attention fork, which worked without issues.
Link:
https://github.com/vllm-project/flash-attention/releases/tag/v2.6.1
Snippet from Dockerfile:
...
# Install Flash Attention
RUN pip install https://github.com/vllm-project/flash-attention/releases/download/v${FLASH_ATTN_VERSION}/vllm_flash_attn-${FLASH_ATTN_VERSION}+cu${CUDA_VERSION_SHORT}-cp${PYTHON_VERSION_SHORT}-cp${PYTHON_VERSION_SHORT}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION_SHORT} \
&& rm -rf /root/.cache/pip \
&& python3 -m pip cache purge \
&& rm -rf /tmp/*
from vllm.
Related Issues (20)
- [Bug]: [Usage]: is_xpu should return true when the torch.xpu.is_available is true even w/o IPEX HOT 2
- [Bug]: Gemma2 model not working with vLLM 0.6.0 CPU backend HOT 1
- [Usage]: how to let vllm use the generation_config.json as the default generation config
- [Installation]: can not install vllm in GPU HOT 2
- [Bug]: Dockerfile has a hardcoded dependency for flashinfer with cuda 12.1
- [Feature]: support out tree multimodal models HOT 8
- [Bug]: son.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
- [Bug]: Slow Inference Speed with llama 3.1 70B GGUF Q4 on A100 80G (8.7 tokens/s)
- [Bug]: Neuron + Vllm inference broken with backward incompatible change HOT 1
- [Doc]: How to Specify System CUTLASS/CUTE Path? HOT 2
- [Usage]: VLLM serve Gemma 2 9B it with more than 4096 tokens HOT 1
- [Bug]: QLoRA inference returns alternating output
- [Doc]: Is Qwen2-VL-72B supported? HOT 1
- [Feature]: improve distributed backend selection
- [New Model][Format]: Support the HF-version of Pixtral HOT 1
- [Bug]: RuntimeError on A800 using vllm0.6.1.post2 HOT 2
- [Bug]: Error when using --tensor-parallel-size 4 on Qwen2.5-72B-Instruct HOT 1
- [Bug]: Pixtral-12B not supported on CPU HOT 7
- [Bug]: AssertionError when loading Qwen 2.5 GGUF q3 model in vLLM HOT 2
- [Bug]: Low trhoughput on AMD MI250 using llama 3.1 (6 toks/s)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.