Your current environment Driver Version: 545.23.08 CUDA Versio

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Since <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-

I have same problelm on Linux ( CentOS 7 ). <h3 dir

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found about vllm HOT 15 CLOSED

maxin9966 commented on September 22, 2024 1

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found

from vllm.

Comments (15)

ameza13 commented on September 22, 2024 4

Hello @atineoSE , I installed vllm 0.5.0.post1 via pip: pip install vllm

It also installs vllm-flash-attn package. However, when I run my script, I still get this message:

INFO 06-21 01:37:43 selector.py:150] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-21 01:37:43 selector.py:51] Using XFormers backend.

Should I do something different in my code to use FlashAttention-2? what does this message mean?

With vllm 0.4.2 FlashAttention-2 was working.

from vllm.

atineoSE commented on September 22, 2024 2

Since #4686, we can use vllm-flash-attn instead of flash-attn.

This is not yet available in the latest release, v0.4.2 but you can build a new vLLM wheel from source, here is how I did it.

git clone [email protected]:vllm-project/vllm.git
cd vllm
sudo docker build --target build -t vllm_build .
container_id=$(sudo docker create --name vllm_temp vllm_build:latest)
sudo docker cp ${container_id}:/workspace/dist .

This builds the container up to the build stage, which will contain the wheel for vllm in the /workspace/dist directory. We can then extract it with docker cp.

Then install with:

pip install vllm-flash-attn
pip install dist/vllm-0.4.2+cu124-cp310-cp310-linux_x86_64.whl

Now you can run vllm and get:

Using FlashAttention-2 backend.

from vllm.

simonwei97 commented on September 22, 2024 1

I have same problelm on Linux (CentOS 7).

My Env

torch                             2.3.0
xformers                          0.0.26.post1
vllm                              0.4.2
vllm-flash-attn                   2.5.8.post2
vllm_nccl_cu12                    2.18.1.0.4.0

CUDA

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   25C    P0    56W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |

Probelm:

INFO 05-24 16:04:56 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
INFO 05-24 16:04:56 selector.py:32] Using XFormers backend.

from vllm.

maxin9966 commented on September 22, 2024

4070 ti super
ubuntu 22

from vllm.

bbeijy commented on September 22, 2024

I also met the problem.

from vllm.

maxin9966 commented on September 22, 2024

@atineoSE Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti?

from vllm.

mces89 commented on September 22, 2024

@atineoSE can you share the wheel somewhere, i cannot compile the wheel using this docker. Thanks.

from vllm.

atineoSE commented on September 22, 2024

@mces89 you have to compile for your architecture, so it's not universal. You can use the steps above.

Alternatively, you can:

use the Docker version of the current release v0.4.2, as explained here (support for flash-attn-2 built in)
wait until the next version is released for the pip version, as explained here (support for vllm-flash-attn will be available)

from vllm.

dymil commented on September 22, 2024

There's not an an absolute need to go through Docker – I just looked at the instructions in the README to build from source and ran
pip install vllm@git+https://github.com/vllm-project/vllm
That seemed to get me further (now I'm dealing with an unrelated error, so I can't confirm everything's peachy)

from vllm.

atineoSE commented on September 22, 2024

@ameza13 this is a new issue and not what the OP mentioned. I have encountered this when running vLLM with microsoft/Phi-3-medium-4k-instruct

Indeed, it looks like the FlashAttention-2 backend does not support the sliding window, so such a model needs to fall back to some other backend (XFormers in this case). The model works just fine, though I'm not sure if this implies some performance penalty.

from vllm.

plp38 commented on September 22, 2024

Same for me.

My env:

Driver Version: 555.42.06 
CUDA Version: 12.1
python3.12.4
vllm 0.5.1.post1
flash_attn 2.5.9.post1
torch 2.3.1

from vllm.

ch9hn commented on September 22, 2024

I am confused too - we can't use CUDA 12.x vllm_flash_attn due driver restrictions and vllm is complaining:
Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. pip install vllm-flash-attn for better performance.

from vllm.

wikithink commented on September 22, 2024

ubuntu20.04+python3.10.14+cuda11.8+cudnn8.9.6+A100

vllm==0.5.4
torch==2.4.0+cu118
transformers==4.44.0
flash-attn == 2.6.1
vllm-flash-attn == 2.6.1
Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. pip install vllm-flash-attn for better performance.
INFO 08-07 15:31:39 selector.py:54] Using XFormers backend.

from vllm.

lavender010101 commented on September 22, 2024

Same for me.

python==3.11.0
torch==2.3.0+cu118
torchvision==0.18.0+cu118
flash-attn==2.6.3
vllm-flash-attn==2.5.9

from vllm.

ch9hn commented on September 22, 2024

Hello,
we had the same issue and just used the build wheels from the vllm-flash-attention fork, which worked without issues.
Link:
https://github.com/vllm-project/flash-attention/releases/tag/v2.6.1

Snippet from Dockerfile:

...
# Install Flash Attention
RUN pip install https://github.com/vllm-project/flash-attention/releases/download/v${FLASH_ATTN_VERSION}/vllm_flash_attn-${FLASH_ATTN_VERSION}+cu${CUDA_VERSION_SHORT}-cp${PYTHON_VERSION_SHORT}-cp${PYTHON_VERSION_SHORT}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION_SHORT} \
 && rm -rf /root/.cache/pip \
 && python3 -m pip cache purge  \
 && rm -rf /tmp/*

from vllm.

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found about vllm HOT 15 CLOSED

Comments (15)

My Env

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent