Giter Club home page Giter Club logo

Comments (15)

ameza13 avatar ameza13 commented on September 22, 2024 4

Hello @atineoSE , I installed vllm 0.5.0.post1 via pip: pip install vllm

It also installs vllm-flash-attn package. However, when I run my script, I still get this message:

INFO 06-21 01:37:43 selector.py:150] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-21 01:37:43 selector.py:51] Using XFormers backend.

Should I do something different in my code to use FlashAttention-2? what does this message mean?

With vllm 0.4.2 FlashAttention-2 was working.

from vllm.

atineoSE avatar atineoSE commented on September 22, 2024 2

Since #4686, we can use vllm-flash-attn instead of flash-attn.

This is not yet available in the latest release, v0.4.2 but you can build a new vLLM wheel from source, here is how I did it.

git clone [email protected]:vllm-project/vllm.git
cd vllm
sudo docker build --target build -t vllm_build .
container_id=$(sudo docker create --name vllm_temp vllm_build:latest)
sudo docker cp ${container_id}:/workspace/dist .

This builds the container up to the build stage, which will contain the wheel for vllm in the /workspace/dist directory. We can then extract it with docker cp.

Then install with:

pip install vllm-flash-attn
pip install dist/vllm-0.4.2+cu124-cp310-cp310-linux_x86_64.whl

Now you can run vllm and get:

Using FlashAttention-2 backend.

from vllm.

simonwei97 avatar simonwei97 commented on September 22, 2024 1

I have same problelm on Linux (CentOS 7).

My Env

torch                             2.3.0
xformers                          0.0.26.post1
vllm                              0.4.2
vllm-flash-attn                   2.5.8.post2
vllm_nccl_cu12                    2.18.1.0.4.0

CUDA

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   25C    P0    56W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |

Probelm:

INFO 05-24 16:04:56 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
INFO 05-24 16:04:56 selector.py:32] Using XFormers backend.

from vllm.

maxin9966 avatar maxin9966 commented on September 22, 2024

4070 ti super
ubuntu 22

from vllm.

bbeijy avatar bbeijy commented on September 22, 2024

I also met the problem.

from vllm.

maxin9966 avatar maxin9966 commented on September 22, 2024

@atineoSE Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti?

from vllm.

mces89 avatar mces89 commented on September 22, 2024

@atineoSE can you share the wheel somewhere, i cannot compile the wheel using this docker. Thanks.

from vllm.

atineoSE avatar atineoSE commented on September 22, 2024

@mces89 you have to compile for your architecture, so it's not universal. You can use the steps above.

Alternatively, you can:

  • use the Docker version of the current release v0.4.2, as explained here (support for flash-attn-2 built in)
  • wait until the next version is released for the pip version, as explained here (support for vllm-flash-attn will be available)

from vllm.

dymil avatar dymil commented on September 22, 2024

There's not an an absolute need to go through Docker – I just looked at the instructions in the README to build from source and ran
pip install vllm@git+https://github.com/vllm-project/vllm
That seemed to get me further (now I'm dealing with an unrelated error, so I can't confirm everything's peachy)

from vllm.

atineoSE avatar atineoSE commented on September 22, 2024

@ameza13 this is a new issue and not what the OP mentioned. I have encountered this when running vLLM with microsoft/Phi-3-medium-4k-instruct

Indeed, it looks like the FlashAttention-2 backend does not support the sliding window, so such a model needs to fall back to some other backend (XFormers in this case). The model works just fine, though I'm not sure if this implies some performance penalty.

from vllm.

plp38 avatar plp38 commented on September 22, 2024

Same for me.

My env:

Driver Version: 555.42.06 
CUDA Version: 12.1
python3.12.4
vllm 0.5.1.post1
flash_attn 2.5.9.post1
torch 2.3.1

from vllm.

ch9hn avatar ch9hn commented on September 22, 2024

I am confused too - we can't use CUDA 12.x vllm_flash_attn due driver restrictions and vllm is complaining:
Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. pip install vllm-flash-attn for better performance.

from vllm.

wikithink avatar wikithink commented on September 22, 2024

ubuntu20.04+python3.10.14+cuda11.8+cudnn8.9.6+A100

vllm==0.5.4
torch==2.4.0+cu118
transformers==4.44.0
flash-attn == 2.6.1
vllm-flash-attn == 2.6.1
Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. pip install vllm-flash-attn for better performance.
INFO 08-07 15:31:39 selector.py:54] Using XFormers backend.

from vllm.

lavender010101 avatar lavender010101 commented on September 22, 2024

Same for me.

python==3.11.0
torch==2.3.0+cu118
torchvision==0.18.0+cu118
flash-attn==2.6.3
vllm-flash-attn==2.5.9

from vllm.

ch9hn avatar ch9hn commented on September 22, 2024

Hello,
we had the same issue and just used the build wheels from the vllm-flash-attention fork, which worked without issues.
Link:
https://github.com/vllm-project/flash-attention/releases/tag/v2.6.1

Snippet from Dockerfile:

...
# Install Flash Attention
RUN pip install https://github.com/vllm-project/flash-attention/releases/download/v${FLASH_ATTN_VERSION}/vllm_flash_attn-${FLASH_ATTN_VERSION}+cu${CUDA_VERSION_SHORT}-cp${PYTHON_VERSION_SHORT}-cp${PYTHON_VERSION_SHORT}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION_SHORT} \
 && rm -rf /root/.cache/pip \
 && python3 -m pip cache purge  \
 && rm -rf /tmp/*
 

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.