vllm-project / vllm Goto Github PK

View Code? Open in Web Editor NEW

21.0K 197.0 2.9K 13.96 MB

A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page: https://docs.vllm.ai

License: Apache License 2.0

Python 81.17% C++ 3.30% Cuda 14.25% C 0.10% Shell 0.43% Dockerfile 0.10% Jinja 0.11% CMake 0.54%

gpt llm pytorch llmops mlops model-serving transformer llm-serving inference llama

vllm's Introduction

Easy, fast, and cheap LLM serving for everyone

Ray Summit CPF is Open (June 4th to June 20th)!

There will be a track for vLLM at the Ray Summit (09/30-10/02, SF) this year! If you have cool projects related to vLLM or LLM inference, we would love to see your proposals. This will be a great chance for everyone in the community to get together and learn. Please submit your proposal here

The Fourth vLLM Bay Area Meetup (June 11th 5:30pm-8pm PT)

We are thrilled to announce our fourth vLLM Meetup! The vLLM team will share recent updates and roadmap. We will also have vLLM collaborators from BentoML and Cloudflare coming up to the stage to discuss their experience in deploying LLMs with vLLM. Please register here and join us!

Latest News 🔥

[2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.
[2024/01] We hosted the second vLLM meetup in SF! Please find the meetup slides here.
[2024/01] Added ROCm 6.0 support to vLLM.
[2023/12] Added ROCm 5.7 support to vLLM.
[2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
[2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
[2023/09] We released our PagedAttention paper on arXiv!
[2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
[2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
[2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
[2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
Optimized CUDA kernels

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs and AMD GPUs
(Experimental) Prefix caching support
(Experimental) Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

Transformer-like LLMs (e.g., Llama)
Mixture-of-Expert LLMs (e.g., Mixtral)
Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

vllm's People

Contributors

Stargazers

Watchers

Forkers

hanzz2007 prnake tomchapin mwarduni kokizzu mrcodechef worthmining touristshaun andreslavescu tuanbc evdcush hbcbh1999 ricklentz techthiyanes mslavescu allthingsllm liuxiaoxuanpku machinelearningsystem dotpyu jjhw woyang muximuxi mard1no brightxiaohan oo0-0-0oo jiancheng-ai yzs-lab sanbuphy onehr machinelearning-system goswamig liujuncn jmaigc mz0in librty ai-jie01 hubin858130 qqr1 felixzhang7 craii rock-you gaowudao qinyuanwu0710 chaojigang001 shenbozeng philhhhhhe hypersniper05 namelin2 troph-team zmvictor jxzhangjhu logp decentralised-ai llyx novellll adambear rioncarter developer-chat scv119 stanleyjacob michaelfeil stjordanis andy-yang-1 knowingnothing richardsonjf yufenglee shadowkun 0xlienid dosier hanwsf sleepcoo jiankunw jenifferchristo tic-top justina582 haluha132 nounchik tyydonkorleone sandovani hadryan hqqccgcg norman199887 pennyre 42424224 youzizi98 brewswang liming32 mingliu0 akankushjnvku zhjang3 mingyka mingpai3 alitoom001 thatvarsitylove kp-forks nddcpn yuwingm ragnarokk23 gl0r1ous carld2023

vllm's Issues

Add docstring

Add code formatting script & Add CI to check code format

Use mypy

Clean up Megatron-LM code

Add an option to disable Ray when using a single GPU

When working with a single GPU, Ray is not useful. Therefore, it would be beneficial to have an option to disable Ray in such scenarios.

Add CD to PyPI

Add a baseline with dynamic growing KV cache size for the paper

Support FP32

          Yes, it does. It is our attention kernel that does not support FP32. More precisely, our attention kernel currently does not support some block sizes when FP32 is used. I will fix this in the future.

Originally posted by @WoosukKwon in #70 (comment)

Profile memory usage

Documentation on Installation

Support custom models

We need to provide clean abstractions and interfaces so that users can easily plug in their custom models.

Support various sampling parameters

The parameters such as repetition_penalty and top_k are often used for sampling. It'd be nice to support them using the HuggingFace logit processors.

Add tests for models

We need tests for the models we support. The tests should ensure that the outputs of our models when using greedy sampling are equivalent to those of HF models.

Implement custom kernels for top-k and top-p sampling

As mentioned in #81 (comment), the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.

Support GPT-2

GPT-2 is a representative of Transformer-based generative models and is still the most downloaded model in huggingface. It'd be nice to support the model.

Decrease the default size of swap space

The current default swap space size (20 GiB per GPU) is a bit too large. It can lead to OOM especially for the machine with multiple GPUs.

Fix the rushed out multi-query kernel

Fix the correctness issue in the current FlashAttention-copy-based kernel. Make sure we call the FlashAttention kernel correctly. Evaluate the performance of this kernel.
Reduce the memory usage of the current kernel by limiting the buffer size and calling the kernel multiple times.

Check whether the input request is too long

Add performance comparison figures on A100, V100, T4

Use pytest for unit tests

Write README

Support string-based stopping conditions

Frontend Improvements

Current implementation of the FastAPI+asyncio+ray combination seems slow
Merge Hao’s throughput profiling code.
Make the frontend looks like OpenAI’s API.

Enhance model mapper

The current model mapper is hacky; it uses string matching based on the model name or path. Let's use HF-style model mapper that uses the architecture specified in model config and lazy-loads the target model only.

Implement client API

Do not initialize process group when using a single GPU

Currently we call torch.distributed.init_process_group even for a single GPU. This is redundant and causes errors when the LLM object is created multiple times.

Add support for Stable-LM and OpenAssistant

The two models are popularly used. As we support LLaMA, it'll not be difficult to support these models.

Documentation on running basic python server and FastAPI server

A critical bug in attention kernel after refactoring

It seems there's a critical bug introduced by #53
Running the single_query_cached_kv_attention kernel with certain configurations leads to CUDA illegal memory access errors. I found the bug in the unit tests.

Add documents on how to add new models

Build failure due to CUDA version mismatch

I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by pip is built with CUDA 11.7 while the container uses CUDA 12.1.

RuntimeError:
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

Publish wheels with pre-built CUDA binaries

Currently, pip installing our package takes 5-10 minutes because our CUDA kernels are compiled on the user machine. For better UX, we should include pre-built CUDA binaries in our PyPI distribution, just like PyTorch and xformers.

Support BLOOM

BLOOM is an open-source LLM developed by BigScience. The BLOOM models have achieved high rankings in HuggingFace downloads. It'd be great to have these models in our catalog.

Implement a system logger to print system status and warnings

We need a logger class that can print the system status, warnings, and debugging information.

Use O3 optimization instead of O2 for CUDA compilation?

We are currently using the -O2 flag in compiling our CUDA kernels. We need to investigate whether/how changing it to -O3 affects the system performance and compilation time.

Modify the current PyTorch model to C++

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:

Python v.s. C++.
PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:

(Fake C++) Torch compiler (torch.jit).
Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.

Tensor Parallel profiling result

Will update the profiling results in this PR.

BS=8, input_len=32, output_len=128

OPT-13B
TP 1: 3.5404738585154214 seconds
TP 2: 4.742188215255737 seconds
TP 4: 4.907034238179524 seconds

OPT-30B
TP 1: OOM
TP 2: 5.9848620891571045 seconds
TP 4: 5.943212985992432 seconds

Debug the optimal upper-bound performance for swapping (0-cost swapping).

Rerun the experiment comparing 0-cost swapping and recomputation. Recomputation should not be faster in any case. If recomputation is consistently faster, we should debug into this.

Turn shareGPT data into a standard benchmark

Extract out the lengths of the conversation rounds, and maybe have that data directly available from github.
The current L-shape evaluation with binary search for throughput is hard to run and not scalable. We should find an easier way to benchmark the performance.

Bug in LLaMA fast tokenizer

In my environment, using the LLaMA fast tokenizer raises an error about protobuf:

  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

While downgrading the protobuf version removed the error, it slowed down the initialization time by ~8x.

Initialization with fast tokenizer & protobuf==3.20.3

real    4m18.476s
user    3m52.706s
sys     0m27.644s

Initialization with slow tokenizer

real    0m27.620s
user    0m8.011s
sys     0m19.237s

# opt-13b inference latency (bs 8, input 32, output 128)
Avg latency: 3.57 seconds
Tokenizer (fast): 0.14 seconds

# llama-13b inference latency (bs 8, input 32, output 128)
Avg latency: 5.28 seconds
Tokenizer (slow): 1.97 seconds

Dangerous floating point comparison

I noticed that we use conditions like this to check whether it is greedy sampling
https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45

However, I guess this will result in several problems

It is not recommended to use == for floating point numbers
A small temperature will result in inf/nan

I typically use something like this https://github.com/lm-sys/FastChat/blob/a94fd259a97128f7f4483ddb760690f467888d84/fastchat/serve/inference.py#L227

@WoosukKwon, @zhuohan123 What do you think? If you are happy, I can change all "==" to "<=".