vllm-project / vllm Goto Github PK

View Code? Open in Web Editor NEW

25.2K 221.0 3.6K 24.82 MB

A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page: https://docs.vllm.ai

License: Apache License 2.0

Python 81.91% C++ 2.83% Cuda 12.78% C 1.00% Shell 1.01% Dockerfile 0.12% CMake 0.34%

gpt llm pytorch llmops mlops model-serving transformer llm-serving inference llama

vllm's Introduction

Easy, fast, and cheap LLM serving for everyone

vLLM & NVIDIA Triton User Meetup (Monday, September 9, 5pm-9pm PT) at Fort Mason, San Francisco

We are excited to announce our sixth vLLM Meetup, in collaboration with NVIDIA Triton Team. Join us to hear the vLLM's recent update about performance. Register now here and be part of the event!

Latest News 🔥

[2024/07] We hosted the fifth vLLM meetup with AWS! Please find the meetup slides here.
[2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post here.
[2024/06] We hosted the fourth vLLM meetup with Cloudflare and BentoML! Please find the meetup slides here.
[2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.
[2024/01] We hosted the second vLLM meetup with IBM! Please find the meetup slides here.
[2023/10] We hosted the first vLLM meetup with a16z! Please find the meetup slides here.
[2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
[2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

Performance benchmark: We include a performance benchmark that compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, text-generation-inference and lmdeploy).

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
Prefix caching support
Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

Transformer-like LLMs (e.g., Llama)
Mixture-of-Expert LLMs (e.g., Mixtral)
Embedding Models (e.g. E5-Mistral)
Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

vllm's People

Contributors

Stargazers

Watchers

Forkers

hanzz2007 prnake tomchapin mwarduni kokizzu mrcodechef worthmining touristshaun andreslavescu tuanbc evdcush hbcbh1999 ricklentz techthiyanes mslavescu allthingsllm liuxiaoxuanpku machinelearningsystem dotpyu jjhw woyang muximuxi mard1no brightxiaohan oo0-0-0oo jiancheng-ai yzs-lab sanbuphy onehr machinelearning-system goswamig liujuncn jmaigc mz0in librty ai-jie01 hubin858130 qqr1 felixzhang7 craii rock-you gaowudao qinyuanwu0710 chaojigang001 shenbozeng philhhhhhe hypersniper05 namelin2 troph-team zmvictor jxzhangjhu logp decentralised-ai llyx novellll adambear rioncarter developer-chat scv119 stanleyjacob michaelfeil stjordanis andy-yang-1 knowingnothing richardsonjf yufenglee shadowkun 0xlienid dosier sleepcoo jiankunw jenifferchristo tic-top justina582 haluha132 nounchik tyydonkorleone sandovani hadryan hqqccgcg norman199887 pennyre 42424224 youzizi98 brewswang liming32 mingliu0 akankushjnvku zhjang3 mingyka mingpai3 alitoom001 thatvarsitylove kp-forks nddcpn yuwingm ragnarokk23 gl0r1ous carld2023 wotundefi

vllm's Issues

Decrease the default size of swap space

The current default swap space size (20 GiB per GPU) is a bit too large. It can lead to OOM especially for the machine with multiple GPUs.

Add a baseline with dynamic growing KV cache size for the paper

Add performance comparison figures on A100, V100, T4

Build failure due to CUDA version mismatch

I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by pip is built with CUDA 11.7 while the container uses CUDA 12.1.

RuntimeError:
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

Add dependencies in setup.py

Add the complete list of dependencies to setup.py so that the package can be pip-installed at once.

Support various sampling parameters

The parameters such as repetition_penalty and top_k are often used for sampling. It'd be nice to support them using the HuggingFace logit processors.

Publish wheels with pre-built CUDA binaries

Currently, pip installing our package takes 5-10 minutes because our CUDA kernels are compiled on the user machine. For better UX, we should include pre-built CUDA binaries in our PyPI distribution, just like PyTorch and xformers.

Profile memory usage

Use pytest for unit tests

A critical bug in attention kernel after refactoring

It seems there's a critical bug introduced by #53
Running the single_query_cached_kv_attention kernel with certain configurations leads to CUDA illegal memory access errors. I found the bug in the unit tests.

Use O3 optimization instead of O2 for CUDA compilation?

We are currently using the -O2 flag in compiling our CUDA kernels. We need to investigate whether/how changing it to -O3 affects the system performance and compilation time.

Add documents on how to add new models

Support BLOOM

BLOOM is an open-source LLM developed by BigScience. The BLOOM models have achieved high rankings in HuggingFace downloads. It'd be great to have these models in our catalog.

Add support for Stable-LM and OpenAssistant

The two models are popularly used. As we support LLaMA, it'll not be difficult to support these models.

Implement a system logger to print system status and warnings

We need a logger class that can print the system status, warnings, and debugging information.

Support FP32

          Yes, it does. It is our attention kernel that does not support FP32. More precisely, our attention kernel currently does not support some block sizes when FP32 is used. I will fix this in the future.

Originally posted by @WoosukKwon in #70 (comment)

Clean up Megatron-LM code

Make sure the system can run on T4 and V100

Support GPT-2

GPT-2 is a representative of Transformer-based generative models and is still the most downloaded model in huggingface. It'd be nice to support the model.

Support custom tokenizer

We should provide a clean abstraction and interface so that users can use their custom tokenizer very easily.

Implement client API

Debug the optimal upper-bound performance for swapping (0-cost swapping).

Rerun the experiment comparing 0-cost swapping and recomputation. Recomputation should not be faster in any case. If recomputation is consistently faster, we should debug into this.

Tokenizer overhead is significant when use_fast=False

After #114 , the server decodes the running sequences every step. This leads to significant overhead, especially when the slow tokenizer is used (e.g., LLaMA).

# opt-13b inference latency (bs 8, input 32, output 128)
Avg latency: 3.57 seconds
Tokenizer (fast): 0.14 seconds

# llama-13b inference latency (bs 8, input 32, output 128)
Avg latency: 5.28 seconds
Tokenizer (slow): 1.97 seconds

Tensor Parallel profiling result

Will update the profiling results in this PR.

BS=8, input_len=32, output_len=128

OPT-13B
TP 1: 3.5404738585154214 seconds
TP 2: 4.742188215255737 seconds
TP 4: 4.907034238179524 seconds

OPT-30B
TP 1: OOM
TP 2: 5.9848620891571045 seconds
TP 4: 5.943212985992432 seconds

Bug in LLaMA fast tokenizer

In my environment, using the LLaMA fast tokenizer raises an error about protobuf:

  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

While downgrading the protobuf version removed the error, it slowed down the initialization time by ~8x.

Initialization with fast tokenizer & protobuf==3.20.3

real    4m18.476s
user    3m52.706s
sys     0m27.644s

Initialization with slow tokenizer

real    0m27.620s
user    0m8.011s
sys     0m19.237s

Frontend Improvements

Current implementation of the FastAPI+asyncio+ray combination seems slow
Merge Hao’s throughput profiling code.
Make the frontend looks like OpenAI’s API.

Turn shareGPT data into a standard benchmark

Extract out the lengths of the conversation rounds, and maybe have that data directly available from github.
The current L-shape evaluation with binary search for throughput is hard to run and not scalable. We should find an easier way to benchmark the performance.

Enhance model mapper

The current model mapper is hacky; it uses string matching based on the model name or path. Let's use HF-style model mapper that uses the architecture specified in model config and lazy-loads the target model only.

Support string-based stopping conditions

Check whether the input request is too long

Add code formatting script & Add CI to check code format

Use mypy

Dangerous floating point comparison

I noticed that we use conditions like this to check whether it is greedy sampling
https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45

However, I guess this will result in several problems

It is not recommended to use == for floating point numbers
A small temperature will result in inf/nan

I typically use something like this https://github.com/lm-sys/FastChat/blob/a94fd259a97128f7f4483ddb760690f467888d84/fastchat/serve/inference.py#L227

@WoosukKwon, @zhuohan123 What do you think? If you are happy, I can change all "==" to "<=".

Modify the current PyTorch model to C++

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:

Python v.s. C++.
PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:

(Fake C++) Torch compiler (torch.jit).
Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.

Add CD to PyPI

Documentation on running basic python server and FastAPI server

Add an option to disable Ray when using a single GPU

When working with a single GPU, Ray is not useful. Therefore, it would be beneficial to have an option to disable Ray in such scenarios.

Clean up the scheduler code

Currently, the scheduler code includes the code for experimental purposes (e.g., collecting various system stats). The code should be removed or minimized.

Documentation on Installation

Add tests for models

We need tests for the models we support. The tests should ensure that the outputs of our models when using greedy sampling are equivalent to those of HF models.

Fix the rushed out multi-query kernel

Fix the correctness issue in the current FlashAttention-copy-based kernel. Make sure we call the FlashAttention kernel correctly. Evaluate the performance of this kernel.
Reduce the memory usage of the current kernel by limiting the buffer size and calling the kernel multiple times.