vLLM 0.2.6 Endpoint | Serverless Worker

🚀 | This serverless worker utilizes vLLM behind the scenes and is integrated into RunPod's serverless environment. It supports dynamic auto-scaling using the built-in RunPod autoscaling feature.

Setting up the Serverless Worker

Option 1: Deploy Any Model Using Pre-Built Docker Image

We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:

runpod/worker-vllm:dev

Environment Variables

Required:
- MODEL_NAME: Hugging Face Model Repository (e.g., openchat/openchat-3.5-1210).
Optional:
- MODEL_BASE_PATH: Model storage directory (default: /runpod-volume).
- HF_TOKEN: Hugging Face token for private and gated models (e.g., Llama, Falcon).
- NUM_GPU_SHARD: Number of GPUs to split the model across (default: 1).
- QUANTIZATION: AWQ (awq) or SqueezeLLM (squeezellm) quantization.
- MAX_CONCURRENCY: Max concurrent requests (default: 100).
- DEFAULT_BATCH_SIZE: Token streaming batch size (default: 10). This reduces the number of HTTP calls, increasing speed 8-10x vs non-batching, matching non-streaming performance.
- DISABLE_LOG_STATS: Enable (0) or disable (1) vLLM stats logging.
- DISABLE_LOG_REQUESTS: Enable (0) or disable (1) request logging.

Option 2: Build Docker Image with Model Inside

To build an image with the model baked in, you must specify the following docker arguments when building the image:

Arguments:

Required
- MODEL_NAME
Optional
- MODEL_BASE_PATH: Defaults to /runpod-volume for network storage. Use /models or for local container storage.
- QUANTIZATION
- HF_TOKEN
- WORKER_CUDA_VERSION: 11.8 or 12.1 (default: 11.8 due to a small amount of workers not having CUDA 12.1 support yet. 12.1 is recommended for optimal performance).

Example: Building an image with OpenChat-3.5

sudo docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg MODEL_BASE_PATH="/models" .

Compatible Models

LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, etc.)
Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
GPT-2 (gpt2, gpt2-xl, etc.)
GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

And any other models supported by vLLM 0.2.6.

Ensure that you have Docker installed and properly set up before running the docker build commands. Once built, you can deploy this serverless worker in your desired environment with confidence that it will automatically scale based on demand. For further inquiries or assistance, feel free to contact our support team.

Model Inputs

You may either use a prompt or a list of messages as input. If you use messages, the model's chat template will be applied to the messages automatically, so the model must have one. If you use prompt, you may optionally apply the model's chat template to the prompt by setting apply_chat_template to true.

Argument	Type	Default	Description
`prompt`	str		Prompt string to generate text based on.
`messages`	list[dict[str, str]]		List of messages, which will automatically have the model's chat template applied. Overrides `prompt`.
`apply_chat_template`	bool	False	Whether to apply the model's chat template to the `prompt`.
`sampling_params`	dict	{}	Sampling parameters to control the generation, like temperature, top_p, etc.
`stream`	bool	False	Whether to enable streaming of output. If True, responses are streamed as they are generated.
`batch_size`	int	DEFAULT_BATCH_SIZE	The number of responses to generate in one batch. Only applicable

Messages Format

Your list can contain any number of messages, and each message can have any role from the following list:

user
assistant
system

The model's chat template will be applied to the messages automatically.

Example:

[
  {
    "role": "system",
    "content": "..."
  },
  {
    "role": "user",
    "content": "..."
  },
  {
    "role": "assistant",
    "content": "..."
  }
]

Sampling Parameters

Argument	Type	Default	Description
`best_of`	Optional[int]	None	Number of output sequences generated from the prompt. The top `n` sequences are returned from these `best_of` sequences. Must be ≥ `n`. Treated as beam width in beam search. Default is `n`.
`presence_penalty`	float	0.0	Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition.
`frequency_penalty`	float	0.0	Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition.
`repetition_penalty`	float	1.0	Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition.
`temperature`	float	1.0	Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling.
`top_p`	float	1.0	Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
`top_k`	int	-1	Controls the number of top tokens to consider. Set to -1 to consider all tokens.
`min_p`	float	0.0	Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable.
`use_beam_search`	bool	False	Whether to use beam search instead of sampling.
`length_penalty`	float	1.0	Penalizes sequences based on their length. Used in beam search.
`early_stopping`	Union[bool, str]	False	Controls stopping condition in beam search. Can be `True`, `False`, or `"never"`.
`stop`	Union[None, str, List[str]]	None	List of strings that stop generation when produced. Output will not contain these strings.
`stop_token_ids`	Optional[List[int]]	None	List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens.
`ignore_eos`	bool	False	Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation.
`max_tokens`	int	16	Maximum number of tokens to generate per output sequence.
`skip_special_tokens`	bool	True	Whether to skip special tokens in the output.
`spaces_between_special_tokens`	bool	True	Whether to add spaces between special tokens in the output.

quinnrachman / worker-vllm Goto Github PK

worker-vllm's Introduction

vLLM 0.2.6 Endpoint | Serverless Worker

Setting up the Serverless Worker

Option 1: Deploy Any Model Using Pre-Built Docker Image

Environment Variables

Option 2: Build Docker Image with Model Inside

Arguments:

Example: Building an image with OpenChat-3.5

Compatible Models

Model Inputs

Messages Format

Sampling Parameters

worker-vllm's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent