Comments (4)
What type of distributed inferencing do you plan to do? Is it model parallel or data parallel?
from vllm.
I just want to use online api serving based on LLM like Qwen1.5-110B-Chat.
My main steps are as follows:
1, I made docker image include ofed driver, "ibstat" can show my 200G infiniband card.
2, create yaml file, like:
rayClusterConfig:
rayVersion: '2.9.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The rayStartParams
are used to configure the ray start
command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams
in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams
.
rayStartParams:
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: repo:5000/harbor/rayvllm:v3
resources:
limits:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
requests:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# The rayStartParams
are used to configure the ray start
command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams
in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams
.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: repo:5000/harbor/rayvllm:v3
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
requests:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
3, I created a head node and a worker node by using kuberay with the image I made,and run the commad on the head node:
python -m vllm.entrypoints.openai.api_server
--model /path/Qwen1.5-110B-Chat
--tensor-parallel-size 16
--host 0.0.0.0
--trust-remote-code
--port 8000
--worker-use-ray
4, I run a benckmark scripts like:
python benchmarks/benchmark_serving.py
--backend vllm
--model /path/Qwen1.5-110B-Chat
--dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 5
--num-prompts 100
--host xxxx
--port 8000
--trust-remote-code \
I observed the Ray cluster's dashboard and found that the read/write throughput can reach up to 1.2GB/s, but it does not utilize the InfiniBand network bandwidth.
So, I just plan to use multiple nodes to perform distributed inference for large models, providing an OpenAI API server service, and using InfiniBand high-speed networks for communication between node
from vllm.
i have similar use cases. Tested it in a DGX cluster, deliberately spread the falcon180b model to multiple nodes (and saw that the read/write per node is about 2-3GB/s).
i didnt set the USE_RDMA
though
from vllm.
@xiphl @hetian127 I have also similar use case.
How did you actually spread model ? Was it using SPREAD placement strategy as described here - https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html ?
And did vllm automatically sharded model across multiple machine ?
from vllm.
Related Issues (20)
- [Usage]: Batching in online inference HOT 2
- [Bug]: GPU can only load the model once, it gets stuck when loaded again HOT 10
- [Bug]: 2 nodes serving hanging HOT 18
- [Bug]: using Dockerfile to build image got Killed,even when I have 32g of ram
- [Bug]: when curl /chat/completions, TypeError: Unable to evaluate type annotation 'Required[Union[str, Iterable[ChatCompletionContentPartTextParam]]]'. HOT 18
- [RFC]: Support encode only models by Workflow Defined Engine HOT 1
- [Usage]: How to deploy lora model using vllm and lora is pluggable HOT 2
- [Usage]: There is a significant difference in memory usage between vllm and sglang when deploying deepseek-v2 fp8.
- [Performance]: Add weaker memory fence for custom allreduce
- [Bug]: How can I run VLLM serving without an internet connection? I tried setting the global variable but it still trying to connect to huggingface HOT 2
- [Usage]: How to use AutoModelForSequenceClassification correctly HOT 6
- [Installation]: vLLM Not Working on x86 CPUs from v0.6.1 Onwards HOT 5
- [Bug]: Model architectures ['Qwen2AudioForConditionalGeneration'] are not supported for now. HOT 3
- [New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization HOT 8
- [Usage]: Collect performance metrics in offline serving
- [Performance]: Moving the initialisation of the v variable in the _fwd_kernel() function has an effect on performance.
- [Usage]: Get first token latency HOT 1
- [Bug]: AMD with multi-step enabled crashes HOT 1
- [Bug]: CPU silently doesn't support prompt adapter
- [Bug]: : CPU silently doesn't support multi-step (--num-scheduler-steps)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.