Comments (3)
Optimization welcomed!
from vllm.
Just installing torch and ray in an empty environment generates a docker image of ~2GiB. I think it's unrealistic to try to cut it down to < 1GiB. The CUDA libraries and other pre-compiled wheels probably make >50% of the total image size.
from vllm.
If its 2GB its better than the 9GB which is published in the dockerhub, can we see why its 9gb and not 2? Maybe the github actions or something
from vllm.
Related Issues (20)
- [Bug]: p2p check in custom all reduce not working HOT 8
- [Bug]: Phi-3-small-128k-instruct on 4 T4 GPUs - Memory error: Tried to allocate 1024.00 GiB HOT 3
- [Performance]: vllm 0.5.4 with enable_chunked_prefill =True, throughput is slightly lower than 0.5.3~0.5.0. HOT 6
- [Bug]: Gemma 2 9b errors HOT 5
- [Bug]: Unusual Memory Usage on H100 with Meta llama 8-B 72 GB it should not be around 8x2x1.2 in bfloat16
- [Usage]: Seeing perf regression using chunked_prefill on VLLM 0.5.4 HOT 2
- [Feature]: Enable Prefix caching kernel on Pallas for TPU backend
- [Bug]: ModuleNotFoundError: No module named 'openai.types' HOT 3
- [Bug]: CUDA error: an illegal memory access was encountered when running autofp8 HOT 1
- [Performance]: Block manager v2 has low throughput with prefix caching warmup HOT 3
- [Bug]: vllm server 部署base和lora模型后,请求lora模型失败 HOT 3
- [Doc]: Has the offline chat inference function been updated? HOT 1
- [Bug]: AttributeError: Model BitsAndBytesModelLoader does not support BitsAndBytes quantization yet HOT 1
- [Bug]: The error is caused by: RuntimeError: out must have shape (total_q, num_heads, head_size_og), leading to the following error: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. HOT 1
- [Bug]: Speculative sampling does not excatly maintain the distribution HOT 3
- [Bug]: OpenGVLab/InternVL-Chat-V1-5 never stops properly HOT 8
- [Bug]: assert num_new_tokens > 0 crashes entire worker instead of just failing single API call HOT 1
- [Feature]: Exit on failures HOT 2
- [Misc]: TTFT profiling with respect to prompt length HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.