Your current environment not a problem in my own env <h3 dir="

[Installation]: container images - too big and need to publish also cpu versions about vllm HOT 3 OPEN

yairyairyair commented on August 24, 2024

[Installation]: container images - too big and need to publish also cpu versions

from vllm.

Comments (3)

simon-mo commented on August 24, 2024

Optimization welcomed!

from vllm.

antoniomdk commented on August 24, 2024

Just installing torch and ray in an empty environment generates a docker image of ~2GiB. I think it's unrealistic to try to cut it down to < 1GiB. The CUDA libraries and other pre-compiled wheels probably make >50% of the total image size.

from vllm.

yairyairyair commented on August 24, 2024

If its 2GB its better than the 9GB which is published in the dockerhub, can we see why its 9gb and not 2? Maybe the github actions or something

from vllm.

Related Issues (20)

[Bug]: p2p check in custom all reduce not working HOT 8
[Bug]: Phi-3-small-128k-instruct on 4 T4 GPUs - Memory error: Tried to allocate 1024.00 GiB HOT 3
[Performance]: vllm 0.5.4 with enable_chunked_prefill =True, throughput is slightly lower than 0.5.3~0.5.0. HOT 6
[Bug]: Gemma 2 9b errors HOT 5
[Bug]: Unusual Memory Usage on H100 with Meta llama 8-B 72 GB it should not be around 8x2x1.2 in bfloat16
[Usage]: Seeing perf regression using chunked_prefill on VLLM 0.5.4 HOT 2
[Feature]: Enable Prefix caching kernel on Pallas for TPU backend
[Bug]: ModuleNotFoundError: No module named 'openai.types' HOT 3
[Bug]: CUDA error: an illegal memory access was encountered when running autofp8 HOT 1
[Performance]: Block manager v2 has low throughput with prefix caching warmup HOT 3
[Bug]: vllm server 部署base和lora模型后，请求lora模型失败 HOT 3
[Doc]: Has the offline chat inference function been updated? HOT 1
[Bug]: AttributeError: Model BitsAndBytesModelLoader does not support BitsAndBytes quantization yet HOT 1
[Bug]: The error is caused by: RuntimeError: out must have shape (total_q, num_heads, head_size_og), leading to the following error: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. HOT 1
[Bug]: Speculative sampling does not excatly maintain the distribution HOT 3
[Bug]: OpenGVLab/InternVL-Chat-V1-5 never stops properly HOT 8
[Bug]: assert num_new_tokens > 0 crashes entire worker instead of just failing single API call HOT 1
[Feature]: Exit on failures HOT 2
[Misc]: TTFT profiling with respect to prompt length HOT 3

[Installation]: container images - too big and need to publish also cpu versions about vllm HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent