Feature request Pipeline parallelism, with more detailed discussio

[Revised not duplicated] Pipeline Parallelism supporting. about text-generation-inference HOT 1 CLOSED

Healthcliff-Ding commented on May 25, 2024 1

[Revised not duplicated] Pipeline Parallelism supporting.

from text-generation-inference.

Comments (1)

Narsil commented on May 25, 2024

We're not interested in supporting PP at all.

This project is interested in making good LATENCY, we're barely interested in throughput (it comes for free so we take it, but it's really not the focus of this project).
If you there's an argument for making PP good for LATENCY we'll be taking it, but we've never seen anything like it, the issue being having differently sized chunks of work on difference GPUs.

Synchronization while slow, is still much more efficient for the overall pipeline.

Other projects will be better suited for throughput most likely.

Inter node is never going to be worth it compared to quantization currently.

from text-generation-inference.

Related Issues (20)

Add `grammar` to chat/completions endpoint / Messages API
Add Intel Arc iGPU support (Meteor Lake)
TGI-2.0.2 encounter "CUDA is not available"
Encounter install error when install vllm package.
Mistral7b takes 4 times its size in VRAM on A100 HOT 5
Regarding llama3-70b-instruct
Use pre-built FA2, vllm, quantization kernels in the dockerfiles
"docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e HUGGING_FACE_HUB_TOKEN={your_token} ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard" showing error with my token id that "Unable to find image 'ghcr.io/huggingface/text-generation-inference:latest' locally latest: Pulling from huggingface/text-generation-inference docker: no matching manifest for linux/arm64/v8 in the manifest list entries. See 'docker run --help'."
Cannot use Inference Endpoint: UnprocessableEntityError: Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'} HOT 1
llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered
how do I adjust the logging level when launching via the docker container?
[Question] Onnx support in TGI
Automatic NUMA binding
How to share memory among 2 GPUS for distributed inference? HOT 10
text generation details not working when stream=False HOT 1
concurrent requests permit limit is broken
Multi-Model Endpoint support in Sagemaker
Logging has no formating when using docker enviroment instead of command
SnapKV support
Question about KV cache HOT 3

[Revised not duplicated] Pipeline Parallelism supporting. about text-generation-inference HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent