Environment Setup Runtime environment: Target: x

Good question. In transformer inference, both <code class="notransla

No problem <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

I am down to collaborate with you <a class="user-mention notranslate" data-hovercard-t

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

How to share memory among 2 GPUS for distributed inference? about text-generation-inference HOT 10 OPEN

martinigoyanes commented on June 16, 2024 2

How to share memory among 2 GPUS for distributed inference?

from text-generation-inference.

Comments (10)

Venkat2811 commented on June 16, 2024 1

Good question.

In transformer inference, both prefill & decode phase use GPU VRAM. i.e., processing is done by moving model weights, kvcache, embeddings, etc., from VRAM to L2, L1 & registers of GPU processing cores and intermediate states, and results are written back to VRAM (See FlashAttention 1 & 2 for more details). This is similar to traditional programs (instruction & data) needed to be loaded in RAM before processing by CPUs. So in your pictorial representation, Ideal Scenario is not possible.

In your pictorial representation, Current Scenario is called as Data Parallelism. Full model weights are loaded in 2 GPUs each of 81% VRAM. We have 2 model instances. This is widely supported by several serving frameworks & engines including TGI.

What you are asking is: A single model to be sharded across multiple GPUs. For ex: 40.5% in GPU 1 & 40.5% in GPU 2. To have only one model instance. To complete one inference both GPUs being utilized. There are several parallelism techniques (Pipeline, Tensor, Sequence). TGI supports Tensor Parallelism. It comes into play when one model is too big to fit into a single GPU VRAM. It is not common to do this for models that fit within a single GPU's VRAM.

That's because generally speaking, inference is bandwidth limited. So, inference engine implementations are trying to use memory bandwidth (VRAM <> Cache communication) efficiently. If we do sharding for a model that fits into VRAM, the communication overhead is greater (GPU 1 Cache <> GPU1 VRAM <> PCIe/NVLink <> GPU2 VRAM <> GPU2 Cache). We want to keep compute intensity as low as possible to utilize GPU cores efficiently.

With modern techniques like FlashAttention, PagedAttention, Quantization, etc., memory bandwidth is being utilized more efficiently. So for specific model serving configurations (large batch sizes, quantized, etc.,), it could technically make sense to shard a model across 2 GPUs. This is similar to "Multi Processing" paradigm in CPU workloads. I'm not sure if this is mainstream though. Because, a single GPU inference is still in the order of seconds & milliseconds. I also see the benefits of this in ensemble model inference in local inference setup on M3 & M4 chips.

But anyways, I am also curious to know TGI team's (@OlivierDehaene, Et al.,) thoughts on this. Came across AlpaServe which discusses this.

Thanks,
Venkat

from text-generation-inference.

martinigoyanes commented on June 16, 2024 1

Maybe there is a terminology/communication gap here. The above statement is not correct. Higher throughput is achieved by increasing inference batch size by splitting model layers across several GPUs. This would require different layers of model to be loaded in different GPUs (base memory). 50% of compute (kv cache memory) is done in 1GPU, and the rest in 2nd GPU as a simple example. It is analogous in MapReduce.

Yes you are right, I think I was not precise enough. I said "100%" but of course you still have to host a % of the model while doing TP. However, I mean that you can utilize much more GPU VRAM which leads to you being able to process higher batch sizes. I was neglecting the size of the model in memory when it is spread among 2 gpus because of the sake of making my point about trade-off of latency-throughput, sorry.

Eg: Your 60 concurrent requests were served with latency. TGI router's queuing system + back pressure management made it possible.

Yeah, indeed! However, the queueing system in TGI is a bit "naive" since it has no sense of prioritization. I would argue that in real-case scenarios, when you have multiple downstream clients. You would allow some of them to consume more total tokens per minute than others depending on how critical the downstream task is. While TGI, allows all requests to fill in the queue. That could also be an extension of TGI, to have some sense of prioritization in the queueing system based on API keys and the token usage of each API key. However, I think that given that TGI is an open-source project, it is better to "keep it simple" and this kind of extensions should/can be built around TGI. What do you think?

from text-generation-inference.

martinigoyanes commented on June 16, 2024

Thank you so much for such well written response! Maybe we could explore adding support for these features, since I do think for cases where batch sizes are very big, it would really help to leverage both GPUs VRAM at the same time right?

from text-generation-inference.

Venkat2811 commented on June 16, 2024

No problem @martinigoyanes. vLLM supports this: vllm-project/vllm#2304

Maybe TGI already supports this natively or through vLLM integration ? I have to look into TGI config to have more clarity on this.

from text-generation-inference.

martinigoyanes commented on June 16, 2024

I am down to collaborate with you @Venkat2811 on supporting this ! I think TGI supports TP when model does not fit on 1 GPU but it does not allow you to force it to happen even when model fits in 1 GPU

from text-generation-inference.

Venkat2811 commented on June 16, 2024

I think TGI supports TP when model does not fit on 1 GPU but it does not allow you to force it to happen even when model fits in 1 GPU

Yes, would like to validate this first to be sure. If this is the case, would be happy to collaborate with you @martinigoyanes !

from text-generation-inference.

martinigoyanes commented on June 16, 2024

I think the increase in throughput is so much worth it since you can leverage 100% VRAM from the extra GPU, even after taking into account the added latency from GPU-to-GPU communication.

When serving LLMs for "real" use cases, you must put some kind of rate limiter in front of it. And, most of the time, the downstream client of the LLM, would rather have increased latency than being rate limited in total tokens used. With this increase in throughput from using 100% VRAM of extra GPU you can offer your downstream clients a much higher amount of total tokens per minute while trading off some latency from communications.

What do you think @Venkat2811 ? I feel like this would be a very valuable feature for TGI

from text-generation-inference.

Venkat2811 commented on June 16, 2024

Hey @martinigoyanes ,

you can leverage 100% VRAM from the extra GPU

Maybe there is a terminology/communication gap here. The above statement is not correct. Higher throughput is achieved by increasing inference batch size by splitting model layers across several GPUs. This would require different layers of model to be loaded in different GPUs (base memory). 50% of compute (kv cache memory) is done in 1GPU, and the rest in 2nd GPU as a simple example. It is analogous in MapReduce.

I think TGI supports TP when model does not fit on 1 GPU but it does not allow you to force it to happen even when model fits in 1 GPU

We have to verify this before proceeding.

Resources:

Tensor Parallelism:

Different types of parallelism:

https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/parallelisms.html

Multi Model inference on GPU cluster of several machines:

https://arxiv.org/abs/2302.11665 (Google probably has a variation of this for serving their models)

from text-generation-inference.

Venkat2811 commented on June 16, 2024

When serving LLMs for "real" use cases, you must put some kind of rate limiter in front of it

vLLM, TGI & Triton are already powering several "real" use cases already :)

Eg: Your 60 concurrent requests were served with latency. TGI router's queuing system + back pressure management made it possible.

from text-generation-inference.

Venkat2811 commented on June 16, 2024

No worries ! I wanted to be precise & sure, considering earlier discussions in this thread.

the queueing system in TGI is a bit "naive" since it has no sense of prioritization

Yes, prioritization, routing, etc., are not part of this and rightfully so. My understanding so far with current state of project is - router & inference server are for serving a single model.

from text-generation-inference.

How to share memory among 2 GPUS for distributed inference? about text-generation-inference HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent