It would be worth to provide the measured memory requirements for inference Text Model

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Memory requirements about lwm HOT 4 OPEN

largeworldmodel commented on July 20, 2024 13

Memory requirements

from lwm.

Comments (4)

wilson1yan commented on July 20, 2024 11

If using vLLM for inference (PyTorch model, FP16), I believe we used:

1 80GB A100 for 32K
2 80GB A100s for 128K
4 80GB A100s for 256K
8 80GB A100s for 512K

For each of the above, serving 1 model with tensor parallelism over the given number of devices. With 8 80GB A100s, I think the limit was around 650K - 700K tokens. In vLLM, it prints out the max number of tokens supported by giving the number of blocks for caches allocated, so it should be easy to tell if you're using GPUs with different amounts of memory.

For Jax, I'm not too sure what intermediate requirements were, but we needed a v4-256 to do inference on 1M tokens (full FP32 inference). I think more optimization can be made (e.g. half-precision, quantization, etc.) to make the requirements smaller. Even at full precision, the requirements seemed higher than I expected, and there might be some Jax / XLA optimizations to be made (e.g. keep it from padding certain dimensions, which we originally had a lot of trouble with).

from lwm.

blazorin commented on July 20, 2024

Any recommendation to run the model on smaller GPUs (T4). It runs out of memory (jax).

from lwm.

Playerrrrr commented on July 20, 2024

@wilson1yan Can you share the shell/bash script for setting up the inference server via vLLM for PyTorch model, FP16?

If using vLLM for inference (PyTorch model, FP16), I believe we used:
* 1 80GB A100 for 32K

* 2 80GB A100s for 128K

* 4 80GB A100s for 256K

* 8 80GB A100s for 512K
For each of the above, serving 1 model with tensor parallelism over the given number of devices. With 8 80GB A100s, I think the limit was around 650K - 700K tokens. In vLLM, it prints out the max number of tokens supported by giving the number of blocks for caches allocated, so it should be easy to tell if you're using GPUs with different amounts of memory.

For Jax, I'm not too sure what intermediate requirements were, but we needed a v4-256 to do inference on 1M tokens (full FP32 inference). I think more optimization can be made (e.g. half-precision, quantization, etc.) to make the requirements smaller. Even at full precision, the requirements seemed higher than I expected, and there might be some Jax / XLA optimizations to be made (e.g. keep it from padding certain dimensions, which we originally had a lot of trouble with).

from lwm.

xloem commented on July 20, 2024

I’m thinking an attention kernel optimization like top-k would be appropriate here. Could a user calculate their own position_ids and pass a subset of the tokens, maybe make multiple passes and drop tokens that don’t impact the results?

from lwm.

Recommend Projects

Memory requirements about lwm HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent