python3.8-intel64 -m venv ~/code/virtualenvs/llm_inference_estimator
source ~/code/virtualenvs/llm_inference_estimator/bin/activate
pip install pip-tools==7.3.0
pip-compile requirements.in
pip-sync requirements.txt
Fundamentals, maybe too theoretical:
Inference:
- https://www.jinghong-chen.net/estimate-vram-usage-in-llm-inference/
- https://linden-li.github.io/posts/inference-slides?ref=jinghong-chen.net
For training:
Some tools out there:
- https://kipp.ly/transformer-inference-arithmetic/#latency-calculations
- https://github.com/adarshxs/TokenTally/tree/main
- https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator/blob/main/index.html
How TGI does its calculations:
Alternative to TGI, this repo has better documented code and are the creators of the algorithms. TGI just "copies" them over:
Issues I have open with TGI:
Reddit comment:
Very important article about PagedAttention:
IMPORTANT: Communication times between gpus increase latency quite a bit. In mistral7b for bs=1 in=100 out=4 it goes from 15.90ms with 1gpu to 220.90ms with 2 gpus