mlc-ai / llm-perf-bench Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
I think the tok/sec metric used in the README needs to be defined more clearly. It's not clear whether it for example measures the rate of tokens during generation only (e.g. llama.cpp "eval" print) or the total runtime (e.g. Oobabooge webui print).
Hi I use the commands in the README to run this project.
As you don't specify the model repo, I download model from here: https://huggingface.co/daryl149/llama-2-7b-chat-hf
The build process is ok. But when I run the built model, it complains:
Use MLC config: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json"
Use model weights: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json"
Use model library: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/llama-2-7b-chat-hf-q4f16_1-cuda.so"
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out the latest stats (token/sec)
/reset restart a fresh chat
/reload [local_id] reload model `local_id` from disk, or reload the current model if `local_id` is not specified
Loading model...
Loading finished
Running system prompts...
CUDA error: no kernel image is available for execution on the device
Aborted
Currently the README does not necessarily provide a like-for-like comparison because 4 bit quantizations can be of different quality depending on the implementation details. For example, in llama.cpp q4_0 is faster than q4_K_M but the quantization format is less efficient in terms of size. So it would be useful to include measurements for the memory usage as well as measure for the output quality (e.g. perplexity on a large corpus of text) to put the speed numbers into context.
Dear authors, thanks for your efforts on such a great benchmarking work on LLM. I learned this repository and met some problems deploying this project.
When I was running the command below:
docker build --no-cache -t llm-perf-exllama-v2:v0.1 -f ./docker/Dockerfile.cu121.exllama_v2 .
It would give me an error: ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects.
I used to successfully benchmarked "MLC LLM" but I met such an error when handling with "Exllama V2".
the "nvidia-smi" outputs:
Fri Oct 27 01:08:12 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 30% 36C P8 16W / 350W| 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:41:00.0 Off | N/A |
| 30% 42C P8 22W / 350W| 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:61:00.0 Off | N/A |
| 39% 35C P8 16W / 350W| 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Here is the full log of the step 7:
Step 7/7 : RUN source ~/.bashrc && micromamba activate python311 && MAX_JOBS=4 python -m pip install flash-attn --no-build-isolation
---> Running in 28d2c70206c1
Collecting flash-attn
Downloading flash_attn-2.3.3.tar.gz (2.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 24.8 MB/s eta 0:00:00
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: torch in /root/micromamba/envs/python311/lib/python3.11/site-packages (from flash-attn) (2.2.0.dev20231026)
Collecting einops (from flash-attn)
Downloading einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: packaging in /root/micromamba/envs/python311/lib/python3.11/site-packages (from flash-attn) (22.0)
Collecting ninja (from flash-attn)
Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Requirement already satisfied: filelock in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.9.0)
Requirement already satisfied: typing-extensions in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (4.8.0)
Requirement already satisfied: sympy in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (1.12)
Requirement already satisfied: networkx in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.2)
Requirement already satisfied: jinja2 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.1.2)
Requirement already satisfied: fsspec in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (2023.10.0)
Requirement already satisfied: MarkupSafe>=2.0 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from jinja2->torch->flash-attn) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from sympy->torch->flash-attn) (1.2.1)
Downloading einops-0.7.0-py3-none-any.whl (44 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB 1.7 MB/s eta 0:00:00
Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 12.3 MB/s eta 0:00:00
Building wheels for collected packages: flash-attn
Building wheel for flash-attn (setup.py): started
Building wheel for flash-attn (setup.py): finished with status 'error'
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [40 lines of output]
No CUDA runtime is found, using CUDA_HOME='/root/micromamba/envs/python311'
fatal: not a git repository (or any of the parent directories): .git
torch.__version__ = 2.2.0.dev20231026
running bdist_wheel
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 288, in <module>
setup(
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 265, in run
wheel_url, wheel_filename = get_wheel_url()
^^^^^^^^^^^^^^^
File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 234, in get_wheel_url
torch_cuda_version = parse(torch.version.cuda)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/packaging/version.py", line 52, in parse
return Version(version)
^^^^^^^^^^^^^^^^
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/packaging/version.py", line 195, in __init__
match = self._regex.search(version)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or bytes-like object, got 'NoneType'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for flash-attn
Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
The command '/bin/bash -ec source ~/.bashrc && micromamba activate python311 && MAX_JOBS=4 python -m pip install flash-attn --no-build-isolation' returned a non-zero code: 1
I'm very appreciate it if anyone could have a look at this problem and give some advises. Thanks!
To reproduce:
On a supported AMD machine, follow the ROCM instructions up to building of the "lib" -- and the following crash occurs....
(python311) root@rumlinux:/workspace# rm -rf $PATH_TEST && mkdir $PATH_TEST && rm -rf $PATH_COMPILE && mkdir $PATH_COMPILE && ln -s ${WEIGHT_PATH} ${PATH_TEST}/params && cp $MODEL_CONFIG $PATH_COMPILE/config.json
(python311) root@rumlinux:/workspace# python -m mlc_llm.build \
--model $PATH_COMPILE \
--artifact-path $PATH_COMPILE \
--quantization $QUANTIZATION \
--max-seq-len 2048 \
--num-shards $NUM_SHARDS \
--target rocm --build-model-only
Traceback (most recent call last):
File "<frozen runpy>", line 189, in _run_module_as_main
File "<frozen runpy>", line 112, in _get_module_details
File "/mlc_llm/mlc_llm/__init__.py", line 6, in <module>
from . import core
File "/mlc_llm/mlc_llm/core.py", line 19, in <module>
from mlc_llm.relax_model import (
File "/mlc_llm/mlc_llm/relax_model/llama_batched_vllm.py", line 7, in <module>
from tvm.relax.op.nn import attention_var_len
ImportError: cannot import name 'attention_var_len' from 'tvm.relax.op.nn' (/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/relax/op/nn/__init__.py)
which repo to download from hugging face for 7b / 13b to try out? i did git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-13b-chat-hf-q4f16_1
but it generated a 3.3mb directory. any help on this?
The benchmark container for llama.cpp does not seem to be manually setting the number of threads. As of right now more than one thread is of no use in llama.cpp when you can offload all layers with CUDA (but they still add overhead/CPU load). So manually setting the number of threads to 1 should yield better performance.
When running python build.py --model /models/Llama-2-7b-chat-hf --target cuda --quantization q4f16_1 --artifact-path "./dist" --use-cache 0
I encountered the following error:
Traceback (most recent call last):
File "/mlc_llm/build.py", line 4, in <module>
main()
File "/mlc_llm/mlc_llm/build.py", line 10, in main
core.build_model_from_args(parsed_args)
File "/mlc_llm/mlc_llm/core.py", line 584, in build_model_from_args
build(mod, args)
File "/mlc_llm/mlc_llm/core.py", line 467, in build
utils.debug_dump_benchmark_script(
File "/mlc_llm/mlc_llm/utils.py", line 108, in debug_dump_benchmark_script
from tvm.dlight.benchmark import extract_all_func_info_from_relax
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/dlight/benchmark/__init__.py", line 18, in <module>
from .bench import benchmark, benchmark_prim_func, benchmark_relax_func
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/dlight/benchmark/bench.py", line 28, in <module>
from tvm.testing import rpc_run
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/testing/__init__.py", line 47, in <module>
from .utils import *
File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/testing/utils.py", line 85, in <module>
import pytest
ModuleNotFoundError: No module named 'pytest'
I was able to fix it by just installing pytest via pip3 install pytest
.
can 8gb rtx 3060 run the 13b model?
I follow the steps by steps for rocm-benchmark. The last step run failed. The log is bellow.
The key words of error log is "[2024-01-17 08:53:37] ERROR model_metadata.py:93: FAILED to read metadata section in legacy model lib."
Hi, I'm new in your work and I've build the docker and tried to run python build.py
command and receive this:
Traceback (most recent call last):
File "/mlc_llm/build.py", line 4, in <module>
main()
File "/mlc_llm/mlc_llm/build.py", line 9, in main
parsed_args = core._parse_args(parsed_args) # pylint: disable=protected-access
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mlc_llm/mlc_llm/core.py", line 192, in _parse_args
utils.parse_target(parsed)
File "/mlc_llm/mlc_llm/utils.py", line 370, in parse_target
raise ValueError("Cannot detect local CUDA GPU target!")
I'm not sure where goes wrong and I can see my GPUs using nvidia-smi.
I'm so happy to try your work on accelerating the llama inference speed.
llama.cpp has a compilation setting LLAMA_CUDA_MMV_Y
which defaults to 1. However, on an RTX 3090 setting LLAMA_CUDA_MMV_Y=2
is ~2% faster and I would expect the setting to also be beneficial for the hardware tested here.
Hello, when it possible, add mobile devices support for this project.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.