Giter Club home page Giter Club logo

llm-perf-bench's People

Contributors

junrushao avatar leshengjin avatar sing-li avatar yongjer avatar yzh119 avatar zxybazh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

llm-perf-bench's Issues

tok/sec metric is not clearly defined

I think the tok/sec metric used in the README needs to be defined more clearly. It's not clear whether it for example measures the rate of tokens during generation only (e.g. llama.cpp "eval" print) or the total runtime (e.g. Oobabooge webui print).

CUDA error: no kernel image is available for execution on the device

Hi I use the commands in the README to run this project.

As you don't specify the model repo, I download model from here: https://huggingface.co/daryl149/llama-2-7b-chat-hf

The build process is ok. But when I run the built model, it complains:

Use MLC config: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/params/mlc-chat-config.json"
Use model weights: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/params/ndarray-cache.json"
Use model library: "/mlc-llm/dist/llama-2-7b-chat-hf-q4f16_1/llama-2-7b-chat-hf-q4f16_1-cuda.so"
You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out the latest stats (token/sec)
  /reset              restart a fresh chat
  /reload [local_id]  reload model `local_id` from disk, or reload the current model if `local_id` is not specified

Loading model...
Loading finished
Running system prompts...
CUDA error: no kernel image is available for execution on the device
Aborted

Perplexity and memory use comparisons would be useful

Currently the README does not necessarily provide a like-for-like comparison because 4 bit quantizations can be of different quality depending on the implementation details. For example, in llama.cpp q4_0 is faster than q4_K_M but the quantization format is less efficient in terms of size. So it would be useful to include measurements for the memory usage as well as measure for the output quality (e.g. perplexity on a large corpus of text) to put the speed numbers into context.

ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

Dear authors, thanks for your efforts on such a great benchmarking work on LLM. I learned this repository and met some problems deploying this project.

When I was running the command below:

docker build --no-cache -t llm-perf-exllama-v2:v0.1        -f ./docker/Dockerfile.cu121.exllama_v2 .

It would give me an error: ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects.

I used to successfully benchmarked "MLC LLM" but I met such an error when handling with "Exllama V2".

the "nvidia-smi" outputs:

Fri Oct 27 01:08:12 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:06:00.0 Off |                  N/A |
| 30%   36C    P8               16W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
| 30%   42C    P8               22W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:61:00.0 Off |                  N/A |
| 39%   35C    P8               16W / 350W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Here is the full log of the step 7:

Step 7/7 : RUN source ~/.bashrc && micromamba activate python311                       &&     MAX_JOBS=4 python -m pip install flash-attn --no-build-isolation
 ---> Running in 28d2c70206c1
Collecting flash-attn
  Downloading flash_attn-2.3.3.tar.gz (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 24.8 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: torch in /root/micromamba/envs/python311/lib/python3.11/site-packages (from flash-attn) (2.2.0.dev20231026)
Collecting einops (from flash-attn)
  Downloading einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: packaging in /root/micromamba/envs/python311/lib/python3.11/site-packages (from flash-attn) (22.0)
Collecting ninja (from flash-attn)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Requirement already satisfied: filelock in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.9.0)
Requirement already satisfied: typing-extensions in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (4.8.0)
Requirement already satisfied: sympy in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (1.12)
Requirement already satisfied: networkx in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.2)
Requirement already satisfied: jinja2 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (3.1.2)
Requirement already satisfied: fsspec in /root/micromamba/envs/python311/lib/python3.11/site-packages (from torch->flash-attn) (2023.10.0)
Requirement already satisfied: MarkupSafe>=2.0 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from jinja2->torch->flash-attn) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /root/micromamba/envs/python311/lib/python3.11/site-packages (from sympy->torch->flash-attn) (1.2.1)
Downloading einops-0.7.0-py3-none-any.whl (44 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB 1.7 MB/s eta 0:00:00
Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 12.3 MB/s eta 0:00:00
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py): started
  Building wheel for flash-attn (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [40 lines of output]
      No CUDA runtime is found, using CUDA_HOME='/root/micromamba/envs/python311'
      fatal: not a git repository (or any of the parent directories): .git


      torch.__version__  = 2.2.0.dev20231026


      running bdist_wheel
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 288, in <module>
          setup(
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/__init__.py", line 103, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/dist.py", line 989, in run_command
          super().run_command(command)
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 265, in run
          wheel_url, wheel_filename = get_wheel_url()
                                      ^^^^^^^^^^^^^^^
        File "/tmp/pip-install-mjt1ot_6/flash-attn_d426c8d13b1a498c98aa462ee75f2537/setup.py", line 234, in get_wheel_url
          torch_cuda_version = parse(torch.version.cuda)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/packaging/version.py", line 52, in parse
          return Version(version)
                 ^^^^^^^^^^^^^^^^
        File "/root/micromamba/envs/python311/lib/python3.11/site-packages/packaging/version.py", line 195, in __init__
          match = self._regex.search(version)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
      TypeError: expected string or bytes-like object, got 'NoneType'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for flash-attn
  Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
The command '/bin/bash -ec source ~/.bashrc && micromamba activate python311                       &&     MAX_JOBS=4 python -m pip install flash-attn --no-build-isolation' returned a non-zero code: 1

I'm very appreciate it if anyone could have a look at this problem and give some advises. Thanks!

[bug] - crashes when doing build following standard ROCM instructions, related to batching code

To reproduce:

On a supported AMD machine, follow the ROCM instructions up to building of the "lib" -- and the following crash occurs....

(python311) root@rumlinux:/workspace# rm -rf $PATH_TEST && mkdir $PATH_TEST && rm -rf $PATH_COMPILE && mkdir $PATH_COMPILE && ln -s ${WEIGHT_PATH} ${PATH_TEST}/params && cp $MODEL_CONFIG $PATH_COMPILE/config.json
(python311) root@rumlinux:/workspace# python -m mlc_llm.build \
        --model $PATH_COMPILE \
        --artifact-path $PATH_COMPILE \
        --quantization $QUANTIZATION \
        --max-seq-len 2048 \
        --num-shards $NUM_SHARDS \
        --target rocm --build-model-only
Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 112, in _get_module_details
  File "/mlc_llm/mlc_llm/__init__.py", line 6, in <module>
    from . import core
  File "/mlc_llm/mlc_llm/core.py", line 19, in <module>
    from mlc_llm.relax_model import (
  File "/mlc_llm/mlc_llm/relax_model/llama_batched_vllm.py", line 7, in <module>
    from tvm.relax.op.nn import attention_var_len
ImportError: cannot import name 'attention_var_len' from 'tvm.relax.op.nn' (/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/relax/op/nn/__init__.py)

llama.cpp thread parameter is suboptimal

The benchmark container for llama.cpp does not seem to be manually setting the number of threads. As of right now more than one thread is of no use in llama.cpp when you can offload all layers with CUDA (but they still add overhead/CPU load). So manually setting the number of threads to 1 should yield better performance.

Docker container seems to be missing Python dependency

When running python build.py --model /models/Llama-2-7b-chat-hf --target cuda --quantization q4f16_1 --artifact-path "./dist" --use-cache 0 I encountered the following error:

Traceback (most recent call last):
  File "/mlc_llm/build.py", line 4, in <module>
    main()
  File "/mlc_llm/mlc_llm/build.py", line 10, in main
    core.build_model_from_args(parsed_args)
  File "/mlc_llm/mlc_llm/core.py", line 584, in build_model_from_args
    build(mod, args)
  File "/mlc_llm/mlc_llm/core.py", line 467, in build
    utils.debug_dump_benchmark_script(
  File "/mlc_llm/mlc_llm/utils.py", line 108, in debug_dump_benchmark_script
    from tvm.dlight.benchmark import extract_all_func_info_from_relax
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/dlight/benchmark/__init__.py", line 18, in <module>
    from .bench import benchmark, benchmark_prim_func, benchmark_relax_func
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/dlight/benchmark/bench.py", line 28, in <module>
    from tvm.testing import rpc_run
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/testing/__init__.py", line 47, in <module>
    from .utils import *
  File "/root/micromamba/envs/python311/lib/python3.11/site-packages/tvm/testing/utils.py", line 85, in <module>
    import pytest
ModuleNotFoundError: No module named 'pytest'

I was able to fix it by just installing pytest via pip3 install pytest.

[BUG] mlc-llm benchmark failed with ROCm

I follow the steps by steps for rocm-benchmark. The last step run failed. The log is bellow.

$python -m mlc_chat.cli.benchmark --model ${PATH_TEST}/params --device "rocm:0" --prompt "What is the meaning of life?" --generate-length 256

The key words of error log is "[2024-01-17 08:53:37] ERROR model_metadata.py:93: FAILED to read metadata section in legacy model lib."

mlc-llm-rocm-bm-failed

raise ValueError("Cannot detect local CUDA GPU target!")

Hi, I'm new in your work and I've build the docker and tried to run python build.py command and receive this:

Traceback (most recent call last):
  File "/mlc_llm/build.py", line 4, in <module>
    main()
  File "/mlc_llm/mlc_llm/build.py", line 9, in main
    parsed_args = core._parse_args(parsed_args)  # pylint: disable=protected-access
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mlc_llm/mlc_llm/core.py", line 192, in _parse_args
    utils.parse_target(parsed)
  File "/mlc_llm/mlc_llm/utils.py", line 370, in parse_target
    raise ValueError("Cannot detect local CUDA GPU target!")

I'm not sure where goes wrong and I can see my GPUs using nvidia-smi.
I'm so happy to try your work on accelerating the llama inference speed.

llama.cpp compilation settings are suboptimal

llama.cpp has a compilation setting LLAMA_CUDA_MMV_Y which defaults to 1. However, on an RTX 3090 setting LLAMA_CUDA_MMV_Y=2 is ~2% faster and I would expect the setting to also be beneficial for the hardware tested here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.