Giter Club home page Giter Club logo

llm-viewer's Introduction

LLM-Viewer

LLM-Viewer

LLM-Viewer is a tool for visualizing Language and Learning Models (LLMs) and analyzing the performance on different hardware platforms. It enables network-wise analysis, considering factors such as peak memory consumption and total inference time cost. With LLM-Viewer, you can gain valuable insights into LLM inference and performance optimization. You can use LLM-Viewer in a web browser or as a command line interface (CLI) tool. The web version provides a user-friendly interface for easy configuration and visualization, you can access it at LLM-Viewer Web.

We invite you to read our paper LLM Inference Unveiled: Survey and Roofline Model Insights. In this paper, we provide a comprehensive analysis of the latest advancements in efficient LLM inference using LLM-Viewer.

This ongoing project will be updated. TODO list:

  • Show shape of tensors.
  • Pre-process and post-process for non-transformer layers.
  • Show the whole network.
  • Expand hardware platform compatibility and allow manual configuration of hardware parameters.
  • Increase support for more LLMs and enable manual configuration of model graphs.

Workflow

LLM-Viewer Workflow

As shown in the Figure, the workflow consists of the following steps:

  1. Input the LLM and gather essential information about each layer, including the computation count, input and output tensor shapes, and data dependencies.
  2. Provide input for the hardware and generate a roofline model that takes into account the computation capacity and memory bandwidth of the hardware.
  3. Configure the inference settings, such as the batch size, prompt token length, and generation token length.
  4. Configure the optimization settings, such as the quantization bitwidth, utilization of FlashAttention, decoding methods, and other system optimization techniques.
  5. Use the LLM-Viewer Analyzer to analyze the performance of each layer based on the roofline model and layer information. It also tracks the memory usage of each layer and calculates the peak memory consumption based on data dependencies. The overall network performance of the LLM can be obtained by aggregating the results of all layers.
  6. Generate a report that provides information such as the maximum performance and performance bottlenecks of each layer and the network, as well as the memory footprint. The report can be used to analyze curves, such as batch size-performance and sequence length-performance curves, to understand how different settings impact performance.
  7. Access the LLM-Viewer web viewer for convenient visualization of the network architecture and analysis results. This tool facilitates easy configuration adjustment and provides access to various data for each layer.

Web Usage

To use LLM-Viewer in a web browser, go to the web-site LLM-Viewer Web. You can click the node to get the detailed analysis of the layer.

CLI Usage

Clone the LLM-Viewer repository from GitHub: git clone https://github.com/hahnyuan/LLM-Viewer.git

Install requirements pip install transformers flask flask_cors easydict

To analyze an LLM using LLM-Viewer in command line interface (cli), run the following command:

python3 analyze_cli.py facebook/opt-125m nvidia_A6000
python3 analyze_cli.py meta-llama/Llama-2-7b-hf nvidia_A6000 --batchsize 1 --seqlen 2048
python3 analyze_cli.py meta-llama/Llama-2-13b-hf nvidia_A6000 --batchsize 16 --seqlen 2048
python3 analyze_cli.py meta-llama/Llama-2-13b-hf nvidia_A6000 --batchsize 1 --seqlen 8192

# DiT models
python3 analyze_cli.py DiT-XL/2 nvidia_A6000 --batchsize 1 --seqlen 256 --source DiT

NOTE: The time estimated by the roofline model represents the theoretical performance that the hardware can achieve. The purpose of creating this tool is to help readers gain a clearer understanding of the key factors that influence LLM inference. Only the relative relationships can be referenced.

Citation

If you are using LLM-Viewer in your research, please cite our paper:

@misc{yuan2024llm,
      title={LLM Inference Unveiled: Survey and Roofline Model Insights}, 
      author={Zhihang Yuan and Yuzhang Shang and Yang Zhou and Zhen Dong and Chenhao Xue and Bingzhe Wu and Zhikai Li and Qingyi Gu and Yong Jae Lee and Yan Yan and Beidi Chen and Guangyu Sun and Kurt Keutzer},
      year={2024},
      eprint={2402.16363},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

llm-viewer's People

Contributors

feifeibear avatar hahnyuan avatar sunshinemyson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

llm-viewer's Issues

Error message when running the example command line

I am trying to run the second example in the README/documentation and I keep getting errors. Any idea how to resolve this issue?

Thanks!

python3 analyze_cli.py meta-llama/Llama-2-7b-hf nvidia_A6000 --batchsize 1 --seqlen 2048
use config file configs/Llama.py for meta-llama/Llama-2-7b-hf
/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Traceback (most recent call last):
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/home/amin/.local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/amin/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file
resolved_file = hf_hub_download(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1221, in hf_hub_download
return _hf_hub_download_to_cache_dir(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1325, in _hf_hub_download_to_cache_dir
_raise_on_head_call_error(head_call_error, force_download, local_files_only)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1823, in _raise_on_head_call_error
raise head_call_error
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1722, in _get_metadata_or_catch_error
metadata = get_hf_file_metadata(url=url, proxies=proxies, timeout=etag_timeout, headers=headers)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadata
r = _request_wrapper(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 372, in _request_wrapper
response = _request_wrapper(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 396, in _request_wrapper
hf_raise_for_status(response)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
raise GatedRepoError(message, response) from e
huggingface_hub.utils._errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-6658fd47-219371367a394d82037e9ad9;078c35d6-a2b7-442a-89fb-a61e9da0b517)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b-hf is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-2-7b-hf to ask for access.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/amin/LLM-Viewer/analyze_cli.py", line 34, in
analyzer = ModelAnalyzer(args.model_id, args.hardware, args.config_file,source=args.source)
File "/home/amin/LLM-Viewer/model_analyzer.py", line 41, in init
self.model_params = AutoConfig.from_pretrained(
File "/home/amin/.local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 934, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
resolved_config_file = cached_file(
File "/home/amin/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
raise EnvironmentError(
OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b-hf.
403 Client Error. (Request ID: Root=1-6658fd47-219371367a394d82037e9ad9;078c35d6-a2b7-442a-89fb-a61e9da0b517)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b-hf is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-2-7b-hf to ask for access.

The error between LLM-viewer predicted results and TensorRT-LLM real performance is large.

I compare the LLM-view results with TensorRT-LLM A100 performance provided by NVIDIA

We can see the estimated generation throughput is higher than real results.

          TRT-LLM LLM-view
Model Batch Size TP (1) Input Length Output Length Throughput (out tok/s/GPU) est throughput
LLaMA 7B 256 1 128 128 5,353 8,934.54
LLaMA 7B 32 1 128 2048 1,518 2,796.58
LLaMA 7B 32 1 2048 128 547 788.73
LLaMA 7B 16 1 2048 2048 613 1,169.17

For the prefill time, you can see the estimated prefill time is lower than the real results.

        TensorRT-LLM   LLM-view
  bs tp input 1st latency est 1st latencty (sec) est (ms)
LLaMA 7B 1 1 128 16.1 0.006977 6.976999894
LLaMA 7B 1 1 2048 120.5 0.10088071 100.88071

This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.

I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .

GQA Correction for KV memory

Hey!

In the model_analyser.py, the current code is

       if use_flashattention:
            name = f"fused_attention"
            bandwidth, max_OPS, onchip_buffer = self.get_hardware_info()
            # flashattention-2 https://arxiv.org/pdf/2307.08691.pdf
            block_size_r = min(math.ceil(onchip_buffer / (kv_byte * head_size)), head_size)
            n_blocks_r = math.ceil(1 / block_size_r)
            q_numel = (1) * head_size * batchsize * num_attention_heads * a_byte
            o_numel = 1 * seqlen * batchsize * num_attention_heads * a_byte
            self._analyze_to_results(
                "decode",
                name,
                OPs=qk_matmul_OPs + sv_matmul_OPs + softmax_OPs,
                load_weight=0,
                load_act=q_numel,
                store_act=o_numel * 2,  # initialize O and save O
                load_kv_cache=n_blocks_r * (seqlen) * head_size * batchsize * num_attention_heads * kv_byte * 2,
                store_kv_cache=0,
            )

To correctly support GQA models, the load_kv_cache should instead be

load_kv_cache=n_blocks_r * (seqlen) * head_size * batchsize * num_key_value_heads * kv_byte * 2,
In the current implementation, the GQA memory is the same as that of MQA.

Thanks!

How can I get throughput for a generative model

I would like to get the throughput measured by (generated tokens)/(overall latency = prefill+decode elapse).
Could you please provide an example of this?

The function analyze() dose not have a param as promp_len.

A40 MAC nubmer

Hi hahnyuan,

Good to know you add A40 in this tool, i'm wondering why you divided FP16 OPS by 2 in hardware configuration?According to official information, it should 149 TFLOPS.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.