hahnyuan / llm-viewer Goto Github PK

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

License: MIT License

Python 58.07% HTML 0.40% Vue 32.37% JavaScript 9.17%

llm-viewer's Introduction

LLM-Viewer

LLM-Viewer is a tool for visualizing Language and Learning Models (LLMs) and analyzing the performance on different hardware platforms. It enables network-wise analysis, considering factors such as peak memory consumption and total inference time cost. With LLM-Viewer, you can gain valuable insights into LLM inference and performance optimization. You can use LLM-Viewer in a web browser or as a command line interface (CLI) tool. The web version provides a user-friendly interface for easy configuration and visualization, you can access it at LLM-Viewer Web.

We invite you to read our paper LLM Inference Unveiled: Survey and Roofline Model Insights. In this paper, we provide a comprehensive analysis of the latest advancements in efficient LLM inference using LLM-Viewer.

This ongoing project will be updated. TODO list:

Show shape of tensors.
Pre-process and post-process for non-transformer layers.
Show the whole network.
Expand hardware platform compatibility and allow manual configuration of hardware parameters.
Increase support for more LLMs and enable manual configuration of model graphs.

Workflow

As shown in the Figure, the workflow consists of the following steps:

Input the LLM and gather essential information about each layer, including the computation count, input and output tensor shapes, and data dependencies.
Provide input for the hardware and generate a roofline model that takes into account the computation capacity and memory bandwidth of the hardware.
Configure the inference settings, such as the batch size, prompt token length, and generation token length.
Configure the optimization settings, such as the quantization bitwidth, utilization of FlashAttention, decoding methods, and other system optimization techniques.
Use the LLM-Viewer Analyzer to analyze the performance of each layer based on the roofline model and layer information. It also tracks the memory usage of each layer and calculates the peak memory consumption based on data dependencies. The overall network performance of the LLM can be obtained by aggregating the results of all layers.
Generate a report that provides information such as the maximum performance and performance bottlenecks of each layer and the network, as well as the memory footprint. The report can be used to analyze curves, such as batch size-performance and sequence length-performance curves, to understand how different settings impact performance.
Access the LLM-Viewer web viewer for convenient visualization of the network architecture and analysis results. This tool facilitates easy configuration adjustment and provides access to various data for each layer.

Web Usage

To use LLM-Viewer in a web browser, go to the web-site LLM-Viewer Web. You can click the node to get the detailed analysis of the layer.

CLI Usage

Clone the LLM-Viewer repository from GitHub: git clone https://github.com/hahnyuan/LLM-Viewer.git

Install requirements pip install transformers flask flask_cors easydict

To analyze an LLM using LLM-Viewer in command line interface (cli), run the following command:

python3 analyze_cli.py facebook/opt-125m nvidia_A6000
python3 analyze_cli.py meta-llama/Llama-2-7b-hf nvidia_A6000 --batchsize 1 --seqlen 2048
python3 analyze_cli.py meta-llama/Llama-2-13b-hf nvidia_A6000 --batchsize 16 --seqlen 2048
python3 analyze_cli.py meta-llama/Llama-2-13b-hf nvidia_A6000 --batchsize 1 --seqlen 8192

# DiT models
python3 analyze_cli.py DiT-XL/2 nvidia_A6000 --batchsize 1 --seqlen 256 --source DiT

NOTE: The time estimated by the roofline model represents the theoretical performance that the hardware can achieve. The purpose of creating this tool is to help readers gain a clearer understanding of the key factors that influence LLM inference. Only the relative relationships can be referenced.

Citation

If you are using LLM-Viewer in your research, please cite our paper:

@misc{yuan2024llm,
      title={LLM Inference Unveiled: Survey and Roofline Model Insights}, 
      author={Zhihang Yuan and Yuzhang Shang and Yang Zhou and Zhen Dong and Chenhao Xue and Bingzhe Wu and Zhikai Li and Qingyi Gu and Yong Jae Lee and Yan Yan and Beidi Chen and Guangyu Sun and Kurt Keutzer},
      year={2024},
      eprint={2402.16363},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

llm-viewer's People

Contributors

Stargazers

Watchers

llm-viewer's Issues

Error message when running the example command line

I am trying to run the second example in the README/documentation and I keep getting errors. Any idea how to resolve this issue?

Thanks!

python3 analyze_cli.py meta-llama/Llama-2-7b-hf nvidia_A6000 --batchsize 1 --seqlen 2048
use config file configs/Llama.py for meta-llama/Llama-2-7b-hf
/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Traceback (most recent call last):
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/home/amin/.local/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/amin/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file
resolved_file = hf_hub_download(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1221, in hf_hub_download
return _hf_hub_download_to_cache_dir(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1325, in _hf_hub_download_to_cache_dir
_raise_on_head_call_error(head_call_error, force_download, local_files_only)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1823, in _raise_on_head_call_error
raise head_call_error
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1722, in _get_metadata_or_catch_error
metadata = get_hf_file_metadata(url=url, proxies=proxies, timeout=etag_timeout, headers=headers)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadata
r = _request_wrapper(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 372, in _request_wrapper
response = _request_wrapper(
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 396, in _request_wrapper
hf_raise_for_status(response)
File "/home/amin/.local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
raise GatedRepoError(message, response) from e
huggingface_hub.utils._errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-6658fd47-219371367a394d82037e9ad9;078c35d6-a2b7-442a-89fb-a61e9da0b517)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b-hf is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-2-7b-hf to ask for access.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/amin/LLM-Viewer/analyze_cli.py", line 34, in
analyzer = ModelAnalyzer(args.model_id, args.hardware, args.config_file,source=args.source)
File "/home/amin/LLM-Viewer/model_analyzer.py", line 41, in init
self.model_params = AutoConfig.from_pretrained(
File "/home/amin/.local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 934, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/amin/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
resolved_config_file = cached_file(
File "/home/amin/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
raise EnvironmentError(
OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b-hf.
403 Client Error. (Request ID: Root=1-6658fd47-219371367a394d82037e9ad9;078c35d6-a2b7-442a-89fb-a61e9da0b517)

The error between LLM-viewer predicted results and TensorRT-LLM real performance is large.

I compare the LLM-view results with TensorRT-LLM A100 performance provided by NVIDIA

We can see the estimated generation throughput is higher than real results.

					TRT-LLM	LLM-view
Model	Batch Size	TP (1)	Input Length	Output Length	Throughput (out tok/s/GPU)	est throughput
LLaMA 7B	256	1	128	128	5,353	8,934.54
LLaMA 7B	32	1	128	2048	1,518	2,796.58
LLaMA 7B	32	1	2048	128	547	788.73
LLaMA 7B	16	1	2048	2048	613	1,169.17

For the prefill time, you can see the estimated prefill time is lower than the real results.

				TensorRT-LLM		LLM-view
	bs	tp	input	1st latency	est 1st latencty (sec)	est (ms)
LLaMA 7B	1	1	128	16.1	0.006977	6.976999894
LLaMA 7B	1	1	2048	120.5	0.10088071	100.88071

This error makes it impossible to use LLM-view to predict the performance comparison of two hardware devices on the same task.

I feel that estimating precise computation time with the roofline model based on operators is very unreliable, and I would like to hear your opinion @hahnyuan .

GQA Correction for KV memory

Hey!

In the model_analyser.py, the current code is

       if use_flashattention:
            name = f"fused_attention"
            bandwidth, max_OPS, onchip_buffer = self.get_hardware_info()
            # flashattention-2 https://arxiv.org/pdf/2307.08691.pdf
            block_size_r = min(math.ceil(onchip_buffer / (kv_byte * head_size)), head_size)
            n_blocks_r = math.ceil(1 / block_size_r)
            q_numel = (1) * head_size * batchsize * num_attention_heads * a_byte
            o_numel = 1 * seqlen * batchsize * num_attention_heads * a_byte
            self._analyze_to_results(
                "decode",
                name,
                OPs=qk_matmul_OPs + sv_matmul_OPs + softmax_OPs,
                load_weight=0,
                load_act=q_numel,
                store_act=o_numel * 2,  # initialize O and save O
                load_kv_cache=n_blocks_r * (seqlen) * head_size * batchsize * num_attention_heads * kv_byte * 2,
                store_kv_cache=0,
            )

To correctly support GQA models, the load_kv_cache should instead be

load_kv_cache=n_blocks_r * (seqlen) * head_size * batchsize * num_key_value_heads * kv_byte * 2,
In the current implementation, the GQA memory is the same as that of MQA.

Thanks!

Thanks

Can I import my own model into LLM-Viewer?

请问hardware_param里面的onchip_buffer指的是哪部分的size？

我查了一下资料，v100的L1和SRAM大小一共是128KB，但是repo里面写的V100 onchip_buffer是20480e3，这块想请教下指的是哪部分buffer