sophgo / llm-tpu Goto Github PK

Run generative AI models in sophgo BM1684X

Shell 2.96% Python 48.10% CMake 0.74% C++ 22.40% C 25.79%

generative-ai llm bm1684x chatglm qwen llm-inference llama2 llama3 qwen-7b qwen1-5

llm-tpu's Introduction

介绍

本项目实现算能BM1684X芯片部署各类开源生成式AI模型，其中以LLM为主。通过TPU-MLIR编译器将模型转换成bmodel，并采用c++代码将其部署到PCIE环境或者SoC环境。在知乎上写了一篇解读，以ChatGLM2-6B为例，方便大家理解源码：ChatGLM2流程解析与TPU-MLIR部署

模型介绍

已部署过的模型如下（按照首字母顺序排列）：

Model	INT4	INT8	FP16/BF16	Huggingface Link
Baichuan2-7B		✅		LINK
ChatGLM3-6B	✅	✅	✅	LINK
CodeFuse-7B	✅	✅		LINK
DeepSeek-6.7B	✅	✅		LINK
Falcon-40B		✅	✅	LINK
Phi-3-mini-4k	✅	✅	✅	LINK
Qwen-7B	✅	✅	✅	LINK
Qwen-14B	✅	✅	✅	LINK
Qwen-72B	✅			LINK
Qwen1.5-0.5B	✅	✅	✅	LINK
Qwen1.5-1.8B	✅	✅	✅	LINK
Qwen1.5-7B	✅	✅	✅	LINK
Llama2-7B	✅	✅	✅	LINK
Llama2-13B	✅	✅	✅	LINK
Llama3-8B	✅	✅	✅	LINK
Llama3.1-8B	✅	✅	✅	LINK
LWM-Text-Chat	✅	✅	✅	LINK
Mistral-7B-Instruct	✅	✅		LINK
Stable Diffusion			✅	LINK
Stable Diffusion XL			✅	LINK
WizardCoder-15B	✅			LINK
Yi-6B-chat	✅	✅		LINK
Yi-34B-chat	✅	✅		LINK
Qwen-VL-Chat	✅	✅		LINK

如果您想要知道转换细节和源码，可以到本项目models子目录查看各类模型部署细节。

如果您对我们的芯片感兴趣，也可以通过官网SOPHGO联系我们。

快速开始

克隆LLM-TPU项目，并执行run.sh脚本

git clone https://github.com/sophgo/LLM-TPU.git
./run.sh --model llama2-7b

详细请参考Quick Start

效果图

跑通后效果如下图所示

Command Table

目前用于演示的模型，全部命令如下表所示

Model	SoC	PCIE
ChatGLM3-6B	./run.sh --model chatglm3-6b --arch soc	./run.sh --model chatglm3-6b --arch pcie
Llama2-7B	./run.sh --model llama2-7b --arch soc	./run.sh --model llama2-7b --arch pcie
Llama3-7B	./run.sh --model llama3-7b --arch soc	./run.sh --model llama3-7b --arch pcie
Qwen-7B	./run.sh --model qwen-7b --arch soc	./run.sh --model qwen-7b --arch pcie
Qwen1.5-1.8B	./run.sh --model qwen1.5-1.8b --arch soc	./run.sh --model qwen1.5-1.8b --arch pcie
LWM-Text-Chat	./run.sh --model lwm-text-chat --arch soc	./run.sh --model lwm-text-chat --arch pcie
WizardCoder-15B	./run.sh --model wizardcoder-15b --arch soc	./run.sh --model wizardcoder-15b --arch pcie

进阶功能

进阶功能说明：

功能	目录	功能说明
多芯	ChatGLM3/parallel_demo	支持ChatGLM3 2芯
	Llama2/demo_parallel	支持Llama2 4/6/8芯
	Qwen/demo_parallel	支持Qwen 4/6/8芯
	Qwen1_5/demo_parallel	支持Qwen1_5 4/6/8芯
投机采样	Qwen/jacobi_demo	LookaheadDecoding
	Qwen1_5/speculative_sample_demo	投机采样
prefill复用	Qwen/prompt_cache_demo	公共序列prefill复用
	Qwen/share_cache_demo	公共序列prefill复用
	Qwen1_5/share_cache_demo	公共序列prefill复用
模型加密	Qwen/share_cache_demo	模型加密
	Qwen1_5/share_cache_demo	模型加密

常见问题

请参考LLM-TPU常见问题及解答

资料链接

ChatGLM2流程解析与TPU-MLIR部署：https://zhuanlan.zhihu.com/p/641975976
模型转换工具链 TPU-MLIR：https://github.com/sophgo/tpu-mlir
TPU-MLIR快速入门手册：https://tpumlir.org/docs/quick_start/index.html
TPU-MLIR论文、整体工程讲解：https://www.bilibili.com/video/BV1My4y1o73Q

llm-tpu's People

Contributors

Stargazers

Watchers

Forkers

harmonyhu enigma9981 sonkyokukou szxysdt benliao julianchenga bao0ne s0ulhun43r muzi0111 zifeng-radxa zifeng278 eltociear russpalms eauclaire1 zillaru

llm-tpu's Issues

请问我想将列表里没有的大模型转成bmodel,应该怎么做?

我想将https://huggingface.co/fla-hub/rwkv6-1.6B-finch/tree/main
这个模型用的线性attention，我想将其转成bmodel，让其在边缘设备上跑，请问应该怎么做？

Qwen1.5 1b8和Qwen2 7b推理到最后出现重复性回答

soc环境
transformers：4.42.4
torch：2.3.1
LLM-TPU：9a744f0/latest 2024.07.23
driver版本：0.5.1

linaro@bm1684:/usr/lib/cmake/libsophon$ bm_version
SophonSDK version: v24.04.01
sophon-soc-libsophon : 0.5.1
sophon-mw-soc-sophon-ffmpeg : 0.10.0
sophon-mw-soc-sophon-opencv : 0.10.0
BL2 v2.7(release):7b2c33d Built : 16:02:07, Jun 24 2024
BL31 v2.7(release):7b2c33d Built : 16:02:07, Jun 24 2024
U-Boot 2022.10 7b2c33d (Jun 24 2024 - 16:01:43 +0800) Sophon BM1684X
KernelVersion : Linux bm1684 5.4.217-bm1684-g27254622663c #1 SMP Mon Jun 24 16:02:21 CST 2024 aarch64 aarch64 aarch64 GNU/Linux
HWVersion: 0x00
MCUVersion: 0x01

偶尔也会有正常的回答。只不过经常这样。

能不能对不同的模型（尤其是差异很大的模型，比如 SD），分别写下教程？

Support for Llama 3.1 model

Are there instructions specific to creating a bmodel from onnx for Llama 3.1 (not lllam3)

Running this is erroring out.
python export_onnx.py --model_path ../../../../Meta-Llama-3.1-8B-Instruct/ --seq_length 1024

Convert block & block_cache
0%| | 0/32 [00:00<?, ?it/s]The attention layers in this model are transitioning from computing the RoPE embeddings internally through position_ids (2D tensor with the indexes of the tokens), to using externally computed position_embeddings (Tuple of tensors, containing cos and sin). In v4.45 position_ids will be removed and position_embeddings will be mandatory.

GPU memory allocation failure

我在对LLM-TPU进行二次开发的时候（适配一个小众的国产模型）
显存分配报错炸掉了
报错代码大致如下:

[BMRT][load_bmodel:1573] INFO:Loading bmodel from [../../bmodels/rwkv6-1b5_bf16_1dev.bmodel]. Thanks for your patience...
[BMRT][load_bmodel:1501] INFO:pre net num: 0, load net num: 1
[bmlib_memory][error] bm_alloc_gmem failed, dev_id = 0, size = 0xc0fb4000

但我导出的模型体积，明显小于官方提供的实例文件中的llama模型文件
我的模型体积:

du -shb ../../bmodels/rwkv6-1b5_bf16_1dev.bmodel
3244769616      ../../bmodels/rwkv6-1b5_bf16_1dev.bmodel

llama模型体积:

du -shb ../../../LLM-TPU/bmodels/llama2-7b_int4_1dev.bmodel 
3888418068      ../../../LLM-TPU/bmodels/llama2-7b_int4_1dev.bmodel

llma3 is not available after conversion

log:

root@bm1684:/data/LLM-TPU/models/Llama3/python_demo# python3 pipeline.py -m /data/models/llama3-8b_int8_1dev_512.bmodel -t ../token_config/ --devid 0
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Load ../token_config/ ...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Device [ 0 ] loading ....
[BMRT][bmcpu_setup:436] INFO:cpu_lib 'libcpuop.so' is loaded.
bmcpu init: skip cpu_user_defined
open usercpu.so, init user_cpu_init 
[BMRT][BMProfile:60] INFO:Profile For arch=3
[BMRT][BMProfileDeviceBase:190] INFO:gdma=0, tiu=0, mcu=0
Model[/data/models/llama3-8b_int8_1dev_512.bmodel] loading ....
[BMRT][load_bmodel:1696] INFO:Loading bmodel from [/data/models/llama3-8b_int8_1dev_512.bmodel]. Thanks for your patience...
[BMRT][load_bmodel:1583] INFO:Bmodel loaded, version 2.2+v1.6.beta.0-243-ga948d0acb-20240507
[BMRT][load_bmodel:1585] INFO:pre net num: 0, load net num: 69
[BMRT][load_tpu_module:1674] INFO:loading firmare in bmodel
[BMRT][preload_funcs:1876] INFO: core_id=0, multi_fullnet_func_id=27
[BMRT][preload_funcs:1879] INFO: core_id=0, dynamic_fullnet_func_id=28
Done!

=================================================================
1. If you want to quit, please enter one of [q, quit, exit]
2. To create a new chat session, please enter one of [clear, new]
=================================================================

Question: 你好 

Answer: You are Llama3, a helpful AI assistant.博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博士博

It is the Llama3 compile part support Llama3-chinese?

i want to convert the llama-3-chinese-8b-instruct to bmodel, Can i reuse the Llama3 compile and deploy code in this repo? and what should i watch out?

Llama3 web_demo code is too old and output without end

when i try to use web_demo in Llama3, i found if use the gradio 3.39.0 in requirements.txt, it output error

AttributeError: 'Textbox' object has no attribute 'style'

after i remove style in Textbox it can run success, but the llama3 output without end until the input size large then maixmum length, so weirddddddddd

标准提问格式，请大家按照这个方式进行提问~（重要）（非常重要）

标准示例：

环境：

soc环境
transformers：4.32.0
torch：2.0.1+cpu
LLM-TPU：6fcc8bf/latest 2024.06.30
tpu-mlir：d0cbae7 2024.06.30
driver版本：0.5.1
libsophon：#1 SMP Sun Jun 16 05:39:19 CST 2024

路径：

/workspace/LLM-TPU/models/Qwen1_5/python_demo

操作：

python3 pipeline.py --model_path ../compile/qwen1.5-1.8b_f16_seq4096_1dev.bmodel --tokenizer_path ../token_config/ --devid 12 --generation_mode penalty_sample

问题：

其他：

自己编译的模型无法跑通，使用./run.sh --model llama2 --arch soc的能跑通

示例说明

环境：

soc环境（需要注明是soc环境还是pcie环境，这两者的处理方式不同）
transformers：4.32.0（可以不用，但是涉及到onnx与tokenizer相关的问题需要transformers以及torch版本）
torch：2.0.1+cpu（可以不用，但是涉及到onnx与tokenizer相关的问题需要transformers以及torch版本）
LLM-TPU：6fcc8bf/latest 2024.06.30（git log命令查看，具体commit id可以不要，但是日期一定要）
tpu-mlir：d0cbae7 2024.06.30（如果是自己编译的模型，需要提交tpu-mlir版本，具体commit id可以不要，但是日期一定要）
driver版本：0.5.1（使用bm-smi命令查看）
libsophon：#1 SMP Sun Jun 16 05:39:19 CST 2024 （soc使用uname -v，pcie使用cat /proc/bmsophon/driver_version）
（基本上50%以上的问题都是版本问题）

路径：

/workspace/LLM-TPU/models/Qwen1_5/python_demo

操作：

python3 pipeline.py --model_path ../compile/qwen1.5-1.8b_f16_seq4096_1dev.bmodel --tokenizer_path ../token_config/ --devid 12 --generation_mode penalty_sample

问题：

（问题截图需要清晰的展现输入的命令，具体的错误，以及使用的路径，如果路径比较敏感建议打码，但是LLM-TPU后面的要带上）
（需要bm-smi的截图，使用bm-smi后，可以看到显存使用情况）

其他：

如果是自己编译的模型，需要注明使用拉下来的模型，能否跑通

ImportError: dynamic module does not define module export function

使用项目下面的support编译完了cpython的so文件之后，在pipline.py里面导包总是出现这个错误，是因为版本不对的问题吗，使用的父文件是Qwen-7B的

ChatGLM3的web demo无法运行成功

即使安装chat包，实际chat包里并没有以下方法

Llama3 pipeline output � error decode

when i run Llama3 follow the instroution, it output "�" error code some time, it more happen when output with Chinese
like "烽"

Question: my name is 张子烽

Answer: Nice to meet you, 张子���! I'm Llama3, your helpful AI assistant. How can I assist you today? Do you have any questions, topics you'd like to discuss, or tasks you'd like to accomplish? I'm here to help!
FTL: 0.747 s
TPS: 9.483 token/s

万人血书MiniCPM-2B！

这个模型比较小，效果还很不错！

我叫万人

关于CV180x的适配问题

您好，我想问一下CV180x可以将这里的LLM部署于TPU吗？我想部署一个，但担心硬件方面的限制。BM似乎比CV高级得多

转换qwen1.5出现的问题

例程中1.8B和7B的模型混起来了，建议修改一下，我转换的是0.5B模型，目前还是卡在了导出onnx阶段

报错1 pos不匹配已解决

报错如下：

  File "C:\ProgramData\Anaconda3\envs\hf\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 696, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "C:\ProgramData\Anaconda3\envs\hf\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 172, in apply_rotary_pos_emb
    q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (16) must match the size of tensor b (512) at non-singleton dimension 1

我不是很清楚是是不是1.8B和0.5B的区别，但是我更倾向于是他修改的代码有问题，来看一下shape
修改后的函数如下：

#输入的con和sin的shape都是512 64
#q和k的shape都是1 16 512 64
cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]  
sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]  
cos = cos[position_ids].unsqueeze(1) #  
sin = sin[position_ids].unsqueeze(1) # 1 1 512 64 
cos = cos.transpose(1, 2)  
sin = sin.transpose(1, 2)  
q_embed = (q * cos) + (rotate_half(q) * sin) #在这里报错了，shape不匹配 
k_embed = (k * cos) + (rotate_half(k) * sin)

看一下原函数的过程

#输入的con和sin 都是 21 64 因为句子长度是21
#输入的k 和 q的shape都是1 16 21 64
cos = cos[position_ids].unsqueeze(unsqueeze_dim)  
sin = sin[position_ids].unsqueeze(unsqueeze_dim)#1 1 21 64
q_embed = (q * cos) + (rotate_half(q) * sin)  
k_embed = (k * cos) + (rotate_half(k) * sin)

看来是修改的多了两个转置？删掉看一下，删掉后这一点不再报错了

报错2 第一个block未输出cache

导出的第一个block没有kv的输出，相关代码端改成这样也能跑过去，但是感觉不太对
class QwenBlock(torch.nn.Module):

def __init__(self, layer_id):
    super().__init__()
    self.layer_id = layer_id
    self.layer = layers[layer_id]
# input_ids, attention_mask, position_ids
def forward(self, hidden_states, position_ids, attention_mask):
    hidden_states, past_kv = self.layer(
        hidden_states,
        attention_mask=attention_mask,
        position_ids=position_ids,
        use_cache=True)
    if past_kv != None:
        present_k, present_v = past_kv
        return hidden_states.float(), present_k.float(), present_v.float()
    else:
        present_k, present_v = torch.tensor([0]),torch.tensor([0])
        return hidden_states.float(), present_k, present_v

报错3 要输入huggingface上的cache列表

报错为

    kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
AttributeError: 'tuple' object has no attribute 'get_usable_length'

我看了因为这个类是huggingface上的DynamicCache，但是输入的是默认元组，所以没有这个方法？我不是很清楚为什么有这些问题
感觉不太是因为0.5B和1.8B的原因，因为他的配置文件基本相同，只有linear的通道数不一样，所以你们可以帮我检查一下吗

untils.h存在错误，卸载tensor进行查看时，只能看到1/4或1/2的tensor值，其他值都为0

文件

untils.h

BUG描述

API调用错误（猜测写得急写错了）

涉及到的函数:

dump_bf16_tensor,dump_fp16_tensor,dump_fp32_tensor,dump_int_tensor
这几个函数调用了device2soc的拷贝函数bm_memcpy_d2s_partial_offset，但size没有倍率，导致卸载打印tensor时，只能看到1/4或1/2的tensor值，其他值都为0
bm_memcpy_d2s_partial_offset的doc:

bm_status_t bm_memcpy_d2s_partial_offset(bm_handle_t handle, void *dst, bm_device_mem_t src, unsigned int size, unsigned int offset)
To copy specified bytes of data from device memory to system memory with an offset in device memory address.
参数:
[in] – handle The device handle
[in] – dst The destination memory (system memory, a void* pointer)
[in] – src The source memory (device memory descriptor)
[in] – size The size of data to copy (in bytes)
[in] – offset The offset of the device memory address

参考其描述，size应该按数据类型对应的bytes进行倍率

	int32(int)	fp16	bf16	fp32
size倍率	4	2	2	4

修改建议

void dump_bf16_tensor(bm_handle_t bm_handle, bm_device_mem_t mem, int offset,
                      int size) {
  std::vector<uint16_t> data(size);
  bm_memcpy_d2s_partial_offset(bm_handle, data.data(), mem, size * 2, offset);
  std::cout << "-------------------------------------" << std::endl;
  fp32 t;
  for (int i = 0; i < size; i++) {
    t.bits = bf16_to_fp32_bits(data[i]);
    std::cout << t.fval << std::endl;
  }
  std::cout << "-------------------------------------" << std::endl;
}

void dump_fp16_tensor(bm_handle_t bm_handle, bm_device_mem_t mem, int offset,
                      int size) {
  std::vector<uint16_t> data(size);
  bm_memcpy_d2s_partial_offset(bm_handle, data.data(), mem, size * 2, offset);
  std::cout << "-------------------------------------" << std::endl;
  fp32 t;
  for (int i = 0; i < size; i++) {
    t.bits = fp16_ieee_to_fp32_bits(data[i]);
    std::cout << t.fval << std::endl;
  }
  std::cout << "-------------------------------------" << std::endl;
}

void dump_fp32_tensor(bm_handle_t bm_handle, bm_device_mem_t mem, int offset,
                      int size) {
  std::vector<float> data(size);
  std::cout << "dump size " << data.size() << std::endl;
  bm_memcpy_d2s_partial_offset(bm_handle, data.data(), mem, size * 4, offset);
  std::cout << "-------------------------------------" << std::endl;
  for (int i = 0; i < size; i++) {
    std::cout << data[i] << std::endl;
  }
  std::cout << "-------------------------------------" << std::endl;
  auto ptr = data.data();
  ptr[0] = ptr[0];
}

void dump_int_tensor(bm_handle_t bm_handle, bm_device_mem_t mem, int offset,
                     int size) {
  std::vector<int> data(size);
  bm_memcpy_d2s_partial_offset(bm_handle, data.data(), mem, size * 4, offset);
  std::cout << "-------------------------------------" << std::endl;
  for (int i = 0; i < size; i++) {
    std::cout << data[i] << std::endl;
  }
  std::cout << "-------------------------------------" << std::endl;
  auto ptr = data.data();
  ptr[0] = ptr[0];
}

Qwen2-7B-Instruct 导出 onnx 报错

已经替换 .venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py
在导出 onnx 出现报错

(.venv) root@f9ba65967758:/workspace/test/LLM-TPU/models/Qwen2/compile# python3 export_onnx.py --model_path /workspace/Qwen2-7B-Instruct/ --seq_length 1024 --device cpu
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:28<00:00,  7.02s/it]
Layers: 28
Hidden size: 3584

Convert block & block_cache
  0%|                                                                                                                                                                       | 0/1 [00:00<?, ?it/s]/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:120: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_len > self.max_seq_len_cached:
  0%|                                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/workspace/test/LLM-TPU/models/Qwen2/compile/export_onnx.py", line 251, in <module>
    convert_block(i)
  File "/workspace/test/LLM-TPU/models/Qwen2/compile/export_onnx.py", line 162, in convert_block
    torch.onnx.export(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/onnx/utils.py", line 516, in export
    _export(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/onnx/utils.py", line 1612, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
    outs = ONNXTracedModule(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/jit/_trace.py", line 138, in forward
    graph, out = torch._C._create_graph_by_tracing(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/jit/_trace.py", line 129, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/compile/export_onnx.py", line 73, in forward
    hidden_states, past_kv = self.layer(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 786, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 695, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/workspace/test/LLM-TPU/models/Qwen2/.venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 171, in apply_rotary_pos_emb
    q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (28) must match the size of tensor b (1024) at non-singleton dimension

chat.cpp:141: void Qwen::init(const std::vector<int>&, std::string): Assertion `true == ret' failed.

环境：

soc环境
transformers：4.42.4
torch：2.3.1
LLM-TPU：9a744f0/latest 2024.07.23
driver版本：0.5.1

路径：

/home/linaro/LLM-TPU/models/Qwen2/python_demo

操作：

python3 pipeline.py --model_path /data/qwen2-7b_int4_seq8192_1dev.bmodel --tokenizer_path ../support/token_config/ --devid 0 --generation_mode greedy

问题

Load ../support/token_config/ ...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Device [ 0 ] loading ....
[BMRT][bmcpu_setup:498] INFO:cpu_lib 'libcpuop.so' is loaded.
[BMRT][bmcpu_setup:521] INFO:Not able to open libcustomcpuop.so
open usercpu.so, init user_cpu_init
[BMRT][BMProfileDeviceBase:190] INFO:gdma=0, tiu=0, mcu=0
Model[/data/qwen2-7b_int4_seq8192_1dev.bmodel] loading ....
[BMRT][load_bmodel:1939] INFO:Loading bmodel from [/data/qwen2-7b_int4_seq8192_1dev.bmodel]. Thanks for your patience...
[BMRT][load_bmodel:1704] INFO:Bmodel loaded, version 2.2+v1.8.beta.0-221-g902a8a4fe-20240704
[BMRT][load_bmodel:1706] INFO:pre net num: 0, load net num: 61
[BMRT][load_tpu_module:1802] INFO:loading firmare in bmodel
[BMRT][preload_funcs:2121] INFO: core_id=0, multi_fullnet_func_id=22
[BMRT][preload_funcs:2124] INFO: core_id=0, dynamic_fullnet_func_id=23
[bmlib_memory][error] bm_alloc_gmem failed, dev_id = 0, size = 0x7d93000
[BM_CHECK][error] BM_CHECK_RET fail /workspace/libsophon/bmlib/src/bmlib_memory.cpp: bm_malloc_device_byte_heap_mask_u64: 1121
[BMRT][Register:2019] FATAL:coeff alloc failed, size[0x7d93000]
python3: /home/linaro/LLM-TPU/models/Qwen2/python_demo/chat.cpp:141: void Qwen::init(const std::vector&, std::string): Assertion `true == ret' failed.
Aborted

之前用sd卡写入驱动之前，跑qwen，跑上面的命令是正常的，但是出现的是逻辑奇怪的重复回答，然后得知驱动低了（当时0.4.9），然后根据教程，刷入驱动（0.5.1）。

发现这次连对话都进不了了，看了其他的issue说可能是bmodel下载的时候Broken了，我就重新下了一次#7 (comment) .还是这样得问题，也跑了首页的./run.sh 跑了qwwn1.5 1b8的，遇到了同样的问题，也重启过服务器

我刷了驱动后面还有什么后续操作要做吗？我还执行了/home/linaro/bsp-debs/linux-headers-install.sh，安装了相关的依赖。

运行模型出错

场景：调用./run.sh --model llama2-7b --arch soc 时候出现两个错误
错误1：
报错提示没有chat.cpython-310-x86_64-linux-gnu.so这个库。
修改：板子非x86的，应该是aarch64 的库，根据自己板子实际编译结果修改run_demo.sh脚本中的库。
建议：建议修改下脚本run_demo.sh或者在常见问题中给出一个说明，让用户根据实际库名称修改

错误2：报一下错误
Traceback (most recent call last):
File "python_demo/pipeline.py", line 216, in
main(args)
File "python_demo/pipeline.py", line 197, in main
model = Llama2(args)
File "python_demo/pipeline.py", line 14, in init
self.tokenizer = AutoTokenizer.from_pretrained(
File "/home/linaro/.local/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 676, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported.
解决：通过运行pip3 install transformers --upgrade来更新Transformers库。
建议：在常见问题中说明，给用户提示下

Unable to run llama2-7b according to readme

IVP03X-V2

uname -v

root@bm1684:/home/linaro/LLM-TPU# uname -v
#1 SMP Fri Mar 1 05:32:48 CST 2024

memory allocation：

root@bm1684:/home/linaro/LLM-TPU# cat /sys/kernel/debug/ion/bm_npu_heap_dump/summary | head -2
Summary:
[0] npu heap size:7717519360 bytes, used:0 bytes        usage rate:0%, memory usage peak 0 bytes
root@bm1684:/home/linaro/LLM-TPU# cat /sys/kernel/debug/ion/bm_vpu_heap_dump/summary | head -2
Summary:
[2] vpu heap size:1073741824 bytes, used:0 bytes        usage rate:0%, memory usage peak 0 bytes
root@bm1684:/home/linaro/LLM-TPU# cat /sys/kernel/debug/ion/bm_vpp_heap_dump/summary | head -2
Summary:
[1] vpp heap size:1073741824 bytes, used:0 bytes        usage rate:0%, memory usage peak 0 bytes

Error:

[BMRT][fix_gdma_addr:488] FATAL:gdma dst shouldn't be coeff, origin[0x1071d8000], ctx[0x1071f8000]
llama2: /home/linaro/LLM-TPU/models/Llama2/demo/demo.cpp:311: void LLama2::init(const std::vector<int>&, std::string, std::string, const float&, const float&, const float&, const int&, const int&, const string&, const string&): Assertion `true == ret' failed.
./run_demo.sh: line 26:  1890 Aborted                 ./demo/llama2 --model ../../bmodels/llama2-7b_int4_1dev.bmodel --tokenizer ./support/token_config/tokenizer.model --devid 0

Run log:

root@bm1684:/home/linaro/LLM-TPU#  ./run.sh --model llama2-7b
+ model_to_demo=(["chatglm2-6b"]="ChatGLM2" ["chatglm3-6b"]="ChatGLM3" ["llama2-7b"]="Llama2" ["qwen-7b"]="Qwen" ["qwen1.5-1.8b"]="Qwen1_5" ["wizardcoder-15b"]="WizardCoder")
+ declare -A model_to_demo
+ parse_args --model llama2-7b
+ [[ 2 -gt 0 ]]
+ key=--model
+ case $key in
+ model=llama2-7b
+ shift 2
+ [[ 0 -gt 0 ]]
+ compare_date=20240110
+ '[' == pcie ']'
./run.sh: line 45: [: ==: unary operator expected
+ '[' = soc ']'
./run.sh: line 48: [: =: unary operator expected
+ '[' '' -lt 20240110 ']'
./run.sh: line 52: [: : integer expression expected
+ echo 'Driver date is , which is up to date. Continuing...'
Driver date is , which is up to date. Continuing...
+ [[ ! -n Llama2 ]]
+ pushd ./models/Llama2
/home/linaro/LLM-TPU/models/Llama2 /home/linaro/LLM-TPU
+ ./run_demo.sh
Bmodel Exists!
llama2 file Exists!
/home/linaro/LLM-TPU/models/Llama2
Demo for Qwen in BM1684X
Init Environment ...
Load ./support/token_config/tokenizer.model ... Done!
Device [ 0 ] loading ....
[BMRT][bmcpu_setup:435] INFO:cpu_lib 'libcpuop.so' is loaded.
bmcpu init: skip cpu_user_defined
open usercpu.so, init user_cpu_init 
[BMRT][BMProfile:59] INFO:Profile For arch=3
[BMRT][BMProfileDeviceBase:190] INFO:gdma=0, tiu=0, mcu=0
Model[../../bmodels/llama2-7b_int4_1dev.bmodel] loading ....
[BMRT][load_bmodel:1573] INFO:Loading bmodel from [../../bmodels/llama2-7b_int4_1dev.bmodel]. Thanks for your patience...
[BMRT][load_bmodel:1501] INFO:pre net num: 0, load net num: 69
[BMRT][fix_gdma_addr:488] FATAL:gdma dst shouldn't be coeff, origin[0x1071d8000], ctx[0x1071f8000]
llama2: /home/linaro/LLM-TPU/models/Llama2/demo/demo.cpp:311: void LLama2::init(const std::vector<int>&, std::string, std::string, const float&, const float&, const float&, const int&, const int&, const string&, const string&): Assertion `true == ret' failed.
./run_demo.sh: line 26:  1890 Aborted                 ./demo/llama2 --model ../../bmodels/llama2-7b_int4_1dev.bmodel --tokenizer ./support/token_config/tokenizer.model --devid 0

问题太长导致回复到一半就终止了

我为了提问回答能够更加精准，在提问的时候，给了ChatGLM3一些参考资料，但我发现如果参考资料太长，大概两百多字的话，就会导致回复经常出现只回答一半就终止了，甚至很多时候一句话只说了一半就结束了，这个问题应该怎么解决，我在电脑上跑ChatGLM3原模型时发现似乎并没有这个问题。

导出onnx出现warning

/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py:1636: UserWarning: The exported ONNX model failed ONNX shape inference.The model will not be executable by the ONNX Runtime.If this is unintended and you believe there is a bug,please report an issue at https://github.com/pytorch/pytorch/issues.Error reported by strict ONNX shape inference: [ShapeInferenceError] (op_type:Add, node name: /Add): A typestr: T, has unsupported type: tensor(bool) (Triggered internally at ../torch/csrc/jit/serialization/export.cpp:1407.)
请问这个warning重要吗，如果会影响模型效果的话应该怎么解决
这是使用ChatGLM3导出onnx的脚本时出现的问题

ChatGLM3-6B转onnx报错：torch.onnx.errors.CheckerError: The model does not have an ir_version set properly.

环境：
onnx 1.14.0
onnxruntime 1.15.1
torch 2.0.1+cpu
torchvision 0.15.2+cpu
protobuf 3.20.3
LLM-TPU和tpu-mlir都是用的主干版本

操作：
cd LLM-TPU/models/ChatGLM3/compile
python3 export_onnx.py --model_path ./ZhipuAI/chatglm3-6b --seq_length 4096 --device cpu

问题：
转换onnx报错如下：（补充：将--seq_length设置为512、1024、2048是可以成功转换的，设置为4096就报错）
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.56s/it]
Layers: 28
Hidden size: 4096

Convert block & block_cache
0%| | 0/28 [00:00<?, ?it/s][libprotobuf ERROR ../third_party/protobuf/src/google/protobuf/message_lite.cc:457] onnx_torch.ModelProto exceeded maximum protobuf size of 2GB: 2964400814
============== Diagnostic Run torch.onnx.export version 2.0.1+cpu ==============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

0%| | 0/28 [00:26<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1636, in _export
_C._check_onnx_proto(proto)
RuntimeError: The model does not have an ir_version set properly.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/workspace/LLM-TPU/models/ChatGLM3/compile/export_onnx.py", line 335, in
convert_block(i)
File "/workspace/LLM-TPU/models/ChatGLM3/compile/export_onnx.py", line 153, in convert_block
torch.onnx.export(
File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 506, in export
_export(
File "/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py", line 1638, in _export
raise errors.CheckerError(e) from e
torch.onnx.errors.CheckerError: The model does not have an ir_version set properly.

Web client not working

IVP03X-V2

uname -v

root@bm1684:/home/linaro/LLM-TPU# uname -v
#1 SMP Fri Mar 1 05:32:48 CST 2024

CMD：

python3 web_demo.py --model_path ./llama2-7b_int4_1dev.bmodel --tokenizer_path ../token_config --devid 0 --generation_mode greedy

Error log:

root@bm1684:/data/LLM-TPU/models/Llama2/python_demo# python3 web_demo.py --model_path ./llama2-7b_int4_1dev.bmodel --tokenizer_path ../token_config --devid 0 --generation_mode greedy
/usr/local/lib/python3.8/dist-packages/gradio_client/documentation.py:103: UserWarning: Could not get documentation group for <class 'gradio.mix.Parallel'>: No known documentation group for module 'gradio.mix'
  warnings.warn(f"Could not get documentation group for {cls}: {exc}")
/usr/local/lib/python3.8/dist-packages/gradio_client/documentation.py:103: UserWarning: Could not get documentation group for <class 'gradio.mix.Series'>: No known documentation group for module 'gradio.mix'
  warnings.warn(f"Could not get documentation group for {cls}: {exc}")
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Load ../token_config ...
Device [ 0 ] loading ....
[BMRT][bmcpu_setup:436] INFO:cpu_lib 'libcpuop.so' is loaded.
bmcpu init: skip cpu_user_defined
open usercpu.so, init user_cpu_init 
[BMRT][BMProfile:60] INFO:Profile For arch=3
[BMRT][BMProfileDeviceBase:190] INFO:gdma=0, tiu=0, mcu=0
Model[./llama2-7b_int4_1dev.bmodel] loading ....
[BMRT][load_bmodel:1696] INFO:Loading bmodel from [./llama2-7b_int4_1dev.bmodel]. Thanks for your patience...
[BMRT][load_bmodel:1583] INFO:Bmodel loaded, version 2.2+v1.7.beta.63-g04a8bad4c-20240409
[BMRT][load_bmodel:1585] INFO:pre net num: 0, load net num: 69
[BMRT][load_tpu_module:1674] INFO:loading firmare in bmodel
[BMRT][preload_funcs:1876] INFO: core_id=0, multi_fullnet_func_id=91
[BMRT][preload_funcs:1879] INFO: core_id=0, dynamic_fullnet_func_id=92
Done!
web_demo.py:103: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/gradio/routes.py", line 442, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1392, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.8/dist-packages/gradio/blocks.py", line 1111, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 346, in async_iteration
    return await iterator.__anext__()
  File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 339, in __anext__
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 322, in run_sync_iterator_async
    return next(iterator)
  File "/usr/local/lib/python3.8/dist-packages/gradio/utils.py", line 691, in gen_wrapper
    yield from f(*args, **kwargs)
  File "web_demo.py", line 82, in predict
    for response, history in model.stream_predict(input):
  File "/data/LLM-TPU/models/Llama2/python_demo/pipeline.py", line 172, in stream_predict
    for answer_cur, history in self._generate_predictions(tokens):
  File "/data/LLM-TPU/models/Llama2/python_demo/pipeline.py", line 180, in _generate_predictions
    next_token = self.forward_first(tokens)
AttributeError: 'Llama2' object has no attribute 'forward_first'

刷机包下载不成功，报错：No available servers found

这个刷机包下不下来，
python3 -m dfss --url=[email protected]:/ext_model_information/LLM/LLM-TPU/sdcard.tgz
麻烦提供一个可用的刷机包，谢谢~

Qwen2可以转onnx,转bmodel的时候出现以下问题

sdk是v1.6.113-g7dc59c81-20240105, 转bmodel出现model_deploy.py: error: unrecognized arguments: --addr_mode io_alone

去掉--addr_mode 转bmodel报这个错误，我用的转模型命令是这个：./compile.sh --mode int8 --name qwen2-7b --seq_length 8192

Floating point exception (core dumped)
Traceback (most recent call last):
File "/workspace/tpu-mlir_v1.6.113-g7dc59c81-20240105/python/tools/model_deploy.py", line 337, in
tool.build_model()
File "/workspace/tpu-mlir_v1.6.113-g7dc59c81-20240105/python/tools/model_deploy.py", line 232, in build_model
mlir_to_model(self.tpu_mlir, self.model, self.final_mlir, self.dynamic,
File "/workspace/tpu-mlir_v1.6.113-g7dc59c81-20240105/python/utils/mlir_shell.py", line 169, in mlir_to_model
_os_system(cmd)
File "/workspace/tpu-mlir_v1.6.113-g7dc59c81-20240105/python/utils/mlir_shell.py", line 50, in _os_system
raise RuntimeError("[!Error]: {}".format(cmd_str))
RuntimeError: [!Error]: tpuc-opt block_cache_9_bm1684x_w8bf16_final.mlir --codegen="model_file=block_cache_9.bmodel embed_debug_info=false model_version=latest" -o /dev/nu

Waiting for Qwen2 gradio web demo

运行llama3输出为乱序

=================================================================

If you want to quit, please enter one of [q, quit, exit]
To create a new chat session, please enter one of [clear, new]
=================================================================

Question: hello

Answer: fragisticskromkromkromkromkromkromkromkrom^CTraceback

请问有支持iFlytekSpark的模型计划吗

如题，请问有支持对讯飞星火模型在算丰平台的适配吗？下面是开原地址
https://gitee.com/iflytekopensource/iFlytekSpark-13B

在跑github下载已经转好的qwen-vl-chat-combine.bmodel模型时，会提示内存不足

用bmrt_test --bmodel 测试模型时发现的这个问题