huawei-noah / bolt Goto Github PK

Bolt is a deep learning library with high performance and heterogeneous flexibility.

Home Page: https://huawei-noah.github.io/bolt/

License: MIT License

CMake 1.21% C 9.52% C++ 80.65% Shell 0.43% Python 5.76% Java 1.20% Objective-C 0.40% Objective-C++ 0.33% Makefile 0.01% XSLT 0.01% Jupyter Notebook 0.50%

bolt inference high-performance x86 mali deep-learning cv nlp arm onnx

bolt's Introduction

Introduction

Bolt is a light-weight library for deep learning. Bolt, as a universal deployment tool for all kinds of neural networks, aims to automate the deployment pipeline and achieve extreme acceleration. Bolt has been widely deployed and used in many departments of HUAWEI company, such as 2012 Laboratory, CBG and HUAWEI Product Lines. If you have questions or suggestions, you can submit issue. QQ群: 833345709

Why Bolt is what you need?

High Performance: 15%+ faster than existing open source acceleration libraries.
Rich Model Conversion: support Caffe, ONNX, TFLite, Tensorflow.
Various Inference Precision: support FP32, FP16, INT8, 1-BIT.
Multiple platforms: ARM CPU(v7, v8, v8.2+, v9), X86 CPU(AVX2, AVX512), GPU(Mali, Qualcomm, Intel, AMD)
Bolt is the first to support NLP and also supports common CV applications.
Minimize ROM/RAM
Rich Graph Optimization
Efficient Thread Affinity Setting
Auto Algorithm Tuning
Time-Series Data Acceleration

See more excellent features and details here

Building Status

There are some common used platform for inference. More targets can be seen from scripts/target.sh. Please make a suitable choice depending on your environment. If you want to build on-device training module, you can add --train option. If you want to use multi-threads parallel, you can add --openmp option. If you want to build for cortex-M or cortex-A7 with restricted ROM/RAM(Sensor, MCU), you can see docs/LITE.md.

Bolt defaultly link static library, This may cause some problem on some platforms. You can use --shared option to link shared library.

target platform	precision	build command	Linux	Windows	MacOS
Android(armv7)	fp32,int8	./install.sh --target=android-armv7
Android(armv8)	fp32,int8	./install.sh --target=android-aarch64 --fp16=off
Android(armv8.2+)	fp32,fp16,int8,bnn	./install.sh --target=android-aarch64
Android(armv9)	fp32,fp16,bf16,int8,bnn	./install.sh --target=android-aarch64_v9
Android(gpu)	fp16	./install.sh --target=android-aarch64 --gpu
Android(x86_64)	fp32,int8	./install.sh --target=android-x86_64
iOS(armv7)	fp32,int8	./install.sh --target=ios-armv7	/	/
iOS(armv8)	fp32,int8	./install.sh --target=ios-aarch64 --fp16=off	/	/
iOS(armv8.2+)	fp32,fp16,int8,bnn	./install.sh --target=ios-aarch64	/	/
Linux(armv7)	fp32,int8	./install.sh --target=linux-armv7_blank		/	/
Linux(armv8)	fp32,int8	./install.sh --target=linux-aarch64_blank --fp16=off		/	/
Linux(armv8.2+)	fp32,fp16,int8,bnn	./install.sh --target=linux-aarch64_blank		/	/
Linux(x86_64)	fp32,int8	./install.sh --target=linux-x86_64		/	/
Linux(x86_64_avx2)	fp32	./install.sh --target=linux-x86_64_avx2		/	/
Linux(x86_64_avx512)	fp32,int8	./install.sh --target=linux-x86_64_avx512		/	/
Windows(x86_64)	fp32,int8	./install.sh --target=windows-x86_64	/		/
Windows(x86_64_avx2)	fp32	./install.sh --target=windows-x86_64_avx2	/		/
Windows(gpu)	fp16	./install.sh --target=windows-x86_64_avx2 --gpu --fp16=on	/		/
Windows(x86_64_avx512)	fp32,int8	./install.sh --target=windows-x86_64_avx512	/		/
Windows(armv8.2+)	fp32,fp16,int8,bnn	./install.sh --target=windows-aarch64	/	/
MacOS(x86_64)	fp32,int8	./install.sh --target=macos-x86_64	/	/
MacOS(x86_64_avx2)	fp32	./install.sh --target=macos-x86_64_avx2	/	/
MacOS(x86_64_avx512)	fp32,int8	./install.sh --target=macos-x86_64_avx512	/	/
MacOS(armv8.2+)	fp32,fp16,int8,bnn	./install.sh --target=macos-aarch64	/	/

Quick Start

Two steps to get started with bolt.

Conversion: use X2bolt to convert your model from caffe, onnx, tflite or tensorflow to .bolt file;
Inference: run benchmark with .bolt and data to get the inference result.

For more details about the usage of X2bolt and benchmark tools, see docs/USER_HANDBOOK.md.

DL Applications in Bolt

Here we show some interesting and useful applications in bolt.

Image Classification android ios	Face Detection ios exe	Pose Detection android

Semantics Analysis android	Reading Comprehension android	Chinese Speech Recognition android ios

Verified Networks

Bolt has shown its high performance in the inference of common CV, NLP and Recommendation neural networks. Some of the representative networks that we have verified are listed below. You can find detailed benchmark information in docs/BENCHMARK.md.

Application	Models
CV	Resnet50, Shufflenet, Squeezenet, Densenet, Efficientnet, Mobilenet_v1, Mobilenet_v2, Mobilenet_v3, BiRealNet, ReActNet, Ghostnet, unet, LCNet, Pointnet, hair-segmentation, duc, fcn, retinanet, SSD, Faster-RCNN, Mask-RCNN, Yolov2, Yolov3, Yolov4, Yolov5, ViT, TNT, RepVGG, VitAE, CMT, EfficientFormer ...
NLP	Bert, Albert, Tinybert, Neural Machine Translation, Text To Speech(Tactron,Tactron2,FastSpeech+hifigan,melgan), Automatic Speech Recognition, DFSMN, Conformer, Tdnn, FRILL, T5, GPT-2, Roberta, Wenet ...
Recommendation	NFM, AFM, ONN, wide&deep, DeepFM, MMOE
More DL Tasks	...

More models than these mentioned above are supported, users are encouraged to further explore.

On-Device Training

On-Device Training has come, it's a beta vesion which supports Lenet, Mobilenet_v1 and Resnet18 for training on the embedded devices and servers. Want more details of on-device training in bolt? Get with the official training tutorial.

Documentations

Everything you want to know about bolt is recorded in the detailed documentations stored in docs.

Articles

教程

图像分类: Android Demo, iOS Demo
图像增强: Android Deme, iOS Demo
情感分类: Android Demo
中文语音识别: Android Demo, iOS Demo
人脸检测: Android Demo, iOS Demo

Acknowledgement

Bolt refers to the following projects: caffe, onnx, tensorflow, ncnn, mnn, dabnn.

License

The MIT License(MIT)

bolt's People

Stargazers

Watchers

Forkers

qaz734913414 nihui l1129433134 starstylesky chaoso hellozjr jerryjinjin zhangyangang ishine chuhai-lin deftruth chanwing dreadlord1984 kuan-li xingjinglu deeplearning2012 qixuxiang wangyuaqi yunfanxiao lzpfmh ridingonleopard shihuaxing chaoyueziji linecode yunying24 liang0537 dawncc wongtusnyan pengcuo vancour16 119243740 wang7393 abcdelf ieyer jiayong dm999 ychjiang zhanghang123123 lunwk beyondliangcai ceasarlee zeitgeistqian xyuan holygen lpcelite hhtimzhou dahu1985 trantorrepository liyucode yhl41001 zyxzju xianfengju bupt-renpei jocfrye wong4j yiyang92 flyingcow8 seanxcwang xgmiao wabluy nanqiai xiangchunyang zhly0 beckhou lishanlin-sysu 975150313 teslawho xiebaiyuan 3-leaves-grass xiaotaochen 666dzy666 zeta1999 zhangjf2018 yuxianzhi jiayingjiebupt jyjatbupt yyqgood kampoz18 mayanbin06 devandong mldl monkeyking mbabby gp1322719830 pangchao-git ysh329 yukaizhou d123456ddq warpuv xingjing1 loyalbenny yangwang92 assassingq chillingche lswzjuer leironghao xiezhiqing fourmi1995 penguinliong ncnnnnn

bolt's Issues

No third_party install shell script

The install shell script to install third party libraries mentioned in INSTALL.md is not in repo. So I have to install and configure all the dependencies manually

Android平台是否有考虑改成动态加载手机/system/vendor/lib(64)下的libOpenCL.so

目前在高通855平台手机上使用bolt编译生成的libOpenCL.so，在clGetPlatformIDs就会发生SIGTRAP

Performance comparison with oneDNN and CoreML

Hi, Thanks for this great work.
Is there any performance compared to oneDNN and CoreML？

是否能完整支持二值化卷积网络？

Bolt supports both XNOR-style and DoReFa-style BNN networks. Just save the binary weights as FP32 in an Onnx model, and X2bolt will automatically convert the storage to 1-bit representations. So far, the floating-point portion of the BNN network can only be FP16 operations, so pass "FP16" as the precision parameter to X2bolt. The number of output channels for BNN convolution layers should be divisible by 32.
这里提到的FP16是什么意思？是指对二值化网络的支持实际是用FP16实现的吗？为什么最后输出的通道数必须要被32整除呢？

是否支持二值化的LSTM？

Error while cloning tensorflow

Building by command

./install.sh -t 12 -c llvm

getting this error:

[new tag] v2.1.0-rc1 -> v2.1.0-rc1

[new tag] v2.1.0-rc2 -> v2.1.0-rc2

[new tag] v2.2.0-rc0 -> v2.2.0-rc0

[new tag] v2.2.0-rc1 -> v2.2.0-rc1
From https://github.com/tensorflow/tensorflow

branch master -> FETCH_HEAD
error: Sparse checkout leaves no entry on working directory

llvm-ranlib问题

$ ./install.sh --target=android-aarch64 --gpu
[ERROR] please install llvm-ranlib tools and set shell environment PATH to find it

新版本中编译过程用到的llvm-ranlib在ndkr20版本中，不存在，应当为llvm-ar，可以复制粘贴重命名一下，建议修改一下install.sh脚本

模型功耗分析

是否有FP16, INT8, Binary相同结构下的功耗对比分析数据？

支持TorchScript格式的模型做为输入吗 ?

REAMDE的介绍上, 从REAMDE文档以及代码目录上来开, PyTorch模型有2种转换方式:

PyTorch->ONNX
PyTorch->Caffe
这2中方式都有比较大的局限.比如PyTorch->ONNX方式, PyTorch自身的一些Prim算子(IF, LOOP, LISTUNPACK ... )ONNX支持不好或者不支持. 所以目前的Runtime推理框架是否考虑支持PyTorch Script做为模型的输入格式?

https://pytorch.org/docs/stable/jit.html?highlight=script
https://github.com/pytorch/pytorch/tree/master/torch/csrc/jit

bolt build failed on windows, here's the error report

CMake Error at C:/Program Files/CMake/share/cmake-3.22/Modules/CMakeTestCCompiler.cmake:69 (message):
The C compiler

"D:/mingw64/bin/gcc.exe"

is not able to compile a simple test program.

It fails with the following output:

Change Dir: C:/Users/Q/Desktop/bolt-master/third_party/windows-x86_64/protobuf/protobuf-3.14.0/build/CMakeFiles/CMakeTmp

Run Build Command(s):D:/mingw64/bin/mingw32-make.exe -f Makefile cmTC_4fc87/fast && mingw32-make  -f CMakeFiles\cmTC_4fc87.dir\build.make CMakeFiles/cmTC_4fc87.dir/build
mingw32-make[1]: Entering directory 'C:/Users/Q/Desktop/bolt-master/third_party/windows-x86_64/protobuf/protobuf-3.14.0/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_4fc87.dir/testCCompiler.c.obj
D:\mingw64\bin\gcc.exe    -o CMakeFiles\cmTC_4fc87.dir\testCCompiler.c.obj -c C:\Users\Q\Desktop\bolt-master\third_party\windows-x86_64\protobuf\protobuf-3.14.0\build\CMakeFiles\CMakeTmp\testCCompiler.c
Linking C executable cmTC_4fc87.exe
"C:\Program Files\CMake\bin\cmake.exe" -E cmake_link_script CMakeFiles\cmTC_4fc87.dir\link.txt --verbose=1
"C:\Program Files\CMake\bin\cmake.exe" -E rm -f CMakeFiles\cmTC_4fc87.dir/objects.a
D:\mingw64\bin\ar.exe qc CMakeFiles\cmTC_4fc87.dir/objects.a @CMakeFiles\cmTC_4fc87.dir\objects1.rsp
D:\mingw64\bin\gcc.exe -Wl,--whole-archive CMakeFiles\cmTC_4fc87.dir/objects.a -Wl,--no-whole-archive -o cmTC_4fc87.exe -Wl,--out-implib,libcmTC_4fc87.dll.a -Wl,--major-image-version,0,--minor-image-version,0 @CMakeFiles\cmTC_4fc87.dir\linklibs.rsp
gcc.exe: error: CreateProcess: No such file or directory
mingw32-make[1]: *** [CMakeFiles\cmTC_4fc87.dir\build.make:100: cmTC_4fc87.exe] Error 1
mingw32-make[1]: Leaving directory 'C:/Users/Q/Desktop/bolt-master/third_party/windows-x86_64/protobuf/protobuf-3.14.0/build/CMakeFiles/CMakeTmp'
mingw32-make.exe: *** [Makefile:126: cmTC_4fc87/fast] Error 2

模型可视化？

目前有模型可视化工具吗？

使用1.2.1的X2bolt转换tflite失败（1.2.0没有问题）

1.2.1替换了schema为tensorflow master版本，解析opcode出了问题。在tflite_adaptee的260行加了些日志，打印的opcode全是0：

[INFO] thread 12984 Start to convert ./xxx.tflite...
[parse_file] tfliteModel->operator_codes[0]->builtin_code : 0
[parse_file] tfliteModel->operator_codes[1]->builtin_code : 0
[parse_file] tfliteModel->operator_codes[2]->builtin_code : 0
[parse_file] tfliteModel->operator_codes[3]->builtin_code : 0
[parse_file] tfliteModel->operator_codes[4]->builtin_code : 0
[parse_file] tfliteModel->operator_codes[5]->builtin_code : 0
Segmentation fault

What features do you want to add to bolt?

We want to know what you want to add to bolt, we will evaluate your advice and make a plan to develop, so please tell us if your requirement.

Does bolt pad not support dynamic input? ??

https://github.com/TensorSpeech/TensorFlowTTS/blob/v1.1/tensorflow_tts/models/mb_melgan.py Use this code to train and turn to bolt ,
[ERROR] thread 37973 file /home/disk1/soft/bolt-master/model_tools/src/onnx/onnx_adaptee.h line 2128 can not process operator name:Pad__82 type:Pad attributes.
Does pad not support dynamic input? ??

转换模型失败 Segmentation fault

对于一个较大transformer的模型（200M）。转换时报错Segmentation fault

benchmark issue

对bolt进行了benchmark测试，install 阶段也关闭了 profile功能，只看模型总耗时，发现达不到文章里提到的性能，不知道是我哪里用错了，请帮忙看一下
如图所示

文章里提到对squeezent1.1在高通888 half情况下耗时为3.949ms，我在小米11 高通888实测fp16case耗时为avg_time:7.443091ms/data；
为了验证，我实际测试了一下 https://github.com/huawei-noah/bolt/blob/master/docs/USER_HANDBOOK.md中提到的 resnet50这个网络，利用X2BOLT工具，我的命令如下./benchmark -a GPU -w 10 -l 10 -m ResNet-50_f16.bolt

高通888fp16耗时情况为
Benchmark Result:
Output Tensor prob desc: dt:DT_F16 memFormat:DF_NCHW stride(1000,1,1) offset(0,0,0) data: 0.000166 0.000330 0.000063 0.000110 0.000000 0.000508 0.000000 0.000000 sum: 0.992770
total_time:305.839355ms(loops=10)
avg_time:30.583936ms/data
min_time:29.903076ms/data
max_time:31.020020ms/data
请问一下，这里的平均耗时30.58ms性能是否正常，能否share一下 resnet50的性能耗时情况，或者提供一下resnet50_v2的模型文件（官方文章为25ms左右），交叉验证一下。

Raspberry Pi 4 compile error

Is there any document for Raspberrypi 4 ?

I have below error for cmake ..


error: ‘Factory’ was not declared in this scope
     std::shared_ptr<Factory> factory;

Would you guys support Windows OS ?

Can Bolt be compiled on Windows OS ?

模型名中包含特殊字符时，build_preprocess_ocl.sh失败

普通模型文件中经常包含 ' . ' 等字符，X2bolt转换模型时不会转义这些字符。
然而build_preprocess_ocl.sh把GPU kernel bin导入cpp中编译时，会因为这些特殊字符导致算法文件编译失败。

[ERROR] thread 0 file C:\Users\Blank\Desktop\bolt-master\model_tools\src\onnx\onnx_adaptee.h line 264 operator name:Less_8 type:Less not supported.

When I was trying to convert an onnx model of ReActNet to a bolt model, such an error occurred, how can I fix it?

bolt的未来

bolt在华为内部的定位是咋样的呢，未来对hi芯片有没有可能支持？

I think it should be version 0.2.0 not 0.3.0 in commit history.

While in release history you refer to it as 0.2.0.
I am confused.

树莓派上可以使用吗

请问有验证在树莓派上可以用吗，比如树莓派4B armv7

ftmDataFormat for direct convolution is wrong in convolution_transform_filter_fp16

ftmDataFormat for CONVOLUTION_ALGORITHM_DIRECT should be DF_NCHWN16.
https://github.com/huawei-noah/bolt/blob/master/tensor_computing/src/cpu/arm/fp16/convolution_transform_fp16.h#L194

model_tools/tools/tensorflow2caffe/tts/transform_fastspeech2.py Is the model source code open source?

Cmake error (Protobuf_SHARED_LIBRARY NOT FOUND) occurs

Environment

Ubuntu 18.04
cmake version 3.10.2
Android(armv8+mali)
android-ndk-r20b

Command

./install.sh --target=android-aarch64 --mali -t 6

Log

-- CXXFLAGS: --target=aarch64-linux-android21 -W -Wall -Wextra -O3 -fPIC -fstack-protector-all -Wno-unused-command-line-argument -Wno-unused-parameter -Wno-unused-result -Wno-deprecated-declarations -Wno-unused-variable -pthread -D_USE_JNI -D_USE_ANDROID_LOG -llog -D_USE_GENERAL -D_USE_MALI -D_USE_FP32 -D_USE_NEON -D_USE_FP16 -D_USE_F16_MIX_PRECISION -D_USE_INT8 -march=armv8-a+fp16+dotprod -D_USE_CAFFE -D_USE_ONNX -D_USE_TFLITE -D_USE_TENSORFLOW -std=c++11 -W -Wall -Wextra -O3 -fPIC -fstack-protector-all -Wno-unused-command-line-argument -Wno-unused-parameter -Wno-unused-result -Wno-deprecated-declarations -Wno-unused-variable -pthread -D_USE_JNI -D_USE_ANDROID_LOG -llog -D_USE_GENERAL -D_USE_MALI -D_USE_FP32 -D_USE_NEON -D_USE_FP16 -D_USE_F16_MIX_PRECISION -D_USE_INT8 -march=armv8-a+fp16+dotprod -D_USE_CAFFE -D_USE_ONNX -D_USE_TFLITE -D_USE_TENSORFLOW -Wl,-allow-shlib-undefined -static-libstdc++

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
Protobuf_SHARED_LIBRARY
    linked by target "model_tools_caffe" in directory /home/ubuntu/workspace/bolt/model_tools/src/caffe
    linked by target "model_tools_onnx" in directory /home/ubuntu/workspace/bolt/model_tools/src/onnx

-- Configuring incomplete, errors occurred!

opencl的一个bug

我在测试bolt的opencl时发现一个bug；
由于bolt采用了NCHW / NCHWC4等数据排布混用、针对opencl在层与层之间的blob混用了buffer、image1d、image2d、image3d，同时内存分配上还采用了内存复用，可能导致了我的一个模型在depth2space_ocl层触发了一个bug，就是depth2space对应的kernel的arg里是写的是buffer的输入类型，但是内存复用以后，对应arg传入了一个image3d的数据类型，导致set_arg报错CL_INVALID_MEM_OBJ，我还在定位是内存复用的代码

是否支持BILSTM

是否支持BILSTM等RNN等转换成C++

conv_direct_sh1_fn_spe.cl存在typo，导致GPU kernel编译失败

build log of device BC-src-code:286:9: error: use of undeclared identifier 'in_off'
        in_off += ihw_str;
        ^
1 diagnostic(s) generated.

tensorflow to caffe model, use caffe to infer, caffe uses that version ???

[libprotobuf ERROR google/protobuf/text_format.cc:307] Error parsing text-format caffe.NetParameter: 80:14: Message type "caffe.EmbedParameter" has no field named "transpose".
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0225 15:25:28.170408 100496 upgrade_proto.cpp:90] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: tensorflow2caffe/tts/tts_encoder.prototxt
*** Check failure stack trace: ***

using netron to visualise bolt model

onnx squezee在opset13改了定义

axes不再是attribute而是input

bolt交流群加不进去

bolt交流群不让进吗？

How to build on Raspberry?

We have sucessully build bolt inference library without model converter on Raspberry 3 model B(armv7).

#67

export CFLAGS="-march=armv7-a -mfpu=neon-vfpv4 "
export CXXFLAGS="-march=armv7-a -mfpu=neon-vfpv4 "
./install.sh --target=linux-armv7_blank --converter=off -t 4

#benchmark
./install_linux-armv7_blank/example/benchmark -m ./kit/assets/ImageClassification/ghostnet_f32.bolt

You can transfer your bolt model to Raspberry and run inference.

编译报错 Unknown CMake command "set_policy".

Demo for quick start

In many excellent open source projects, a demo will be given as a quick start, however, we can't find a real example that can be run directly after compilation ?

An example with input data and real models may be more friendly to some beginners. Thx :)

STORE_OUTPUT存在typo，Adreno GPU计算反卷积f2s2出错

./test_deconvolution_ocl 24 256 128 1 2 2 2 0                                                        <

[DEBUG] thread 13883 OCLContext 0x61531c6278 constructor start
[DEBUG] thread 13883 try to dlopen libQUALCOMM_Adreno_660_map.so failed, dlopen failed: library "libQUALCOMM_Adreno_660_map.so" not found, create kernel from source code
[DEBUG] thread 13883 gcl_kernel_source 0xb40000714c3a1250 constructor
[DEBUG] thread 13883 OCLContext 0x61531c6278 constructor end
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_12 runInfo: ls <0 0 0> executeTime = 153.856000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_22 runInfo: ls <0 0 0> executeTime = 130.816000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_32 runInfo: ls <0 0 0> executeTime = 153.088000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_42 runInfo: ls <0 0 0> executeTime = 122.880000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_14 runInfo: ls <0 0 0> executeTime = 143.872000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_24 runInfo: ls <0 0 0> executeTime = 102.144000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_34 runInfo: ls <0 0 0> executeTime = 118.016000 us
[DEBUG] thread 13883 enqueue_fill_image runInfo: executeTime = 15.872000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_trans_fltbuf_44 runInfo: executeTime = 5.888000 us
[DEBUG] thread 13883 DATATRANS>>> enqueue_write_buffer runInfo: executeTime = 129.024000 us
[DEBUG] thread 13883 KERNEL>>> unknow_mem_trans_om_nchw_to_nchwc4 runInfo: executeTime = 113.920000 us
[INFO] thread 13883 warm up gpu:
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_24 runInfo: ls <0 0 0> executeTime = 102.912000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_24 runInfo: ls <0 0 0> executeTime = 100.864000 us
[DEBUG] thread 13883 KERNEL>>> unknow_deconv_gemm_f2s2_qc_iom_24 runInfo: ls <0 0 0> executeTime = 98.048000 us
[DEBUG] thread 13883 KERNEL>>> unknow_mem_trans_im_nchwc4_to_nchw runInfo: executeTime = 51.968000 us
[DEBUG] thread 13883 DATATRANS>>> enqueue_read_buffer runInfo: executeTime = 16.896000 us
[INFO] thread 13883 16bit,         Deonvolution,                                    (1 24 256 128)+(24 1 2 2)/(2 0)=(1 1 512 256),    TIME    0.098ms,        GFLOPS   65.504
abs(diff) >= 1.000000e+00f, number = 23
abs(diff) >= 1.000000e-01f, number = 822
abs(diff) >= 1.000000e-02f, number = 164
abs(diff) >= 1.000000e-03f, number = 1084
abs(diff) >= 1.000000e-04f, number = 85300
abs(diff) >= 1.000000e-05f, number = 3176
abs(diff) >= 0.000000e+00f, number = 40503
maxabs = 1.530273, a = 0.000000, b = 1.530273 @ 428
maxrel = 976.562500, a = -0.000244, b = 0.000244 @ 73386
[DEBUG] thread 13883 OCLContext 0x61531c6278 deconstructor start
[DEBUG] thread 13883 gcl_kernel_source 0xb40000714c3a1250 constructor
[DEBUG] thread 13883 OCLContext 0x61531c6278 deconstructor end

使用GPU算法选择文件加速模型初始化，存在corner case未被加速

GPU的算法文件包含algorithmMap和kernelThreadMap，当模型仅包含一些简单OP（eltwise, power等）时，不需要对tiling等参数做搜索，这时algorithmMap就是空的，kernelThreadMap中仍然包含着这些OP的local搜索结果。

因此存在一种corner case：algorithmMap.size() == 0 && kernelThreadMap.size() > 0

这时void saveMapToFile() 就会出现bug，导致这种模型的local搜索结果不会被保存到算法文件中。从而，模型下次初始化时虽然链接了这个算法文件，仍然需要重新搜索local。这时模型的第一次执行就会非常慢。具体表现是-w 0和-w 1的执行时间差异非常明显。

MobileNetV2 Top1 acc 75%？？？What the heck？！

batch inference

Is bolt support batch inference , could I inference 2 or more sentence at the same time ?

error adding symbols: file in wrong format

Hello, compiling with your instructions (using cross compile) I face this problem. How can I solve it?

[ 74%] Linking CXX executable ../../../image/bin/test_image_processing
/home/<user>/Downloads/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/../lib/gcc/aarch64-linux-gnu/8.3.0/../../../../aarch64-linux-gnu/bin/ld: ../../../image/dependency/png/lib/libpng.a(png.o): Relocations in generic ELF (EM: 62)
/home/<user>/Downloads/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/../lib/gcc/aarch64-linux-gnu/8.3.0/../../../../aarch64-linux-gnu/bin/ld: ../../../image/dependency/png/lib/libpng.a(png.o): Relocations in generic ELF (EM: 62)
/home/<user>/Downloads/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/../lib/gcc/aarch64-linux-gnu/8.3.0/../../../../aarch64-linux-gnu/bin/ld: ../../../image/dependency/png/lib/libpng.a(png.o): Relocations in generic ELF (EM: 62)
/home/<user>/Downloads/gcc-arm-8.3-2019.03-x86_64-aarch64-linux-gnu/bin/../lib/gcc/aarch64-linux-gnu/8.3.0/../../../../aarch64-linux-gnu/bin/ld: ../../../image/dependency/png/lib/libpng.a: error adding symbols: file in wrong format

bolt编译后能在visual studio导入使用吗？

Mali GPU errors in install.sh script

Hello,
how to fix this

CANNOT LINK EXECUTABLE

errors? Not running on Kirin 980 nor 990.

1: --- Network Test (LeNet)
1: CANNOT LINK EXECUTABLE "/data/local/tmp/uldra/lenet": cannot locate symbol "Mali_G76p_bin" referenced by "/data/local/tmp/uldra/libkernelbin.so"...
1: CANNOT LINK EXECUTABLE "/data/local/tmp/uldra/lenet": cannot locate symbol "Mali_G76p_bin" referenced by "/data/local/tmp/uldra/libkernelbin.so"...
1: [ 20%] /data/local/tmp/uldra/hdr_ocl
1: [ 40%] /data/local/tmp/uldra/hdr_ocl
1: [ 60%] /data/local/tmp/uldra/hdr_ocl
1: [ 80%] /data/local/tmp/uldra/hdr_ocl
1: [100%] /data/local/tmp/uldra/hdr_ocl
1: /home/yury/source/bolt-master/tests/bin/hdr_ocl: 1 file pushed. 5.3 MB/s (324480 bytes in 0.059s)
1:
1:
1: --- GPU Network Test (HDR_OCL)
1:
1: === Input FP16
1: CANNOT LINK EXECUTABLE "/data/local/tmp/uldra/hdr_ocl": cannot locate symbol "Mali_G76p_bin" referenced by "/data/local/tmp/uldra/libkernelbin.so"...
1:
1: === Input UCHAR
1: CANNOT LINK EXECUTABLE "/data/local/tmp/uldra/hdr_ocl": cannot locate symbol "Mali_G76p_bin" referenced by "/data/local/tmp/uldra/libkernelbin.so"...
1/1 Test #1: quick_benchmark .................. Passed 7.88 sec

CMake Error at CMakeLists.txt:25 (install): install TARGETS given target "test_completeUT" which does not exist.

Following installation guide in INSTALL.md I face this problem:
CMake Error at CMakeLists.txt:25 (install): install TARGETS given target "test_completeUT" which does not exist.
How can I fix it?

TFLite not install success

MINGW64 /f/bolt-master
$ ./install.sh --target=android-aarch64
[INFO] use 8 threads to parallel build third party library on windows-x86_64 for target android-aarch64 in directory /f/bolt-master/third_party/android-aarch64...
[INFO] use c language compiler /c/Users/AppData/Local/Android/Sdk/ndk/20.1.5948944/toolchains/llvm/prebuilt/windows-x86_64/bin/clang
[INFO] use c++ language compiler /c/Users/AppData/Local/Android/Sdk/ndk/20.1.5948944/toolchains/llvm/prebuilt/windows-x86_64/bin/clang++
[INFO] generate environment file to /f/bolt-master/third_party/android-aarch64.sh...
[INFO] build TFLite in /f/bolt-master/third_party/android-aarch64/tflite...
[INFO] please source /f/bolt-master/third_party/android-aarch64.sh to use...
[INFO] use /f/bolt-master/third_party/android-aarch64.sh to set environment variable...
[ERROR] TFLite not install success

onnx_adaptee.h line 2128 can not process operator name:Pad__82 type:Pad attributes

CANNOT LINK EXECUTABLE

Hi, everyone

Could you help me to resolve an issue please

I've built bolt as it described in INSTALL.md with Kirin 980 device plugged in.
At the end of installation I've seen:

1: Test command: /root/bolt/quick_benchmark.sh "-b" "/root/bolt/tests/bin" "-p" "/data/local/tmp/uldra" "-l" "/root/bolt/install_llvm/lib"
1: Test timeout computed to be: 10000000
1: [INFO] run test in '/root/bolt/tests/bin'
1: [INFO] test on device directory `/data/local/tmp/uldra'
1: [INFO] use library in /root/bolt/install_llvm/lib
1: /root/bolt/install_llvm/lib/libBoltModel.so: 1 file pushed. 2.5 MB/s (1067120 bytes in 0.413s)
1: /root/bolt/install_llvm/lib/libblas-enhance.so: 1 file pushed. 1.8 MB/s (57456 bytes in 0.031s)
1: /root/bolt/install_llvm/lib/libimage.so: 1 file pushed. 2.5 MB/s (149856 bytes in 0.058s)
1: /root/bolt/install_llvm/lib/libinference.so: 1 file pushed. 2.6 MB/s (682352 bytes in 0.253s)
1: /root/bolt/install_llvm/lib/libmodel-tools.so: 1 file pushed. 2.5 MB/s (246248 bytes in 0.093s)
1: /root/bolt/install_llvm/lib/libmodel-tools_caffe.so: 1 file pushed. 1.9 MB/s (1439040 bytes in 0.710s)
1: /root/bolt/install_llvm/lib/libmodel-tools_onnx.so: 1 file pushed. 2.7 MB/s (486920 bytes in 0.169s)
1: /root/bolt/install_llvm/lib/libmodel-tools_tflite.so: 1 file pushed. 3.2 MB/s (279320 bytes in 0.083s)
1: /root/bolt/install_llvm/lib/libtensor_computing.so: 1 file pushed. 2.3 MB/s (709328 bytes in 0.291s)
1: /root/bolt/tests/bin/test_mmm_int8: 1 file pushed. 2.7 MB/s (131904 bytes in 0.047s)
1: /root/bolt/tests/bin/test_mmm: 1 file pushed. 2.0 MB/s (136400 bytes in 0.066s)
1:  
1: --- Matrix Matrix Multiplication
1: taskset: failed to set 25058's affinity: Invalid argument
1: taskset: failed to set 25061's affinity: Invalid argument
1: /root/bolt/tests/bin/test_convolution: 1 file pushed. 1.5 MB/s (30144 bytes in 0.019s)
1:  
1: --- Conv IC=3
1: taskset: failed to set 25065's affinity: Invalid argument
1: /root/bolt/tests/bin/test_convolution_bnn: 1 file pushed. 1.5 MB/s (30152 bytes in 0.019s)
1: /root/bolt/tests/bin/test_convolution_int8: 1 file pushed. 1.7 MB/s (30232 bytes in 0.017s)
1:  
1: --- Conv 5x5
1: taskset: failed to set 25070's affinity: Invalid argument
1: taskset: failed to set 25073's affinity: Invalid argument
1: taskset: failed to set 25076's affinity: Invalid argument
1:  
1: --- Conv 3x3
1: taskset: failed to set 25079's affinity: Invalid argument
1: taskset: failed to set 25082's affinity: Invalid argument
1: taskset: failed to set 25085's affinity: Invalid argument
1: /root/bolt/tests/bin/test_depthwise_convolution: 1 file pushed. 1.4 MB/s (30264 bytes in 0.021s)
1:  
1: --- Depthwise-Pointwise Conv
1: taskset: failed to set 25089's affinity: Invalid argument
1: /root/bolt/tests/bin/lenet: 1 file pushed. 2.1 MB/s (414384 bytes in 0.185s)
1:  
1:  
1: --- Network Test (LeNet)
1: taskset: failed to set 25093's affinity: Invalid argument
1: taskset: failed to set 25096's affinity: Invalid argument
1/1 Test #1: quick_benchmark ..................   Passed    5.42 sec

100% tests passed, 0 tests failed out of 1

But when I try to run onnx2bolt binary I see an error:

CANNOT LINK EXECUTABLE "./tools/onnx2bolt": library "libprotobuf.so.11" not found

There was some other error before I exported export LD_LIBRARY_PATH=/data/local/tmp/uldra

Add a benchmark tool

Please add an easy-to-use benchmark tool to run arbitrary models for users to see performance on popular model like MobileNet and etc.

ONNX2Bolt what's available

Hello,

Could you please explain couple of things that is unclear for me. I use onnx models, and would like to use onnx2bolt runtime. I've deployed MaskRCNN network in ONNX, and script fails.

How to know more information from unsuccesfull run. More than Segfault
Are ROIAlign and NMS available to be converted from ONNX? Which ONNX opset is supported?
And the last. What is parameter skip operators in ONNX2Bolt runtime?