adamgallas / fpga_accelerator_yolov3tiny Goto Github PK

License: Apache License 2.0

Shell 0.03% Python 0.06% Jupyter Notebook 0.09% Dockerfile 0.01% SuperCollider 0.02% C 13.35% C++ 7.09% Makefile 0.11% Tcl 1.19% CartoCSS 0.01% HTML 0.08% Assembly 0.25% CMake 0.01% Verilog 31.26% VHDL 43.71% SystemVerilog 2.17% V 0.51% Pascal 0.01% JavaScript 0.07% Batchfile 0.01%

fpga_accelerator_yolov3tiny's Introduction

Introduction

First Prize Winner of the 2021 DIGILENT Cup, China College Integrated Circuit Competition.

Project Description

This project implements a convolutional neural network accelerator, successfully deploying YOLOv3tiny. With the loop of camera capture + display screen feedback, a high-performance real-time object recognition and detection system is constructed.

Verification Platform: Xilinx Zynq Ultrascale series xzcu3eg chip, Digilent official Genesys ZU3EG board
Basic peripherals: Digilent PCAM 5C MIPI camera, mini DisplayPort display interface standard with Ultrascale
Implementation method: Design of convolution accelerator is implemented in pure Verilog, development of Zynq PS side is implemented in C language, neural network construction and quantization are implemented in Python
Development tool suite: Vivado, Vitis, Python, Pytorch
Performance indicators: Inference time of YOLOv3tiny is less than 50ms, VGG16 backbone inference time is less than 200ms, maximum clock frequency exceeds 250MHz, peak rate exceeds 172GOPS, INT8 quantization
Resource consumption: 24K LUTs, 23K FFs, 40 BRAM36Ks, 296 DSP48s
Demo included in the project: Face mask recognition based on YOLOv3tiny, helmet recognition based on YOLOv3tiny
Operations that the convolution accelerator can perform: 1x1 Conv, 3x3 Conv, 2x2 Maxpooling with stride = 1/2, implement any activation function through table lookup, Relu, Tanh, sigmoid, leakyRelu

Q&A

Q1: Can this project only implement the YOLOv3tiny algorithm?
- A: This project implements a generic convolution accelerator on the PL side, which is actually independent of the network. However, the scheduling on the PS side is coupled with YOLOv3tiny and requires designing a scheduler program according to the architecture of the network on the PS side, but I do not recommend modifying it yourself, as it is difficult and I recommend learning by reference.
Q2: Which convolutional neural network operations does this project support?
- A: 1x1 Conv, 3x3 Conv, 2x2 Maxpooling with stride = 1/2, implement any activation function through table lookup, Relu, Tanh, sigmoid, leakyRelu
Q3: Can I port it to my own development board?
- A: Yes. But you need to port it yourself according to the constraints of your camera, display, and board.
Q3: Does the implementation of this accelerator consume a lot of resources?
- A: Not really. 24K LUTs, 23K FFs, 40 BRAM36Ks, almost all Xilinx boards have sufficient resources. The only thing is that DSP usage is relatively high. If the DSP48s on the chip are not enough, you can map the multiplier to LUTs yourself.
Q4: Can it run on Artix or Virtex series FPGAs without CPU?
- A: In theory, yes. This accelerator must have a CPU for scheduling. You can try instantiating MicroBlaze or Cortex M1/3 or even Riscv soft cores yourself.
Q5: How do I put the Python-trained weight data into the FPGA?
- A: After preprocessing the weight data to a certain extent, put it on an SD card, call the built-in xilff.h SD card driver library on the PS side, read the binary weight file from the SD card, load it into DDR, and then the PL side accesses the data in DDR through the AXI DMA core for inference calculation.
Q6: What is the architecture of this accelerator?
- A: The design of this accelerator is inspired by the papers: "Angel-Eye A Complete Design Flow for Mapping CNN Onto Embedded FPGA" and "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network".
Q7: Can I use this project as a thesis, competition entry, paper, or other projects?
- A: Not recommended. This project is no longer maintained, and the comments are sparse, and the code style is not particularly standardized. It is only recommended for learning and reference.

Before Raising an Issue

This repository is no longer maintained, but I will try to reply to the issues raised as much as possible. Before raising an issue, you can check if there are any related issues in the history. Based on observation, most issues are related to neural network quantization. However, quantization is not the focus of this project. The Python project included in the repository is of poor quality. Please refer to more standard quantization processes and use more convenient quantization tools. The reproduction of this project, porting to other neural networks, etc., all have significant engineering difficulties. Please carefully evaluate the difficulty of implementation before investing time.

Citation

If you find this work useful, please cite

@inproceedings{chen2021hardware,
  title={Hardware Resource and Computational Density Efficient CNN Accelerator Design Based on FPGA},
  author={Chen, Xiang and Li, Jindong and Zhao, Yong},
  booktitle={2021 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA)},
  pages={204--205},
  year={2021},
  organization={IEEE}
}

简介

2021年全国大学生集成电路创新创业大赛DIGILENT杯一等奖作品

工程描述

该工程实现了一个卷积神经网络加速器，成功搭载Yolov3tiny。配合摄像头采集+显示器回显环路，构建了一个高性能实时目标识别与检测系统。

验证平台：Xilinx Zynq Ultrascale系列xzcu3eg芯片，Digilent官方Genesys ZU3EG板卡
基本外设：Digilent PCAM 5C MIPI摄像头，Ultrascale标配的mini DisplayPort显示器接口
实现方式：纯Verilog实现卷积加速器的设计，C语言实现Zynq PS端的开发，Python实现神经网络的搭建与量化
开发工具套件：Vivado，Vitis，Python，Pytorch
性能指标：Yolov3tiny推理时间小于50ms，VGG16主干推理时间小于200ms，最高时钟频率超过250MHz，峰值速率超过172GOPS，INT8量化
耗用资源：24K个LUT，23K个FF，40个BRAM36K，296个DSP48
工程自带的demo：基于Yolov3tiny的人脸口罩识别、基于Yolov3tiny的头盔识别
卷积加速器能实现的运算：1x1 Conv，3x3 Conv，2x2 Maxpooling步长为=1/2，通过查表法实现任意激活函数，Relu,Tanh,sigmoid,leakyRelu

Q&A

Q1：该工程只能实现Yolov3tiny算法吗？

A：该工程在PL端实现了一个通用的卷积加速器，pl端的加速器其实和网络无关，但是ps端的调度和yolov3tiny仅耦合，需要在PS端自己根据网络的架构设计调度程序，但我不建议自己魔改，工程难度大，我建议参考学习。
Q2：该工程支持哪些卷积神经网络的运算？

A：1x1 Conv，3x3 Conv，2x2 Maxpooling步长为=1/2，通过查表法实现任意激活函数，Relu,Tanh,sigmoid,leakyRelu
Q3：我能移植到我自己的开发板上吗？

A：可以。但是需要自己根据自己的摄像头，显示器，板子的约束，自己进行移植。
Q3：该加速器的实现耗用的资源大吗？

A：不大。24K个LUT，23K个FF，40个BRAM36K，几乎所有Xilinx的板子都有充足的资源。唯一的就是DSP耗用较多，如果芯片的DSP48不够的话，可以自己将乘法器映射成LUT。
Q4：不带CPU的Artix或者Virtex系列的FPGA能跑吗？

A：理论上可以。该加速器必须得有一个CPU进行调度。可以自己尝试例化MicroBlaze或者Cortex M1/3甚至Riscv的软核
Q5：Python训练好的权重数据怎么放到FPGA里面？

A：将权重数据进行一定的预处理之后，放到SD卡里面，在PS端调用自带的xilff.h的SD卡驱动库，读取SD卡的二进制权重文件，灌入DDR之后，PL端通过AXI DMA核访问DDR的数据进行推理计算
Q6：该加速器的架构是怎样的？

A：该加速器的设计借鉴了论文：Angel-Eye A Complete Design Flow for Mapping CNN Onto Embedded FPGA与Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
Q7: 我能把该工程当做毕设，比赛，论文或其他项目吗？

A: 不建议。该工程没有继续维护了，且注释较少，代码风格不算规范，仅建议学习参考。

在提Issue之前

这个仓库已经没有继续维护了，但是提出的issue我会尽可能回复。在提issue之前可以先查看历史的issue有没有相关的问题。据观察，大部分的issue会和神经网络量化相关。但是量化不是这个工程的重点，仓库中包含的python工程写的质量不高，请参考更加标准的量化流程和使用更加便捷的量化工具。该工程的复现，其他神经网络的移植等都具有较大的工程难度，请在开始投入时间之前，谨慎评估一下实现的难度。

引用

如果你觉得这个工作有用，请引用

@inproceedings{chen2021hardware,
  title={Hardware Resource and Computational Density Efficient CNN Accelerator Design Based on FPGA},
  author={Chen, Xiang and Li, Jindong and Zhao, Yong},
  booktitle={2021 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA)},
  pages={204--205},
  year={2021},
  organization={IEEE}
}

fpga_accelerator_yolov3tiny's People

Contributors

Stargazers

Watchers

fpga_accelerator_yolov3tiny's Issues

推理运行时的几个问题

麻烦几个问题请教下：

在hls_preprocess模块中，有一个convert函数，应该是将像素都下采样到0-127的取值范围了，那是不是后续输入到加速器的数据都是0-127的而非0-255？是不是根据量化定的？
在main.c的258-260行，Setup_Video_Resize()函数里的，我看对ifmInbuf里三个元素的地址分别进行了设置，个人理解这里的[0][1][2]三个数组对应的就是输入视频流的RGB三个通道吧？但是在accel_data.c里122行的ifm_ddr_base_addr_in_list里，貌似只用了ifmInBuf[1]这个数组的数据，那是不是意味着只将G通道的数据输入到加速器了呢？这样没问题吗？
第一层的输入应该是RGB三通道的特征图，但在yolov3tiny_quant.py里看到存到配置bin文件里的第一层通道数是8（ifm_list的第一个元素），这里没搞懂为什么是8，不应该是3吗？

谢谢

请问python工程中如图2所示的int8量化后的模型是如何得到的？

输出问题

这是我的输出情况，不太理解是什么原因导致的，我的版本是2023的。

量化权重生产问题

首先感谢大佬。
我在复显权重这里遇到一点问题。

img_path='gao_mask.jpg'
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # 将BGR转换为RGB
from torchvision import transforms

# 预处理图像
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((640, 640)),  # 假设模型输入尺寸为640x640
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # 标准化
])
img_tensor = transform(img).unsqueeze(0)  # 添加批次维度

float_model,dataloader_iter=load_model_data('face_mask.yaml','yolov3tiny_facemask.pt',416,False)
generate_quant_model(float_model,dataloader_iter,'yolov3tiny_facemask_quant.pth')
quant_model=load_quant_model('yolov3tiny_facemask.pt','yolov3tiny_facemask_quant.pth')
quant_model_evaluate_show(img_tensor,quant_model)
yolov3tiny_infer_para_gen(quant_model,32,'F')

目录如下
python_prj
├── images
└── labels(代码中好像没遇到)
└── yolov3tiny_quant.py

最后在quant_model_evaluate_show函数报错，这个函数应该是看下分割的权重能否正确检测。
Input shape must be (N, C, H, W)!
File "D:\yolo_work\fpga_accelerator_yolov3tiny\python_prj\models\common.py", line 45, in fuseforward
return self.act(self.conv(x))
File "D:\yolo_work\fpga_accelerator_yolov3tiny\python_prj\yolov3tiny_quant.py", line 80, in quant_model_detect
x=quant_model[1].modelii
File "D:\yolo_work\fpga_accelerator_yolov3tiny\python_prj\yolov3tiny_quant.py", line 97, in quant_model_evaluate_show
res=quant_model_detect(x,quant_model)
File "D:\yolo_work\fpga_accelerator_yolov3tiny\python_prj\yolov3tiny_quant.py", line 320, in
quant_model_evaluate_show(img_tensor,quant_model)
ValueError: Input shape must be (N, C, H, W)!
请问您有遇到这个问题吗

B8,MIPI_A_PWUP_LS

你好，想请问vivado工程中的emio_gpio_o在io约束中绑定到了B8，手册上显示B8连接的是MIPI_A_PWUP_LS，MIPI_A_PWUP_LS这个引脚我在网上没查询到，请问有什么意义

可以创建微信或者QQ群讨论吗？

关于cal_addtree_int16_x9这个模块中的一些问题

assign a1_d1={a1[15],a1[15],a1[15:0]};
这一句代码中第18位为符号位吗,第17位又是什么?可以解答一下吗

关于卷积并行计算的问题

想请问下，这个PL卷积模块每次只能计算Yolov3tiny 10层的一层吗？从1->10依次进入这个卷积模块，还是说是pipline形式，从宏观上看是同一时刻10个阶段（不同帧）一起算？

显示器没有显示，串口打印如下报错信息

，请问如何解决

关于加速器性能GOP/s计算

想问下作者，您这个峰值和平均分别是怎么计算的呀

我是这样算的：

2是一次乘法和一次加法，64*9是64个3x3卷积，按照我这样算最后是288GOPs（频率取250Mhz），

还有平均是怎么算的呀

source文件内容

@@@source中acceltop文件和accel_conv关系能请作者指点一下吗？我尝试移植在PYNQ上，整个source放进去显示资源不够

关于我想学习您的代码并尝试移植到zedboard上的一些问题

您好，我在学习您的代码的过程中，发现您给神经网络的输入是418X258X8？yolov3 tiny的输入不是416X416X3吗？不知道我说的对不对。如果是对的，能告诉我为什么要修改输入的通道数吗？

字库问题

你好！请问下字库是怎么取的，我想显示数字应该怎么取？

how to find config_list

hello, I am learning your sdk code now , but it is confused that the config_list has no definded. I hope can contact with you ,thank you !

关于填充和有效信号的一些疑问

首先非常感谢你开源这么完整的项目，让我理解了一些不会的方法，但是我还是有部分无法理解
1.关于填充部分，我看到你的图像参数的列都是在原始图像上+2，如：
localparam [8:0] FM_COL_0=LINEBUFFER_LEN1;//15->13+2?
localparam [8:0] FM_COL_1=LINEBUFFER_LEN1+LINEBUFFER_LEN2;//28
localparam [8:0] FM_COL_2=LINEBUFFER_LEN1+LINEBUFFER_LEN2+LINEBUFFER_LEN3;//54
localparam [8:0] FM_COL_3=LINEBUFFER_LEN1+LINEBUFFER_LEN2+LINEBUFFER_LEN3+LINEBUFFER_LEN4;//106
localparam [8:0] FM_COL_4=LINEBUFFER_LEN1+LINEBUFFER_LEN2+LINEBUFFER_LEN3+LINEBUFFER_LEN4+LINEBUFFER_LEN5;//210
localparam [8:0] FM_COL_5=LINEBUFFER_LEN1+LINEBUFFER_LEN2+LINEBUFFER_LEN3+LINEBUFFER_LEN4+LINEBUFFER_LEN5+LINEBUFFER_LEN6;//418
但是我读不懂这是怎么完成填充的。

2.关于global_zero,我看到他的目的是使后面的当前累加数据（acc_curr_data_zero）为0，我不懂为什么要使得当前累加数据为0，这就导致我看不懂他有效的的计数为什么是这样:（我只看懂了acc_prev_data_zero是因为第一次卷积，前累加数据应该为0）
always@(posedge clk) begin
if(col_cnt<2 || (addr_cnt<conv_col_mult_2 || addr_cnt>=conv_addr_len)) begin //?????????
global_zero<=1'b1;
end else begin
global_zero<=1'b0;
end
end

3.是否接收数据那里跟前面两个问题有关？
always@(posedge clk) begin
case(ofm_send_sel)
2'b00: begin // whole
no_pool_ofm_addr_start<=0;
no_pool_ofm_addr_end<=conv_addr_len;
pool_ofm_addr_start<=0;
pool_ofm_addr_end<=pool_addr_len;
end
2'b01: begin // no head//最后一次多行不要头卷积结果？
no_pool_ofm_addr_start<=conv_col;
no_pool_ofm_addr_end<=conv_addr_len;
pool_ofm_addr_start<=pool_col;
pool_ofm_addr_end<=pool_addr_len;
end
2'b10: begin // no tail //第一次多行不要最后卷积结果？
no_pool_ofm_addr_start<=0;
no_pool_ofm_addr_end<=conv_addr_len-conv_col;
pool_ofm_addr_start<=0;
pool_ofm_addr_end<=pool_addr_len-pool_col;
end
2'b11: begin // no head no tail 中间？
no_pool_ofm_addr_start<=conv_col;
no_pool_ofm_addr_end<=conv_addr_len-conv_col;
pool_ofm_addr_start<=pool_col;
pool_ofm_addr_end<=pool_addr_len-pool_col;
end
default: begin
no_pool_ofm_addr_start<=0;
no_pool_ofm_addr_end<=conv_addr_len;
pool_ofm_addr_start<=0;
pool_ofm_addr_end<=pool_addr_len;
end
endcase
end

最后，真的非常感谢你开源的这么完整的项目，解决了我太多不懂的东西

verilog

想问一下
这部分代码的8个输入数据之间是什么关系，以及8个权重数据之间的关系

dma有关软件配置

我看到ruan'jia软件部分只有这部分对axi dma的配置，前两行是复位，后两行是启动，配置中断阈值，并没有看到对dma的使用代码，没有指定传输地址等，是我哪里没有理解到吗

权重分割错误

作者你好，我用docker拉取了最新版的yolov3，生成权重并且在PC上检测图片没有问题。
相应的权重用您的函数量化分割。

if __name__ == '__main__':
    float_model,dataloader_iter=load_model_data('face-mask-detection.yaml','last.pt',416,False)
    generate_quant_model(float_model,dataloader_iter,'yolov3tiny_facemask_quant.pth')
    quant_model=load_quant_model('last.pt','yolov3tiny_facemask_quant.pth')
    yolov3tiny_infer_para_gen(quant_model,32,'F')

新版yolo_v3量化做了一些函数名的变化，我也做了相应的改变。
函数yolov3tiny_infer_para_gen中有如下报错

Exception has occurred: KeyError
'1.model.0.act.scale'
  File "D:\fpga\app\test.py", line 132, in generate_para_list
    ascale=quant_model.state_dict()['1.model.'+str(index)+'.act.scale'].item()
  File "D:\fpga\app\test.py", line 239, in yolov3tiny_infer_para_gen
    w,b,wscale,ascale,cscale,azp,czp=generate_para_list(quant_model,index)
  File "D:\fpga\app\test.py", line 314, in <module>
    yolov3tiny_infer_para_gen(quant_model,32,'F')
KeyError: '1.model.0.act.scale'

请问有什么解决思路吗

算法部分

请问你的算法部分是在哪里实现的？我简单看了一下项目，HLS制作的ip核并不是算法部分

sd卡中二进制权重文件如何得到

sd卡中的权重是根据weight，bias，分别分成十份了吗？想问一下是怎么转化的（从训练好的pt文件到分成10份的bin文件）

检测水果时出现问题

复现您的源码后，将F开头的bin文件（口罩检测）改成自己生成的bin文件（也是F开头），头盔检测的bin文件没动，FCS.bin文件的标签种类为苹果，香蕉，橘子三种，可是始终识别不到橘子，在pycharm上跑推理的时候也能检测到橘子，可是上板子之后始终是将将橘子识别成苹果，请问一下作者能解答一下问题出在哪里吗？

关于我对于您的accel.c文件的Detect函数的理解

作者您好，看了您的vitis代码我的理解是通过Detect函数得到了硬件画框所需要的框框坐标与其他数据。对于Detect函数的具体理解我认为三个for循环相当与遍历摄像头检测到的一帧特征图，curr_p相当于每个像素点的一个概率值，再通过if(curr_p>max_p)对于每个像素点的概率值进行比较最后取最大概率的那个像素点的各种数据(max_p,trec,irec,jrec)，下一个for循环我认为是取所有种类中概率最大的那个种类为检测到的物体类别。再通过一系列坐标转换等操作最终得到硬件画框所需要的框框坐标与其他数据。后续我想实现硬件上画多个框的功能，目前我是将if(curr_p>max_p)改为了if(curr_p>固定值)，也就是只要curr_p满足curr_p>固定值那么就都保存下来，这样来实现通过Detect函数得到画多组框所需要的多组数据。不知道我对于Detect函数的理解是否有误，以及修改成if(curr_p>固定值)来实现一帧图像能够得到多组框框数据的做法是否可行。

自定义模型更改问题

您好，我目前正在进行一个有关面部识别的项目，我已经使用 yolov4 tiny 训练了一个自定义模型。请让我知道我需要在哪里导入我的模型以及我需要在这个 git 项目中进行哪些更改。谢谢。

字库文字

想请问下hls源码中的字库文件怎么取

程序卡在wait_ap done

cpu接受不到来自卷积模块的中断，我已经对中断的配置进行了很多检测，应该不是中断配置问题，我使用了ila调试，vdma1有数据传回ddr，但是dma没有数据传给卷积模块，卷积模块的s axis也在整个过程中都没有触发，我的dma和卷积部分的代码没有修改，硬件模块设置也是一样的，请问你有什么意见吗

vivado工程遇到的问题

程序卡在了YOLOV3_TINY函数中的Inference函数

在PS端进行调试时，发现程序卡住了。经过Debug，发现程序卡在了YOLOV3_TINY函数中的Inference函数，当通过AXI_Lite总线给卷积核发送指令时出现了问题，尝试用ILA观察下AXI_Lite总线上的信号，但观察下来好像并没有发送读写的指令与数据，想要询问下解决办法。

能否提供一下该卷积加速器的仿真测试文件，有偿

作者能否留下联系方式，有问题需要问您

关于我调试该系统所遇到的问题

以下是我调试时，出现的一些调试错误以及板子的连线图。

Accel FM LEN的设置问题

想请问下vitis prj里面accel_parameter.h中特征图的尺寸设置（代码93至107行）是怎么考量的呢，为什么行的base是8而列的base是13，还有就是为什么要+2呢

生成sd 卡文件问题

我使用yolo3tiny_quant 文件拆分权重时，提示我错误信息如下：

我更改了文件路径如下：

best .pt 文件是我训练的权重文件，.pth是我根据pt 使用torch 生成的，mask.ymal 文件内容如下：

想要对其他的目标进行识别与分类需要修改源码的哪些部分

请问一下我已经用python实现了水果的检测与分类，训练推理与量化文件皆以跑通，并且也已经复现了您的源码，能够通过板子的摄像头进行头盔的识别，现在我想用您的源码来进行水果的检测与识别，请问需要修改您源码的哪些地方，才能够在板子上来实现这个功能，您能够详细地解答一下吗？

是否有相关的论文成果

我的论文中参考了本项目的工作，想问一下作者在本项目中的工作是否有已发表的论文，我希望可以引用您的工作成果。

关于BN层归一化的问题

想要问下，这个加速器中有引入BN层进行归一化处理吗？

想咨询一下数据流传输的问题

想问一下DDR中特征图数据的存放规则，是图1这样的吗，比如一张（224,224,3）的图像，是先按行再按列再到通道数吗，输出给pl也是按地址连续的发送过去吗，那8个ifm是按图2读的吗
pl端是否是按脉动阵列展开kernel并行再扩展到IFM,OFM通道并行？每个IFM与8个不同kernel相乘，8个不同的ifm并行与64个不同的kernel相乘

vivado及vitis所用的是哪个版本

ifm及ofm尺寸最低256字节的问题

大佬，又来打扰了。在C代码的Set_Next_OFM_ADDR_LEN()和Set_Next_IFM_ADDR_LEN()两个函数里，都有针对特征图尺寸小于256时按最低256处理的逻辑，想请教下这个是有什么用意呢？谢谢

可否提供项目中的测试数据

我正在试图将该项目的软硬件移植到RISC-V软核上，软件运行一切正常，但加速器总是给出显然错误的结果。我希望能够对比神经网络的中间数据，以找出我的系统中哪个环节存在问题，但项目中除了生成量化文件外没有任何关于加速器的仿真的testbench或matlab代码来对比。如果您在工作时有相关代码，可否附加在该项目中？这将给调试带来很大的帮助。

我想从sd卡中读取图片进行检测，请问我该怎么做

我修改了load_pic（）的输入图片，并且取消了注释，关闭了摄像头的使能，但是屏幕还是不显示，串口输出检测的信息如下

量化模型生成问题

你好，在运行yolov3tiny_quant时里面是直接用load_quant_model加载的量化好的模型，请问一下这个量化好的模型（.pth）我该怎么得到，是通过generate_quant_model这个生成吗，生成过程中还出现了报错，能请您解答一下吗

作者您好，请问该系统的吞吐率和准确率如何确定。

python工程无法复现

您好，facemask.yaml和yolov3-tiny-facemask.yaml在Readme中提到但没有提供

yolov3-tiny换成yolov5s以提高检测精度和实现同时检测多个物体的可行性问题

您好作者，之前已经复现了您的工程，看到您在介绍中的这样一段话，

因此有了在此基础上换成yolov5并后续对v5进行改进来提高检测检测精度的想法，但是前两天又看到您在别人的问题下说这个工程不是通用加速器，不支持yolov5，所以想确认一下这个加速器是不是只适用于yolov3-tiny的推理加速吗？如果不能支持yolov5那我对于yolov3-tiny算法进行改进来提高检测精度，加速器支持改进后的yolov3-tiny吗？
还有一个关于同时检测多个物体画多个框的问题：这个工程目前只能画一个框，那我修改hls_rect_ip的代码能实现画多个框的功能吗？或者说修改哪些部分能实现画多个框的功能呢？

关于在PS侧如何使用axi dma的问题？

您好，我在zedboard PL侧实现了一个LeNet神经网络，移植了您的AXI 接口的编写，并且使用了AXI DMA IP核。但是在PS侧，我不知道如何去使用AXI DMA，我移植了您的关于axi dma的部分代码，修改了中断号和地址。但是，我在下载验证时，PS侧接收不到PL侧的中断信号(ap_done)，不知道是什么原因。请问您当时是如何编写axi dma部分的？

请问一下量化文件的部分参数是什么意思？

请问一下我想量化为8bit，那么这两个参数ofmch_t和first_chr要填什么呀，不太理解这些参数都是什么意思，请问作者能解答一下吗，十分感谢

请教一下ps控制输入特征图的部分

您好，我想学习您的代码和控制逻辑，但是在控制数据传输的时候对于每次怎么传八个通道的特征图有些疑问
void Set_Next_IFM_ADDR_LEN(){
if(iter_ofm_post!=ofm_batch-1||iter_div_post!=fm_div_cnt-1||iter_ifm_post!=ifm_batch-1){
ifm_send_task_enable=1;
if(fm_size>256) ifm_addr_fmbase=iter_ifm_prefm_size;
else ifm_addr_fmbase=iter_ifm_pre256;
if(fm_div_cnt==1){
ifm_addr_offset=0;
ifm_send_len=fm_size<<3;
}
else{
if(iter_div_pre==fm_div_cnt-1){
ifm_addr_offset=(fm_row-fm_res-2)*fm_col;
ifm_send_len=((fm_res+2)fm_col)<<3;
}
else{
ifm_addr_offset=(iter_div_prefm_div)*fm_col;
ifm_send_len=((fm_div+2)*fm_col)<<3;
}
}
ifm_addr_send=ifm_ddr_base_addr+((ifm_addr_fmbase+ifm_addr_offset)<<3);
}
else{
ifm_send_task_enable=0;
}
return;
}

这部分代码我理解下来是每个批次只发送了第一张图的对应块的基地址和对应的长度对于同一个批次的其他七张图我在代码里没有找到具体的传输那个<<3指的是int8的8位数据长度吗？我是按照<<3是指int8的长度来理解这段代码您能告诉我这部分具体的传输形式吗谢谢了

vivado无法打开工程中的BD文件

我使用的是Vivado 2018.3版本，打开工程之后无法看到BD文件。提问之前翻看了一下以往的issue，说是采用2019.2往上的版本。请问有2018.3的解决办法吗？（在网上找了一些发现没什么效果，再装2019.2的版本比较耗时间）

zcu104移植问题

您好，我尝试将工程移植到zcu104上，对dout_0[0]的约束有疑问。您的约束是：
set_property PACKAGE_PIN B8 [get_ports {dout_0[0]}]
set_property IOSTANDARD LVCMOS12 [get_ports {dout_0[0]}]

请问B8引脚的作用是什么？zcu104上有相似功能的引脚吗？

谢谢您！

请教几个关于RTL代码问题，感谢！

// global addr and global valid, zero, stride=1----------------------------------------
wire pool_col_zero_s1;
wire pool_row_zero_s1;
assign pool_col_zero_s1=((pool_col_addr==0)||(pool_col_addr==conv_col_minus_1)||(pool_col_addr==conv_col_minus_2));  //这里为什么在矩阵最后2列都补0？
assign pool_row_zero_s1=((pool_row_addr==0)||(pool_row_addr==1)||(pool_row_addr==conv_row_minus_1)||(pool_row_addr==conv_row));  //这里为什么在矩阵最后两行都补0？
wire gen_pool_valid_s1;
wire gen_pool_zero_s1;
assign gen_pool_valid_s1=((pool_row_addr>=1)&&(pool_row_addr<=conv_row)&&(!pool_idle));
assign gen_pool_zero_s1=pool_col_zero_s1||pool_row_zero_s1;

你好，我阅读上面这部分代码后，发现1、pool之后的矩阵最后两列和最后两行都补为0，这是什么原因呢？2、根据这部分代码展示，在读取(row=2,col=1)这个数据后就和前面行缓冲出去的(row=1,col=0)(row=1,col=1)(row=2,col=0)进行了pool，但col=0的数据是卷积出来的padding，这样处理似乎与yolov3tiny的网络结构似乎不同，为什么这样处理呢？

assign conv_row={1'b0,row};
assign pool_row_t1=conv_row-2;
assign pool_row_t2={1'b0,pool_row_t1[8:1]};
assign pool_row=pool_row_t2+2;

上面这部分代码，在stride=1时，计算pool之后的row似乎与实际情况不同，实际stride=1是，pool的输出矩阵行数应该不会减半，这部分代码怎么考虑的呢？

reg task_done;
always@(posedge clk) begin
	case(task_reg)
		3'b100: task_done<=recv_done;
		3'b010: task_done<=send_done;
		3'b001: task_done<=conv_done;
		3'b101: task_done<=recv_done;
		default: task_done<=0;
	endcase
end

上面这部分代码，当处理task_reg=101，表示recv和conv同时进行，但如果recv先结束，那么CPU会直接收到ap_done，是否会影响系统正常工作？

最后一个问题：zero_point_act是激活函数后数据的zero_point，为什么该零点只输入到pool模块加padding用？如果计算不经过pool，那么输出数据的零点使用的是量化输出的零点zero_point_out，为什么不使用zero_point_act？

以上，烦请帮忙解惑。万分感谢！