amirgholami / zeroq Goto Github PK

View Code? Open in Web Editor NEW

273.0 273.0 54.0 5.6 MB

[CVPR'20] ZeroQ: A Novel Zero Shot Quantization Framework

Python 58.76% Shell 0.06% Jupyter Notebook 32.22% Dockerfile 0.02% Makefile 0.02% Batchfile 0.02% C++ 3.63% Cuda 5.27%

compression efficient-model efficient-neural-networks quantization quantized-neural-networks

zeroq's Issues

Export quantized model into pth file

How to export the quantized model into pth file?

Difference in Baseline FP32 accuracy numbers for MobileNetV2 and ResNet18 as compared to DFQ

Dear Author,
Can you please explain the reason for difference in FP32 accuracy for MobileNet and ResNet18 models as compared to your paper and DFQ paper.
E.g. DFQ claims MobileNetV2 FP32 -- 71.7%
But ZeroQ claims MobileNetV2 FP32 -- 73.03%.
In case if we assume both are different baseline models, is it a fair comparison to claim ZEROQ improvement of 2% over DFQ in case of MobileNetV2. Hope the same is valid of ResNet50 etc models from other papers.
Please explain.
Thank you.

Quantize tensor is not enough

Great job, you give the best quantization accuracy as I know.
I'm very interested in your paper and code, but I have some issue about this paper.
As far as my knowledge.
For a full quantized model, all tensor&weight&bias should be quantized.

from your code, I get some question.

bias is not quantized earlier. use scale_bias = scale_input * scale_weight maybe useful(some difference between float and int8 exist), but if bias value bigger than scale_bias*128(for int8 mode), this error will be very large. Don't you care about it ?
input/output tensor quantize is not enough in your model.
Did you use "QuantAct" as input/output quantize?
from https://github.com/amirgholami/ZeroQ/blob/master/utils/quantize_model.py#L46
only tensor after relu/relu6 are quantized.
for mobilenet-v2 network,
a. you lose quantization for first input quantization;
b. In every Bottleneck, the third convolution has no relu/relu6, quantization is lost;
c. all bypass add structure has no quantization. so add cannot run as a quantize operation.
BatchNorm should be fused with convolution ?
In most cnn engine, batchnorm is fused in convolution or fully_connect before inference, so it is necessary to fuse batchnorm with conv/fc before quantization. I have no idea whether final accuracy will increase or decrease after fuse.
Act_max and Act_min values ?
According to my test(for model mobilenet-v2), after https://github.com/amirgholami/ZeroQ/blob/master/uniform_test.py#L89
all Act_max is 6, all Act_min is 0, it is just the range of relu6, for most other method who calculate act_min&act_max value, 0 & 6 is the most common case, I didn't find any difference from result.

Thank you for your reply.

Could you share the slides of the oral report? Thanks

Where could I find low bit quantization code.

Thank you for sharing the great repo.
I ran successfully your code under 8-bit options.

In the paper, there is an experiment result with mixed-precision (low-bit).
How could I run mixed-precision?

Thank you

Reproduce Mixed Quantization Results on paper

Hello, it's good to see greatest paper among the quantization methods.
Really easy to reproduce, also really efficient algorithms without any re-training or validation calibration.

By the way, I have several questions following the paper.

Is there any plan to release mixed quantization?

To Reproduce mixed quantization method for paper - Pareto frontier

Measure Sensitivity with model size constraints
Grouping alpha block of network and select each top-k configuration
Solve optimization with DP table(t*alpha) based on sensitivity cost?

Thank you again.

How to Train A Quantized SSD Detector?

Hi,

I have a question for training the quantized SSD detector. It seems there is no script to run training. How can I train the quantized detector using this framework?

Thanks,

Model remains float32 type after quantization

Hello, thank you for providing the open-source code. I encountered some issues while trying to reproduce.
I set up the environment according to your instructions and run the code. However, I found that my model remains float32 type data after quantization (I take weight_bit = 8 and activation_bit = 8). I am not sure where the problem is and would appreciate your help.
Thank you again for providing the code, and I look forward to your reply.

object detection test example?

How is the Mixed Precision bit setting is automated?

Hi,

This solution looks very interesting and getting impressive results on the quantization models.
Thank you so much for sharing the solution.

I have a small confusion in the source code.
I could see, fixed bit precision of 8 bit for weight and activation in the source code.
But couldn't see the method to get exact bit precision configuration for each layer in NN model.
Could you please point me exactly where is this logic incorporated in the source code?

Regards,
Albin

Why do the weights need to be dequantized after one quantization?

   new_quant_x = linear_quantize(x, scale, zero_point, inplace=False)
    n = 2**(k - 1)
    new_quant_x = torch.clamp(new_quant_x, -n, n - 1)
    quant_x = linear_dequantize(new_quant_x,
                                scale,
                                zero_point,
                                inplace=False)

Doesn't this get the weight of the floating point number?

Questions about quantization

Hi, thanks for the great paper and codes. I have 2 questions about quantization in the paper:

Is the quantization used in paper per-channel or per-tensor? (In quant_utils.py it seems to be per-channel)
Is the results tested on true int8 inference? or only tested with fake quantization?

Thanks again!

Can you provide the pretrained models for test?

Fusing batch normalization and convolution

In many previous works, they show that it's more difficult to do quantization while the batch normalization is fused with convolution due to wider variance of weight range.
So, do you have any comment about quantizing the model with fused batch normalization and convolution if only smarter min/max value range of activation quantization is set?

生成数据时，提取的网络激活位置是否有bug?

感谢您的杰出工作，有一个小问题想问下：

在distill_data.py中：

87行代码，hook提取到的特征都是卷积后的值，但这种做法是不是默认了卷积后一定是BN？
如果一个网络存在某些conv后没有加BN，那么此时计算论文中公式3，可能是“不同层的两个BN参数间的计算”

感谢~

is the initialized data from a uniform distribution instead of a gaussian distribution?

hope to get your answer, thanks

Is the proposed method a Offline quantization or Run-time quantization?

Dear Author,
Thank you for the great work!!. We are able to successfully run your code. From our observation we noted that the proposed method is a "Run-time-quantization" method which means the quantization of weights, and activations happen during inference time for each layer as opposed to "Offline-Quantization" method where in the stored quantized weights and activations are used during the inference stage.

During the inference when we call the test(quantized_model, test_loader) in uniformtest.py #114
we observe that in Quant_Conv2d() of quant_modules.py #131 the control flow is always going to if not self.full_precision_flag: in Forward() #169 but not to the else part.

If it not a Run-time quantization method then how are we making use of stored weights and activation ranges quantized using distillation data. In such case we expect the control flow should reach the else: part of forward() call. Please clarify.

Regards,
Tej.

bitwidth of each layer (discussion of MP)

Thanks for your great works.
I want to know which layers are more sensitive to quantization. I would be very grateful if you can share the (MP4) bit-width of each layer of MobileNet or ResNet.

How much calibration data is needed?

Hi,

Thanks for the great work. I am wondering how many calibration data is needed. According to here and here, only 64 calibration images are good to calibrate the model?

Hi @yaohuicai, sorry to bother you. Any idea?

Thanks.

Update: Only 1 calibration image can get an accuracy of 71.17%......

Welcome update to OpenMMLab 2.0

I am Vansin, the technical operator of OpenMMLab. In September of last year, we announced the release of OpenMMLab 2.0 at the World Artificial Intelligence Conference in Shanghai. We invite you to upgrade your algorithm library to OpenMMLab 2.0 using MMEngine, which can be used for both research and commercial purposes. If you have any questions, please feel free to join us on the OpenMMLab Discord at https://discord.gg/amFNsyUBvm or add me on WeChat (van-sin) and I will invite you to the OpenMMLab WeChat group.

Here are the OpenMMLab 2.0 repos branches:

	OpenMMLab 1.0 branch	OpenMMLab 2.0 branch
MMEngine		0.x
MMCV	1.x	2.x
MMDetection	0.x 、1.x、2.x	3.x
MMAction2	0.x	1.x
MMClassification	0.x	1.x
MMSegmentation	0.x	1.x
MMDetection3D	0.x	1.x
MMEditing	0.x	1.x
MMPose	0.x	1.x
MMDeploy	0.x	1.x
MMTracking	0.x	1.x
MMOCR	0.x	1.x
MMRazor	0.x	1.x
MMSelfSup	0.x	1.x
MMRotate	1.x	1.x
MMYOLO		0.x

Attention: please create a new virtual environment for OpenMMLab 2.0.

The accuracy of pytorch official model is 0.3% lower than full precision

I've used the pytorch official resnet50 (https://pytorch.org/docs/stable/torchvision/models.html) in your ZeroQ, and it only gives
75.85 (full precision is 76.13)

All I do is change the model loading line in uniform_test.py as follows

model = ptcv_get_model(args.model, pretrained=True)
model = torchvision.models.resnet50(pretrained=True)

Any hypermeters are allowed to be tuned for better performance?

Reproduction and Auto-Mixed Quantization?

Hey, Thanks for your great work.
When i want to reproduce your result, i get a wired result. In classification task on imagenet, here is my result:

mobilenetv2 36.5%
shufflenet 32.12%
resnet50 38.83%
resnet18 35.7%

I used pytorch 1.5 for Test. When i test resnet20_cifar10, looks normal 93.88%.
Another question is when auto-mixed quantization version will release?
Anybody help me @amirgholami @yaohuicai @Zhen-Dong

Backpropagation function for quantized model

Just observed that the backpropagation part for the quantized model has not been implemented yet in the quant_utils.py file. Any ideas on how to approach the implementation?

Runtime error when running uniform_test.py

The same error happens when running with resnet18 as well. Appreciate any suggestions to get this fixed.

$ python ./uniform_test.py --dataset cifar10 --model mobilenetv2_w1
****** Full precision model loaded ******
Files already downloaded and verified
Traceback (most recent call last):
File "./uniform_test.py", line 75, in
dataloader = getDistilData(
File "/home/user/ZeroQ/classification/distill_data.py", line 121, in getDistilData
output = teacher_model(gaussian_data)
File "/home/user/env3/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/env3/env/lib/python3.8/site-packages/pytorchcv/models/mobilenetv2.py", line 147, in forward
x = self.features(x)
File "/home/user/env3/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/env3/env/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/user/env3/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user/env3/env/lib/python3.8/site-packages/torch/nn/modules/pooling.py", line 616, in forward
return F.avg_pool2d(input, self.kernel_size, self.stride,
RuntimeError: Given input size: (1280x1x1). Calculated output size: (1280x-5x-5). Output size is too small

increased inference latency for quantized model

I hava just reproduced the classification on resnet50+imagenet. The accuracy is excellent!

But there is a significant increase in inference latency for quantized model.
Test results on resnet + imagenet + tesla t4:

test(model, test_loader) takes 143 seconds
test(quantized_model, test_loader) takes 1442 seconds

Does anybody hit the same issue ?

The result of using original ImageNet dataset to calibrate 4bit model is worse than ZeroQ, why is that?

When I use ImageNet Val dataset to calibrate 4bit resnet18, I got a result of 8.77%, while the distilled data could reach 20%. So original data is not perfect to do calibration?

amirgholami / zeroq Goto Github PK

zeroq's Issues

Welcome update to OpenMMLab 2.0

Recommend Projects

Recommend Topics

Recommend Org