antspy / quantized_distillation Goto Github PK

Implements quantized distillation. Code for our paper "Model compression via distillation and quantization"

License: MIT License

Python 95.98% Perl 4.02%

quantized_distillation's Introduction

Model compression via distillation and quantization

This code has been written to experiment with quantized distillation and differentiable quantization, techniques developed in our paper "Model compression via distillation and quantization".

If you find this code useful in your research, please cite the paper:

@article{2018arXiv180205668P,
   author = {{Polino}, A. and {Pascanu}, R. and {Alistarh}, D.},
    title = "{Model compression via distillation and quantization}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1802.05668},
 keywords = {Computer Science - Neural and Evolutionary Computing, Computer Science - Learning},
     year = 2018,
    month = feb,
}

The code is written in Pytorch 0.3 using Python 3.6. It is not backward compatible with Python2.x

Note Pytorch 0.4 introduced some major breaking changes. To use this code, please use Pytorch 0.3.

Check for the compatible version of torchvision. To run the code, use torchvision 0.2.0.

pip install torchvision==0.2.0

This should be done after installing the requirements.

Getting started

Prerequisites

This code is mostly self contained. Only a few additional libraries are requires, specified in requirements.txt. The repository already contains a fork of the openNMT-py project. Note that, due to the rapidly changing nature of the openNMT-py codebase and the substantial time and effort required to make it compatible with our code, it is unlikely that we will support newer versions of openNMT-py.

Summary of folder's content

This is a short explanation of the contents of each folder:

datasets is a package that automatically downloads and process several datasets, including CIFAR10, PennTreeBank, WMT2013, etc.
quantization contains the quantization functions that are used.
perl_scripts contains some perl scripts taken from the moses project to help with the translation task.
onmt contains the code from openNMT-py project. It is slightly modified to make it consistent with our codebase.
helpers contains some functions used across the whole project.
model_manager.py contains a useful class that implements common I/O operations on saved models. It is especially useful when training multiple similar models, as it keeps track of the options with which the models were trained and the results of each training run. Note: it does not support concurrent access to the same files. I am working on a version that does; if you are interested, drop me a line.
First-level files like cifar10_test.py are the main files that implement the experiments using the rest of the codebase.
Other folders contain model definitions and training routines, depending on the task.

Running the code

The first thing to do is to import some dataset and create the train and test set loaders. Define a folder where you want to save all your datasets; they will be automatically downloaded and processed in the folder specified. The following example shows how to load the CIFAR10 dataset, create and train a model.

import datasets
datasets.BASE_DATA_FOLDER = '/home/saved_datasets'

batch_size = 50
cifar10 = datasets.CIFAR10() #-> will be saved in /home/saved_datasets/cifar10
train_loader, test_loader = cifar10.getTrainLoader(batch_size), cifar10.getTestLoader(batch_size)

Now we can use train_loader and test_loader as generators from which to get the train and test data as pytorch tensors.

At this point we just need to define a model and train it:

import os
import cnn_models.conv_forward_model as convForwModel
import cnn_models.help_fun as cnn_hf
teacherModel = convForwModel.ConvolForwardNet(**convForwModel.teacherModelSpec,
                                              useBatchNorm=True,
                                              useAffineTransformInBatchNorm=True)
convForwModel.train_model(teacherModel, train_loader, test_loader, epochs_to_train=200)

As mentioned before, it is often better to use the ModelManager class to be able to automatically save the results and retrieve them later. So we would typically write

import os
import cnn_models.conv_forward_model as convForwModel
import cnn_models.help_fun as cnn_hf
import model_manager
cifar10Manager = model_manager.ModelManager('model_manager_cifar10.tst',
                                            'model_manager', create_new_model_manager=False)#the first time set this to True
model_name = 'cifar10_teacher'
cifar10modelsFolder = '~/quantized_distillation/'
teacherModelPath = os.path.join(cifar10modelsFolder, model_name)
teacherModel = convForwModel.ConvolForwardNet(**convForwModel.teacherModelSpec,
                                              useBatchNorm=True,
                                              useAffineTransformInBatchNorm=True)
if not model_name in cifar10Manager.saved_models:
    cifar10Manager.add_new_model(model_name, teacherModelPath,
            arguments_creator_function={**convForwModel.teacherModelSpec,
                                        'useBatchNorm':True,
                                        'useAffineTransformInBatchNorm':True})
cifar10Manager.train_model(teacherModel, model_name=model_name,
                           train_function=convForwModel.train_model,
                           arguments_train_function={'epochs_to_train': 200},
                           train_loader=train_loader, test_loader=test_loader)

This is the general structure necessary to use the code. For more examples, please look at one of the main files that are used to run the experiments.

Authors

Antonio Polino
Razvan Pascanu
Dan Alistarh

License

The code is licensed under the MIT Licence. See the LICENSE.md file for detail.

Acknowledgements

We would like to thank Ce Zhang (ETH Zürich), Hantian Zhang (ETH Zürich) and Martin Jaggi (EPFL) for their support with experiments and valuable feedback.

quantized_distillation's People

Contributors

Stargazers

Watchers

quantized_distillation's Issues

compute_loss problem

Hi, the code in compute_loss function in Loss.py has code below:

loss_distilled = nn.functional.kl_div(scores, prob_teacher,
weight=self.criterion.weight,
size_average=self.criterion.size_average)

but the nn.functional.kl_div has no argument named "weight".
How it come? thanks a lot :)

[Errno 2] No such file or directory: 'model_manager_cifar10.tst'

what is 'model_manager_cifar10.tst',and where i can download?

Why do you have to train Cifar10 two times?

Notice: The code should be find in cifar10_test.py in line around 190.
In file cifar10_test.py you use code

if TRAIN_SMALLER_MODEL:
        for _ in range(2):

to train the model twice.
But as I can see, the first get a higher performance for small model 0, with no quantization or distill.

smallerModelSpec0 = {'spec_conv_layers': [(75, 5, 5), (50, 5, 5), (50, 5, 5), (25, 5, 5)],
                    'spec_max_pooling': [(1, 2, 2), (3, 2, 2)],
                    'spec_dropout_rates': [(1, 0.2), (3, 0.3), (4, 0.4)],
                    'spec_linear': [500], 'width': 32, 'height': 32}

So Why?

A question about Absmax quantization functon

Hi Antspy,

I have a question about the following codes in the quantization function:

norm_scaling = getattr(tensor, norm_to_use)(p=2)

Can I check what is the meaning of p=2? I was thinking the code is equivalent to tensor.max() while I cannot find any doc for the specification of p. Could you please suggest?

ValueError: The model "cifar10_distilled_spec0" is not present in the list of saved models

Hi!
I ran the cifar10_test.py. But following successful training of the teacher network, it gives the following error. I have made TRAIN_DISTILLED_MODEL =True but this does not fix the issue. I also notice a section of code commented which says "train normal distilled" which i think can be a cause of the error.

Where is the quantization implemented?

Thank you for your excellent contribution.
I read some of your codes, but I found that the format and the number of the model weights remain the same. The code 'quantize_weights_model(model)' maps the model weights to different scales.
Could you please give me some clues where the model weights are quantized?
Thank you.

ModuleNotFoundError: No module named 'spacy.en'

I run "cifar10_test.py ",but the tab tell me "ModuleNotFoundError: No module named 'spacy.en'"

How can I calculate the model size

Model "cifar10_smaller_spec0_quantized2bits" ==> Prediction accuracy: 85.280000% == Reported accuracy: 79.710000%
Effective bit Huffman: 1.7171958071948659 - Size reduction: 16.26680978221003

The above texts are the outputs of the main training code.
I believe it represents the distribution of parameters in the network and so that affect the effective of Huffman code(the smaller the better). Am I right?

And then, using Huffman code, you can reduce the size of the network, which is reported as the texts.

However, how to calculate the original model size is a question to me.
Would you like to give me a hand?

How is the setting in the ImageNet?

Which file do you run to do the experiments as the cifar10_test.py and how much is the batch size of the training process?

Are there any problems about these tow lines?

When I tried to run DISTILLED_MODEL in cifar10_wideResNet.py, I found maybe there are some errors with these two lines (in cnn_models/help_fun.py):

index_distillation_loss = torch.arange(0, outputs.size(0))[mask_distillation_loss.view(-1, 1)].long()
inverse_idx_distill_loss = torch.arange(0, outputs.size(0))[1-mask_distillation_loss.view(-1, 1)].long()

The size of the tensor which produced from torch.arange(0, outputs.size(0)) is [outputs.size(0)], but it's [batch_size, 1] of mask_distillation_loss. So maybe there are some errors with these two lines.
It will be appreciate if I receive your reply.

Discrepancy between reported accuracy and predicted accuracy

Hello,

Thank you for sharing your codebase.

So, I used this repo to train Resnet20 (All teacher, smaller, etc are all same) on cifar10 and I see some ambiguity in reported performance vs predicted performance in the last section of your evaluations here line.

Here are the final results:

Model cifar10_distilled_quantized8bits prediected accuracy tensor(91.8600) reported accuract 91.91

Effective bit Huffman: 7.585104101179689 - Size reduction: 4.084183131042501 - Size MB : 0.26617525000000014

Model cifar10_teacher prediected accuracy tensor(90.5300) reported accuract 90.53
Size MB : 1.08164

Model cifar10_distilled_quantized4bits prediected accuracy tensor(91.4400) reported accuract 88.03

Effective bit Huffman: 3.542860840945231 - Size reduction: 8.436903261661763 - Size MB : 0.129758

Model cifar10_distilled_quantized2bits prediected accuracy tensor(90.6000) reported accuract 12.23

Effective bit Huffman: 1.5554269442698125 - Size reduction: 17.72433944312385 - Size MB : 0.06269675

Model cifar10_distilled_model prediected accuracy tensor(92.2500) reported accuract 92.25
Size MB : 1.08164

PM quantization of model "cifar10_distilled_model" with "2" bits and bucketing 256: 21.850000%

Effective bit Huffman: 1.4938500795088938 - Size reduction: 18.350201302288497 - Size MB: 0.06062125

PM quantization of model "cifar10_distilled_model" with "4" bits and bucketing 256: 89.990005%

Effective bit Huffman: 2.690225953182204 - Size reduction: 10.883517290692039 - Size MB: 0.10095149999999999

PM quantization of model "cifar10_distilled_model" with "8" bits and bucketing 256: 92.169998%

Effective bit Huffman: 6.505658074775342 - Size reduction: 4.7367702222057995 - Size MB: 0.22969087499999996

There's a huge difference especially in the case of cifar10_distilled_quantized2bits. It seems like some issues in evaluation.
I see that pred_accuracy was hard-coded to 0. Is there any reason for that?

The model is evaluated after Full precision weight is load.

In the code conv_forward_model.py The model is evaluated after Full precision weight is load. I think you should quantizeWeights again after all epochs.

How can I verify my model is quantized?

how to load the Teacher Model?

When saving the teacher model, you set path_to_save_model as self.saved_models[model_name][0].path_saved_model + str(continue_training_from+1)
But when you load the teacher moedl, you use only the self.saved_models[model_name][0].path_saved_model as the path.
Therefore, mistake occurs.

Do you have time to fix it?

Question about the implement of the KLDiv Loss

As I can see in the file help_fun.py, you use SoftmaxFunction to normalize the teacher output but use logSoftmaxFunction to normalize the student output.
Why do you set this two different?

typo of README

Hi, I found typo in sample code in README.md as below.

- import dataset
- dataset.BASE_DATA_FOLDER = '/home/saved_datasets'
+ import datasets
+ datasets.BASE_DATA_FOLDER = '/home/saved_datasets'

batch_size = 50
cifar10 = datasets.CIFAR10() #-> will be saved in /home/saved_datasets/cifar10
train_loader, test_loader = cifar10.getTrainLoader(batch_size), cifar10.getTestLoader(batch_size)

If it is preferable, I can create PR.
Regards.

question on the file model_manager.py

In the file model_manager.py
if continue_training_from > 0: #if continue_training_from == 0 then the model has never been trained before #(it has just been added as a new model) so we load only when continue_training_from > 0 print('Loading weights from train run number {}'.format(continue_training_from)) state_dict = torch.load(self.saved_models[model_name][continue_training_from].path_saved_model) try: model.load_state_dict(state_dict) except KeyError: #this means that the weight were saved with a DataParallel but the current one is not #or vice-versa. if isinstance(model, torch.nn.parallel.DataParallel): state_dict = mhf.convert_state_dict_to_data_parallel(state_dict) else: state_dict = mhf.convert_state_dict_from_data_parallel(state_dict) model.load_state_dict(state_dict)

you load the state_dict only but fail to load the learning rate and other parameters, it is not continue_training but train with a better initialization. Am I right? Is there any mistakes?
In this way, the continue training is wired. Few paper(including yours) explain that we should train a model again in this way.
Can you explain your code to me?