intersun / pkd-for-bert-model-compression Goto Github PK

View Code? Open in Web Editor NEW

194.0 7.0 45.0 1.18 MB

pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

Dockerfile 0.02% Python 60.61% Jupyter Notebook 39.37%

bert-model-compression glue bert patient-knowledge-distillation pkd pytorch

pkd-for-bert-model-compression's Introduction

Patient Knowledge Distillation for BERT Model Compression

Knowledge distillation for BERT model

Installation

Run command below to install the environment

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
pip install -r requirements.txt

Training

Objective Function

$$L = (1 - \alpha) L_CE + \alpha * L_DS + \beta * L_PT,$$

where L_CE is the CrossEntropy loss, DS is the usual Distillation loss, and PT is the proposed loss. Please see our paper below for more details.

Data Preprocess

Modify the HOME_DATA_FOLDER in envs.py and put all data under it (by default it is ./data), RTE data is uploaded for your convenience.

The folder name under HOME_DATA_FOLDER should be
- data_raw: store the raw datas of all tasks. So put downloaded raw data under here
  - MRPC
  - RTE
  - ... (other tasks)
- data_feat: store the tokenized data under this folder (optional)
  - MRPC
  - RTE
  - ...
models
- pretrained: put downloaded pretrained model (bert-base-uncased) under this folder

Predefinted Training

Run NLI_KD_training.py to start training, you can set DEBUG = True to run some pre-defined arguments

set argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher') or argv = get_predefine_argv('glue', 'RTE', 'finetune_student') to start the normal fine-tuning
run run_glue_benchmark.py to get teacher's prediction for KD or PKD.
- set output_all_layers = True for patient teacher
- set output_all_layers = False for normal teacher
set argv = get_predefine_argv('glue', 'RTE', 'kd') to start the vanilla KD
set argv = get_predefine_argv('glue', 'RTE', 'kd.cls') to start the vanilla KD

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citation

If you find this code useful for your research, please consider citing:

@article{sun2019patient,
title={Patient Knowledge Distillation for BERT Model Compression},
author={Sun, Siqi and Cheng, Yu and Gan, Zhe and Liu, Jingjing},
journal={arXiv preprint arXiv:1908.09355},
year={2019}
}

Paper is available at here.

pkd-for-bert-model-compression's People

Contributors

Stargazers

Watchers

pkd-for-bert-model-compression's Issues

Not able to reproduce results

First, thank you for releasing your code.

I am trying to reproduce results of your paper. I am running NLI_KD_training.py for MRPC with DEBUG=True.

The setting I am running is argv = get_predefine_argv('glue', 'MRPC', 'finetune_teacher').

After completing the training for 4 epochs, I get following results :

05/10/2020 19:09:30 - INFO - __main__ -   ***** Eval results *****
05/10/2020 19:09:30 - INFO - __main__ -     acc = 0.27942028985507245
05/10/2020 19:09:30 - INFO - __main__ -     acc_and_f1 = 0.13971014492753622
05/10/2020 19:09:30 - INFO - __main__ -     eval_loss = 3.8775325307139643
05/10/2020 19:09:30 - INFO - __main__ -     f1 = 0.0

Also the eval_log has the following :

epoch,acc,loss
1,0.8259803921568627,0.35975449818831223
2,0.8700980392156863,0.3205762528456174
3,0.8774509803921569,0.3944101127294394
4,0.8578431372549019,0.4749428268808585

-- which means training is probably correct but there is something wrong with test evaluation.

I have referred to the hyperparameter files that are provided in results_summary but I am not sure what might be wrong.

论文公式3是交叉熵公式，但为什么代码用的KL散度来实现的公式3？

如题，想问一下这个问题

Result is different...

Thank you for your code.
However, when i run code setting only finetune teacher(BERT-base)

# run simple fune-tuning *teacher* by uncommenting below cmd
    argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher')

In argument_parser.py,

    elif mode == 'glue':
        argv = [
                '--task_name', task_name,
                '--bert_model', 'bert-base-uncased',
                '--max_seq_length', '128',
                '--train_batch_size', '32',
                '--learning_rate', '2e-5',
                '--num_train_epochs', '4',
                '--eval_batch_size', '32',
                '--log_every_step', '1',
                '--output_dir', os.path.join(HOME_DATA_FOLDER, f'outputs/KD/{task_name}/teacher_12layer'),
                '--do_train', 'True',
                '--do_eval', 'True',
                '--fp16', 'True',
            ]
        if train_type == 'finetune_teacher':
            argv += [
                '--student_hidden_layers', '12',
                '--kd_model', 'kd',
                '--do_eval', 'True',
                '--alpha', '0.0',    # alpha = 0 is equivalent to fine-tuning for KD
            ]

Result

12/16/2019 06:48:04 - INFO - __main__ -   ***** Eval results *****
12/16/2019 06:48:04 - INFO - __main__ -     acc = 0.5983333333333334
12/16/2019 06:48:04 - INFO - __main__ -     eval_loss = 1.6177796708776595

what is the reason..?

What's the version of python and pytorch of this project?

I am Impressed by your crucial work. But I encounter some issues when I reproduce this project, may I ask a question that: What's the version of python and pytorch of this project?

RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

I Found a bug, when we set fp16=False, and train RTE with kd.cls, got this problem. and the traceback is：

Traceback (most recent call last):
  File "NLI_KD_training.py", line 288, in <module>
    loss.backward()
  File "/home/vernon/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/vernon/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

Running NLI_KD_training.py raises the following error.

File 'NLI_KD_training.py',line 204, max_grad_norms=1.0)
TypeError: init() got an unexpected keyword argument 'max_grad_norm=1.0'
According to https://nvidia.github.io/apex/optimizers.html, the apex has been updated to a new release that removed this parameter. If just ignore this parameters, will the performance be influenced?

Trying to do distillation for regression task

Hi, I am trying to extend your research and compute accuracy for all GLUE tasks, in that I am kind of stuck with STS-B. Since it is a regression task, where all do you think I should make the changes to get the numbers?

How do I run student predictions?

Hey, I am trying to reproduce your results, and am interested in training several students with different number of hidden layers. I want to submit the student predictions on GLUE website. I have been able too train student models with PKD-skip procedure.

My question is, how do I make predictions from the student model? I guess I should change the run_glue_benchmark somehow. Any help in this regard will be appreciated.

question on pretrained/bert_config.json

Thanks for sharing your work! I'm training a new dataset (classification task just like glue dataset) with following steps and wanted to make sure whether I'm doing it right.

get finetuned BERT (pytorch_model.bin) on dataset and put it into pretrained directory
run NLI_KD_training.py with below to get encoder.pkl, classifier.pkl

# run simple fune-tuning *teacher* by uncommenting below cmd
argv = get_predefine_argv('glue', 'RTE', 'finetune_teacher')

run run_glue_benchmark.py to get teacher's prediction for KD
run NLI_KD_training.py with below to distil knowledge from teacher to student model

    # run Patient Teacher by uncommenting below cmd
    argv = get_predefine_argv('glue', 'RTE', 'kd.cls')

I'm confused since doing step 2 redundantly finetunes teacher (bert base model) which was already done in step 1. Is it correct to place finetuned version into pretrained directory? or should I just use plain pytorch_model.bin?

Why do you set for KD.Full like this [fix_pooler=True]?

Hi,

Thank you for your interesting work! I just wondering why don`t you used the pooler for only KD.Full and if you use the pooler, did you initialize the pooler with BERT_teacher weight and bias?

Thank you,
Sincerely,

Some questions about layer number (model size)

Hi,

Thank you for your interesting work! I have just started to learn BERT and distillation recently. I have some general questions regarding this topic.

I want to compare the performance of BERT with different model size (transformer block number). Is it necessary to do distillation? If I just train a BERT with 6 Layers without distillation, does the performance look bad?
Do you have to do pretraining every time you change the layer number of BERT? Is it possible to just remove some layers in an existing pre-trained model and finetune on tasks?
Why BERT has 12 blocks? Not 11 or 13 etc. ? I couldn't find any explanation.

Thanks,
ZLK

请问一个问题

代码中有一个--teacher_prediction，这个哪来的？是在训练teacher模型中保存下来的？为什么没看到？

Reproducing results

Nice paper! Thanks for sharing the code.
I was trying to reproduce your results. It would be great if you could share the best hyperparameter for each GLUE task?
For example with the command: $ python NLI_KD_training.py

For RTE I was able to get following results:

acc = 0.6216666666666667
eval_loss = 1.3263624415118644

BUT

With only change at line 34 to argv = get_predefine_argv('glue', 'MRPC', 'finetune_student')
I got:

acc = 0.28289855072463765
acc_and_f1 = 0.14144927536231883
eval_loss = 3.820818234373022
f1 = 0.0

Where to download the pretrained weights?

Thanks a lot for your impressive work and I want to reproduce the results in the paper. Now I have a question: where can I get the pretrained model (bert-base-uncased)? Thank you!