princeton-nlp / cofipruning Goto Github PK

View Code? Open in Web Editor NEW

186.0 186.0 32.0 1.83 MB

[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408

License: MIT License

Python 92.42% Jupyter Notebook 5.72% Shell 1.85%

bert model-compression nlp pruning

cofipruning's People

Contributors

Stargazers

Watchers

cofipruning's Issues

About comparsion with other baseline

Nice work! I have two questions: 1) why report the GLUE dev set results only? 2) Some strong baselines are not compared, such as NasBERT BERT-EMD.

Troubles reproducing the results

Hello, Thank you for providing code. But I have a question on how to reproduce the results the 95% sparsity on MNLI with the following commands:

TASK=MNLI
SUFFIX=sparsity0.95
EX_CATE=CoFi
PRUNING_TYPE=structured_heads+structured_mlp+hidden+layer
SPARSITY=0.95
DISTILL_LAYER_LOSS_ALPHA=0.9
DISTILL_CE_LOSS_ALPHA=0.1
LAYER_DISTILL_VERSION=4
DISTILLATION_PATH=dynabert/MNLI
CUDA_VISIBLE_DEVICES=1 bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY $DISTILLATION_PATH $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION

And I get following results with accuracy 78.20 on MNLI:

wandb: Run history:
wandb:                   eval/loss ▃▁▂▂▂▃██▆▆▅▅▅▅▆▅▅▄▄▅▄▅▅▅▄▄▄▄▄▄▄▄▄▄▄▄▅▄▄▄
wandb:              train/accuracy ▆█▇██▇▁▁▃▄▄▄▄▅▄▅▅▅▅▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▆
wandb:     train/expected_sparsity ▁▃▄▆████████████████████████████████████
wandb:           train/global_step ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           train/hidden_dims █████▁▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
wandb:              train/lag_loss ▆▆▇▆▆█▁▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
wandb:         train/learning_rate █████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:                  train/loss ▂▁▆▂▂▇▃█▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▆▅▆▆▆
wandb: train/pruned_model_sparsity ▁▃▄▆████████████████████████████████████
wandb:         train/pruned_params ▁▃▄▆████████████████████████████████████
wandb:              train/reg_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:      train/remaining_params █▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:       train/target_sparsity ▁▂▄▆████████████████████████████████████
wandb:
wandb: Run summary:
wandb:                   eval/loss 0.66644
wandb:              train/accuracy 0.78197
wandb:     train/expected_sparsity 0.94999
wandb:           train/global_step 0
wandb:           train/hidden_dims 764
wandb:              train/lag_loss 1e-05
wandb:         train/learning_rate 0.0
wandb:                  train/loss 0.40625
wandb: train/pruned_model_sparsity 0.95561
wandb:         train/pruned_params 81243440
wandb:              train/reg_loss 0.0
wandb:      train/remaining_params 3774160
wandb:       train/target_sparsity 0.95

By the way, I found some issues during reproducing:

In evaluation.py:77, datasets["validation"] should be datasets["validation_matched"] for MNLI.
Label_map in MNLI, dynabert and princeton-nlp/CoFi-MNLI-s95 use different label map compare to MNLI in datasets.load_dataset. And directly evaluating with python evaluation.py MNLI princeton-nlp/CoFi-MNLI-s95 will get wrong result.
Pruning is unavailable for trained models in evaulation.py. For example, model is not purnned according to zs.pt with python evaluation.py MNLI ./out/MNLI/CoFi/MNLI_sparsity0.95.

It is possible to run pruning on multiple GPUs？

problem of loading from_pretrained('princeton-nlp/CoFi-XXX')

I just test model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95")
when I receive an error due to dimension mismatch
` File "/root/token_prune/CoFiPruning-pretrain/test.py", line 16, in
model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95")
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained
keep_in_fp32_modules=keep_in_fp32_modules,
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 2844, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for CoFiBertForSequenceClassification:
size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 764]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
size mismatch for bert.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 764]) from checkpoint, the shape in current model is torch.Size([512, 768]).'

Error in finetune with pruned model--AttributeError: 'NoneType' object has no attribute 'forward'`

hello,@xiamengzhou! When i use your script to finetune the pruned model, there is an issue. But i hava no idea about it. What`s wrong with my code?

TASK=MRPC
SUFFIX=sparsity0.95
EX_CATE=CoFi
SPARSITY=0.95
DISTILL_LAYER_LOSS_ALPHA=0.9
DISTILL_CE_LOSS_ALPHA=0.1
LAYER_DISTILL_VERSION=4
SPARSITY_EPSILON=0.01
DISTILLATION_PATH=/home/tt6232/KdQuant/teacher-model/bert-base-uncased/

PRUNED_MODEL_PATH=./out/$TASK/$EX_CATE/${TASK}_${SUFFIX}/best
PRUNING_TYPE=None # Setting the pruning type to be None for standard fine-tuning.
LEARNING_RATE=3e-5

bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY $DISTILLATION_PATH $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $PRUNED_MODEL_PATH $LEARNING_RATE &

Should I get a finetuned teacher model by setting `pruning_type=None`

Should I get a teacher model by finetune a bert-base-uncased model by setting pruning_type=None, pretrained_pruned_model=None ,remove do_distill and do_layer_distill?
then I use the finetuned model as distillation_path(teacher model) to get a pruned model?

or Should I fine tune the model myself to get the teacher model?

In the code pruning_ type=none is used to fine tune after pruning.

Thanks!

Cannot load the checkpoints

Hi thanks for the great work. I try to load your pruned checkpoints with following commands:

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('princeton-nlp/CoFi-MNLI-s95')

However, I get the following errors:

RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:                                                                                      
        size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 764]) from checkpoint, the shape in current model is 
torch.Size([30522, 768]).
        size mismatch for bert.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 764]) from checkpoint, the shape in current model $s torch.Size([512, 768]).
        size mismatch for bert.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 764]) from checkpoint, the shape in current model $s torch.Size([2, 768]).
        size mismatch for bert.embeddings.LayerNorm.weight: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model is torch.Size([$68]).
        size mismatch for bert.embeddings.LayerNorm.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model is torch.Size([76$]).
        size mismatch for bert.encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([64, 764]) from checkpoint, the shape in current m$del is torch.Size([64, 768]).
        size mismatch for bert.encoder.layer.0.attention.self.key.weight: copying a param with shape torch.Size([64, 764]) from checkpoint, the shape in current mod$l is torch.Size([64, 768]).
        size mismatch for bert.encoder.layer.0.attention.self.value.weight: copying a param with shape torch.Size([64, 764]) from checkpoint, the shape in current m$del is torch.Size([64, 768]).
        size mismatch for bert.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([764, 64]) from checkpoint, the shape in current
model is torch.Size([768, 64]).
        size mismatch for bert.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model
is torch.Size([768]).
        size mismatch for bert.encoder.layer.0.attention.output.LayerNorm.weight: copying a param with shape torch.Size([764]) from checkpoint, the shape in current
model is torch.Size([768]).
        size mismatch for bert.encoder.layer.0.attention.output.LayerNorm.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current m$del is torch.Size([768]).
        size mismatch for bert.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([395, 764]) from checkpoint, the shape in current mo$el is torch.Size([3072, 768]).
        size mismatch for bert.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([395]) from checkpoint, the shape in current model is $orch.Size([3072]).
        size mismatch for bert.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([764, 395]) from checkpoint, the shape in current model is
torch.Size([768, 3072]).
        size mismatch for bert.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model is torch.Size([768]).

Could you please tell me how to load the pruned checkpoints? By the way, the command line in the README seems does not work either?

from CoFiPruning.models import CoFiBertForSequenceClassification
model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95") 
output = model(**inputs)

there is no setup.py in this repo, how could I install this package?

potential bug loading a pruned model with no masks

load_pruned_model in the cofi_utils file seems to take a model as a first argument, however load_model(..) calls load_pruned_model by passing a string. In this case the program crashes as the string doesn't have a "config" property for example

The pre-request and training process

I appreciate the great work @xiamengzhou, but sorry that I cannot clearly understand the training process.

q1) could you specify the versions of packages, e.g. datasets, transformers, etc.?
q2) can I get the fine-tuned original BERT by running run_FT.sh with the specification of 'proj_dir' only?

layer-distillation: teacher layer sets selection?

The original papers mentioned: Specifically, let T denote a set of teacher layers that we use to distill knowledge to the student model.'' And the code in trainer provides[2, 5, 8, 11]'' only, which is part of settings in Appendix.
Any suggestions of selection of such teacher layer sets for distillation,?
4 layers at most?
which 4 layers are proper?
how do we specify task-aware settings?
i.e., There are 12 layers for Students, why we only choose to select from given 4 layers? How about 5, 6, 12 layers for T,?
I think it is critical for reproduce results, where I barely reproduce any results to match the reported scores now?

Something wrong with run_FT.sh and data_dir

When I use run_FT.sh, only [task_name] and [EX_NAME_SUFFIX] need to input. I change the model_name_or_path to where the bert-base-uncased is.

Firstly, an error appeared:
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--data_dir', './datasets/RTE']
Checking the log, I find that the model will find datasets in cache, so I delete the argument 'data_dir'.

However,
during pre-finetuning, the accuracy for dev is very small.
In the evaluation output file, it is only 0.47, and I found the sparsity is 0.666.

Task: rte
Model path: /home/ykw/cofi/out-test/RTE/RTE_test_RTE/
Model size: 28385280
Sparsity: 0.6659999999999999
accuracy: 0.4729
seconds/example: 0.00093

Why did the pre-finetune process prune the model? It even don't need to input a sparsity number. And the accuracy is really smaller than yours (0.70).

Where can i see the detail of distillation?

Thank you for your work!
In the process of implementation, I have a little question, where can i find the details of the distillation part of the code? Hope you can reply me, it will be very helpful to me.

Student model initialization

Hi, thanks for your great work on this project!

I'm curious why the student model starts from an untuned model rather than from the weights of the teacher? It would seem that reusing it could make the training faster. Is that something you've explored?

Typo?

In readme, structured_head should be structured_heads?

How to prune the model from the very begigning?

Hi @xiamengzhou , thanks for your contribution. But in your code, you use Model.from_pretrained to load the model architecture, and the files you have already provided. But if I want to prune my own, original model, for instance T5 model, using your method in the paper. Which code should I check? Many thanks:)

About prepruning finetune steps

Thank you for your amazing work!

I have some difficulty understanding the pre-pruning fine-tune steps in the code. I found that in pre-pruning fine-tune steps, only layer and prediction distillation losses are calculated, but it seems that the teacher and student models are both bert-base models. Does this mean that the distillation is between two same models? If so, why should we do that?

Pruning for Encoder-Decoder Architecture?

Hi,
Does CoFiPruning work on Encoder-Decoder Architectures for Seq2seq tasks such as translation?
Thanks!

Fatal Logic Error found in trainer.py

in the file: https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py
line 279 sepcifies following statement :

 if self.start_prune:
    zs = self.l0_module.forward(training=True)
    self.fill_inputs_with_zs(zs, inputs)

only when this runs, we can get the gradient for the params in self.l0_optimizer.
Only when the condiction satisfied as below (line 268):

  if self.prepruning_finetune_steps > 0 and self.global_step == self.prepruning_finetune_steps:
      self.start_prune = True

However, line 301 just directly update the params without checking whether the grads are ready:

  if self.l0_module is not None and self.l0_optimizer is not None:
      self.l0_optimizer.step()
      self.lagrangian_optimizer.step()

therefore, the adamw yields bugs for beta1/beta2 referred before define in its step method.
As the the grad of the params are all None, the adamw implementation will skip define the hyper-params via the self.group dict.

Why prepruning distillation?

Hi, I have a question about the intuition behind the prepruning distillation step. Why are you not initializing the student model from the teacher weights, instead of initializing it from scratch (/pretrained on MLM BERT checkpoint)?

prunable_model_size updated only in initialize_structured_head

Hi this is a clarification question. In l0_module.py self.prunable_model_size is updated only in the initialize_structured_head function but not in either initialize_structured_mlp or in initialize_whole_mlp. Why is that? The mlp layers can also be pruned later I believe?

[Typo]: argument names in run_qa_prune.py seems not updated

The argument has an attribute of do_distill

CoFiPruning/args.py

Line 43 in 793e3e1

 do_distill: bool = field(default=False, metadata={"help": "Whether to do distillation or not, prediction layer."}) 

while in run_qa_prune.py, do_distil (missing an 'l') is used

CoFiPruning/run_qa_prune.py

Line 101 in 793e3e1

if additional_args.do_distil:

Loading a finetuned model starts from scratch?

Hello,

Say I finetune a model with your script without any pruning. Then running the evaluation.py script seems to give accuracy matching the final accuracy of the finetuned model during training (so far so good).

However, when I attempt to train a new pruned model, starting with that finetuned model, the accuracy of the first evaluation in the trainer.py script seems to be much lower, e.g. 32%. Why is that the case? shouldn't the initial evaluation in the trainer.py match the evaluation of "evaluation.py"?

EDIT: I think I have figured out what's happening: The first evaluation corresponds to a bert base model with untrained classification heads for the task of choice. Can you verify this?

What's the model I should prepare and the training process?

Hi thanks for the great work @xiamengzhou , but sorry for that I'm not clear about the training process and the models I should prepare before. Here's my comprehension:

If I need to prune a BERT on MNLI and then test, there are three stages: train (prune), fine-tune (prune_type=none), evaluate.

Firstly, I need to download an original BERT_base_uncased, and applied the prune_type=none fine-tuning on the original BERT?
But in the fine-tune bash, there is a distillation_path.
bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $SPARSITY_EPSILON [PRUNED_MODEL_PATH] $LEARNING_RATE
Q1: If I use prune_type=none fine-tuning in readme, how can I put it in the distillation_path preliminarily...

Then I get a MNLI-fine-tuned BERT, I will prune it and also regard it as the teacher model.
Q2: Whether the MNLI-fine-tuned BERT is the model to be pruned, or in the prune process there would import a bert_base as the model to be pruned?

In the train and fine-tune stage,
the arguments 'distillation_path' is the path of MNLI-fine-tuned BERT,
and 'pretrained_pruned_model' is the path of the pruned-MNLI-BERT.

I don't really understand the original model I should use.
How can I get your fine-tuned BERT, or use your way to fine-tune the original BERT?

Detailed experiment results on RoBERTa?

Hello! Appreciate your great work. In the appendix of your paper, I saw figures illustrating the experiment results of applying CoFi on RoBERTa, but I cannot reproduce the results based on this repository. Could you please provide more detailed experimental results? Thanks!

Generating predictions with CoFi models

Hi,
First of all, thanks a lot for open-sourcing your code and models!

I've been trying to use your code to generate predictions with CoFi models (with --do_predict on for example test-split of GLUE tasks) but unfortunately the prediction loop always fails with CUDA OOM exception (even on the 80GB A100 GPU). Could you also please try and let me know if I did something wrong?

Why use 3 optimizers during training?

Hi! I want to ask why we should use 3 optimizers during training? I think self.optimizer.zero_grad() is enough.

self.optimizer.zero_grad()
if self.l0_optimizer is not None:
     self.l0_optimizer.zero_grad()
if self.lagrangian_optimizer is not None:
     self.lagrangian_optimizer.zero_grad()

The initial of 'intermediate' loga

Hi,
I noticed the initial of the 'intermediate_z' is different from others, which will introduce a initial sparsity in mlp layer. I wonder why did this different initial step.

https://github.com/princeton-nlp/CoFiPruning/blob/da855a809c4a15e1c964a47a37998db2e1a226fd/models/l0_module.py#L147C8-L147C39

https://github.com/princeton-nlp/CoFiPruning/blob/da855a809c4a15e1c964a47a37998db2e1a226fd/models/l0_module.py#L134C9-L134C9

Experimental results

Hello,@xiamengzhou !The result on the Squad task dataset is 79.74, which is quite different from the result (82.6) in the paper. Can you further announce the detailed parameters? The teacher model F1 value is 88.43. I will be very grateful!

Removing the already-pruned parts in the model may cause some changes in the outputs

Hi! I am trying to apply CoFi pruning to my own model, and I noticed that there might exist some edge cases where removing the already-pruned parts in my model will cause some changes in the outputs. I think this will happen when all the dims of the intermediate layer are removed.

I found that when intermediate_zs are all zero, the intermediate.dense in the pruned model is set to None

CoFiPruning/utils/cofi_utils.py

Lines 229 to 231 in 5423094

 if len(keep_dims[layer]) == 0: 

 bert.encoder.layer[layer].intermediate.dense = None 

 bert.encoder.layer[layer].output.dense = None

, and the FFN parts will then be skipped

CoFiPruning/models/modeling_bert.py

Lines 364 to 365 in 5423094

 if self.intermediate.dense is None: 

 layer_output = attention_output

But before pruning, intermediate.dense is not None, and these zero outputs will still pass through CoFiBertOutput.dense which add a bias to the output

CoFiPruning/models/modeling_bert.py

Lines 562 to 566 in 5423094

 hidden_states = self.dense(hidden_states) 

 if mlp_z is not None: 

 hidden_states *= mlp_z 

 if not inference and hidden_states.sum().eq(0).item(): 

 return hidden_states + input_tensor

, so the FFN parts are not skipped.

Should I change some part of my code to skip the FFN parts when intermediate_zs are all zero during training?

Device incompatibility?

Hello,

In the following line:

CoFiPruning/trainer/trainer.py

Line 599 in 022847a

indexes < last_aligned_layer) & existing_layers]

existing_layers tensor is in cpu and the result of indexes<last_aligned_layer is in gpu. This throws an error as a result

Is this a bug? maybe first move existing_layers to gpu?

Obtaining models having different target sparsity using single trained model

Hello.
First of all thanks for your great work.
I have the following question.
If we trained a model having for example ~90% target sparsity is it possible to get variations of the already trained model with decreasing sparsity like 75%, 50% etc or the only way to obtain different target sparsity is to retrain the model again with needed sparsity?
Thanks in advance`)

Performance reproduction

Hi,
I got pretrained-Bert by modifying script/run_FT.sh, which results in a decent 84.3% ACC on MNLI.
Using this pretrained model as a teacher, I ran run.sh where the only change is the path for the teacher model. And, the result is above 85% under 95% sparsity. Does this result make sense? Otherwise, did I make some mistakes?

License of this repo

Hi, great work and thanks for sharing the code and models!

I'm trying to run this code myself. Could you please point me to the license of this repo? Thanks.

(expected_sparsity - target_sparsity) or (expected_sparsity - target_sparsity).abs()

Hi, we've recently been experimenting with compression models based on CoFi, and we've found that on small datasets, using the Lagrangian term from the paper causes the model to converge to a size smaller than the target sparsity. However, taking an absolute value for (expected_sparsity - target_sparsity) in the Lagrangian term seems to ameliorate the problem. Do you think (expected_sparsity - target_sparsity).abs() would be a better choice for calculating the Lagrangian term?

Discrepancy between my evaluation results and README for MNLI in evaluation.py

Hi, I'm running evaluation.py on MNLI as described in the README, but I'm getting different results compared to what's displayed there. I'm using Google Colab for this, and you can find my notebook here: https://colab.research.google.com/drive/1UahAOTIwALfEC_DXE11mVOp5iSgwHoYH?usp=sharing

When I run evaluation.py, it shows the following results:
Task: mnli
Model path: ../CoFi-MNLI-s95
Model size: 4330279
Sparsity: 0.949
Accuracy: 0.091
Seconds/example: 0.000561

However, in the README file, the results for the same evaluation are different:
Task: MNLI
Model path: princeton-nlp/CoFi-MNLI-s95
Model size: 4920106
Sparsity: 0.943
mnli/acc: 0.8055
Seconds/example: 0.010151

I need help figuring out why there's a discrepancy between my results and what's described in the README. I've tried to follow the instructions in the README as closely as possible, but I may have missed something. Thank you for any assistance you can provide.

More numbers on other sparsities

CoFi is a great work which may benefit the research in related areas.

However, I have found the numbers of the task performance on other sparsities are not available. Could you please provide these numbers in detail?

Besides, metrics besides accuracy scores on GLUE would also be appreciated.

Is this method applicable to LLAMA?

An issue when reproducing the efficiency evaluation

Hi @xiamengzhou.
When reproducing the efficiency evaluation of the [CoFi-MNLI-s95] model on a single NVIDIA A100 graphic card, it shows that the model's speed is 8.8e-05 seconds/example, where the vanilla fine-tuned BERT's speed is 4.6e-04 seconds/example, meaning that the speedup is only about 5.23× instead of 12.1×.
Could it be possible that the decrease in speedup comes from the difference in the hardware? Are there any other possible reasons that may cause the difference in efficiency testing? Many thanks!

The output for CoFi-MNLI-s95 testing:

The output for fine-tuned BERT testing:

Too low accuracy result compared with the expected result

Hi, thanks for your work.
I'm trying to test out the result of your work but found some difficulties on reproducing similar accuracy results.

Below is the Environment that I created:
channels:

default
dependencies:
python=3.9.7
pip
pip:
- transformers==4.17.0
- scipy==1.7.3
- datasets==2.00.0
- scikit-learn==1.0.2
- torch==1.10.2
- black
- wandb
- matplotlib

I used datasets==2.00.0, cause when I install datasets==1.14.0, it would result the following conflict:
The conflict is caused by:
transformers 4.17.0 depends on huggingface-hub<1.0 and >=0.1.0
datasets 1.14.0 depends on huggingface-hub<0.1.0 and >=0.0.19

If I use datasets 2.00.0, it is able to run the evaluation.py MNLI ../CoFi-MNLI-s95, but the results seems wrong?
What can I do to solve this problem? Thanks a lot!

../CoFi-MNLI-s95 is what is downloaded from https://huggingface.co/princeton-nlp/CoFi-MNLI-s95
Results I obtained:
Task: mnli
Model path: ../CoFi-MNLI-s95
Model size: 4330279
Sparsity: 0.949
accuracy: 0.091
seconds/example: 0.000531

Too low accuracy compared to the expected result:
Task: MNLI
Model path: princeton-nlp/CoFi-MNLI-s95
Model size: 4920106
Sparsity: 0.943
mnli/acc: 0.8055
seconds/example: 0.010151

Does it support Marian machine translation models?

I am trying to prune MarianMT model for example https://huggingface.co/Helsinki-NLP/opus-mt-en-ar. and the library seems not to support it. Is there any way to use it for pruning that model, or what parts of the code should I modify to make it compatible?

training error about qnli

Great job. However, when I train to the 3rd epoch in the QNLI task, I encounter the following problem, but the CoLA or Squad tasks do not encounter this problem. Do you have any suggestions? I will be very grateful!

The error may appear in the following code block
in the file: https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py,Lines 680-685

lagrangian_loss = None
if self.start_prune:
        lagrangian_loss, _, _ = \
                 self.l0_module.lagrangian_regularization(
                        self.global_step - self.prepruning_finetune_steps)
        loss += lagrangian_loss

The usage of L_c

I do not understand how this loss works ---- since $\lambda_1$ and $\lambda_2$ are 0 as default, I find that sometimes the loss maybe a negative number sometimes.

About the upper layer in your paper

Hi @xiamengzhou , many thanks to your contribution. I have a small question in your paper, in your paper you said that

CoFi tends to prune submodules more from upper layers than lower layers.

What is the upper layer means? Is it near input or output? Many thanks!

Bug or intent?

Hi! I have a question, why are you checking if the MHA layer is not pruned in this line, unlike here where you check for the FFN layer? Since distillation happens at the outputs of FFN layers, shouldn't the check be for the presence of FFN instead of MHA? Is this intentional or potentially a bug?

A Few issues with reproducing the code

Hello,

I am trying to run your codebase. I am having some issues however:

Set of python requirements cannot be installed due to incompatibilities. Are these requirements strict or can they be relaxed?
After relaxing the above evaluation runs fine, but training requires a --distillation_path. Could you provide an example on how to use this argument?
To overcome 2, I set variable additional_args.do_distill to False. this results in an epoch being trained but crashing at the end. Model loss succesfully reduces but reg loss and lag loss is 0.
The error at the end is a failure in assertion: " assert "head" in self.types" in the l0 module

Could you help me or provide pointers on resolving the above?

Thank you

when the training is going to end,occurred error

A/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File "./run_glue_prune.py", line 394, in
main()
File "./run_glue_prune.py", line 385, in main
trainer.train()
File "/bit_share//LLM/Fitune_LLM/model_pruning/CoFiPruning/trainer/trainer.py", line 285, in train
loss_terms = self.training_step(model, inputs)
File "/bit_share/zhangxiaolei/LLM/Fitune_LLM/model_pruning/CoFiPruning/trainer/trainer.py", line 704, in training_step
loss.backward()
File "/data03//anaconda3/envs/llmprune/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data03//anaconda3/envs/llmprune/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered

How to get the loss of `lagrangian_regularization`

Hi! In your code you calculate the Lc

https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py#L682

And you use expected_size to calculate expected_sparsity , but does it match the equation in your paper?

https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L267

Actually you said that sˆ is the expected model sparsity calculated from z , but the lagrangian_regularization() do not have inputs or z
Many thanks!

Maybe confusing description of the distillation constraint

Hi, I just noticed a confusing description of the distillation constraint. Intuitively, I (and probably many other readers) would imagine the distillation from bottom to top, i.e., from layer 1 to layer 12. And to tackle layer mismatching, it is likely that we need higher student layer matched with higher teacher layer. Thus, it is weird to see the constraint as "lower than the previous matched layer".

After reading the code trainer.py line 601, I know the distillation is top-down, so the constraint is "lower than the previous matched layer", but I think the distillation direction needs to be clarified.

for search_index in range(3, -1, -1):

Questions about some code

Hi, thanks for the great work! I have some questions about the current code.

First, is this following line expected? https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py#L667

zs = {key: inputs[key] for key in inputs if "_z" in inputs}

Should it be zs = {key: inputs[key] for key in inputs if "_z" in key} in order to extract zs from inputs?

Second, what is the last term self.hidden_size * 4 in the following line when calculating the params of an FFN layer? https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L44

self.params_per_mlp_layer = self.hidden_size * self.intermediate_size * 2 + self.hidden_size + self.hidden_size * 4

I guess it means the bias parameter of the intermediate dense layer, so it is equivalent to self.intermediate_size?

Third, when initializing the loga params in l0_module, the structured_mlp uses a different mean compared with other components, as shown in the following line: https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L147

It seems the intermediate dimension has an initial sparsity of 0.5, even before any pruning. What is the intuition of setting it this way?

Thank you very much for your time!

About the diag() and distillation in your paper

Hi @xiamengzhou , many thanks to your contribution. I have small questions in your paper, in your paper you said that

FNN pruning introduce a Zint

And in your paper there is a Eq, but what is diag, why do we have to put Zint into a diagonal matrix? Do diag(Zint) is df*df size?

And you also says that

Coarse-grained and Fine- grained units (§3.1) with a layerwise distillation objective transferring knowledge from unpruned to pruned models (§3.2)

However, distilling intermediate layers during the pruning process is challenging as the model struc- ture changes throughout training. （previous method）

So are we pruning a student model during distillation？

Many thanks!!

	if len(keep_dims[layer]) == 0:
	bert.encoder.layer[layer].intermediate.dense = None
	bert.encoder.layer[layer].output.dense = None

	if self.intermediate.dense is None:
	layer_output = attention_output

	hidden_states = self.dense(hidden_states)
	if mlp_z is not None:
	hidden_states *= mlp_z
	if not inference and hidden_states.sum().eq(0).item():
	return hidden_states + input_tensor

princeton-nlp / cofipruning Goto Github PK

cofipruning's People

Contributors

Stargazers

Watchers

Forkers

cofipruning's Issues

Recommend Projects

Recommend Topics

Recommend Org