princeton-nlp / cofipruning Goto Github PK
View Code? Open in Web Editor NEW[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
License: MIT License
[ACL 2022] Structured Pruning Learns Compact and Accurate Models https://arxiv.org/abs/2204.00408
License: MIT License
Nice work! I have two questions: 1) why report the GLUE dev set results only? 2) Some strong baselines are not compared, such as NasBERT BERT-EMD.
Hello, Thank you for providing code. But I have a question on how to reproduce the results the 95% sparsity on MNLI with the following commands:
TASK=MNLI
SUFFIX=sparsity0.95
EX_CATE=CoFi
PRUNING_TYPE=structured_heads+structured_mlp+hidden+layer
SPARSITY=0.95
DISTILL_LAYER_LOSS_ALPHA=0.9
DISTILL_CE_LOSS_ALPHA=0.1
LAYER_DISTILL_VERSION=4
DISTILLATION_PATH=dynabert/MNLI
CUDA_VISIBLE_DEVICES=1 bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY $DISTILLATION_PATH $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION
And I get following results with accuracy 78.20 on MNLI:
wandb: Run history:
wandb: eval/loss ▃▁▂▂▂▃██▆▆▅▅▅▅▆▅▅▄▄▅▄▅▅▅▄▄▄▄▄▄▄▄▄▄▄▄▅▄▄▄
wandb: train/accuracy ▆█▇██▇▁▁▃▄▄▄▄▅▄▅▅▅▅▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▆
wandb: train/expected_sparsity ▁▃▄▆████████████████████████████████████
wandb: train/global_step ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train/hidden_dims █████▁▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃
wandb: train/lag_loss ▆▆▇▆▆█▁▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
wandb: train/learning_rate █████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb: train/loss ▂▁▆▂▂▇▃█▇▇▇▇▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▅▆▅▆▆▆
wandb: train/pruned_model_sparsity ▁▃▄▆████████████████████████████████████
wandb: train/pruned_params ▁▃▄▆████████████████████████████████████
wandb: train/reg_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train/remaining_params █▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: train/target_sparsity ▁▂▄▆████████████████████████████████████
wandb:
wandb: Run summary:
wandb: eval/loss 0.66644
wandb: train/accuracy 0.78197
wandb: train/expected_sparsity 0.94999
wandb: train/global_step 0
wandb: train/hidden_dims 764
wandb: train/lag_loss 1e-05
wandb: train/learning_rate 0.0
wandb: train/loss 0.40625
wandb: train/pruned_model_sparsity 0.95561
wandb: train/pruned_params 81243440
wandb: train/reg_loss 0.0
wandb: train/remaining_params 3774160
wandb: train/target_sparsity 0.95
By the way, I found some issues during reproducing:
evaluation.py:77
, datasets["validation"]
should be datasets["validation_matched"]
for MNLI.dynabert
and princeton-nlp/CoFi-MNLI-s95
use different label map compare to MNLI in datasets.load_dataset
. And directly evaluating with python evaluation.py MNLI princeton-nlp/CoFi-MNLI-s95
will get wrong result.evaulation.py
. For example, model is not purnned according to zs.pt with python evaluation.py MNLI ./out/MNLI/CoFi/MNLI_sparsity0.95
.I just test model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95")
when I receive an error due to dimension mismatch
` File "/root/token_prune/CoFiPruning-pretrain/test.py", line 16, in
model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95")
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained
keep_in_fp32_modules=keep_in_fp32_modules,
File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 2844, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for CoFiBertForSequenceClassification:
size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 764]) from checkpoint, the shape in current model is torch.Size([30522, 768]).
size mismatch for bert.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 764]) from checkpoint, the shape in current model is torch.Size([512, 768]).'
hello,@xiamengzhou! When i use your script to finetune the pruned model, there is an issue. But i hava no idea about it. What`s wrong with my code?
TASK=MRPC
SUFFIX=sparsity0.95
EX_CATE=CoFi
SPARSITY=0.95
DISTILL_LAYER_LOSS_ALPHA=0.9
DISTILL_CE_LOSS_ALPHA=0.1
LAYER_DISTILL_VERSION=4
SPARSITY_EPSILON=0.01
DISTILLATION_PATH=/home/tt6232/KdQuant/teacher-model/bert-base-uncased/
PRUNED_MODEL_PATH=./out/$TASK/$EX_CATE/${TASK}_${SUFFIX}/best
PRUNING_TYPE=None # Setting the pruning type to be None for standard fine-tuning.
LEARNING_RATE=3e-5
bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY $DISTILLATION_PATH $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $PRUNED_MODEL_PATH $LEARNING_RATE &
Should I get a teacher model by finetune a bert-base-uncased
model by setting pruning_type=None
, pretrained_pruned_model=None
,remove do_distill
and do_layer_distill
?
then I use the finetuned model as distillation_path
(teacher model) to get a pruned model?
or Should I fine tune the model myself to get the teacher model?
In the code pruning_ type=none
is used to fine tune after pruning.
Thanks!
Hi thanks for the great work. I try to load your pruned checkpoints with following commands:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('princeton-nlp/CoFi-MNLI-s95')
However, I get the following errors:
RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([30522, 764]) from checkpoint, the shape in current model is
torch.Size([30522, 768]).
size mismatch for bert.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 764]) from checkpoint, the shape in current model $s torch.Size([512, 768]).
size mismatch for bert.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 764]) from checkpoint, the shape in current model $s torch.Size([2, 768]).
size mismatch for bert.embeddings.LayerNorm.weight: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model is torch.Size([$68]).
size mismatch for bert.embeddings.LayerNorm.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model is torch.Size([76$]).
size mismatch for bert.encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([64, 764]) from checkpoint, the shape in current m$del is torch.Size([64, 768]).
size mismatch for bert.encoder.layer.0.attention.self.key.weight: copying a param with shape torch.Size([64, 764]) from checkpoint, the shape in current mod$l is torch.Size([64, 768]).
size mismatch for bert.encoder.layer.0.attention.self.value.weight: copying a param with shape torch.Size([64, 764]) from checkpoint, the shape in current m$del is torch.Size([64, 768]).
size mismatch for bert.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([764, 64]) from checkpoint, the shape in current
model is torch.Size([768, 64]).
size mismatch for bert.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model
is torch.Size([768]).
size mismatch for bert.encoder.layer.0.attention.output.LayerNorm.weight: copying a param with shape torch.Size([764]) from checkpoint, the shape in current
model is torch.Size([768]).
size mismatch for bert.encoder.layer.0.attention.output.LayerNorm.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current m$del is torch.Size([768]).
size mismatch for bert.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([395, 764]) from checkpoint, the shape in current mo$el is torch.Size([3072, 768]).
size mismatch for bert.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([395]) from checkpoint, the shape in current model is $orch.Size([3072]).
size mismatch for bert.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([764, 395]) from checkpoint, the shape in current model is
torch.Size([768, 3072]).
size mismatch for bert.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([764]) from checkpoint, the shape in current model is torch.Size([768]).
Could you please tell me how to load the pruned checkpoints? By the way, the command line in the README seems does not work either?
from CoFiPruning.models import CoFiBertForSequenceClassification
model = CoFiBertForSequenceClassification.from_pretrained("princeton-nlp/CoFi-MNLI-s95")
output = model(**inputs)
there is no setup.py in this repo, how could I install this package?
load_pruned_model in the cofi_utils file seems to take a model as a first argument, however load_model(..) calls load_pruned_model by passing a string. In this case the program crashes as the string doesn't have a "config" property for example
I appreciate the great work @xiamengzhou, but sorry that I cannot clearly understand the training process.
q1) could you specify the versions of packages, e.g. datasets, transformers, etc.?
q2) can I get the fine-tuned original BERT by running run_FT.sh with the specification of 'proj_dir' only?
The original papers mentioned: Specifically, let T denote a set of teacher layers that we use to distill knowledge to the student model.'' And the code in trainer provides
[2, 5, 8, 11]'' only, which is part of settings in Appendix.
Any suggestions of selection of such teacher layer sets for distillation,?
4 layers at most?
which 4 layers are proper?
how do we specify task-aware settings?
i.e., There are 12 layers for Students, why we only choose to select from given 4 layers? How about 5, 6, 12 layers for T,?
I think it is critical for reproduce results, where I barely reproduce any results to match the reported scores now?
When I use run_FT.sh, only [task_name] and [EX_NAME_SUFFIX] need to input. I change the model_name_or_path to where the bert-base-uncased is.
Firstly, an error appeared:
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--data_dir', './datasets/RTE']
Checking the log, I find that the model will find datasets in cache, so I delete the argument 'data_dir'.
However,
during pre-finetuning, the accuracy for dev is very small.
In the evaluation output file, it is only 0.47, and I found the sparsity is 0.666.
Task: rte
Model path: /home/ykw/cofi/out-test/RTE/RTE_test_RTE/
Model size: 28385280
Sparsity: 0.6659999999999999
accuracy: 0.4729
seconds/example: 0.00093
Why did the pre-finetune process prune the model? It even don't need to input a sparsity number. And the accuracy is really smaller than yours (0.70).
Thank you for your work!
In the process of implementation, I have a little question, where can i find the details of the distillation part of the code? Hope you can reply me, it will be very helpful to me.
Hi, thanks for your great work on this project!
I'm curious why the student model starts from an untuned model rather than from the weights of the teacher? It would seem that reusing it could make the training faster. Is that something you've explored?
Hi @xiamengzhou , thanks for your contribution. But in your code, you use Model.from_pretrained
to load the model architecture, and the files you have already provided. But if I want to prune my own, original model, for instance T5 model, using your method in the paper. Which code should I check? Many thanks:)
Thank you for your amazing work!
I have some difficulty understanding the pre-pruning fine-tune steps in the code. I found that in pre-pruning fine-tune steps, only layer and prediction distillation losses are calculated, but it seems that the teacher and student models are both bert-base models. Does this mean that the distillation is between two same models? If so, why should we do that?
Hi,
Does CoFiPruning work on Encoder-Decoder Architectures for Seq2seq tasks such as translation?
Thanks!
in the file: https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py
line 279 sepcifies following statement :
if self.start_prune:
zs = self.l0_module.forward(training=True)
self.fill_inputs_with_zs(zs, inputs)
only when this runs, we can get the gradient for the params in self.l0_optimizer
.
Only when the condiction satisfied as below (line 268):
if self.prepruning_finetune_steps > 0 and self.global_step == self.prepruning_finetune_steps:
self.start_prune = True
However, line 301 just directly update the params without checking whether the grads are ready:
if self.l0_module is not None and self.l0_optimizer is not None:
self.l0_optimizer.step()
self.lagrangian_optimizer.step()
therefore, the adamw yields bugs for beta1/beta2 referred before define in its step method.
As the the grad of the params are all None, the adamw implementation will skip define the hyper-params via the self.group dict.
Hi, I have a question about the intuition behind the prepruning distillation step. Why are you not initializing the student model from the teacher weights, instead of initializing it from scratch (/pretrained on MLM BERT checkpoint)?
Hi this is a clarification question. In l0_module.py self.prunable_model_size
is updated only in the initialize_structured_head function but not in either initialize_structured_mlp or in initialize_whole_mlp. Why is that? The mlp layers can also be pruned later I believe?
Hello,
Say I finetune a model with your script without any pruning. Then running the evaluation.py script seems to give accuracy matching the final accuracy of the finetuned model during training (so far so good).
However, when I attempt to train a new pruned model, starting with that finetuned model, the accuracy of the first evaluation in the trainer.py script seems to be much lower, e.g. 32%. Why is that the case? shouldn't the initial evaluation in the trainer.py match the evaluation of "evaluation.py"?
EDIT: I think I have figured out what's happening: The first evaluation corresponds to a bert base model with untrained classification heads for the task of choice. Can you verify this?
Hi thanks for the great work @xiamengzhou , but sorry for that I'm not clear about the training process and the models I should prepare before. Here's my comprehension:
If I need to prune a BERT on MNLI and then test, there are three stages: train (prune), fine-tune (prune_type=none), evaluate.
Firstly, I need to download an original BERT_base_uncased, and applied the prune_type=none fine-tuning on the original BERT?
But in the fine-tune bash, there is a distillation_path.
bash scripts/run_CoFi.sh $TASK $SUFFIX $EX_CATE $PRUNING_TYPE $SPARSITY [DISTILLATION_PATH] $DISTILL_LAYER_LOSS_ALPHA $DISTILL_CE_LOSS_ALPHA $LAYER_DISTILL_VERSION $SPARSITY_EPSILON [PRUNED_MODEL_PATH] $LEARNING_RATE
Q1: If I use prune_type=none fine-tuning in readme, how can I put it in the distillation_path preliminarily...
Then I get a MNLI-fine-tuned BERT, I will prune it and also regard it as the teacher model.
Q2: Whether the MNLI-fine-tuned BERT is the model to be pruned, or in the prune process there would import a bert_base as the model to be pruned?
In the train and fine-tune stage,
the arguments 'distillation_path' is the path of MNLI-fine-tuned BERT,
and 'pretrained_pruned_model' is the path of the pruned-MNLI-BERT.
I don't really understand the original model I should use.
How can I get your fine-tuned BERT, or use your way to fine-tune the original BERT?
Hello! Appreciate your great work. In the appendix of your paper, I saw figures illustrating the experiment results of applying CoFi on RoBERTa, but I cannot reproduce the results based on this repository. Could you please provide more detailed experimental results? Thanks!
Hi,
First of all, thanks a lot for open-sourcing your code and models!
I've been trying to use your code to generate predictions with CoFi models (with --do_predict
on for example test-split of GLUE tasks) but unfortunately the prediction loop always fails with CUDA OOM exception (even on the 80GB A100 GPU). Could you also please try and let me know if I did something wrong?
Hi! I want to ask why we should use 3 optimizers during training? I think self.optimizer.zero_grad()
is enough.
self.optimizer.zero_grad()
if self.l0_optimizer is not None:
self.l0_optimizer.zero_grad()
if self.lagrangian_optimizer is not None:
self.lagrangian_optimizer.zero_grad()
Hi,
I noticed the initial of the 'intermediate_z' is different from others, which will introduce a initial sparsity in mlp layer. I wonder why did this different initial step.
Hello,@xiamengzhou !The result on the Squad task dataset is 79.74, which is quite different from the result (82.6) in the paper. Can you further announce the detailed parameters? The teacher model F1 value is 88.43. I will be very grateful!
Hi! I am trying to apply CoFi pruning to my own model, and I noticed that there might exist some edge cases where removing the already-pruned parts in my model will cause some changes in the outputs. I think this will happen when all the dims of the intermediate layer are removed.
I found that when intermediate_z
s are all zero, the intermediate.dense
in the pruned model is set to None
CoFiPruning/utils/cofi_utils.py
Lines 229 to 231 in 5423094
CoFiPruning/models/modeling_bert.py
Lines 364 to 365 in 5423094
But before pruning, intermediate.dense
is not None, and these zero outputs will still pass through CoFiBertOutput.dense
which add a bias to the output
CoFiPruning/models/modeling_bert.py
Lines 562 to 566 in 5423094
Should I change some part of my code to skip the FFN parts when intermediate_z
s are all zero during training?
Hello,
In the following line:
CoFiPruning/trainer/trainer.py
Line 599 in 022847a
existing_layers tensor is in cpu and the result of indexes<last_aligned_layer is in gpu. This throws an error as a result
Is this a bug? maybe first move existing_layers to gpu?
Hello.
First of all thanks for your great work.
I have the following question.
If we trained a model having for example ~90% target sparsity is it possible to get variations of the already trained model with decreasing sparsity like 75%, 50% etc or the only way to obtain different target sparsity is to retrain the model again with needed sparsity?
Thanks in advance`)
Hi,
I got pretrained-Bert by modifying script/run_FT.sh, which results in a decent 84.3% ACC on MNLI.
Using this pretrained model as a teacher, I ran run.sh where the only change is the path for the teacher model. And, the result is above 85% under 95% sparsity. Does this result make sense? Otherwise, did I make some mistakes?
Hi, great work and thanks for sharing the code and models!
I'm trying to run this code myself. Could you please point me to the license of this repo? Thanks.
Hi, we've recently been experimenting with compression models based on CoFi, and we've found that on small datasets, using the Lagrangian term from the paper causes the model to converge to a size smaller than the target sparsity. However, taking an absolute value for (expected_sparsity - target_sparsity) in the Lagrangian term seems to ameliorate the problem. Do you think (expected_sparsity - target_sparsity).abs() would be a better choice for calculating the Lagrangian term?
Hi, I'm running evaluation.py on MNLI as described in the README, but I'm getting different results compared to what's displayed there. I'm using Google Colab for this, and you can find my notebook here: https://colab.research.google.com/drive/1UahAOTIwALfEC_DXE11mVOp5iSgwHoYH?usp=sharing
When I run evaluation.py, it shows the following results:
Task: mnli
Model path: ../CoFi-MNLI-s95
Model size: 4330279
Sparsity: 0.949
Accuracy: 0.091
Seconds/example: 0.000561
However, in the README file, the results for the same evaluation are different:
Task: MNLI
Model path: princeton-nlp/CoFi-MNLI-s95
Model size: 4920106
Sparsity: 0.943
mnli/acc: 0.8055
Seconds/example: 0.010151
I need help figuring out why there's a discrepancy between my results and what's described in the README. I've tried to follow the instructions in the README as closely as possible, but I may have missed something. Thank you for any assistance you can provide.
CoFi is a great work which may benefit the research in related areas.
However, I have found the numbers of the task performance on other sparsities are not available. Could you please provide these numbers in detail?
Besides, metrics besides accuracy scores on GLUE would also be appreciated.
Hi @xiamengzhou.
When reproducing the efficiency evaluation of the [CoFi-MNLI-s95] model on a single NVIDIA A100 graphic card, it shows that the model's speed is 8.8e-05 seconds/example, where the vanilla fine-tuned BERT's speed is 4.6e-04 seconds/example, meaning that the speedup is only about 5.23× instead of 12.1×.
Could it be possible that the decrease in speedup comes from the difference in the hardware? Are there any other possible reasons that may cause the difference in efficiency testing? Many thanks!
The output for CoFi-MNLI-s95 testing:
The output for fine-tuned BERT testing:
Hi, thanks for your work.
I'm trying to test out the result of your work but found some difficulties on reproducing similar accuracy results.
Below is the Environment that I created:
channels:
I used datasets==2.00.0, cause when I install datasets==1.14.0, it would result the following conflict:
The conflict is caused by:
transformers 4.17.0 depends on huggingface-hub<1.0 and >=0.1.0
datasets 1.14.0 depends on huggingface-hub<0.1.0 and >=0.0.19
If I use datasets 2.00.0, it is able to run the evaluation.py MNLI ../CoFi-MNLI-s95, but the results seems wrong?
What can I do to solve this problem? Thanks a lot!
../CoFi-MNLI-s95 is what is downloaded from https://huggingface.co/princeton-nlp/CoFi-MNLI-s95
Results I obtained:
Task: mnli
Model path: ../CoFi-MNLI-s95
Model size: 4330279
Sparsity: 0.949
accuracy: 0.091
seconds/example: 0.000531
Too low accuracy compared to the expected result:
Task: MNLI
Model path: princeton-nlp/CoFi-MNLI-s95
Model size: 4920106
Sparsity: 0.943
mnli/acc: 0.8055
seconds/example: 0.010151
I am trying to prune MarianMT model for example https://huggingface.co/Helsinki-NLP/opus-mt-en-ar. and the library seems not to support it. Is there any way to use it for pruning that model, or what parts of the code should I modify to make it compatible?
Great job. However, when I train to the 3rd epoch in the QNLI task, I encounter the following problem, but the CoLA or Squad tasks do not encounter this problem. Do you have any suggestions? I will be very grateful!
The error may appear in the following code block
in the file: https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py,Lines 680-685
lagrangian_loss = None
if self.start_prune:
lagrangian_loss, _, _ = \
self.l0_module.lagrangian_regularization(
self.global_step - self.prepruning_finetune_steps)
loss += lagrangian_loss
I do not understand how this loss works ---- since
Hi @xiamengzhou , many thanks to your contribution. I have a small question in your paper, in your paper you said that
CoFi tends to prune submodules more from upper layers than lower layers.
What is the upper layer means? Is it near input or output? Many thanks!
Hello,
I am trying to run your codebase. I am having some issues however:
Set of python requirements cannot be installed due to incompatibilities. Are these requirements strict or can they be relaxed?
After relaxing the above evaluation runs fine, but training requires a --distillation_path
. Could you provide an example on how to use this argument?
To overcome 2, I set variable additional_args.do_distill to False. this results in an epoch being trained but crashing at the end. Model loss succesfully reduces but reg loss and lag loss is 0.
The error at the end is a failure in assertion: " assert "head" in self.types" in the l0 module
Could you help me or provide pointers on resolving the above?
Thank you
A/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [1,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [2,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1634272068694/work/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [3,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
Traceback (most recent call last):
File "./run_glue_prune.py", line 394, in
main()
File "./run_glue_prune.py", line 385, in main
trainer.train()
File "/bit_share//LLM/Fitune_LLM/model_pruning/CoFiPruning/trainer/trainer.py", line 285, in train
loss_terms = self.training_step(model, inputs)
File "/bit_share/zhangxiaolei/LLM/Fitune_LLM/model_pruning/CoFiPruning/trainer/trainer.py", line 704, in training_step
loss.backward()
File "/data03//anaconda3/envs/llmprune/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data03//anaconda3/envs/llmprune/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered
Hi! In your code you calculate the Lc
https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py#L682
And you use expected_size
to calculate expected_sparsity
, but does it match the equation in your paper?
https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L267
Actually you said that sˆ is the expected model sparsity calculated from z
, but the lagrangian_regularization() do not have inputs
or z
Many thanks!
Hi, I just noticed a confusing description of the distillation constraint. Intuitively, I (and probably many other readers) would imagine the distillation from bottom to top, i.e., from layer 1 to layer 12. And to tackle layer mismatching, it is likely that we need higher student layer matched with higher teacher layer. Thus, it is weird to see the constraint as "lower than the previous matched layer".
After reading the code trainer.py line 601, I know the distillation is top-down, so the constraint is "lower than the previous matched layer", but I think the distillation direction needs to be clarified.
for search_index in range(3, -1, -1):
Hi, thanks for the great work! I have some questions about the current code.
First, is this following line expected? https://github.com/princeton-nlp/CoFiPruning/blob/main/trainer/trainer.py#L667
zs = {key: inputs[key] for key in inputs if "_z" in inputs}
Should it be zs = {key: inputs[key] for key in inputs if "_z" in key}
in order to extract zs
from inputs
?
Second, what is the last term self.hidden_size * 4
in the following line when calculating the params of an FFN layer? https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L44
self.params_per_mlp_layer = self.hidden_size * self.intermediate_size * 2 + self.hidden_size + self.hidden_size * 4
I guess it means the bias
parameter of the intermediate dense layer, so it is equivalent to self.intermediate_size
?
Third, when initializing the loga params in l0_module
, the structured_mlp
uses a different mean
compared with other components, as shown in the following line: https://github.com/princeton-nlp/CoFiPruning/blob/main/models/l0_module.py#L147
It seems the intermediate dimension has an initial sparsity of 0.5, even before any pruning. What is the intuition of setting it this way?
Thank you very much for your time!
Hi @xiamengzhou , many thanks to your contribution. I have small questions in your paper, in your paper you said that
FNN pruning introduce a Zint
And in your paper there is a Eq, but what is diag
, why do we have to put Zint
into a diagonal matrix? Do diag(Zint)
is df*df
size?
And you also says that
Coarse-grained and Fine- grained units (§3.1) with a layerwise distillation objective transferring knowledge from unpruned to pruned models (§3.2)
However, distilling intermediate layers during the pruning process is challenging as the model struc- ture changes throughout training. (previous method)
So are we pruning a student model during distillation?
Many thanks!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.