huggingface / nn_pruning Goto Github PK

Prune a model while finetuning or training.

License: Apache License 2.0

Python 6.69% Makefile 0.01% HTML 0.13% JavaScript 13.14% Jupyter Notebook 80.04%

nn_pruning's Introduction

Neural Networks Block Movement Pruning

An interactive version of this site is available here.

Movement pruning has been proved as a very efficient method to prune networks in a unstructured manner. High levels of sparsity can be reached with a minimal of accuracy loss. The resulting sparse networks can be compressed heavily, saving a lot of permanent storage space on servers or devices, and bandwidth, an important advantage for edge devices. But efficient inference with unstructured sparsity is hard. Some degree of structure is necessary to use the intrinsic parallel nature of today hardware. Block Movement Pruning work extends the original method and explore semi-structured and structured variants of Movement Pruning. You can read more about block sparsity and why it matters for performance on these blog posts.

Documentation

The documentation is here.

Installation

User installlation

You can install nn_pruning using pip as follows:

python -m pip install -U nn_pruning

Developer installation

To install the latest state of the source code, first clone the repository

git clone https://github.com/huggingface/nn_pruning.git

and then install the required dependencies:

cd nn_pruning
python -m pip install -e ".[dev]"

After the installation is completed, you can launch the test suite from the root of the repository

pytest nn_pruning

Results

Squad V1

The experiments were done first on SQuAD v1.

Two networks were tested: BERT-base, and BERT-large.

Very significant speedups were obtained with limited drop in accuracy.

Here is a selection of the networks that are obtained through the different variant method variants.

The original "large" and "base" finedtuned models were added in the table for comparison.

The "BERT version" column shows which base network was pruned. The parameter count column is relative to linear layers, which contain most of the model parameters (with the embeddings being most of the remaining parameters).

F1 difference, speedups and parameters counts are all relative to BERT-base to ease practical comparison.

Model	Type	method	Params	F1	F1 diff	Speedup
#1	large	-	+166%	93.15	+4.65	0.35x
#2	large	hybrid-filled	-17%	91.03	+2.53	0.92x
#3	large	hybrid-filled	-40%	90.16	+1.66	1.03x
#4	base	hybrid-filled	-59%	88.72	+0.22	1.84x
#5	base	-	+0%	88.5	+0.00	1.00x
#6	base	hybrid-filled	-65%	88.25	-0.25	1.98x
#7	base	hybrid-filled	-74%	87.71	-0.79	2.44x
#8	base	hybrid-filled	-73%	87.23	-1.27	2.60x
#9	base	hybrid-filled	-74%	86.69	-1.81	2.80x
#10	base	struct	-86%	85.52	-2.98	3.64x

Main takeaways

network #2: pruned from BERT-large, it's significantly more accurate than BERT-base, but have a similar size and speed.
network #3: pruned from BERT-large, it is finally 40% smaller but significantly better than a BERT-base, and still as fast.

That means that starting from a larger networks is beneficial on all metrics, even absolute size, something observed in the Train Large, Then Compress paper.

network #4: we can shrink BERT-base by ~60%, speedup inference by 1.8x and still have a better network
networks #N: we can select a tradeoff between speed and accuracy, depending on the final application.
last network: pruned using a slightly different "structured pruning" method that gives faster networks but with a significant drop in F1.

Additional remarks

The parameter reduction of the BERT-large networks are actually higher compared to the original network: 40% smaller than BERT-base means actually 77% smaller than BERT-large. We kept here the comparison with BERT-base numbers as it's what matters on a practical point of view.
The "theoretical speedup" is a speedup of linear layers (actual number of flops), something that seems to be equivalent to the measured speedup in some papers. The speedup here is measured on a 3090 RTX, using the HuggingFace transformers library, using Pytorch cuda timing features, and so is 100% in line with real-world speedup.

Example "Hybrid filled" Network

Here are some visualizations of the pruned network #7. It is using the "Hybrid filled" method:

Hybrid : prune using blocks for attention and rows/columns for the two large FFNs.
Filled : remove empty heads and empty rows/columns of the FFNs, then re-finetune the previous network, letting the zeros in non-empty attention heads evolve and so regain some accuracy while keeping the same network speed.

You can see that the results linear layers are all actually "dense" (hover on the graph to visualize them).

You can see here the pruned heads for each layer:

Comparison with state of the art

If we plot the F1 of the full set of pruned networks against the speedup, we can see that we outperform fine-tuned TinyBERT and Distilbert by some margin. MobileBert seems significantly better, even with the "no OPT" version presented here, which does not contain the LayerNorm optimization used in the much faster version of MobileBERT. An interesting future work will be to add those optimizations to the pruning tools.

Even in terms of saved size, we get smaller networks for the same accuracy (except for MobileBERT, which is better on size too):

GLUE/MNLI

The experiments were done on BERT-base. Significant speedups were obtained, even if the results are a bit behind compared to the SQuAD results. Here is a selection of networks, with the same rules as for the SQuAd table:

Model	Type	method	Params	Accuracy	Accuracy diff	Speedup
#1	base	-	+0%	84.6	+0.00	1.00x
#2	base	hybrid-filled	-65%	83.71	-0.89	2.00x
#3	base	hybrid-filled	-74%	83.05	-1.55	2.40x
#4	base	hybrid-filled	-81%	82.69	-1.91	2.86x
#5	base	hybrid-filled	-87%	81.03	-3.57	3.44x

Comparison with state of the art

(This is WIP : Some more runs are needed to check the performance versus MobileBERT and TinyBert at same level of speed. Some better hyperparameters may help too.)

From the following graphs, we see that the speed is a bit lower compared to TinyBERT, and roughly in line with MobileBERT. In terms of sparsity, the precision is a bit lower than MobileBERT and TinyBERT. On both metrics it's better than DistilBERT by some significant margin.

Related work

pytorch_block_sparse is a CUDA Implementation of block sparse kernels for linear layer forward and backward propagation. It's not needed to run the models pruned by the nn_pruning tools, as it's not fast enough yet to be competitive with dense linear layers: just pruning heads is faster, even if those heads still contain some inner sparsity.

nn_pruning's People

Contributors

Stargazers

Watchers

nn_pruning's Issues

Possible speed up for Sparse Block Computation

Dear @madlag,

Matthias Fey, PyTorch Geometric main author implemented this: https://github.com/rusty1s/pytorch_sparse/blob/master/csrc/cuda/spmm_cuda.cu#L11

After benchmarking, it confirmed it was faster than CuSparse.
I think this algorithm could be adapted for Block Sparse to get some additional speed up.

From Matthias,

basically the same, just that you additionally parallelize over the block dimension
depends on the density of the sparse matrix I guess

Just an idea :)

Best,
T.C

ValueError: mutable default <class 'nn_pruning.modules.masked_nn.InitDirective'> for field mask_init is not allowed: use default_factory

python3.11 examples/command_line.py --model_name_or_path bert-base-uncased --dataset_name squad_v2 --dataset_config_name 2.1 --output_dir ./results --pruning_method topK --pruning_params '{"pruning_fraction": 0.5}' --do_train --do_eval --evaluation_strategy epoch --learning_rate 2e-5 --per_device_train_batch_size 16 --per_device_eval_batch_size 64 --num_train_epochs 3 --weight_decay 0.01

Traceback (most recent call last):
  File "/home/junfan/gitpod/bert_pruning/nn_pruning/examples/command_line.py", line 5, in <module>
    from question_answering.qa_sparse_xp import QASparseXP
  File "/home/junfan/gitpod/bert_pruning/nn_pruning/examples/question_answering/qa_sparse_xp.py", line 29, in <module>
    from nn_pruning.patch_coordinator import SparseTrainingArguments
  File "/home/junfan/.local/lib/python3.11/site-packages/nn_pruning/patch_coordinator.py", line 29, in <module>
    from .modules.masked_nn import (
  File "/home/junfan/.local/lib/python3.11/site-packages/nn_pruning/modules/masked_nn.py", line 69, in <module>
    @dataclass
     ^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 1220, in dataclass
    return wrap(cls)
           ^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 1210, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 958, in _process_class
    cls_fields.append(_get_field(cls, name, type, kw_only))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 815, in _get_field
    raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'nn_pruning.modules.masked_nn.InitDirective'> for field mask_init is not allowed: use default_factory

Dependencies and version numbers?

Hello François,

First of all, thank you for making your work on movement-pruning and sparse transformers available to the public!

I've finally found some time to start playing around with nn_pruning and I noticed there appears to be a mismatch the dependencies listed in setup.py (i.e. just click) vs those visible in the source code (e.g. transformers, datasets, torch etc).

Would you mind sharing which package dependencies and version numbers you used to run your experiments?

Is this pruning methods common for multiHeadAttention

As the document tested the BERT models and got good result, one question is this nn_pruning methods can be applied to other Transformer models, like Google ViT, Swin Transformer and so on.

Applying Block Movement Pruning for BART

Hi,
I am working to prune BART model for seq2seq purpose. Currently, I have replaced this code with BART based functionalities. After executing I am getting drop in number of parameters for both attention and FFN but dimension reduction happens only for FFN which results in slowness. My questions are following:

Is this right code to refer to or should I follow this command_line.py?
Is there any existing code which works for BART based models for Conditonal Generation or Seq2Seq?

Experiments run slowly on V100

Hi thanks for sharing this great work! I am trying to train a model on my own with the hybrid filled setting for MNLI and it seems that it takes more than 5 hours to train for one epoch. Is the training expected to be this slow or there might be something run with what I do? I basically use the same hyperparameters provided in the analysis folder.

Is edge-popup the same as movement pruning with frozen weights?

In hidden networks, Ramanujan et al. develop a method to find masks via optimization (called edge-popup). The algorithm is extremely similar to movement pruning, where the masks are part of the computational graph and receive gradients for a negative gradient step. The main difference is that they freeze the weights and only train the scores (mask), such that they can find well-performing networks within randomly initialized models.

If I freeze the weights and apply movement pruning, is it the same as the above method? If not, what would be the difference?

From a theoretical standpoint, movement pruning talks about how the method will prune those weights that move towards zero as shown by the tendency of the gradients. In edge-popup, they never mention such behavior, but I assume it would be the same if both methods apply the same operations. Given the idea that they track tendency of weights towards zero, it sounds counterintuitive to freeze the weights since there will be no movement tendency anymore. However, that's what they do in edge-popup and it works surprisingly well. Any thoughts about this?

Why you say it's not needed to run the models pruned by the nn_pruning tools？

In the README.md, why did you say that "it's not needed to run the models pruned by the nn_pruning tools"?

a bug for MNLI eval result?

Hi there @madlag,

Thanks for your great work!
It seems there is a problem for MNLI
if we update text_classification/parameters.json with do_train: 0
and run following,

mkdir result
export CUDA_VISIBLE_DEVICES=0; python command_line.py finetune --json-path  text_classification/parameters.json mnli result

we get the result
"eval_results_mnli.json"

{
    "eval_accuracy": 0.06622516556291391,
    "eval_loss": 15.527382850646973
}

I would really appreciate if you could double-check the code and there could be a bug? Or I must miss something.

General help for custom token classification endtask

Hey all,

Thanks for the great repo, that's exactly what I was looking for 😄

I have been through the examples and documentation you guys provided but I am attempting to use the library for token classification (specifically for NER).
I have my own datasets.Dataset, a custom BERT model and I am not using a HF Trainer.

I have tried to follow the steps provided here but they are quite confusing to me...
@madlag Could you by any chance give me further hints/notebooks on how I could use the library to reach my end goal?

Thanks a lot for you help,
Cheers,
Jules

Example links not working in docs/HOWTO

In the documentation, links for examples are not working,

I'm trying to find examples of nn_pruning for text_classification.

Does this support TensorFlow?

Hello,

I am using HF and i have built my model using TensorFlow. I am interested in pruning my model but I am unsure if this supports TensorFlow.
Any clarification would be appreciated

unable to replicate results of example notebooks

Can you please provide a dockerfile or requirement.txt to reproduce the results. I installed this library but had to make a few changes to run the example notebooks. On running the notebook, the number of parameters were the same as the original model. No pruning was done.

AttributeError: 'GraphModule' object has no attribute 'config'

when test quantization, it raises errors. May I ask if anyone has encountered this problem?

pytorch==3.8.1
transformers==4.7.0

nn_pruning doesn't seem to work for T5 Models, Roberta-based Models

Hi @madlag @julien-c @co42 @srush @Narsil

I am trying to use nn_pruning for Pruning different transformer models.

Code:

model_checkpoint = "t5-small"
t5small_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)
mpc.patch_model(t5small_model)

t5small_model.save_pretrained("models/patched")

Error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-47-602943fc51a1>](https://localhost:8080/#) in <module>()
     1 
     2 t5small_model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)
----> 3 mpc.patch_model(t5small_model)
     4 
     5 t5small_model.save_pretrained("models/patched")

[/usr/local/lib/python3.7/dist-packages/nn_pruning/patch_coordinator.py](https://localhost:8080/#) in patch_model(self, model, trial)
   640             patched_count += 2 * layers_count
   641 
--> 642         assert (patcher.stats["patched"] == patched_count)
   643 
   644         if layer_norm_patch:

AssertionError:

[Colab] (https://colab.research.google.com/drive/1Gz7rozG8NbeBtsiWXjGNQ5wnVU7SE_Wl?usp=sharing)

Descriptions of hybrid pruning

I am wondering where I can find descriptions of the hybrid pruning method as it is not covered in the original movement pruning paper? Thanks so much!

Quantization tests fail

The tests in nn_pruning/tests/test_quantization.py fail.
A fix for this was proposed in PR #38

No weights removed during fine-pruning?

Hello François,

I've put together a simple text classification example (link) using the SparseTrainer and it seems that no weights are being removed during fine-pruning.

From the nn_pruning docs my understanding is that I need to take the following steps:

Create a mixin with `SparseTrainer` and `Trainer`

Since I'm not doing anything fancy like question-answering, I created the following class:

class PruningTrainer(SparseTrainer, Trainer):
    def __init__(self, sparse_args, *args, **kwargs):
        Trainer.__init__(self, *args, **kwargs)
        SparseTrainer.__init__(self, sparse_args)
        
    def compute_loss(self, model, inputs, return_outputs=False):
        """
        We override the default loss in SparseTrainer because it throws an 
        error when run without distillation
        """
        outputs = model(**inputs)

        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        # We don't use .loss here since the model may return tuples instead of ModelOutput.
        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
        self.metrics["ce_loss"] += float(loss)
        self.loss_counter += 1
        return (loss, outputs) if return_outputs else loss

where I override the default compute_loss function because it throws an TypeError: iteration over a 0-d tensor error when a teacher model is not provided (I want to try first without distillation). I think the error is produced by distil_loss_combine which returns a single value here but compute_loss expects two values here.

Instantiate trainer with sparse training arguments

With the above mixin, my idea was to use the default values in SparseTrainingArguments along with the usual things we need in a HF Trainer:

sparse_args = SparseTrainingArguments()

trainer = PruningTrainer(
    sparse_args=sparse_args,
    args=args,
    model=bert_model,
    train_dataset=boolq_enc['train'],
    eval_dataset=boolq_enc['validation'],
    tokenizer=bert_tokenizer,
    compute_metrics=compute_metrics
)

By default SparseTrainingArguments has initial_threshold=1 and final_threshold=0.5 so my understanding is that by the end of fine-pruning we expect around 50% of the encoder weights to be removed.

Set the trainer's patch coordinator

Here I took a guess based on your instructions for fine-pruning without a trainer and set the patch coordinator as follows:

mpc = ModelPatchingCoordinator(
    sparse_args=sparse_args, 
    device=device, 
    cache_dir="checkpoints", 
    logit_names="logits", 
    teacher_constructor=AutoModelForSequenceClassification)

trainer.set_patch_coordinator(mpc)

Fine-tune

Running

trainer.train()

seems to show the model is learning, although curiously the mask threshold in the logs is already 0.5 after the first epoch.

Optimize the model for inference

Following your example I ran

prunebert_model = optimize_model(trainer.model.to("cpu"), "dense")

but find that no parameters are removed 😢. So it seems that although the model is learning, I have missed something to enable pruning during fine-tuning.

Any ideas on what step(s) I'm missing?

P.S. What is the meaning of "XP" in classes like SparseXP?

What is the difference between "finetune" and "final-finetune" in `/example`.

Hello,

Thanks for the amazing repo!

I'm wondering what is the difference between "finetune" and "final-finetune" in /example.
Do we train the model and the mask score in the finetune stage, and only train the optimized model in the final-finetune stage?

Is there a way to directly save the optimized model and load the optimized model instead of loading the patched model and optimizing to get the pruned model?

Big thanks again for the great work!

Not seeing the inference speed up on cuda using the sparse trainer notebook

Hi @madlag ,
I have tried the notebook which is very similar to the notebook you shared in the issue #5 but I am not seeing any speed up at the end if we move the models to cuda, although I can see about 1.3X speed up on cpu. I am running this on EC2 g4dn.2xlarge instance which has T4 card.

This is my training code and this is the inference code. I wonder if I am missing something here.

The parameter counts shows the reduction but the inference speed is both pruned and non-pruned ~9 ms.

prunebert_model.num_parameters() / bert_model_original.num_parameters() = 0.6118184376136527

Thanks for you help and the great work.

Does this script run with the hybrid-filled method by default?Thanks!

@madlag

Is global sort supported?

In the older version of this repository, there is an option global_topk in the masked run examples (https://github.com/huggingface/block_movement_pruning/blob/c16cecafb9db2636250487064e00896a77772a3a/block_movement_pruning/masked_run_glue.py). I think this allows globally sorting the pruning units. Is this functionality supported by nn_pruning?

NN_pruning module for Question Answering

Hi!

I am trying to run the launch_qa_sparse_single.py file from the question answering example from your nn_pruning library. I haven't changed anything from the original code and I get this error:

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
***** Running training *****
Num examples = 131754
Num Epochs = 20
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 164700
0%| | 0/164700 [00:00<?, ?it/s]Traceback (most recent call last):
File "question_answering/launch_qa_sparse_single.py", line 33, in
main()
File "question_answering/launch_qa_sparse_single.py", line 23, in main
qa.run()
File "./question_answering/xp.py", line 324, in run
self.train()
File "./question_answering/xp.py", line 312, in train
model_path= model_path
File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/transformers/trainer.py", line 1120, in train
tr_loss += self.training_step(model, inputs)
File "/home/ines/NN_pruning/nn_pruning/nn_pruning/sparse_trainer.py", line 86, in training_step
return super().training_step(*args, **kwargs)
File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/transformers/trainer.py", line 1542, in training_step
loss.backward()
File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/ines/NN_pruning/venv_nn_prun/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [16]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I found several solutions to this problem on the internet but all the solutions I came accross with tell me to change something in the architecture of the model. Unfortunately here, we are using a Trainer from the transformers library so I don't really know how to fix this issue. Thank you for your help.

huggingface / nn_pruning Goto Github PK

nn_pruning's Introduction

Neural Networks Block Movement Pruning

Documentation

Installation

User installlation

Developer installation

Results

Squad V1

Main takeaways

Example "Hybrid filled" Network

Comparison with state of the art

GLUE/MNLI

Comparison with state of the art

Related work

nn_pruning's People

Contributors

Stargazers

Watchers

Forkers

nn_pruning's Issues

Create a mixin with SparseTrainer and Trainer

Instantiate trainer with sparse training arguments

Set the trainer's patch coordinator

Fine-tune

Optimize the model for inference

Recommend Projects

Recommend Topics

Recommend Org

Create a mixin with `SparseTrainer` and `Trainer`