ist-daslab / sparsegpt Goto Github PK

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".

Home Page: https://arxiv.org/abs/2301.00774

License: Apache License 2.0

Python 84.27% Jupyter Notebook 15.73%

sparsegpt's Introduction

SparseGPT

This repository contains code to reproduce the key results of the paper SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot.

Specifically, it provides scripts and implementations to:

Evaluate baseline and pruned models on raw-WikiText2, PTB and C4-subset. (datautils.py, opt.py, bloom.py)
Perform unstructured, n:m and sparse + quantized SparseGPT compression on OPT and BLOOM models. (sparsegpt.py, opt.py, bloom.py)

We note that this SparseGPT implementation is based on our open-source GPTQ code.

Dependencies

torch: tested on v1.10.1+cu111
transformers: tested on v4.21.2
datasets: tested on v1.17.0

Usage

Here are some sample commands to run baselines and sparsification on OPT models, followed by perplexity evaluations on raw-WikiText2, PTB and C4. See also the CMD-argument documentation.

# Run dense baseline
python opt.py facebook/opt-125m c4

# Run magnitude baseline
python opt.py facebook/opt-125m c4 --sparsity .5 --gmp

# Prune to 50\% uniform sparsity with SparseGPT
python opt.py facebook/opt-125m c4 --sparsity .5

# Prune to full 2:4 sparsity with SparseGPT
python opt.py facebook/opt-125m c4 --prunen 2 --prunem 4

# Prune to 50\% + 4-bit with SparseGPT
python opt.py facebook/opt-125m c4 --sparsity .5 --wbits 4

To run on other OPT models, replace "facebook/opt-125m" by the HuggingFace name of the corresponding model. For the 175B model, access must first be requested from Meta and the checkpoint converted to HuggingFace format, then its location can simply be passed as a name to this script.

The BLOOM script bloom.py has a very similar interface, however some features are currently only available for OPT, e.g.:

# Sparsify BLOOM-176B with SparseGPT
python bloom.py bigscience/bloom c4 --sparsity .5

We also provide LLaMA pruning script with the very same interface:

# Sparsify LLaMa with SparseGPT
python llama.py LLAMA_HF_WEIGHTS_LOCATION c4 --sparsity 0.5

In case one would like to save the sparsified model specify path to saved checkpoint via --save flag.

One can optionally log evalution results to W&B with --log_wandb.

Demo

One can try SparseGPT via the colab demo - demo.ipynb.

Cite

If you found this work useful, please consider citing:

@article{frantar-sparsegpt,
  title={{SparseGPT}: Massive Language Models Can Be Accurately Pruned in One-Shot}, 
  author={Elias Frantar and Dan Alistarh},
  year={2023},
  journal={arXiv preprint arXiv:2301.00774}
}

sparsegpt's People

Contributors

Stargazers

Watchers

sparsegpt's Issues

Can SparseGPT be used on BERT ?

Hi, great work. I am curious, can SparseGPT be used to sparsify BERT models, or related models such as DistilBERT etc?

Would sparsegpt be available for Llama2?

transformers version is not correct

Running the code faces an error because it cannot import LlamaTokenizer.

File "/local/home/.../SparseGPT/datautils.py", line 6, in <module>
    from transformers import AutoTokenizer, LlamaTokenizer
ImportError: cannot import name 'LlamaTokenizer' from 'transformers'

If you update to the latest version you run into.a different issue:raise OSError(

OSError: Unable to load weights from pytorch checkpoint file for '/home/.../.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6/pytorch_model.bin' at '/home/.../.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

The best workaround is to use transformers v4.21.2 and remove the LlamaTokenizer from line 6 of datautils.py.

Why Hessian can get by activation ($H = XX^T$) ？

I don't really understand, isn't the Hessian matrix a second-order derivative matrix, so how can it be obtained by multiplying the activation value matrix by its transpose ($H = XX^T$) ?
Can you give me some more detailed instructions?

Pruning log files

Hi, is there any chance you release the log files of your pruning experiments?
I'm trying to work with the code and wanted to make sure that the intermediate error values make sense.
Thanks in advance!

Out of memory issue

Hi, I am encountering OOM issues with opt-13b with a single A100 80GB gpu! Do you know if there is any workaround to fix the issue?

Purpose of this update

Gaussian elimination

I saw the "Hessian Synchronization" section in the paper, but it is not reflected in the code. Instead, it directly takes Hinv1 = Hinv[i1:i2, i1:i2]. What is the reason behind this?

how to use for Baichuan?

when I try this repo on Baichuan, I met some error:
Some weights of the model checkpoint at baichuan2_v1_5_s5_zh/ were not used when initializing LlamaForCausalLM
what should I do?

What causes gpu memory increase compred to dense mode？Is this normal？

Great job !
and I recurrence your code ,but i noticed an increase in gpu memory, and i don't understand why , because According to the paper description, the model parameters have been reduced by 50%.

Saving the pruned checkpoint?

Does the Bloom implementation save the pruned weights?

why there is no inference related code in the project？

OOM:cannot download opt-30b, opt-66b

Hi, the code runs well for opt-2.7b, opt13b. But When I run it for opt-30b and opt-66b on A100 GPU with 80GB memory, it fails to download the model.

Downloading shards: 100%|██████████| 7/7 [13:28<00:00, 93.22s/it]
Downloading shards: 100%|██████████| 7/7 [13:28<00:00, 115.44s/it]

Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 14%|█▍ | 1/7 [00:03<00:20, 3.48s/it]
Loading checkpoint shards: 29%|██▊ | 2/7 [00:06<00:17, 3.46s/it]
Loading checkpoint shards: 43%|████▎ | 3/7 [00:10<00:13, 3.47s/it]
Loading checkpoint shards: 57%|█████▋ | 4/7 [00:13<00:10, 3.47s/it]
Loading checkpoint shards: 71%|███████▏ | 5/7 [00:17<00:06, 3.47s/it]slurmstepd: error: Detected 1 oom-kill event(s) in StepId=48677022.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: spartan-gpgpu148: task 0: Out Of Memory
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=48677022.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Jul 2 12:14:40 spartan-gpgpu148 kernel: Task in /slurm/uid_14236/job_48677022/step_0 killed as a result of limit of /slurm/uid_14236/job_48677022
Jul 2 12:14:40 spartan-gpgpu148 kernel: Memory cgroup stats for /slurm/uid_14236/job_48677022: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jul 2 12:14:40 spartan-gpgpu148 kernel: Memory cgroup stats for /slurm/uid_14236/job_48677022/step_extern: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB

Using llama.py silently fails and occasionally causes system instability

Silent Failure:

I'm attempting to use llama.py to reduce the size of the Airochronos L2 13B model.

sparsegpt on  master via 🐍 v3.10.12
❯ python ./llama.py kingbri/airochronos-l2-13B c4 --sparsity 0.5 --save ./airochronos-l2-13B-sparse --wbits 4
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 640/640 [00:00<?, ?B/s]
pytorch_model.bin.index.json: 100%|███████████████████████████████████████████████████████| 29.9k/29.9k [00:00<?, ?B/s]
pytorch_model-00001-of-00003.bin: 100%|███████████████████████████████████████████| 12.9G/12.9G [05:56<00:00, 36.2MB/s]
pytorch_model-00002-of-00003.bin: 100%|███████████████████████████████████████████| 12.8G/12.8G [10:34<00:00, 20.2MB/s]
pytorch_model-00003-of-00003.bin: 100%|█████████████████████████████████████████████| 328M/328M [00:08<00:00, 40.3MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████| 3/3 [16:42<00:00, 334.24s/it]
Loading checkpoint shards:   0%|                                                                 | 0/3 [00:00<?, ?it/s]

Then it just quits without any error message. The same thing occurs without the --wbits 4 flag.

I am able to use opt.py perfectly well, and Task Manager shows that nothing is being maxed out so it can't be a hardware limitation:

System instability:

When I was running the command I mentioned above, trying to capture the image of the Task Manager usage graphs, an Electron app restarted itself and Task Manager displayed some black artifacts and closed. This is not a regular occurence for this machine, and only happened when I ran that command. On another occasion, Firefox froze for a few seconds while llama.py was attempting to run.
Unfortunately, I have been unable to reproduce either of these behaviours. Even so, I feel it's worth mentioning.

Different error between OBS and SparseGPT

Following OBS, we want to remove the weights with minimum error $w_m^2/H_{mm}^{-1}$ . But, in the SparseGPT algorithm, we use $w_m^2/{H^{-1}}_{mm}^2$ instead.

I'm not sure if it is equivalent between using $H_{mm}^{-1}$ and ${H_{mm}^{-1}}^2$ and I don't see any $w_m^2/{H_{mm}^{-1}}^2$ elsewhere in the paper. Am I missing some points about such a difference?

Below are the occurrences of this issue in the paper and code.

OBS	SparseGPT

sparsegpt/sparsegpt.py

Line 96 in f5c2500

tmp = W1 ** 2 / (torch.diag(Hinv1).reshape((1, -1))) ** 2

AWQ alongside sparsegpt

Hi, wondering if you think that it would be possible to use AWQ with sparsegpt ? i tried to make an AWQ model work with sparsegpt by finding the awq.modules.linear.gemv.WQLinear_GEMV layer of the model but still block on the add batch part with this error.

Traceback (most recent call last):
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\opt.py", line 320, in
opt_sequential(model, dataloader, DEV)
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\opt.py", line 101, in opt_sequential
outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\transformers\models\opt\modeling_opt.py", line 552, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\transformers\models\opt\modeling_opt.py", line 182, in forward
query_states = self.q_proj(hidden_states) * self.scaling
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1574, in _call_impl
hook_result = hook(self, args, result)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\opt.py", line 95, in tmp
gpts[name].add_batch(inp[0].data, out.data)
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\sparsegpt.py", line 52, in add_batch
self.H += inp.matmul(inp.t())
RuntimeError: The size of tensor a (96) must match the size of tensor b (768) at non-singleton dimension 1

just want to know if its possible, or if someone has an idea?

Mistral Support

Hi,
Does this project support Mistral?
Thanks!

finetuning sparsified LLaMa

hello. First of all, thank you for sharing your great research.

I am trying to fine-tune the LLaMa-7B model for a specific task after sparsifications.

Is it possible to simply use the saved path via --save option?

Thank you.

Hessian Inverse

Hello,

SparseGPT is really an amazing job.

I am wondering in https://github.com/IST-DASLab/sparsegpt/blob/master/sparsegpt.py#L77, why cholesky decomposition is performed on the inversed hessian matrix.

Thank you.

Adaptation for Pruning Conv2d or Conv3d Layers?

How would I proceed to adapt the "add_batch" function to make the pruning possible on a Conv layer? Am I missing something here.

Any suggestions are greatly appreciated. Thanks in advance.

Dependencies are wrong

Hello, I have tried lots of different version combinations to make the LLaMA script work, it produces very bad results which is
also what I observed with my own implementation and some other implementation for SparseGPT LLaMA.

All 3 of these implementations produce exactly the same results, which is good it shows probably we are doing everything correctly,
but then the performance is incredibly poor for LLaMA, it performs even worse than BLOOM or OPT.

If your results are better can you please share the exact dependencies to repeat your experiments, because the transformers
library version you give in the README does not even have LLaMA tokenizer etc.

Thank you

2:4 sparsity with to_sparse_semi_structured method from pytorch results in memory issue

I am trying to reduce the memory footprint of the 2:4 sparsegpt pruned LLaMA2 model using to_sparse_semi_structured method from PyTorch. However, when I apply this to modify the way the sparse parameters are stored, I got out of memory. Please note that I did not get out of memory for the original dense model.
Below is the code I was running, where model_path is the path to the pruned model.

from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
model = AutoModelForCausalLM.from_pretrained(model_path)
model = model.to(device).half()

for fqn, module in model.named_modules():
    # print(fqn)
    if isinstance(module, nn.Linear):
        module.weight = nn.Parameter(to_sparse_semi_structured(module.weight))

Inference Speedup

Great work!
I am trying to do inference speedup. Could you please share the code for inference speedup using 2:4 sparsity on Ampere GPUs? Thanks!

Why LLaMa pruning is difficult?

Hi All, I have seen this repo which uses your algorithm to prune the Llama. Pruning LLama makes the accuracy very bad. I am wondering if there is any explanation? have yo guys done pruning the LLaMa?

why i can't reproduce the result of paper?

hi, i am very confused, I am reproducing opt-1.3b sparsity. the fact is that i can get the same dense model preplexity at 14.62 of wikitext2, but the preplexity after sparsing 50% is 26.71, which is higher than 17.46 in paper. And I didn't modify any code in this repo. I wonder if the hyperparams in paper's experiments are different?
look forward to your reply.

Question about multi-GPU inference

sparsegpt has shown good results in the A100, but for a large number of commercial GPUs, VRAM and computation power are still a big pressure. is there a necessity and idea for multi-GPU inference?

How should I verify the speedup effect of the algorithm?

As shown in paper, CUTLASS library is used for speedup. But I did not find codes rely on these settlement.How should I verify SparseGPT is faster than dense models when doing inference? Even with end-to-end, speedups would be slightly lower, that would be fine. Thanks a lot for your perfect works~

When would the code for GPT-J-6B be released?

Thanks a lot for your work on compression on LLMs, and looking forward for the code for GPT-J. When would it be available for it? It would be a great help for my experiment.

Lack of comments in the code

Hello, thank you very much for the contribution to the community.

You've done an impressive work and lots of people are probably trying to implement your method for their own work.
It is very nice that you share the code, so that we can check the 'not so clear' parts in the paper.

Yet even I see that you've applied many smart tricks and I probably can imagine the state you are in while writing the code,
it would be very nice if you add some comments to the code. Currently it is like an ancient codex that needs to be cracked,
I will probably crack it faster then waiting for an update but for people in the future, please consider adding comments.
This code is one of the most cryptic I saw in the field, which is normal considering the nature of the problem, creating something
new almost always breaks mainstream frameworks.

Thanks a lot.

AttributeError: 'NoneType' object has no attribute 'shape'

(textgen) [root@pve-m7330 sparsegpt]# python llama.py ../text-generation-webui/models/TinyLlama-1.1B-Chat-v1.0/ wikitext2 --nsamples 10
Token indices sequence length is longer than the specified maximum sequence length for this model (2824491 > 2048). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2824491 > 2048). Running this sequence through the model will result in indexing errors
Dataset: wikitext2
Evaluating ...
0
Traceback (most recent call last):
  File "/home/user/sparsegpt/llama.py", line 335, in <module>
    llama_eval(model, testloader, DEV, dataset, args.log_wandb)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/sparsegpt/llama.py", line 211, in llama_eval
    outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask)[0]
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 739, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 644, in forward
    cos, sin = self.rotary_emb(value_states, position_ids)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 134, in forward
    inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
AttributeError: 'NoneType' object has no attribute 'shape'
(textgen) [root@pve-m7330 sparsegpt]#

Why transpose the input when in case of nn.Linear or nn.Conv1d?

In sparsegpt.py at def add_batch line 42: inp = inp.t()

This code makes the hessian matrix into X^TX rather than XX^T when pruning nn.Linear or nn.Conv1d

Why did you transpose these inputs??

Are there any missings that I don't understand?