Giter Club home page Giter Club logo

sparsegpt's Introduction

SparseGPT

This repository contains code to reproduce the key results of the paper SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot.

Specifically, it provides scripts and implementations to:

  • Evaluate baseline and pruned models on raw-WikiText2, PTB and C4-subset. (datautils.py, opt.py, bloom.py)
  • Perform unstructured, n:m and sparse + quantized SparseGPT compression on OPT and BLOOM models. (sparsegpt.py, opt.py, bloom.py)

We note that this SparseGPT implementation is based on our open-source GPTQ code.

Dependencies

  • torch: tested on v1.10.1+cu111
  • transformers: tested on v4.21.2
  • datasets: tested on v1.17.0

Usage

Here are some sample commands to run baselines and sparsification on OPT models, followed by perplexity evaluations on raw-WikiText2, PTB and C4. See also the CMD-argument documentation.

# Run dense baseline
python opt.py facebook/opt-125m c4

# Run magnitude baseline
python opt.py facebook/opt-125m c4 --sparsity .5 --gmp

# Prune to 50\% uniform sparsity with SparseGPT
python opt.py facebook/opt-125m c4 --sparsity .5

# Prune to full 2:4 sparsity with SparseGPT
python opt.py facebook/opt-125m c4 --prunen 2 --prunem 4

# Prune to 50\% + 4-bit with SparseGPT
python opt.py facebook/opt-125m c4 --sparsity .5 --wbits 4

To run on other OPT models, replace "facebook/opt-125m" by the HuggingFace name of the corresponding model. For the 175B model, access must first be requested from Meta and the checkpoint converted to HuggingFace format, then its location can simply be passed as a name to this script.

The BLOOM script bloom.py has a very similar interface, however some features are currently only available for OPT, e.g.:

# Sparsify BLOOM-176B with SparseGPT
python bloom.py bigscience/bloom c4 --sparsity .5

We also provide LLaMA pruning script with the very same interface:

# Sparsify LLaMa with SparseGPT
python llama.py LLAMA_HF_WEIGHTS_LOCATION c4 --sparsity 0.5

In case one would like to save the sparsified model specify path to saved checkpoint via --save flag.

One can optionally log evalution results to W&B with --log_wandb.

Demo

One can try SparseGPT via the colab demo - demo.ipynb.

Cite

If you found this work useful, please consider citing:

@article{frantar-sparsegpt,
  title={{SparseGPT}: Massive Language Models Can Be Accurately Pruned in One-Shot}, 
  author={Elias Frantar and Dan Alistarh},
  year={2023},
  journal={arXiv preprint arXiv:2301.00774}
}

sparsegpt's People

Contributors

donyme avatar efrantar avatar godofnothing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparsegpt's Issues

transformers version is not correct

Running the code faces an error because it cannot import LlamaTokenizer.

File "/local/home/.../SparseGPT/datautils.py", line 6, in <module>
    from transformers import AutoTokenizer, LlamaTokenizer
ImportError: cannot import name 'LlamaTokenizer' from 'transformers' 

If you update to the latest version you run into.a different issue:raise OSError(

OSError: Unable to load weights from pytorch checkpoint file for '/home/.../.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6/pytorch_model.bin' at '/home/.../.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

The best workaround is to use transformers v4.21.2 and remove the LlamaTokenizer from line 6 of datautils.py.

Why Hessian can get by activation ($H = XX^T$) ?

I don't really understand, isn't the Hessian matrix a second-order derivative matrix, so how can it be obtained by multiplying the activation value matrix by its transpose ($H = XX^T$) ?
Can you give me some more detailed instructions?

Pruning log files

Hi, is there any chance you release the log files of your pruning experiments?
I'm trying to work with the code and wanted to make sure that the intermediate error values make sense.
Thanks in advance!

Out of memory issue

Hi, I am encountering OOM issues with opt-13b with a single A100 80GB gpu! Do you know if there is any workaround to fix the issue?

Gaussian elimination

I saw the "Hessian Synchronization" section in the paper, but it is not reflected in the code. Instead, it directly takes Hinv1 = Hinv[i1:i2, i1:i2]. What is the reason behind this?

how to use for Baichuan?

when I try this repo on Baichuan, I met some error:
Some weights of the model checkpoint at baichuan2_v1_5_s5_zh/ were not used when initializing LlamaForCausalLM
what should I do?

OOM:cannot download opt-30b, opt-66b

Hi, the code runs well for opt-2.7b, opt13b. But When I run it for opt-30b and opt-66b on A100 GPU with 80GB memory, it fails to download the model.

Downloading shards: 100%|██████████| 7/7 [13:28<00:00, 93.22s/it]
Downloading shards: 100%|██████████| 7/7 [13:28<00:00, 115.44s/it]

Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 14%|█▍ | 1/7 [00:03<00:20, 3.48s/it]
Loading checkpoint shards: 29%|██▊ | 2/7 [00:06<00:17, 3.46s/it]
Loading checkpoint shards: 43%|████▎ | 3/7 [00:10<00:13, 3.47s/it]
Loading checkpoint shards: 57%|█████▋ | 4/7 [00:13<00:10, 3.47s/it]
Loading checkpoint shards: 71%|███████▏ | 5/7 [00:17<00:06, 3.47s/it]slurmstepd: error: Detected 1 oom-kill event(s) in StepId=48677022.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: spartan-gpgpu148: task 0: Out Of Memory
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=48677022.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Jul 2 12:14:40 spartan-gpgpu148 kernel: Task in /slurm/uid_14236/job_48677022/step_0 killed as a result of limit of /slurm/uid_14236/job_48677022
Jul 2 12:14:40 spartan-gpgpu148 kernel: Memory cgroup stats for /slurm/uid_14236/job_48677022: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jul 2 12:14:40 spartan-gpgpu148 kernel: Memory cgroup stats for /slurm/uid_14236/job_48677022/step_extern: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB

Using llama.py silently fails and occasionally causes system instability

Silent Failure:

I'm attempting to use llama.py to reduce the size of the Airochronos L2 13B model.

sparsegpt on  master via 🐍 v3.10.12
❯ python ./llama.py kingbri/airochronos-l2-13B c4 --sparsity 0.5 --save ./airochronos-l2-13B-sparse --wbits 4
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 640/640 [00:00<?, ?B/s]
pytorch_model.bin.index.json: 100%|███████████████████████████████████████████████████████| 29.9k/29.9k [00:00<?, ?B/s]
pytorch_model-00001-of-00003.bin: 100%|███████████████████████████████████████████| 12.9G/12.9G [05:56<00:00, 36.2MB/s]
pytorch_model-00002-of-00003.bin: 100%|███████████████████████████████████████████| 12.8G/12.8G [10:34<00:00, 20.2MB/s]
pytorch_model-00003-of-00003.bin: 100%|█████████████████████████████████████████████| 328M/328M [00:08<00:00, 40.3MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████| 3/3 [16:42<00:00, 334.24s/it]
Loading checkpoint shards:   0%|                                                                 | 0/3 [00:00<?, ?it/s]

Then it just quits without any error message. The same thing occurs without the --wbits 4 flag.

I am able to use opt.py perfectly well, and Task Manager shows that nothing is being maxed out so it can't be a hardware limitation:
Usage graphs

System instability:

When I was running the command I mentioned above, trying to capture the image of the Task Manager usage graphs, an Electron app restarted itself and Task Manager displayed some black artifacts and closed. This is not a regular occurence for this machine, and only happened when I ran that command. On another occasion, Firefox froze for a few seconds while llama.py was attempting to run.
Unfortunately, I have been unable to reproduce either of these behaviours. Even so, I feel it's worth mentioning.

Different error between OBS and SparseGPT

Following OBS, we want to remove the weights with minimum error $w_m^2/H_{mm}^{-1}$ . But, in the SparseGPT algorithm, we use $w_m^2/{H^{-1}}_{mm}^2$ instead.

I'm not sure if it is equivalent between using $H_{mm}^{-1}$ and ${H_{mm}^{-1}}^2$ and I don't see any $w_m^2/{H_{mm}^{-1}}^2$ elsewhere in the paper. Am I missing some points about such a difference?

Below are the occurrences of this issue in the paper and code.

OBS SparseGPT
image image

tmp = W1 ** 2 / (torch.diag(Hinv1).reshape((1, -1))) ** 2

AWQ alongside sparsegpt

Hi, wondering if you think that it would be possible to use AWQ with sparsegpt ? i tried to make an AWQ model work with sparsegpt by finding the awq.modules.linear.gemv.WQLinear_GEMV layer of the model but still block on the add batch part with this error.

Traceback (most recent call last):
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\opt.py", line 320, in
opt_sequential(model, dataloader, DEV)
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\opt.py", line 101, in opt_sequential
outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\transformers\models\opt\modeling_opt.py", line 552, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\transformers\models\opt\modeling_opt.py", line 182, in forward
query_states = self.q_proj(hidden_states) * self.scaling
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\AppData\Local\anaconda3\envs\nouvel_env\Lib\site-packages\torch\nn\modules\module.py", line 1574, in _call_impl
hook_result = hook(self, args, result)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\opt.py", line 95, in tmp
gpts[name].add_batch(inp[0].data, out.data)
File "C:\Users\mjarnier\travail2\sparse\sparsegpt-master\sparsegpt.py", line 52, in add_batch
self.H += inp.matmul(inp.t())
RuntimeError: The size of tensor a (96) must match the size of tensor b (768) at non-singleton dimension 1

just want to know if its possible, or if someone has an idea?

finetuning sparsified LLaMa

hello. First of all, thank you for sharing your great research.

I am trying to fine-tune the LLaMa-7B model for a specific task after sparsifications.

Is it possible to simply use the saved path via --save option?

Thank you.

Hessian Inverse

Hello,

SparseGPT is really an amazing job.

I am wondering in https://github.com/IST-DASLab/sparsegpt/blob/master/sparsegpt.py#L77, why cholesky decomposition is performed on the inversed hessian matrix.

Thank you.

Adaptation for Pruning Conv2d or Conv3d Layers?

How would I proceed to adapt the "add_batch" function to make the pruning possible on a Conv layer? Am I missing something here.

Any suggestions are greatly appreciated. Thanks in advance.

Dependencies are wrong

Hello, I have tried lots of different version combinations to make the LLaMA script work, it produces very bad results which is
also what I observed with my own implementation and some other implementation for SparseGPT LLaMA.

All 3 of these implementations produce exactly the same results, which is good it shows probably we are doing everything correctly,
but then the performance is incredibly poor for LLaMA, it performs even worse than BLOOM or OPT.

If your results are better can you please share the exact dependencies to repeat your experiments, because the transformers
library version you give in the README does not even have LLaMA tokenizer etc.

Thank you

2:4 sparsity with to_sparse_semi_structured method from pytorch results in memory issue

I am trying to reduce the memory footprint of the 2:4 sparsegpt pruned LLaMA2 model using to_sparse_semi_structured method from PyTorch. However, when I apply this to modify the way the sparse parameters are stored, I got out of memory. Please note that I did not get out of memory for the original dense model.
Below is the code I was running, where model_path is the path to the pruned model.

from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
model = AutoModelForCausalLM.from_pretrained(model_path)
model = model.to(device).half()

for fqn, module in model.named_modules():
    # print(fqn)
    if isinstance(module, nn.Linear):
        module.weight = nn.Parameter(to_sparse_semi_structured(module.weight))

Inference Speedup

Great work!
I am trying to do inference speedup. Could you please share the code for inference speedup using 2:4 sparsity on Ampere GPUs? Thanks!

Why LLaMa pruning is difficult?

Hi All, I have seen this repo which uses your algorithm to prune the Llama. Pruning LLama makes the accuracy very bad. I am wondering if there is any explanation? have yo guys done pruning the LLaMa?

why i can't reproduce the result of paper?

hi, i am very confused, I am reproducing opt-1.3b sparsity. the fact is that i can get the same dense model preplexity at 14.62 of wikitext2, but the preplexity after sparsing 50% is 26.71, which is higher than 17.46 in paper. And I didn't modify any code in this repo. I wonder if the hyperparams in paper's experiments are different?
look forward to your reply.

image

Question about multi-GPU inference

sparsegpt has shown good results in the A100, but for a large number of commercial GPUs, VRAM and computation power are still a big pressure. is there a necessity and idea for multi-GPU inference?

How should I verify the speedup effect of the algorithm?

As shown in paper, CUTLASS library is used for speedup. But I did not find codes rely on these settlement.How should I verify SparseGPT is faster than dense models when doing inference? Even with end-to-end, speedups would be slightly lower, that would be fine. Thanks a lot for your perfect works~

Lack of comments in the code

Hello, thank you very much for the contribution to the community.

You've done an impressive work and lots of people are probably trying to implement your method for their own work.
It is very nice that you share the code, so that we can check the 'not so clear' parts in the paper.

Yet even I see that you've applied many smart tricks and I probably can imagine the state you are in while writing the code,
it would be very nice if you add some comments to the code. Currently it is like an ancient codex that needs to be cracked,
I will probably crack it faster then waiting for an update but for people in the future, please consider adding comments.
This code is one of the most cryptic I saw in the field, which is normal considering the nature of the problem, creating something
new almost always breaks mainstream frameworks.

Thanks a lot.

AttributeError: 'NoneType' object has no attribute 'shape'

(textgen) [root@pve-m7330 sparsegpt]# python llama.py ../text-generation-webui/models/TinyLlama-1.1B-Chat-v1.0/ wikitext2 --nsamples 10
Token indices sequence length is longer than the specified maximum sequence length for this model (2824491 > 2048). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2824491 > 2048). Running this sequence through the model will result in indexing errors
Dataset: wikitext2
Evaluating ...
0
Traceback (most recent call last):
  File "/home/user/sparsegpt/llama.py", line 335, in <module>
    llama_eval(model, testloader, DEV, dataset, args.log_wandb)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/sparsegpt/llama.py", line 211, in llama_eval
    outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask)[0]
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 739, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 644, in forward
    cos, sin = self.rotary_emb(value_states, position_ids)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 134, in forward
    inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
AttributeError: 'NoneType' object has no attribute 'shape'
(textgen) [root@pve-m7330 sparsegpt]#

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.