why there is no inference related code in the project？ about sparsegpt HOT 10 CLOSED

ist-daslab commented on August 18, 2024

why there is no inference related code in the project？

from sparsegpt.

Comments (10)

Godofnothing commented on August 18, 2024

@18140663659 this is research repository for benchmarking and evaluation of the efficacy of the pruning method.
We could add, in principle, demo colab (for smaller models) with generations of the sparse model.

from sparsegpt.

18140663659 commented on August 18, 2024

@18140663659 this is research repository for benchmarking and evaluation of the efficacy of the pruning method.
We could add, in principle, demo colab (for smaller models) with generations of the sparse model.

Thank you for your reply. if you can add this demo colab, this is helpful for me !
I would like to ask: if the weight is only set to 0 and the storage format is not changed, the model volume should not decrease. Do you recommend any tools to support sparse reasoning in the application process? For example, deepSparse?

from sparsegpt.

Godofnothing commented on August 18, 2024

@18140663659 Ok, I'll try to add some demo with demonstrations. Vanilla PyTorch cannot utilize the sparsity, as you've said the memory storage and compute is the same.

DeepSparse is a great tool for model compression and acceleration on CPU. In a recent blogpost they claim to show some speedups with OPT-2.7b model.

from sparsegpt.

18140663659 commented on August 18, 2024

@18140663659 Ok, I'll try to add some demo with demonstrations. Vanilla PyTorch cannot utilize the sparsity, as you've said the memory storage and compute is the same.

DeepSparse is a great tool for model compression and acceleration on CPU. In a recent blogpost they claim to show some speedups with OPT-2.7b model.

Thank you for your answer. I would also like to ask a question: If I want to save the SparseGPT pruned and quantified model (with reduced volume, eg 14G(7b) -> ~7G (7b + 50% sparse)) and support reasoning, what should I do? Is there a recommended route to use the tool

from sparsegpt.

Godofnothing commented on August 18, 2024

@18140663659 I've added a demo with use case.
Concerning the save of SparseGPT model - we do not provide the option of saving the pruned + quantized model.
For quantization one can use the code from the GPTQ repository.

from sparsegpt.

efrantar commented on August 18, 2024

See also my comment here for references to some other libraries for actually exploiting sparse models in practice.

from sparsegpt.

xiao1228 commented on August 18, 2024

hi @Godofnothing I have run the sparseGPT and saved the sparse model and then quantized the sparse model with GPTQ, then all the sparsity are gone.. is there another way of doing it? Thank you!

from sparsegpt.

Godofnothing commented on August 18, 2024

Hi, @xiao1228. Note, that GPTQ updates the weights within the same input dimension, when quantizing the weights.

Unless you manifestly prevent the change of the pruned weights (since implementation of the GPTQ is not aware of the existence of sparse weights) they could be changed. I would propose two solutions to prevent such outcome:

You can merge the SparseGPT and GPTQ implementation and prune a fraction of weights (say 50%) in the inner loop via SparseGPT and then process the remaining weights via GPTQ.
You can run SparseGPT procedure first, save the masks (locations of the zero weights), and then manifestly impose sparsity in GPTQ (prevent this weights from being updated).

from sparsegpt.

efrantar commented on August 18, 2024

Note that sparse + quant, as discussed in the paper, is actually implemented in this repository as well (see gptq.py). You can test it via the --wbits option of opt.py. However, there is currently no code for exporting or running such a sparse + quantized model in compressed form (only in simulated sparse + quantized mode via FP16 weights).

from sparsegpt.

xiao1228 commented on August 18, 2024

thank you @efrantar i tried that option, i didnt get very good PPL for 50% sparse +4bit,

for gptq 4bits on its own PPL is 5.78 on wikitext2(baseline 5.68)
for sparsegpt 50% sparse on its own PPL is 7.21 on wikitext2

For 50% sparse + 4bits PPL is 14.54 on wikitext2 after using python llama.py ./llama-7b/ c4 --sparsity 0.5 --wbits 4 --save ./llama_pth_7B_50sparse_4bits
and the saved model is quantized mode via FP16 weights and with 50% sparse in it right?

As @Godofnothing suggested option 2, i generated using sparseGPT for a 50% sparse model and then i tried to put mask in this function in gptq for llama and run gptq: https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/quant/quantizer.py#L28

after that before evaluation i did a check for sparsity, there are sparsity in layers but not every layer is 50% sparse, maybe there are some other operations inside GPTQ that i missed.
However, when i try to export the model and it call llama_pack https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/llama.py#L265 after loading back that packed model all the zeros are gone...

Thank you for the help!

from sparsegpt.

why there is no inference related code in the project？ about sparsegpt HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent