As shown in paper, CUTLASS library is used for speedup. But I did not find codes rely

How should I verify the speedup effect of the algorithm? about sparsegpt HOT 4 OPEN

ist-daslab commented on August 18, 2024

How should I verify the speedup effect of the algorithm?

from sparsegpt.

Comments (4)

efrantar commented on August 18, 2024

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

from sparsegpt.

moonlightian commented on August 18, 2024

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

Thank you for your kind reply~

from sparsegpt.

moonlightian commented on August 18, 2024

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

@efrantar Hi, following your introducement, I prepare an environment for NVIDIA's CUTLASS profiler and compiled kernels with official guide. As for "Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch" mentioned above, I'm confused about how to make it work. Would that be convenient for you to offer some code for speedup testing? Or some links to NVIDIA related demo would be fine too. Thanks again

from sparsegpt.

kiucho commented on August 18, 2024

Hi, I'm someone who wants to validate the speedup of 2:4 sparsification and density models.
As I understand it, to properly utilize SPMM (sparse matrix and dense matrix multiplication) on Nvidia's ampere architecture GPUs(like A6000 or A100), it is necessary to implement the cuSPARSELt library within Pytorch, which I think they are working on (cuSPARSELt Integration).
I have a few questions about this.

Does SparseGPT use the CUTLASS library only for speedup measurement, or does it also use it to approximate cuSPARSELt to do SPMM?
Finally, implementing a profiler within Pytorch seems to be a complex task that requires a deep understanding of both the Pytorch framework and the profiler. I would also be grateful if I could get the profiler and code for speedup.

I look forward to hearing from you. Thank you.

from sparsegpt.

How should I verify the speedup effect of the algorithm? about sparsegpt HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent