Giter Club home page Giter Club logo

llm-awq's Introduction

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

[Paper][Slides][Video]

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.

overview

The current release supports:

  • AWQ search for accurate quantization.
  • Pre-computed AWQ model zoo for LLMs (Llama-1/2/3, OPT, CodeLlama, StarCoder, Vicuna, VILA, LLaVA; load to generate quantized weights).
  • Memory-efficient 4-bit Linear in PyTorch.
  • Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
  • Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (VILA).

Thanks to AWQ, TinyChat can deliver more efficient responses with LLM/VLM chatbots through 4-bit inference.

  • TinyChat on RTX 4090 (3.4x faster than FP16):

TinyChat on RTX 4090: W4A16 is 3.4x faster than FP16

  • TinyChat on Jetson Orin (3.2x faster than FP16):

TinyChat on Orin: W4A16 is 3.2x faster than FP16

TinyChat also supports inference with vision language models (e.g., VILA, LLaVA). In the following examples, W4A16 quantized models from VILA family are launched with TinyChat.

  • TinyChat with VILA-13B on RTX 4090 (multi-image inputs supported):

TinyChat with VILA on 4090

  • TinyChat with VILA-7B/13B on Jetson Orin:

TinyChat with VILA on Orin

Check out TinyChat, which offers a turn-key solution for on-device inference of LLMs and VLMs on resource-constrained edge platforms. With TinyChat, it is now possible to efficiently run large models on small and low-power devices even without Internet connection!

News

  • [2024/05] πŸ”₯ The VILA-1.5 model family which features video understanding is now supported in AWQ and TinyChat. Check out out online demo powered by TinyChat here. Example is here.
  • [2024/04] πŸ”₯ We released AWQ and TinyChat support for The Llama-3 model family! Check out our example here.
  • [2024/03] πŸ”₯ AWQ has been widely adopted by the industry, such as NVIDIA, Google, Amazon, and Intel!
  • [2024/02] πŸ”₯ AWQ has been accepted to MLSys 2024!
  • [2024/02] πŸ”₯ We supported VILA Vision Languague Models in AWQ & TinyChat! Check our latest demos with multi-image inputs!
  • [2024/02] πŸ”₯ We released new version of quantized GEMM/GEMV kernels in TinyChat, leading to 38 tokens/second inference speed on NVIDIA Jetson Orin!
  • [2023/11] πŸ”₯ We added AWQ support and pre-computed search results for CodeLlama, StarCoder, StableCode models. Checkout our model zoo here!
  • [2023/11] πŸ”₯ AWQ is now integrated natively in Hugging Face transformers through from_pretrained. You can either load quantized models from the Hub or your own HF quantized models.
  • [2023/10] AWQ is integrated into NVIDIA TensorRT-LLM
  • [2023/09] AWQ is integrated into FastChat, vLLM, HuggingFace TGI, and LMDeploy.
  • [2023/09] ⚑ Check out our latest TinyChat, which is ~2x faster than the first release on Orin!
  • [2023/09] ⚑ Check out AutoAWQ, a third-party implementation to make AWQ easier to expand to new models, improve inference speed, and integrate into Huggingface.
  • [2023/07] πŸ”₯ We released TinyChat, an efficient and lightweight chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. Llama-2-chat models are supported! Check out our implementation here.
  • [2023/07] πŸ”₯ We added AWQ support and pre-computed search results for Llama-2 models (7B & 13B). Checkout our model zoo here!
  • [2023/07] We extended the support for more LLM models including MPT, Falcon, and BLOOM.

Contents

Install

  1. Clone this repository and navigate to AWQ folder
git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
  1. Install Package
conda create -n awq python=3.10 -y
conda activate awq
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  • For edge devices like Orin, before running the commands above, please:

    1. Modify pyproject.toml by commenting out this line.
    2. Set this line to transformers==4.32.0.
    3. Manually install precompiled PyTorch binaries (>=2.0.0) from NVIDIA.
    4. Set the appropriate Python version for conda environment (e.g., conda create -n awq python=3.8 -y for JetPack 5).
  1. Install efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel and optimized FP16 kernels (e.g. layernorm, positional encodings).
cd awq/kernels
python setup.py install
  1. In order to run AWQ and TinyChat with VILA-1.5 model family, please install VILA:
git clone [email protected]:Efficient-Large-Model/VILA.git
cd VILA
pip install -e .

AWQ Model Zoo

We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:

# git lfs install  # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

The detailed support list:

Models Sizes INT4-g128 INT3-g128
VILA-1.5 3B/8B/13B/40B βœ… βœ…
Llama3 8B/70B βœ… βœ…
VILA 7B/13B βœ…
Llama2 7B/13B/70B βœ… βœ…
LLaMA 7B/13B/30B/65B βœ… βœ…
OPT 125m/1.3B/2.7B/6.7B/13B/30B βœ… βœ…
CodeLlama 7B/13B/34B βœ… βœ…
StarCoder 15.5B βœ… βœ…
Vicuna-v1.1 7B/13B βœ…
LLaVA-v0 13B βœ…

Note: We only list models that we have prepare the AWQ searching results in the table above. AWQ also supports models such as LLaVA-v1.5 7B, and you may need to run the AWQ search on your own to quantize these models.

Examples

AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs.

Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under ./examples directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe memory savings when running the models with 4-bit weights.

Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to ./examples for details.

overview

Usage

We provide several sample script to run AWQ (please refer to ./scripts). We use Llama3-8B as an example.

  1. Perform AWQ search and save search results (we already did it for you):
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/llama3-8b-w4-g128.pt
  1. Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend fake
  1. Generate real quantized weights (INT4)
mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend real --dump_quant quant_cache/llama3-8b-w4-g128-awq.pt
  1. Load and evaluate the real quantized model (now you can see smaller gpu memory usage)
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/llama3-8b-w4-g128-awq.pt

Results on Vision-Language Models (VILA-1.5)

AWQ also seamlessly supports large multi-modal models (LMMs). We demonstrate the results on the recent VILA-1.5 model family.

VILA-1.5-3B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 80.4 61.5 53.5 69.0 60.4 85.9 1442.4 63.4 52.7 60.9
AWQ-INT4 80.0 61.1 53.8 67.8 60.4 85.9 1437.3 63.3 51.4 59.8
VILA-1.5-8B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 80.9 61.9 58.7 79.9 66.3 84.4 1577.01 72.3 66.2 64.2
AWQ-INT4 80.3 61.7 59.3 79.0 65.4 82.9 1593.65 71.0 64.9 64.0
VILA-1.5-13B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 82.8 64.3 62.6 80.1 65.0 86.3 1569.55 74.9 66.3 65.1
AWQ-INT4 82.7 64.5 63.3 79.7 64.7 86.7 1531.35 74.7 66.7 65.1
VILA-1.5-40B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 84.3 64.6 62.2 87.2 73.6 87.3 1726.82 82.4 80.2 69.1
AWQ-INT4 84.1 64.4 61.3 86.7 73.2 88.2 1714.79 83.2 79.6 68.9

Inference speed ( Token/sec )

$~~~~~~$ Precision A100 4090 Orin
VILA1.5-3B fp16 104.6 137.6 25.4
VILA1.5-3B-AWQ int4 182.8 215.5 42.5
VILA1.5-3B-S2 fp16 104.3 137.2 24.6
VILA1.5-3B-S2-AWQ int4 180.2 219.3 40.1
Llama-3-VILA1.5-8B fp16 74.9 57.4 10.2
Llama-3-VILA1.5-8B-AWQ int4 168.9 150.2 28.7
VILA1.5-13B fp16 50.9 OOM 6.1
VILA1.5-13B-AWQ int4 115.9 105.7 20.6
VILA1.5-40B fp16 OOM OOM --
VILA1.5-40B-AWQ int4 57.0 OOM --

Reference

If you find AWQ useful or relevant to your research, please kindly cite our paper:

@inproceedings{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
  booktitle={MLSys},
  year={2024}
}

Related Projects

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

Vicuna and FastChat

LLaVA: Large Language and Vision Assistant

VILA: On Pre-training for Visual Language Models

llm-awq's People

Contributors

casper-hansen avatar eltociear avatar isaac-vidas avatar kentang-mit avatar louym avatar sakits avatar songhan avatar tonylins avatar younesbelkada avatar ys-2020 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm-awq's Issues

[Question/Feature] Skip initialization after quantization

Hi maintainers.

I have been developing models using your AWQ library, which has significantly increased the speed. I have noticed there is a challenge when loading the weights again after quantization because we need to run in init_only mode to load weights correctly and replace layers. This took me roughly 10-12 seconds on a 3090.

Have you thought of a way to 1) quantize weights, 2) save weights with blocks/layers replaced, and 3) load weights without needing to initialize again?

The rationale is that we can get loading times down since we would not need to re-initialize every time we need to load the model.

Question: Mismatched CUDA version (12.0)

When I was installing the efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel, I encountered the following error:
The detected CUDA version (12.0) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

There is no downgrade possibility to other version. Does it mean I cannot use awq if I have to use CUDA 12? Thank you.

4bit kernel error when large input size

import torch
import f16s4_gemm
in_features = 14336
out_features = 5376
w = torch.Tensor(in_features, out_features // (32 // 4)).to(torch.int32).cuda()
scales = torch.Tensor(in_features // 128, out_features).half().cuda()
qzeros = torch.Tensor(in_features // 128, out_features // (32 // 4)).to(torch.int32).cuda()
x = torch.Tensor(2048 * 8, 14336).cuda().half()                       
f16s4_gemm.gemm_forward_cuda(x, w, scales, qzeros, 8)

when m of gemm is 2048 x 8, it runs perfectly, if we double it

x = torch.Tensor(2048 * 16, 14336).cuda().half()

I got an error like RuntimeError: CUDA error: invalid configuration argument
I think my GPU memory is sufficient since I ran this on A100.
Is there anything I can do to solve this ?

SmoothQuant vs AWQ which one is faster?

Question

We are very interested in two post-training quantization papers from han lab!

SmoothQuant use W8A8 for efficient GPU computation.
AWQ uses W4/3A16 for lower memory requirements and higher memory throughput.

But which one is faster in actual production?
If you have any data about this, could you share it with us?

[Bug] Memory leak in real_quantize_model_weight

Hi,

I have been trying to quantize bigger models (mpt-30b, falcon-40b) on a relatively smaller GPU (RTX 3060 with 12GB of VRAM) and have struggled with CUDA OOM errors.

First of all, I wrote a small utility function to get all tensors allocated on CUDA:

import gc


def get_cuda_tensors():
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                if obj.is_cuda:
                    yield obj
        except:
            pass

It seems, there are a lot of places where memory is being leaked.

For e.g., in real_quantize_model_weight, calling this function at the end of for loop here and printing the tensors, it is clear that the number of tensors allocated on CUDA keeps on growing every iteration of the loop despite calling gc.collect(); torch.cuda.empty_cache() repeatedly.

Thanks!

Can not install with 2080ti

Itβ€˜s an amazing package, and may be is the only cuda 4 bit method for MPT for now. However, I can not install it because of the following, my cuda version is Cuda compilation tools, release 11.2, V11.2.152, and graph card is 2080ti, maybe the problem is because the card is too old? because I use another machine with 3090 cuda is 11.3, V11.3.109 and everything is ok. is there any method to install the pachage with 2080ti? the log is at below:

(whisper) vitualwht@DESKTOP-DSTBT14:$ cd llm-awq
(whisper) vitualwht@DESKTOP-DSTBT14:
/llm-awq$ cd awq/kernels
ython setup.py i(whisper) vitualwht@DESKTOP-DSTBT14:~/llm-awq/awq/kernels$ python setup.py install
running install
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!!
self.initialize_options()
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!!
self.initialize_options()
running bdist_egg
running egg_info
writing f16s4_gemm.egg-info/PKG-INFO
writing dependency_links to f16s4_gemm.egg-info/dependency_links.txt
writing requirements to f16s4_gemm.egg-info/requires.txt
writing top-level names to f16s4_gemm.egg-info/top_level.txt
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'f16s4_gemm.egg-info/SOURCES.txt'
writing manifest file 'f16s4_gemm.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/utils/cpp_extension.py:388: UserWarning: The detected CUDA version (11.2) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'f16s4_gemm' extension
creating build/temp.linux-x86_64-cpython-39
/usr/local/cuda/bin/nvcc -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/TH -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/vitualwht/anaconda3/envs/whisper/include/python3.9 -c gemm_cuda_gen.cu -o build/temp.linux-x86_64-cpython-39/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -keep -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=f16s4_gemm -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75
gemm_cuda_gen.cu(23): warning: variable "scaling_factors_shared" was declared but never referenced

gemm_cuda_gen.cu(24): warning: variable "zeros_shared" was declared but never referenced

gemm_cuda_gen.cu(28): warning: variable "blockIdx_x" was declared but never referenced

gemm_cuda_gen.cu(42): warning: variable "ld_zero_flag" was declared but never referenced

/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here

/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/ATen/core/qualified_name.h(73): here

gemm_cuda_gen.cu(10): warning: function "__pack_half2" was declared but never referenced

ptxas gemm_cuda_gen.ptx, line 911; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 915; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 919; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 923; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 927; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 931; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 935; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 939; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 983; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 987; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 991; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 995; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 999; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 1003; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 1007; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 1011; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas fatal : Ptx assembly aborted due to errors
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 255

error on setup.py in kernels folder

(awq) C:\Users\caleb\Desktop\AI stuff\llm-awq\awq\kernels>python -m setup.py install
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1'
running install
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!!
self.initialize_options()
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!!
self.initialize_options()
running bdist_egg
running egg_info
writing f16s4_gemm.egg-info\PKG-INFO
writing dependency_links to f16s4_gemm.egg-info\dependency_links.txt
writing requirements to f16s4_gemm.egg-info\requires.txt
writing top-level names to f16s4_gemm.egg-info\top_level.txt
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'f16s4_gemm.egg-info\SOURCES.txt'
writing manifest file 'f16s4_gemm.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py:359: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
Traceback (most recent call last):
File "C:\Users\caleb\miniconda3\envs\awq\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Users\caleb\miniconda3\envs\awq\lib\runpy.py", line 110, in get_module_details
import(pkg_name)
File "C:\Users\caleb\Desktop\AI stuff\llm-awq\awq\kernels\setup.py", line 9, in
setup(
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_init
.py", line 107, in setup
return distutils.core.setup(**attrs)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 185, in setup
return run_commands(dist)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\install.py", line 80, in run
self.do_egg_install()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\install.py", line 129, in do_egg_install
self.run_command('bdist_egg')
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 164, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 150, in call_command
self.run_command(cmdname)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\install_lib.py", line 11, in run
self.build()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\command\install_lib.py", line 111, in build
self.run_command('build_ext')
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\build_ext.py", line 84, in run
_build_ext.run(self)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\command\build_ext.py", line 345, in run
self.build_extensions()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 499, in build_extensions
_check_cuda_version(compiler_name, compiler_version)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 383, in _check_cuda_version
torch_cuda_version = packaging.version.parse(torch.version.cuda)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 52, in parse
return Version(version)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 195, in init
match = self._regex.search(version)
TypeError: expected string or bytes-like object

I can't figure out why the setup.py thing won't work, I can't finish installing this repo because of this error.

Quantization of larger models on smaller GPUs using CPU offloading

Hi,

I was trying to quantize facebook/opt-6.7b on RTX 3060 (12GB of VRAM) and was running into OOM errors.

I tried supplying my own device_map (instead of device_map="balanced"), the quantization progressed to 3rd layer and then I was getting this error in accelerate:

NotImplementedError: Cannot copy out of meta tensor; no data!

I think carefully using accelerate and CPU offloading, it should be possible to quantize larger models on smaller GPUs.

Can you please look into this? Or provide some guidance as to where to make changes in the code?

Thanks!

Question about apply scales for fc2

Thanks for your guys' great work from smoothquant to awq.
I have a question about this line.

scales_list.append(_auto_get_scale(
prev_op=module.fc1,
layers=[module.fc2],
inp=input_feat['fc2'],
))

why could we transfer the scale from fc2 to fc1? there is a nonlinear activation function between two fc.

[Feature Request] Support grouped-query attention

Hi,

The recent release of LLaMA 2 from Meta AI uses grouped-query attention (GQA) as opposed to multi-head attention (MHA) for the 70B model and the current AWQ search fails. Considering it is the best open-source model, please support GQA.

Thanks!

Does this work support compression and acceleration for mt0-xl or gpt model?

Hi, thanks for sharing the great work.
I'm wondering if this work supports compression and acceleration for mt0-xl or gpt model?
If not currently, do you have plan to support these models? How can I transfer this work to support these models? Could you please give some advice?
For example:
https://huggingface.co/bigscience/mt0-xl
https://huggingface.co/nvidia/nemo-megatron-mt5-3B
https://huggingface.co/nvidia/nemo-megatron-gpt-5B

bloom-176b CUDA out of memory on 8* A100 80g

Thanks for your work on support the bloom model. I have already put the --parallel or --auto_parallel argument on my script, but still can't comput AWQ on my 8* A100 80g server.
python -m awq.entry_new_lambada --model_path $model_path/$MODEL
--w_bit 4 --q_group_size 128
--run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt --parallel

How can I fix this problem?

A question about the metrics in the paper

Hello~, I'm reading AWQ and have a small question about the metrics. I found the results about OPT on wikitext-2 in AWQ are different from what it is in GPTQ's paper,

image
(results from AWQ)

image
(results from GPTQ)

image
(results from SqPR, basically same with GPTQ)

would that be a problem? is it due to the different experiment setting or I missed something?

Question about inference speed

I'm trying compare the inference performance of gptq (reorder) and awq on A100-40G. The table below is the result of preliminary tests.

LLaMA-13B best(t/s) worse(t/s)
Exllama 47.13 41.86
Tinychat 23.04 21.35

It seems that the results here are inconsistent with those in the paper, awq is much slower than gptq.

Support for MPT models

Hi @kentang-mit, great work and research!

I would like to suggest implementing the MPT foundational models for AWQ.

MPT obstacles:

  • uses ALiBi
  • uses Triton with FlashAttention

What are your thoughts on supporting other architectures and foundational open-source models?

http.client.RemoteDisconnected: Remote end closed connection without response

Hi,
thx for your inspiring work, I try to reproduce the work following the Example in the readme, however it seems to raise an error about network connecting, I wonder does the optimization process need http request? can i run in the local

I ran the code by :
python -m awq.entry --model_path /path/to/llama-7b-hf --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/llama-7b-w4-g128.pt

And the error:

raise ConnectionError(err, request=request)

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Bad result when running AWQ without GPU

Hi, folks, I met some weird issue when reproducing the results shown in paper. I can get results below with GPU visible, but cannot reproduce it with only CPU. I set the dtype as torch.float to avoid lossing precision from float16.

It's not the inference device issue, the difference comes from the awq_results got w/ and w/o GPU. Is there any workaround the handle it? Any suggestions would be helpful, thanks!

To disable GPU: export CUDA_VISIBLE_DEVICES=''

image

opt-125m FP32 group_size INT4 RTN asym on CPU AWQ on CPU AWQ on GPU
wikitext 31.95 G32 33.83 48.52 33.01
G128 35.96 39.53 33.96

Which "activations" are assumed in this work?

Hello, thanks for your work.

We hypothesize that the input features with larger magnitudes are generally more important.

Do you mean "activations" = "input features" = encircled matrices in this picture (from blogpost http://jalammar.github.io/illustrated-transformer/)?
image
And, if so, why did you decide to use input features, but not calculated Q, K, V (and maybe Z as it's referred in the picture) values as measure of saliency of weights?

Error Occurs When Quantizing LLaMA2-70B

I really appreciate authors' fantastic work done here.

When I tried to apply awq on LLaMA2-70B, however, the error below poped out:

Running AWQ...:   0%|                                                                                                                               | 0/80 [00:16<?, ?it/s]
Traceback (most recent call last):
  File "/home/vma/.conda/envs/awq/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/vma/.conda/envs/awq/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/entry.py", line 214, in <module>
    main()
  File "/proj/Projects/FastChat/repo/llm-awq/awq/entry.py", line 189, in main
    model, enc = build_model_and_enc(args.model_path)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/entry.py", line 122, in build_model_and_enc
    awq_results = run_awq(
  File "/home/vma/.conda/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/quantize/pre_quant.py", line 149, in run_awq
    apply_scale(layers[i], scales_list, input_feat_dict=input_feat)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/quantize/auto_scale.py", line 355, in apply_scale
    scale_fc_fc(prev_op, layers[0], scales)
  File "/home/vma/.conda/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/quantize/auto_scale.py", line 68, in scale_fc_fc
    fc1.weight[-scales.size(0):].div_(scales.view(-1, 1))
RuntimeError: The size of tensor a (1024) must match the size of tensor b (8192) at non-singleton dimension 0

This error is really unexpected, since awq works fine on LLaMA2-7B and LLaMA2-13B.

I wonder if anyone could give me some hint about solving this problem.

Thanks in advance.

awq use more GPU memory than gptq

We tested the llama model using AWQ and GPTQ. It does have higher accuracy than GPTQ.

But we found that when using AWQ code to infer the llama model, it uses more GPU memory than GPTQ.

The following are the relevant test results:

For llama-7b w4 group_size=128, the quantized model size is 3.7G.

use A100 40GB and test on human-eval

GPTQ

  • use_cache=True Maximum memory used:9.505859375GB
  • use_cache=False Maximum memory used:9.115234375GB

AWQ

  • use_cache=True Maximum memory used:26.47265625GB
  • use_cache=False Maximum memory used:36.96484375GB

There are two points to pay attention to the above results.

  1. In the inference stage, GPTQ can use less memory than AWQ
  2. For AWQ, use_cache=False uses more memory( usually use_cache=True requires more memory)

use_cache=False
We use GPTQ script to infer 4bit llama-65b, which can be run on a single GPU. When using AWQ, the OOM will occur.

I would like to ask if you have any of the above problems during the test. Could you please provide your thoughts on the above issues? Thank you so much.

I noticed that in the forward phase, the main difference between GPTQ and AWQ is that AWQ uses Tensor cores (I am not familiar with the contents of tensor cores). Will Tensor cores cause more memory usage?

Question about Activation-aware Scaling and its implementation

Hi,

Thank you for your outstanding work and make it accessible to the public.

I would like to inquire about the correlation between the activation distribution and the chosen alpha through grid search.

According to the paper, AWQ aims to reduce quantization error by preserving significant weights identified by the activation distribution. However, in the implementation, alpha is selected based on the mean square error (MSE) between the original output and the output of the quantized layer.

Paper insights path: activation distribution -> salient weight -> keep the salient weight with high precision -> reduce quantization error
Implementation path: layer-wise MSE -> alpha -> migrate activation outlier -> reduce quantization error

Is there a connection between the activation distribution and the lowest MSE? In other words, does the alpha value determined by MSE reflect the underlying activation distribution?

For example, if we found the best alpha, does it reflect the activation distribution and found the real salient weights?

Please let me know if I have missed anything or if there are any misunderstandings.

Your clarification would be greatly appreciated :).

Can AWQ be run on TPUs?

Hi,

Is it possible to run AWQ on cloud TPUs? Will CUDA kernels run correctly on those?

Thanks!

Can not load pre-computed AWQ results for Bloom7b

Thanks for your support for Bloom updated today! I tried with Bloom-7b but failed.
After performing AWQ search and save search results , I tried to reload and evaluate with cmd
'python -m awq.entry --model_path /tmp/bloom_orig_0608/ --w_bit 4 --q_group_size 128 --run_awq --load_awq awq_cache/bloom-7b-w4-g128.pt --tasks wikitext --q_backend fake'
but raise NotImplementedError with awq.quantize.qmodule.ScaledActivation like:
image

Open-Flamingo reference

In the paper you said the following. How to do quantization for Open-Flamingo?

Thanks to better generalization, it also achieves good quantization
performance for instruction-tuned LMs (e.g., Vicuna) and, for the first time, multi-modal LMs (Open-Flamingo [2]). Thanks to our efficient kernels, AWQ achieves 1.45Γ— and 2Γ— speedup over GPTQ
and GPTQ with reordering on A100.

need help! about auto_scale.scale_fc_fc function

Hello, I would like to apply awq to a GPTBigCodeForCausalLM object and it has a unusual atten like this picture shown:
image

I added some necessary implements and finally i got this
image

it was caused by there:
image

Seems like fc2's scale was apply to both fc1 and fc2 , and because the different shape between my_fc1 and my_fc2 ,the entire progrem was broken here.

It seems to be dividing the weight of the previous layer by the scaler of fc2 instead of dividing the input x of fc2 by the scaler,Right?

How can I fix this error , or could you please tell me why we must apply the scale to fc1 and fc2 ?

Thanks & Regrads

Can I do some change like this?
image

[Bug] ValueError: OC is not multiple of cta_N = 128

Hi,

I encounter this error while using the quantized tiiuae/falcon-7b-instruct model:

python -m awq.entry --model_path tiiuae/falcon-7b-instruct \
  --max_memory 0:9GiB cpu:99GiB \
  --tasks wikitext \
  --w_bit 4 --q_group_size 64 \
  --load_quant quant_cache/falcon-7b-instruct-w4-g64-awq.pt
β”‚ /home/user/llm-awq/awq/quantize/qmodule.py:92 in forward             β”‚
β”‚                                                                                                  β”‚
β”‚   89 β”‚   @torch.no_grad()                                                                        β”‚
β”‚   90 β”‚   def forward(self, x):                                                                   β”‚
β”‚   91 β”‚   β”‚   out_shape = x.shape[:-1] + (self.out_features, )                                    β”‚
β”‚ ❱ 92 β”‚   β”‚   out = f16s4_gemm.gemm_forward_cuda(x.reshape(-1, x.shape[-1]), self.qweight, sel    β”‚
β”‚   93 β”‚   β”‚   out = out + self.bias if self.bias is not None else out                             β”‚
β”‚   94 β”‚   β”‚   return out.reshape(out_shape)                                                       β”‚
β”‚   95                                                                                             β”‚
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: OC is not multiple of cta_N = 128

Please note that group size 128 cannot be used for tiiuae/falcon-7b-instruct model as input dimension size of Linear layer in the transformer block is not divisible by 128.

Thanks!

Version of Nvidia Jetson Orin used for TinyChat benchmarks

You report some benchmarking numbers for TinyChat running on an Nvidia Jetson Orin device, but it is not clear which version of the device you are using. Is it a Nano, NX or AGX and with how much memory? Please update the TinyChat benchmark with this information.

Bug of Load and evaluate the real quantized model

I encountered an error while loading and evaluating the model for OPT-6.7b, here is my implementation code:
image
The error is displayed as follows:
image
It looks like the weight given in awq-model-zoo mismatch defined model in code.

3bit backward implementation

Thank you for your amazing work. Do you have any plans to implement 3 bit backward (transpose matmul)?
I think this can apply LoRA to the model in 3 bits. like QLoRA

W4A16 kernel error when group_size is not 128

Hi,

Thanks for your interesting work and clear open-source code.

I have been trying to test the W4A16 kernel with different quantization group size, and I have found that this kernel only produces correct outputs when the group_size is set as 128.

For example, I tested the W4A16 kernel with the following code:

import torch
from awq.quantize.quantizer import pseudo_quantize_tensor,pseudo_quantize_model_weight
from awq.quantize.qmodule import WQLinear
w_bit = 4
q_group_size= 128
inputs = torch.randn((1,4096,4096)).cuda().half()
module = torch.nn.Linear(4096,4096, True).cuda().half()

module.weight.data, scales, zeros = pseudo_quantize_tensor(module.weight.data, n_bit=w_bit, get_scale_zp=True,q_group_size=q_group_size)
fake_outputs = module(inputs)
scales = scales.t().contiguous()
zeros = zeros.t().contiguous()
q_linear = WQLinear.from_linear(module, w_bit,q_group_size , False, scales, zeros)
real_outputs = q_linear(inputs)

print(f"average dist:{(real_outputs-fake_outputs).abs().mean()}")

when q_group_size=128, the gap is negligible:

average dist:0.00014293193817138672

However when q_group_size was set as other value, the gap becomes significant. Takeing group_size=256 as an example, the output is:

average dist:0.32958984375

Is there anything I can do to resolve this ?

awqlora

Excited for your great research. Can we combine qlora and awq, like gptqlora? if it's possible, would you consider releasing a version of awqlora?

Best regards.

TypeError: expected string or bytes-like object

(awq) C:\Users\Bhanu prakash\llm-awq\awq\kernels>python setup.py install
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0'
running install
E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer or other
    standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!!
self.initialize_options()
E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer or other
    standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!!
self.initialize_options()
running bdist_egg
running egg_info
writing awq_inference_engine.egg-info\PKG-INFO
writing dependency_links to awq_inference_engine.egg-info\dependency_links.txt
writing requirements to awq_inference_engine.egg-info\requires.txt
writing top-level names to awq_inference_engine.egg-info\top_level.txt
reading manifest file 'awq_inference_engine.egg-info\SOURCES.txt'
writing manifest file 'awq_inference_engine.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
E:\Anaconda\envs\awq\lib\site-packages\torch\utils\cpp_extension.py:359: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
Traceback (most recent call last):
File "C:\Users\Bhanu prakash\llm-awq\awq\kernels\setup.py", line 9, in
setup(
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_init_.py", line 107, in setup
return distutils.core.setup(**attrs)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 185, in setup
return run_commands(dist)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\install.py", line 80, in run
self.do_egg_install()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\install.py", line 129, in do_egg_install
self.run_command('bdist_egg')
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 164, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 150, in call_command
self.run_command(cmdname)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\install_lib.py", line 11, in run
self.build()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\command\install_lib.py", line 111, in build
self.run_command('build_ext')
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\build_ext.py", line 84, in run
_build_ext.run(self)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\command\build_ext.py", line 345, in run
self.build_extensions()
File "E:\Anaconda\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 499, in build_extensions
_check_cuda_version(compiler_name, compiler_version)
File "E:\Anaconda\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 383, in _check_cuda_version
torch_cuda_version = packaging.version.parse(torch.version.cuda)
File "E:\Anaconda\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 52, in parse
return Version(version)
File "E:\Anaconda\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 196, in init
match = self._regex.search(version)
TypeError: expected string or bytes-like object

[Question/Feature] Fused attention/mlp/norm for MPT

I have had the great pleasure of testing out TinyChat today - it's blazing fast.

In particular, I was able to get 102 tokens/s (9.8ms/token) on a 4090 with the fused operations on LLaMa-2 7B, which is a 100% speed boost over the non-fused operations which ran at about 45-50 tokens/s.

How can we extend these fusing operations to the MPT model series? i.e. fusing the torch implementation of Multi-Head Attention plus their ALiBi implementation.

The main reason I want to use MPT models over LLaMa is licensing issues, but also that MPT has 7B models trained for 8k context.

Guidance on CUDA driver and runtime versions

Hi,

I have two different setups where I wanted to run AWQ - I managed to run it successfully on one, but not on the other.

The setup where I was able to run AWQ successfully has the following:

$ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
530.30.02

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Whereas the unsuccessful setup has the following:

$ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
470.161.03

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

The second environment is a Kaggle notebook and the traceback is as follows:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/kaggle/working/llm-awq/awq/entry.py", line 9, in <module>
    from awq.quantize.pre_quant import run_awq, apply_awq
  File "/kaggle/working/llm-awq/awq/quantize/pre_quant.py", line 12, in <module>
    from .auto_scale import auto_scale_block, apply_scale
  File "/kaggle/working/llm-awq/awq/quantize/auto_scale.py", line 8, in <module>
    from .qmodule import ScaledActivation
  File "/kaggle/working/llm-awq/awq/quantize/qmodule.py", line 4, in <module>
    import f16s4_gemm  # with CUDA kernels
ModuleNotFoundError: No module named 'f16s4_gemm'

Thanks!

How to measure the speedup of W4A16 kernel like Figure 6?

Hi,

Thanks for your outstanding work. I have tested the quantized model using the W4A16 kernel on the WikiText2 datasets. Specially, the WikiText2 validation datasets is split into non-overlapping segments of width 2048. I have observed that the W4A16 kernel significantly reduces memory usage. However, the actual speed is even slow than the W16A16 in my setup.

For example, for LLaMa-30B, the test time of W16A16 on the WikiText2 validation datasets is 177 seconds, whereas the test time increase to 420 seconds when using the W4A16 kernel.

I would like to know how to accurately measure the speedup. Am I overlooking something?

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.