squeezeailab / kvquant Goto Github PK

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Home Page: https://arxiv.org/abs/2401.18079

Python 98.89% Makefile 0.01% Dockerfile 0.04% Jsonnet 0.01% Shell 0.12% Jupyter Notebook 0.36% C++ 0.05% Cuda 0.51% C 0.01% Cython 0.01%

compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression natural-language-processing quantization small-models text-generation transformer

kvquant's Introduction

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [Paper]

KVQuant is a methodology for efficient KV cache quantization that incorporates several innovations to acheive accurate low-precision quantization, thereby enabling efficient long context length inference.

TLDR: KVQuant addresses the memory bottleneck with long context length inference by quantizing the KV cache to low precision. KVQuant achieves high accuracy with low-precision KV cache quantization by considering several consistent patterns observed in cached KV values across different LLMs, and by developing methods to exploit these patterns, including:

Per-channel, Pre-RoPE Key quantization to better match the outlier channels in Keys
Non-Uniform Quantization (NUQ) to better represent the non-uniform activations
Dense-and-Sparse Quantization to mitigate the impacts of numerical outliers on quantization difficulty
Q-Norm to mitigate distribution shift at ultra low precisions (eg. 2-bit)

KVQuant enables serving the LLaMA-7B model with 1M context length on a single A100-80GB GPU, or even the LLaMA-7B model with 10M context length on an 8-GPU system 🔥

[TLDR: Twitter Thread] [Paper]

Long Context Length Inference with Large World Model

Large World Model (LWM) is a recent work that enables training long context length models with up to 1M context length. However, inferring these models is extremely resource intensive due to the large size of the KV cache that must be stored throughout inference. Using KVQuant, we can now infer these long context length models efficiently on a single A100!

The lmw/ directory contains scripts for running inference and evaluation using the quantized Large World Models.

Additional Method Improvements

To further improve our methodology for supporting long context length inference, we have made several improvements:

Parallel topK support on GPU and kernels for parallel prompt processing - we have augmented our open-source support with additional kernels to perform parallel packing with multiple input tokens, and also modified our inference code to utilize the GPU for parallel topK when appending many value tokens in parallel.
Capping Key Outliers - we have added support for running both calibration and inference with a fixed number of outliers per token for keys. This allows us to design more efficient kernels, since there is a maximum number of outliers per token for both keys and values, and it makes memory allocation easier for our method since we can allocate fixed-size memory for each key.
Attention Sink-Aware Quantization - based on the insight from the Attention Sink paper that the model concentrates its attention on the first token, we have added support during both calibration and inference for leaving a small number of initial keys and values (eg. 5) in fp16. This can allow for significant performance gains, and was also introduced as a method for improving quantization performance in another concurrent work IntactKV. More detailed evaluation and analysis for these improvements will be added to the arxiv preprint shortly!

Installation

The codebase contains three different subfolders, each of which has its own README file with instructions that you can follow for installing the required environment for that step.

How the code is structured

gradients - codebase for computing fisher information - this is required to be able to quantize a new model
quant - codebase for running simulated quantization + eval experiments (need to first compute fisher information)
deployment - codebase for running efficient inference with compressed vectors (need to first get quantizers from quant step)
lwm - code for running inference with and evaluating quantized LWM models

To reproduce the perplexity numbers reported in the paper, run gradients and then quant.

Roadmap:

~~add deployment code~~
multi-GPU evaluation environment for long sequence length evaluation with simulated quantization
unify environments to simplify installation
optimized kernels for A100
additional evaluation on long context lengths + different downstream tasks
multi-GPU inference

Acknowledgement

This code reuses components from several libraries including GPTQ, GPTQ-For-LLaMA, and SqueezeLLM.

Citation

KVQuant has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:

@article{hooper2024kvquant,
  title={KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization},
  author={Hooper, Coleman and Kim, Sehoon and Mohammadzadeh, Hiva and Mahoney, Michael W and Shao, Yakun Sophia and Keutzer, Kurt and Gholami, Amir},
  journal={arXiv preprint arXiv:2401.18079},
  year={2024}
}

kvquant's People

Contributors

Stargazers

Watchers

Forkers

eternalops 0x7d0 niconico6 akrichikov yangwang92 ishine mard1no pprp wanglongzhi2001 knowledgehacker zakorainc guan-jw zjc664656505 cli99

kvquant's Issues

Can this be done for other transformer based models?

CUDA error: an illegal memory access was encountered

Thank you for your excellent work!

Currently, I am trying to reproduce KVQaunt but have encountered some errors. Your assistance with this matter would be appreciated.

1. Reproduce the bug

I followed the provided instructions and set up the environment for gradient/quant/deployment. The gradient and quantization processes performed well; I successfully computed the gradient and built the quantizer. However, when I tested the deployment code using the following instructions, I encountered the error message "CUDA error: an illegal memory access was encountered."

cp ../quant/quantizers.pickle .

CUDA_VISIBLE_DEVICES=1 python llama.py JackFram/llama-160m wikitext2 \
    --abits 4 \
    --include_sparse \
    --sparsity-threshold 0.99 \
    --quantizer-path quantizers.pickle \
    --benchmark 128 \
    --check

2. Error logs

The detailed error logs are shown as follows:

/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
splitting into 1 GPUs
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Load quantizers.
k:  model.layers.0.self_attn.k_proj
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:449: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_upper = torch.tensor(quantizer[0]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:450: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_lower = torch.tensor(quantizer[1]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:484: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  lut_tmp = torch.tensor(self.lut)
k:  model.layers.0.self_attn.v_proj
k:  model.layers.1.self_attn.k_proj
k:  model.layers.1.self_attn.v_proj
k:  model.layers.2.self_attn.k_proj
k:  model.layers.2.self_attn.v_proj
k:  model.layers.3.self_attn.k_proj
k:  model.layers.3.self_attn.v_proj
k:  model.layers.4.self_attn.k_proj
k:  model.layers.4.self_attn.v_proj
k:  model.layers.5.self_attn.k_proj
k:  model.layers.5.self_attn.v_proj
k:  model.layers.6.self_attn.k_proj
k:  model.layers.6.self_attn.v_proj
k:  model.layers.7.self_attn.k_proj
k:  model.layers.7.self_attn.v_proj
k:  model.layers.8.self_attn.k_proj
k:  model.layers.8.self_attn.v_proj
k:  model.layers.9.self_attn.k_proj
k:  model.layers.9.self_attn.v_proj
k:  model.layers.10.self_attn.k_proj
k:  model.layers.10.self_attn.v_proj
k:  model.layers.11.self_attn.k_proj
k:  model.layers.11.self_attn.v_proj
Model type : llama
Benchmarking ...
Traceback (most recent call last):
  File "/root/KVQuant/deployment/llama.py", line 224, in <module>
    benchmark(model, input_ids, check=args.check)
  File "/root/KVQuant/deployment/llama.py", line 82, in benchmark
    out = model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2683, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2565, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2250, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 1965, in forward
    attn_weights = self.kcache.forward_fused_sparse(query_states, key_states)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 710, in forward_fused_sparse
    outliers_rescaled = outliers_rescaled.cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

According to my understanding, it appears that the error is somehow related to CUDA kernel implementation "vecquant4appendvecKsparse," which modifies the variable "outliers_rescaled".

3. Environment

OS: Ubuntu 20.04 LTS
GPU: Tesla P100-PCIE-16GB
Packages (pip list):

Package                  Version     Editable project location
------------------------ ----------- -------------------------------------
accelerate               0.29.3
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
certifi                  2024.2.2
charset-normalizer       3.3.2
datasets                 2.19.0
dill                     0.3.8
einops                   0.8.0
filelock                 3.14.0
flash-attn               2.5.8
frozenlist               1.4.1
fsspec                   2024.3.1
huggingface-hub          0.23.0
idna                     3.7
Jinja2                   3.1.3
kvquant                  0.1.0       /root/KVQuant/deployment
MarkupSafe               2.1.5
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
packaging                24.0
pandas                   2.2.2
pip                      23.3.1
protobuf                 5.26.1
psutil                   5.9.8
pyarrow                  16.0.0
pyarrow-hotfix           0.6
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
quant-cuda               0.0.0
regex                    2024.4.28
requests                 2.31.0
safetensors              0.4.3
sentencepiece            0.2.0
setuptools               68.2.2
six                      1.16.0
sympy                    1.12
tokenizers               0.15.2
torch                    2.3.0
tqdm                     4.66.4
transformers             4.38.0.dev0 /root/KVQuant/deployment/transformers
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
urllib3                  2.2.1
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

LLM weights: https://huggingface.co/JackFram/llama-160m

Due to hardware constraints, I intend to perform a quick test on the smaller model weights as indicated above. KVQuant is expected to work properly, as the smaller model differs from Llama-7B only in terms of weight size while sharing a similar architecture.

4、Related solutions that I have tried

As suggested in the discussion related to this CUDA error on https://github.com/pytorch/pytorch/issues/21819 , I have updated CUDA, torch, and other relevant components to the latest versions. However, I am still encountering the same error.

What's the potential problem of this error and how could I solve it?

Thanks in advance!

problem when reproduce experiment

Thanks for the great work!
I'm having a little problem reproducing the PPL results in the paper. I used the code snippet from the gptq repo for measuring ppl and was able to reproduce the fp16 baseline for the llama family in the paper, but I was unable to reproduce the fp16 baseline for mistral-7b using the same test code:

testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')["input_ids"]

nsamples = testenc.numel() // input_len
nlls = []

loss_fct = nn.CrossEntropyLoss()
for i in tqdm(range(nsamples)):
    batch = testenc[:, (i * input_len) : ((i + 1) * input_len)].to(model.device)
    outputs = model.model(batch)
    hidden_states = outputs[0]
    logits = model.lm_head(hidden_states)
    shift_logits = logits[:, :-1, :]
    shift_labels = batch[:, 1:].to(model.lm_head.weight.device)
    loss = loss_fct(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
    )
    neg_log_likelihood = loss.float() * input_len
    nlls.append(neg_log_likelihood)

ppl = torch.exp(torch.stack(nlls).sum() / (nsamples * input_len)).item()

Specifically, I use mistral-7b-v0.1, tried seqlen=8000 as well as seqlen=8192, both slightly lower than the results in the paper, which gave us a bit of trouble.

I would like to ask will you release the code of measuring ppl?

AttributeError: 'LlamaModel' object has no attribute 'split_gpus'

when I try
CUDA_VISIBLE_DEVICES=0 python llama_simquant.py --abits 4 --nsamples 16 --seqlen 2048 --nuq --fisher --quantize --include_sparse --sparsity-threshold 0.99 --quantizer_path quantizers.pickle ;

get this error
AttributeError: 'LlamaModel' object has no attribute 'split_gpus'

what is the problem

Question about storage

Thanks for your great work and the open-sourced code！
I have some problems with the storage of sparse matrix. Could you please provide the code to reproduce Table 10 in ablation experiments in your paper?
Thanks a lot!!!

PRE-ROPE quantization during inference

Thanks for the great work! I am curious about the time complexity of the pre-rope quantization.

In detail, I assume the operations act as the following orders with pre-rope quant during inference: qkv_projection_matmul -> quantize_k -> write_cache_k -> load_cache_k -> dequantize_k -> rope_k -> transpose_k. However, in the decode phase, the sequence length is getting longer per step, making it necessary to apply rope_k on all the previous token features for each step. This is an O(m*m) time complexity where m is sequence_length.

This differs with post-rope case, because for post one, what in cache is post-rope quantized key. Time complexity is O(m).

One way to walk around is saving the rope result to another cache, making the time complexity O(m) but it costs much more storage space. Another way I suppose is to over-write the cache with post rope key (bfloat16/float16) but it will be conflict with the default cache dtype (INT4/INT2).

Please correct me if anything wrong above. And looking forward to your reply. Thanks.

reproduce the ablation results in Figure 1

Thanks for your great works!
I want to reproduce the ablation results presented in paper Figure 1. According to Figure 1, Per-Channel Key Quantization + Pre-RoPE Key Quantization yields PPL=6.34 in Llama-7b (3bit) setting. However, I got PPL=6.71 by running the following command:
CUDA_VISIBLE_DEVICES=0 python llama_simquant.py <path-to-llama-7b-hf> --abits 4 --nsamples 16 --seqlen 2048 --quantize --quantizer_path quantizers.pickle ;
can't figure out why. Would you please give me some advice? Thank you so much!

The value of self.include_sparse being 0 causes the assert (False) error

Excuse me, when executing cache-llama-activations.py in the deployment directory to generate activations.pickle, an assert (False) error is raised in the QuantK class's parallel_pack function in deployment/transformers/src/transformers/models/llama/modeling_llama.py file, with self.include_sparse being set to 0, as shown in the image. It seems that there is an issue with the workflow.

The quantizers.pickle file has been successfully generated.Should the instructions in the README file be adjusted in order to generate activations.pickle successfully?

Where is the code of "ATOM-4bit"in the KVQuant codebase?

Thank you for your great work!

Now I want to reproduce the Perlexity of LLaMA-7B on Wikitext-2 with the method of "ATOM-4bit", but I can not find the code in KVQuant.
Should I clone the repo of Atom and reproduce the Perlexity on it?
Waiting for your reply， Thanks.