Giter Club home page Giter Club logo

autoquip's Introduction

QuIP

Based on the great works of Cornell-RelaxML/quip-sharp (the original implementation) and chu-tianxiang/QuIP-for-all (majority of this repo).

AutoQuIP is designed to provide an easy API for quantizing and loading QuIP# models.

The checkpoints provided here are incompatible with the original team's, due to a few factors:

  • Every linear layer is quantized seperately without fusions (i.e. concatenating QKV layers)
  • The packing format is slightly different, weights are simply packed as (outdim, indim / codesz) shape without complex permuting.

Usage

Clone the repository, and install via pip:

git clone https://github.com/AlpinDale/AutoQuIP

pip install -e .

Quantize

Please refer to the author's blog for a thorough explanation of the QuIP# algorithm.

After installation, you can use the CLI app auto_quip to quantize a model. Simply run auto_quip --help for a list of options.

If you wish to do this programmatically via the API, you can use the following example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_quip import QuipQuantizer

model_name = "meta-llama/Llama-2-70b-hf"
quant_dir = "llama-70b_2bit_quip"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quant = QuipQuantizer(codebook="E8P12", dataset="c4", nsamples=4096)
quant_model = quant.quantize_model(model, tokenizer)
quant.save(quant_model, quant_dir)
tokenizer.save_pretrained(quant_dir)

Arguments for QuipQuantizer includes:

  • codebook: the algorithm for quantization, including E8P12(2-bit), E8P12RVQ3B(3-bit), E8P12RVQ4B(4-bit), D4(2-bit), HI(4-bit). D4 and HI are relatively inferior to E8P12-based methods.
  • dataset: the data used for calibration, supports c4, ptb, wikitext2.
  • nsamples: number of samples used for calibration, larger is slower. By default 4096 samples of calibration data will be used as suggested by the author, which is very time consuming.
  • quip_tune_iters: Greedy update passes of the algorithm, higher is slower but yields slightly better quanlity. Default to 10.
  • use_rand: when the dim is not powers of 2, say dim = 2^n * base, use_rand will decompose it into 2^n Hadamard matrix and base x base random orthogonal matrices, instead of decomposing to two Hadamard matrix in the original implementation which is not always feasible. Default to true.
  • modules_to_not_convert: the name of layers not to quantize, useful for MOE models where gate layer is often unquantized.
  • merge_suv: trick to cancel out some vectors to reduce calculation. Only support llama, mixtral and qwen. Default to false.

Inference

from transformers import AutoTokenizer
from auto_quip import load_quantized_model

quant_dir = "llama-70b_2bit_quip"

quant_model = load_quantized_model(quant_dir).cuda()
tokenizer = AutoTokenizer.from_pretrained(quant_dir)

input_ids = tokenizer.encode("The capital of France is", return_tensors="pt").cuda()
print(tokenizer.decode(quant_model.generate(input_ids, do_sample=True)[0]))

autoquip's People

Contributors

chu-tianxiang avatar alpindale avatar

Stargazers

Nikolaus Schlemm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.