Giter Club home page Giter Club logo

i-bert's Introduction

Screen Shot 2020-12-19 at 9 51 50 PM

I-BERT: Integer-only BERT Quantization

HuggingFace Implementation

I-BERT is also available in the master branch of HuggingFace! Visit the following links for the HuggingFace implementation.

Github Link: https://github.com/huggingface/transformers/tree/master/src/transformers/models/ibert

Model Links:

Installation & Requirements

You can find more detailed installation guides from the Fairseq repo: https://github.com/pytorch/fairseq

1. Fairseq Installation

Reference: Fairseq

  • PyTorch version >= 1.4.0
  • Python version >= 3.6
  • Currently, I-BERT only supports training on GPU
git clone https://github.com/kssteven418/I-BERT.git
cd I-BERT
pip install --editable ./

2. Download pre-trained RoBERTa models

Reference: Fairseq RoBERTa

Download pretrained RoBERTa models from the links and unzip them.

# In I-BERT (root) directory
mkdir models && cd models
wget {link}
tar -xvf roberta.{base|large}.tar.gz

3. Download GLUE datasets

Reference: Fairseq Finetuning on GLUE

First, download the data from the GLUE website. Make sure to download the dataset in I-BERT (root) directory.

# In I-BERT (root) directory
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
python download_glue_data.py --data_dir glue_data --tasks all

Then, preprocess the data.

# In I-BERT (root) directory
./examples/roberta/preprocess_GLUE_tasks.sh glue_data {task_name}

task_name can be one of the following: {ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA} . ALL will preprocess all the tasks. If the command is run propely, preprocessed datasets will be stored in I-BERT/{task_name}-bin

Now, you have the models and the datasets ready, so you are ready to run I-BERT!

Task-specific Model Finetuning

Before quantizing the model, you first have to finetune the pre-trained models to a specific downstream task. Although you can finetune the model from the original Fairseq repo, we provide ibert-base branch where you can train non-quantized models without having to install the original Fairseq. This branch is identical to the master branch of the original Fairseq repo, except for some loggings and run scripts that are irrelevant to the functionality. If you already have finetuned models, you can skip this part.

Run the following commands to fetch and move to the ibert-base branch:

# In I-BERT (root) directory
git fetch
git checkout -t origin/ibert-base

Then, run the script:

# In I-BERT (root) directory
# CUDA_VISIBLE_DEVICES={device} python run.py --arch {roberta_base|roberta_large} --task {task_name}
CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task MRPC

Checkpoints and validation logs will be stored at ./outputs directory. You can change this output location by adding the option --output-dir OUTPUT_DIR. The exact output location will look something like: ./outputs/none/MRPC-base/wd0.1_ad0.1_d0.1_lr2e-5/1219-101427_ckpt/checkpoint_best.pt. By default, models are trained according to the task-specific hyperparameters specified in Fairseq Finetuning on GLUE. However, you can also specify the hyperparameters with the options (use the option -h for more details).

Quantiation & Quantization-Aware-Finetuning

Now, we come back to ibert branch for quantization.

git checkout ibert

And then run the script. This will first quantize the model and do quantization-aware-finetuning with the learning rate that you specify with the option --lr {lr}.

# In I-BERT (root) directory
# CUDA_VISIBLE_DEVICES={device} python run.py --arch {roberta_base|roberta_large} --task {task_name} \
# --restore-file {ckpt_path} --lr {lr}
CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task MRPC --restore-file ckpt-best.pt --lr 1e-6

NOTE: Our work is still on progress. Currently, all integer operations are executed with floating point.

Citation

I-BERT has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:

@article{kim2021bert,
  title={I-BERT: Integer-only BERT Quantization},
  author={Kim, Sehoon and Gholami, Amir and Yao, Zhewei and Mahoney, Michael W and Keutzer, Kurt},
  journal={International Conference on Machine Learning (Accepted)},
  year={2021}
}

i-bert's People

Contributors

alexeib avatar cndn avatar edunov avatar freewym avatar halilakin avatar hitvoice avatar huihuifan avatar jhcross avatar jingfeidu avatar jma127 avatar joshim5 avatar kahne avatar kartikayk avatar kssteven418 avatar lematt1991 avatar liezl200 avatar liuchen9494 avatar louismartin avatar maigoakisame avatar multipath avatar myleott avatar ngoyal2707 avatar nng555 avatar pipibjc avatar skritika avatar stephenroller avatar tangyuq avatar theweiho avatar xianxl avatar xu-song avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

i-bert's Issues

Storing both float32 and int parameters

Hi

It looks like at least in the HF code, you are storing both the float32 AND the int weights, which would increase the memory footprint. Don't you want to either load one or the other, or at least have an option to quanitize and send to cuda or something like that, where you would clear the float32 version or int version and send to cuda, thus lowering the memory footprint. Alternately you could overload the 'to' (or 'cuda'? or whatever method is used to convert to cuda) to only move over only the right parameters?

Thanks

Wrong script of downloading GLUE datasets

๐Ÿ› Bug

The script of downloading GLUE datasets in Readme(ibert branch) is wrong.

"wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py"

It will call โ€œ403 Forbiddenโ€ as W4ngatang said they are no longer hosting the GLUE data on their server.(More details in comment of link https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/). Also there are some bugs in urllibs.

๐ŸŽˆ Fix methods

I fixed it with right urls of data and io operations in link -> https://gist.github.com/Alexiazzf/67f8a3ba31bf489e6b4242e9c1f516a8

Why use 22 bit quantized activations for some layer norms (except in Embeddings)?

Hi,
I've noticed that the QuantAct layers preceding IntLayerNorm in the IBertSelfOutput and IbertOutput modules specify a 22 bit activation width while the QuantAct layer preceding IntLayerNorm in IBertEmbedding specifies a 16 bit activation.

I couldn't find any mention of these bit width choices in the paper. Could you please explain why these choices have been made?

Thank you!

Can not inference the quantilized model in my device by int8

๐Ÿ‘‰ Please follow one of these issue templates ๐Ÿ‘ˆ

Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates.
I want to test the accuracy and time consumption of ibert in int8 inference, so I installed transformers and tried to quantize the Roberta-base model to generate the weights. I have set that quant_model is true and torch type is int8. However, the time consumption of ibert in int8 inference is similar to Roberta-base model in my 1080ti device. Is any problem with my config.json or my device? Here is my config.json:
{
"_name_or_path": "./outputs/checkpoint-1150/",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": 0,
"1": 1
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"0": 0,
"1": 1
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.11.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}

0

fixed

IBert problems of quant_model=true

Dear Editor,
My first step is to do full-precision finetuning, and I set quant_mode:true. And then I carry out the Integer-only finetuning. When I test the Integer-only finetuning model on the MRPC, the result is very bad. Could you give some guidance๏ผŸ(I test the MRPC sample, the result is tensor([[0.5003, 0.4997]], grad_fn=))

{
"_name_or_path": "/home/rram/storage/cailei/nlp_project/fine_tune/standard_ibert_weights/ibert-roberta-base",
"architectures": [
"IBertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"finetuning_task": "mrpc",
"force_dequant": "none",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "not_equivalent",
"1": "equivalent"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"equivalent": 1,
"not_equivalent": 0
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "ibert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"quant_mode": true,
"tokenizer_class": "RobertaTokenizer",
"torch_dtype": "int8",
"transformers_version": "4.12.0.dev0",
"type_vocab_size": 1,
"vocab_size": 50265
}

About scaling_factor

Dear Authors,
Thanks for sharing valuable codes.

I'm trying to use your code for vision transformer quantization.

About the scaling factor for this work, I have some questions.
If I want to swap some layers (i.e. GELU -> IntGELU), I have to set the scaling factor for input args.

For this, I suppose that I can add QuantAct in the forward function of IntGELU,

class IntGELU(nn.Module):

def forward(self, x, scaling_factor=None):
     if not self.quant_mode:
         return self.activation_fn(x), None
     x, scaling_factor = QuantAct(32, quant_mode=self.quant_mode)
--------------------------------------------------------------------------------------
Is it right? could you please give me some advice?

Thanks in advance.

About CUDA out of memory

When I try the quantization step in this code, it cannot continue to run. An error occurs: CUDA out of memory. What can I do to solve this issue in the quantization step. Are there any parameters I can change? Thank you so much!

quantize other roberta model

I am trying to use I-BERT through transformers. Can I use other roberta models to quantize it?
If so, what parameters can I change? and how to convert it?

Bugs in the code

๐Ÿ› Bug

  1. x_scaling_factor in transformer_sentence_encoder.py:
    I think the code here should modified scaling_factor to x_scaling_factor. Or do we use the same scaling factor for all transformer encoder layers?
  2. freeze_model for IntSoftmax:
    I think we should implement fix function in IntSoftmax because we have QuantAct here. If we don't implement the fix function, we will skip fixing the QuantAct in IntSoftmax from here.

Problem

I am recently trying to implement IBERT on TVM an I found I should add FixedPointMul and SymmetricQuantFunction operators in TVM. Do you have any existing implementation? Thanks!

Pre-trained weights for specific tasks

Hi, thank you for releasing this project!

I was wondering if you happen to have the pre-trained weights for the models finetuned on the different downstream tasks (QQP, MNLI.. etc). i.e. the initialisation weights for the quantisation-aware fine-tuning stage. I only say this as it would save me a lot of time and compute, and may be helpful to others too.

Thanks

Quantization on trained model

โ“ Questions and Help

Hello,
Great paper! kudos!
After reading I was wondering if it is possible to use these quantization methods on trained model using one of huggingface transformers or shall we re-train the model and use I-BERT?

Another setting for quantization

Thanks for the great work.

It uses 32 integer points for activation and softmax.

However, the self-attention result cannot exceed 26 bits (8 bits x 8 bits x 10 bits (768 channels)).

I want to try the result with 16-bit precision (quantized with 16-bit and softmax and GeRU algorithms).
Is 16 bit any problem?
If not, I want a way to implement this.

Training in mixed precision

โ“ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Hi, thanks for the amazing contribution!
I'm trying to use IBert from Huggingface/transformers (4.4.2) in my own training pipeline where I'm fine-tuning in quant mode with mixed precision (Using pytorch's cuda.amp module). This results in overflows in the QuantLinear layers, which causes following training to break due to nans. I'm considering artificially clamping the weights to a smaller range to avoid this or using a lower bit precision (from 8 to say 4) while fine-tuning.

I'm wondering if you have tried this or have any suggestions about my approaches that could help me train effectively.

Thanks.

Code

 with autocast(enabled=grad_scaler.is_enabled()):
        # TRAINING CODE...

I'm unable to post any more code (proprietary stuff, sorry!), but I can provide some specifics if you need them.

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0): 1.8.0
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version: 10.1/7.6.5
  • GPU models and configuration:
  • Any other relevant information:

How is the scaling factor S implemented with integer?

Thanks for sharing the code. This is really a great paper with clear and impressive ideas. I feel sorry that I might have some misunderstanding and is a little confused about some details. In the paper, the author claims the model is implemented with integer-only arithmetic, yet in Algorithms 1-4 as well as in Equations 1 and 2, the scaling factor is not an integer but seems to be some floating number. In other words, it needs to do floating-point scaling, which is also demonstrated on this line. I wonder if I misunderstood this and how this is implemented with integer arithmetic. Thanks a lot!

Latency 20x with quant_mode = true

In the hugging face config, I set quant_mode = TRUE.
The weight_integer buffer remains 0, and the result is wrong.
Moreover, inference latency of integer mode is 20 times of float mode.
Can you please explain the reason for me?

Possible bug in IntSoftmax

๐Ÿ› Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd '....'
  2. See error

Code sample

Expected behavior

Environment

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

I've been trying to add the Ibert quantization modules to distilbert and ran into this issue.

scaling_factor = 1 / 2 ** self.output_bit

is a float value and is returned as is. I believe that this should be converted to a tensor on the appropriate device before returning something like scaling_factor = torch.tensor([1 / 2 ** self.output_bit], device=exp_int.device).

Please let me know your thoughts on this. Thanks!

Task name references in strings are wrong

Some inconsistencies arose due to previous commits, preventing fine-tuning for these tasks (SST-2 and STS-B).
It's better to use enumeration in order to prevent such issues in the future. However, for now I'm providing two lightweight PRs (#27 and #28) to address this issue.

Missing deployment part on TensorRt

โ“ Questions and Help

You make reference in the paper and on Huggingface to a tensorRt deployment but I can't find the code.
Do you plan to share it too?

As far as I know the nvidia repo has only examples for their own models (all bert based), it's a bit hard to try it on our own without an example.

python run.py --arch roberta_base --task STS-B

When I try " CUDA_VISIBLE_DEVICES=0 python run.py --arch roberta_base --task STS-B(or SST-2)", there are always "FileNotFoundError: [Errno 2] No such file or directory: 'STS-B-bin/label/dict.txt'", but the glue_dataset is downloaded.
How can I solve it?

(huggingface) The output of IBERT is float. Am I doing wrong?

โ“ Questions and Help

What is your question?

I'm using the huggingface's implementation. Even though I set the quant_mode=True, I see the output of IBert is in float32 type. Am I using the model wrong, or is it expected?

Code

self.bert = AutoModel.from_pretrained(
    base_model, quant_mode=quant_mode, add_pooling_layer=False
)

...


def forward(
        self,
        input_ids: Tensor,
        attention_mask: Tensor,
        k: int = None,
        return_layers: List[int] = None,
        return_orig: bool = False,
    ):
        bert_out = self.bert(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            return_dict=True,
        )

        # the output dtype is float32!
        print(bert_out.hidden_states[0])

What's your environment?

  • PyTorch Version: 1.7.1
  • OS (e.g., Linux): Ubuntu 20
  • How you installed fairseq (pip, source): No
  • Python version: 3.8.5
  • CUDA/cuDNN version: 11.0
  • GPU models and configuration: RTX 3090

Arguments in run.py

๐Ÿ› Bug

the hyperparameters in run.py is not corresponding to the task_speces in run.py, line 73.
image
image

error when runing download_glue_data.py

๐Ÿ› Bug

Runing python download_glue_data.py --data_dir glue_data --tasks all get errors

it looks like the links in the code can't be accessed

error | ย 
code | 403
message | "Permission denied."

To Reproduce

Runing python download_glue_data.py --data_dir glue_data --tasks all

errors

C:\work\AIBenchmarking\app\I-BERT>py download_glue_data.py --data_dir glue_data --tasks all
Downloading and extracting CoLA...
Traceback (most recent call last):
File "C:\work\AIBenchmarking\app\I-BERT\download_glue_data.py", line 141, in
sys.exit(main(sys.argv[1:]))
File "C:\work\AIBenchmarking\app\I-BERT\download_glue_data.py", line 137, in main
download_and_extract(task, args.data_dir)
File "C:\work\AIBenchmarking\app\I-BERT\download_glue_data.py", line 47, in download_and_extract
urllib.request.urlretrieve(TASK2PATH[task], data_file)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 525, in open
response = meth(req, response)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Users\a00646069\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Code sample

Expected behavior

Environment

Python 3.10.2

Additional context

Where can I find the integer-sqrt kernel ?

โ“ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Code

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.