mit-han-lab / hardware-aware-transformers Goto Github PK

View Code? Open in Web Editor NEW

322.0 13.0 48.0 17.12 MB

[ACL'20] HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Home Page: https://hat.mit.edu

License: Other

Python 95.70% Shell 2.96% C++ 0.51% Cuda 0.83%

hardware-aware transformer specialization efficient-model natural-language-processing machine-translation

hardware-aware-transformers's Introduction

HAT: Hardware Aware Transformers for Efficient Natural Language Processing [paper] [website] [video]

@inproceedings{hanruiwang2020hat,
    title     = {HAT: Hardware-Aware Transformers for Efficient Natural Language Processing},
    author    = {Wang, Hanrui and Wu, Zhanghao and Liu, Zhijian and Cai, Han and Zhu, Ligeng and Gan, Chuang and Han, Song},
    booktitle = {Annual Conference of the Association for Computational Linguistics},
    year      = {2020}
}

Overview

We release the PyTorch code and 50 pre-trained models for HAT: Hardware-Aware Transformers. Within a Transformer supernet (SuperTransformer), we efficiently search for a specialized fast model (SubTransformer) for each hardware with latency feedback. The search cost is reduced by over 10000×.

HAT Framework overview:

HAT models achieve up to 3× speedup and 3.7× smaller model size with no performance loss.

Usage

Installation

To install from source and develop locally:

git clone https://github.com/mit-han-lab/hardware-aware-transformers.git
cd hardware-aware-transformers
pip install --editable .

Data Preparation

Task	task_name	Train	Valid	Test
WMT'14 En-De	wmt14.en-de	WMT'16	newstest2013	newstest2014
WMT'14 En-Fr	wmt14.en-fr	WMT'14	newstest2012&2013	newstest2014
WMT'19 En-De	wmt19.en-de	WMT'19	newstest2017	newstest2018
IWSLT'14 De-En	iwslt14.de-en	IWSLT'14 train set	IWSLT'14 valid set	IWSLT14.TED.dev2010 IWSLT14.TEDX.dev2012 IWSLT14.TED.tst2010 IWSLT14.TED.tst2011 IWSLT14.TED.tst2012

To download and preprocess data, run:

bash configs/[task_name]/preprocess.sh

If you find preprocessing time-consuming, you can directly download the preprocessed data we provide:

bash configs/[task_name]/get_preprocessed.sh

Testing

We provide pre-trained models (SubTransformers) on the Machine Translation tasks for evaluations. The #Params and FLOPs do not count in the embedding lookup table and the last output layers because they are dependent on tasks.

Task	Hardware	Latency	#Params (M)	FLOPs (G)	BLEU	Sacre BLEU	model_name	Link
WMT'14 En-De	Raspberry Pi ARM Cortex-A72 CPU	3.5s 4.0s 4.5s 5.0s 6.0s 6.9s	25.22 29.42 35.72 36.77 44.13 48.33	1.53 1.78 2.19 2.26 2.70 3.02	25.8 26.9 27.6 27.8 28.2 28.4	25.6 26.6 27.1 27.2 27.6 27.8	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link link
WMT'14 En-De	Intel Xeon E5-2640 CPU	137.9ms 204.2ms 278.7ms 340.2ms 369.6ms 450.9ms	30.47 35.72 40.97 46.23 51.48 56.73	1.87 2.19 2.54 2.86 3.21 3.53	25.8 27.6 27.9 28.1 28.2 28.5	25.6 27.1 27.3 27.5 27.6 27.9	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link link
WMT'14 En-De	Nvidia TITAN Xp GPU	57.1ms 91.2ms 126.0ms 146.7ms 208.1ms	30.47 35.72 40.97 51.20 49.38	1.87 2.19 2.54 3.17 3.09	25.8 27.6 27.9 28.1 28.5	25.6 27.1 27.3 27.5 27.8	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link
WMT'14 En-Fr	Raspberry Pi ARM Cortex-A72 CPU	4.3s 5.3s 5.8s 6.9s 7.8s 9.1s	25.22 35.72 36.77 44.13 49.38 56.73	1.53 2.23 2.26 2.70 3.09 3.57	38.8 40.1 40.6 41.1 41.4 41.8	36.0 37.3 37.8 38.3 38.5 38.9	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link link
WMT'14 En-Fr	Intel Xeon E5-2640 CPU	154.7ms 208.8ms 329.4ms 394.5ms 442.0ms	30.47 35.72 44.13 51.48 56.73	1.84 2.23 2.70 3.28 3.57	39.1 40.0 41.1 41.4 41.7	36.3 37.2 38.2 38.5 38.8	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link
WMT'14 En-Fr	Nvidia TITAN Xp GPU	69.3ms 94.9ms 132.9ms 168.3ms 208.3ms	30.47 35.72 40.97 46.23 51.48	1.84 2.23 2.51 2.90 3.25	39.1 40.0 40.7 41.1 41.7	36.3 37.2 37.8 38.3 38.8	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link
WMT'19 En-De	Nvidia TITAN Xp GPU	55.7ms 93.2ms 134.5ms 176.1ms 204.5ms 237.8ms	36.89 42.28 40.97 46.23 51.48 56.73	2.27 2.63 2.54 2.86 3.18 3.53	42.4 44.4 45.4 46.2 46.5 46.7	41.9 43.9 44.7 45.6 45.7 46.0	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link link
IWSLT'14 De-En	Nvidia TITAN Xp GPU	45.6ms 74.5ms 109.0ms 137.8ms 168.8ms	16.82 19.98 23.13 27.33 31.54	0.78 0.93 1.13 1.32 1.52	33.4 34.2 34.5 34.7 34.8	32.5 33.3 33.6 33.8 33.9	[email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected] [email protected][email protected]	link link link link link

Download models:

python download_model.py --model-name=[model_name]
# for example
python download_model.py [email protected][email protected]
# to download all models
python download_model.py --download-all

Test BLEU (SacreBLEU) score:

bash configs/[task_name]/test.sh \
    [model_file] \
    configs/[task_name]/subtransformer/[model_name].yml \
    [normal|sacre]
# for example
bash configs/wmt14.en-de/test.sh \
    ./downloaded_models/[email protected][email protected] \
    configs/wmt14.en-de/subtransformer/[email protected][email protected] \
    normal
# another example
bash configs/iwslt14.de-en/test.sh \
    ./downloaded_models/[email protected][email protected] \
    configs/iwslt14.de-en/subtransformer/[email protected][email protected] \
    sacre

Test Latency, model size and FLOPs

To profile the latency, model size and FLOPs (FLOPs profiling needs torchprofile), you can run the commands below. By default, only the model size is profiled:

python train.py \
    --configs=configs/[task_name]/subtransformer/[model_name].yml \
    --sub-configs=configs/[task_name]/subtransformer/common.yml \
    [--latgpu|--latcpu|--profile-flops]
# for example
python train.py \
    --configs=configs/wmt14.en-de/subtransformer/[email protected][email protected] \
    --sub-configs=configs/wmt14.en-de/subtransformer/common.yml --latcpu
# another example
python train.py \
    --configs=configs/iwslt14.de-en/subtransformer/[email protected][email protected] \
    --sub-configs=configs/iwslt14.de-en/subtransformer/common.yml --profile-flops

Training

1. Train a SuperTransformer

The SuperTransformer is a supernet that contains many SubTransformers with weight-sharing. By default, we train WMT tasks on 8 GPUs. Please adjust --update-freq according to GPU numbers (128/x for x GPUs). Note that for IWSLT, we only train on one GPU with --update-freq=1.

python train.py --configs=configs/[task_name]/supertransformer/[search_space].yml
# for example
python train.py --configs=configs/wmt14.en-de/supertransformer/space0.yml
# another example
CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --configs=configs/wmt14.en-fr/supertransformer/space0.yml --update-freq=32

In the --configs file, SuperTransformer model architecture, SubTransformer search space and training settings are specified.

We also provide pre-trained SuperTransformers for the four tasks as below. To download, run python download_model.py --model-name=[model_name].

Task	search_space	model_name	Link
WMT'14 En-De	space0	HAT_wmt14ende_super_space0	link
WMT'14 En-Fr	space0	HAT_wmt14enfr_super_space0	link
WMT'19 En-De	space0	HAT_wmt19ende_super_space0	link
IWSLT'14 De-En	space1	HAT_iwslt14deen_super_space1	link

2. Evolutionary Search

The second step of HAT is to perform an evolutionary search in the trained SuperTransformer with a hardware latency constraint in the loop. We train a latency predictor to get fast and accurate latency feedback.

2.1 Generate a latency dataset

python latency_dataset.py --configs=configs/[task_name]/latency_dataset/[hardware_name].yml
# for example
python latency_dataset.py --configs=configs/wmt14.en-de/latency_dataset/cpu_raspberrypi.yml

hardware_name can be cpu_raspberrypi, cpu_xeon and gpu_titanxp. The --configs file contains the design space in which we sample models to get (model_architecture, real_latency) data pairs.

We provide the datasets we collect in the latency_dataset folder.

2.2 Train a latency predictor

Then train a predictor with collected dataset:

python latency_predictor.py --configs=configs/[task_name]/latency_predictor/[hardware_name].yml
# for example
python latency_predictor.py --configs=configs/wmt14.en-de/latency_predictor/cpu_raspberrypi.yml

The --configs file contains the predictor's model architecture and training settings. We provide pre-trained predictors in latency_dataset/predictors folder.

2.3 Run evolutionary search with a latency constraint

python evo_search.py --configs=[supertransformer_config_file].yml --evo-configs=[evo_settings].yml
# for example
python evo_search.py --configs=configs/wmt14.en-de/supertransformer/space0.yml --evo-configs=configs/wmt14.en-de/evo_search/wmt14ende_titanxp.yml

The --configs file points to the SuperTransformer training config file. --evo-configs file includes evolutionary search settings, and also specifies the desired latency constraint latency-constraint. Note that the feature-norm and lat-norm here should be the same as those when training the latency predictor. --write-config-path specifies the location to write out the searched SubTransformer architecture.

3. Train a Searched SubTransformer

Finally, we train the search SubTransformer from scratch:

python train.py --configs=[subtransformer_architecture].yml --sub-configs=configs/[task_name]/subtransformer/common.yml
# for example
python train.py --configs=configs/wmt14.en-de/subtransformer/[email protected] --sub-configs=configs/wmt14.en-de/subtransformer/common.yml

--configs points to the --write-config-path in step 2.3. --sub-configs contains training settings for the SubTransformer.

After training a SubTransformer, you can test its performance with the methods in Testing section.

Dependencies

Python >= 3.6
PyTorch >= 1.0.0
configargparse >= 0.14
New model training requires NVIDIA GPUs and NCCL

Related works on efficient deep learning

MicroNet for Efficient Language Modeling

Lite Transformer with Long-Short Range Attention

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

Once-for-All: Train One Network and Specialize it for Efficient Deployment

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

Contact

If you have any questions, feel free to contact Hanrui Wang through Email ([email protected]) or Github issues. Pull requests are highly welcomed!

Licence

This repository is released under the MIT license. See LICENSE for more information.

Acknowledgements

We are thankful to fairseq as the backbone of this repo.

hardware-aware-transformers's People

Contributors

Stargazers

Watchers

hardware-aware-transformers's Issues

Training new SuperTransformer - calculating number of SubTransformer combinations?

Dear Authors,

Thanks for the great library. I am currently attempting to train a new SuperTransformer. The paper states that the default design space contains 10^15 SubTransformer configurations. Can you explain how this number is calculated, so I can work on calculating the number of SubTransformers in my new SuperTransformer?

Used version of `fairseq`

Hi,

Would it be possible to disclose which version/commit of fairseq was used & modified for this implementation?

Thanks!

Error in step 2.3 (Evolutionary search with latency constraint)

Hi,

When following the code step by step, I get an error when running the evolutionary search. The error is:
"_th_admm_out not supported on CPUType for Half"

Do you know what could be causing this and how to fix it? I am currently running this for my i5 CPU. Does the config file need any change to avoid using the GPU when only the CPU is being tested?

Help with this would be highly appreciated. Thanks.

Quantization on HAT.

Hi, firstly thanks for the library.
I tried a few experiments and trained sub-transformer with latency constraint of 75 ms. It is mentioned in the paper that HAT is quantization friendly. Now I want to quantize the subtransformer I trained.
Can you please share on how you quantized the subtransformer ? It would be helpful.
Also is there any comparison with respect to latency when quantized. Can we expect latency reduction here ?
Thanks.

Question on how to evaluate inherited SubTransformers.

Hi,

Table 5 in the paper mentions that the "Inherited" BLEU score is similar to the "From-Scratch" BLEU score.

Can you please specify which part of the code can be used to run inference/test the inherited SubTransformers (without training from scratch) or how to use the code to perform this task?

In other words, I would like to know how to test translations from a specific SubTransformer in the SuperTransformer design space without training the model again from scratch.

Hope you can point me in the right direction on this. Thanks for your help.

Latency predictor relative error instead of absolute error

Hello,

You mention in Fig. 6 in the paper that the average prediction error of the latency predictor is 0.1 secs.

Could you please provide how much error percentage is that on Raspberry Pi?

I checked your Raspberry Pi latency dataset file and I could see ground truth latency ("latency_mean_encoder" and "latency_mean_decoder") of the models used for test (last 200).

However I could not see the predicted values to calculate how much percentage is the error.

Thanks.

What is the method used to sample training examples for the MLP latency predictor?

Hi,

The paper says it trains the MLP latency predictor on 2000 latency samples. May I ask what was your sampling method? That is, how did you choose those 2000 samples?

Thanks,
Mohamed.

RAM in the used Raspberry Pi

Hello,

I would just like to ask how large is the RAM in the used Rasberry Pi? 4 GB or 8 GB? And was it solely sufficient for your experiments?

Thanks.

Question about the SubTransformers sampling process.

Hi,

Thanks a lot for releasing this great project.
I have a question on the SubTransformers sampling process in the distributed training environment. I see you sample a random SubTransformer before each train step by doing the following, then in multi-GPU scenario, does each GPU has the same random SubTransformer or they each has a different random Subnetwork? Would reset_rand_seed force all GPUs to sample the same random SubTransformer from the SuperNet? And is trainer.get_num_updates() the same at each train step?

configs = [utils.sample_configs(utils.get_all_choices(args), reset_rand_seed=True, rand_seed=trainer.get_num_updates(), super_decoder_num_layer=args.decoder_layers)]

Thanks a lot for your help.

questions about the search & training process

Hi, I tried to run an evolutionary search on the IWLST14.de-en dataset with a 1080 Ti GPU.

I modified the latency-constraint from 200 to 150 since the 1080 Ti is faster than the Titan XP.

But the best architecture (143 ms) didn't change after ten epochs while the max iteration is 30.

Then I trained the searched architecture using the same configuration file and got only 33.77 BLEU (normal).

My questions are:

Is this phenomenon normal? Does it mean that the search has encountered a local optimum?
How to get comparable scores reported in your list if I use other GPUs with similar latency?

Here is the search log:
iwlst.evo.gpu.log

Does the generated latency count in the embedding lookup table and the last output layers ?

According to the code, the generated latency should count in the embedding lookup table and the last output layers. But I find a problem, I train a predictor , and it is very accurate. Then I run the evo search with a hardware latency constraint of 200ms. After the subTransformer is trained, I test the latency, and the latency is 270ms, which is much larger than predicted latency. Why does this happen?

question about number of parameters

Hi, I trained some models using the pre-defined configurations but
the number of parameters is much larger than what you reported (55.1M vs. 31.5M).

Configuration:
HAT_iwslt14deen_titanxp@[email protected]

Here is the code I used to calculate the number of parameters:
(embedding layers are excluded)

import torch

m = torch.load('checkpoints/iwslt14.de-en/subtransformer/HAT_iwslt14deen_titanxp@\
[email protected]/checkpoint_best.pt', map_location='cpu')
m = m['model']

n = 0
for k in m:
    if 'emb' not in k:
        n += m[k].numel()

print(n)

lower loss but the BLEU is 0

I have trained a model, the loss in train and valid dataset is very low(lower than 2), But when I evaluated the BLEU, I found it 0. And I check the translation result, all the result is the same, like "the the the the ..." and etc. This is very strange, what could be the reason?

One question

Thanks for releasing the great project！

One thing I‘d like to make sure is whether the parameter of head_num has an effect on model latency? From your code, I find that qkv_dim is fixed, thus i conjecture that the head_num would not affect the model latency.

Thanks

how to use the processed data in your code?

As described in title, I have downloaded the processed data, ' *.tgz' , but I don't find how to use it. pls help me out.
I find the data path in each space0.yml. Is it for preprocessed data ?

Question about the latency on Raspberry Pi

Dear Authors,

Hi, thanks for the library!
I tested it successfully on my GPU, but the problem I am having now is that the latency of the raspberry pi I'm used is different with yours. I used WMT'14 En-Fr for testing, and below is the instruction for my delay and test:

I'm not sure what the problem is, do I need additional settings?
And this is my environment:

Python == 3.7.3
pytorch == 1.4.0

Thanks.

About the Quantization Friendly.

Hi, firstly thanks for your code.
I tried the experiments about the "Quantization Friendly". Could you tell me about how exactly you implement Transformer Float32？
Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

mit-han-lab / hardware-aware-transformers Goto Github PK

hardware-aware-transformers's Introduction

HAT: Hardware Aware Transformers for Efficient Natural Language Processing [paper] [website] [video]

Overview

Usage

Installation

Data Preparation

Testing

Download models:

Test BLEU (SacreBLEU) score:

Test Latency, model size and FLOPs

Training

1. Train a SuperTransformer

2. Evolutionary Search

2.1 Generate a latency dataset

2.2 Train a latency predictor

2.3 Run evolutionary search with a latency constraint

3. Train a Searched SubTransformer

Dependencies

Related works on efficient deep learning

Contact

Licence

Acknowledgements

hardware-aware-transformers's People

Contributors

Stargazers

Watchers

Forkers

hardware-aware-transformers's Issues

Recommend Projects

Recommend Topics

Recommend Org