microsoft / fastseq Goto Github PK

An efficient implementation of the popular sequence models for text generation, summarization, and translation tasks. https://arxiv.org/pdf/2106.04718.pdf

License: MIT License

Python 92.58% Shell 5.95% Dockerfile 0.56% C++ 0.38% Cuda 0.53%

fastseq's Introduction

FastSeq

Introduction

FastSeq provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc. It automatically optimizes inference speed based on popular NLP toolkits (e.g. FairSeq and HuggingFace-Transformers) without accuracy loss. All these can be easily done (no need to change any code/model/data if using our command line tool, or simply add one-line code import fastseq if using source code).

Features:

EL-Attention: Memory Efficient Lossless Attention for Generation
GPU-based Block N-Gram Repeats
Asynchronous Pipeline for Postprocess

Speed Gain

Below shows the generation speed gain by using FastSeq.

Model	W/O FastSeq (in samples/s)	W/ FastSeq (in samples/s)	Speedup
ProphetNet (`fs`)	2.8	11.9	4.3
Bart (`fs`)	3.3	25.1	7.7x
Bart (`hf`)	4.5	12.4	2.8x
DistilBart (`hf`)	5.5	19.1	3.5x
T5 (`hf`)	9.5	31.7	3.3x
WMT16 En-De (`fs`)	144.5	422.8	2.9x
GPT2 (`hf`)	0.9	7.1	7.9x
ProphetNet (`hf`)	3.4	6.2	1.8x

All benchmarking experiments run on NVIDIA-V100-16GB with docker. Highest speed recorded for each model by tuning batch size. For parameter setting details, click link of corresponding model.
The baseline (W/O Fastseq) for ProphetNet (fs) is run with fairseq 0.9.0, as it has not yet been updated for compatibility with version 0.10.2
fs stands for Fairseq 0.10.2 version, hf stands for Huggingface Transformers 4.12.0 version.
Optimizations were automatically applied to all generation/sequence models in Fairseq & Huggingface Transformers. Above only lists a subset of them.

How it works?

FastSeq develops multiple speedup techniques, including an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations support various Transformer-based model architectures, such as the encoder-decoder architecture, the decoder-only architecture, and the encoder-only architecture. The more efficient implementations in FastSeq will be automatically patched to replace the ones in existing NLP toolkits (e.g., HuggingFace-Transformers and FairSeq), so there is no need of big code changes to integrate FastSeq with these toolkits.

Installation

Requirements

Python version >= 3.6
torch >= 1.4.0
fairseq >= 0.10.0
transformers >= 4.12.0
requests >= 2.24.0
absl-py >= 0.9.0
rouge-score >= 0.0.4

If you use fairseq or transformers, you only need to install one of them. If you use both, you need to install both.

Building the Dockerfile

The dockerfile requires the specification of a base image.

cd fastseq/docker
# pass the base image name as a build-arg when building the image from the dockerfile
docker build --build-arg BASE_IMAGE=nvcr.io/nvidia/pytorch:20.03-py3 .

Install from the source

# when fairseq and/or transformers has been installed
$ pip install git+https://github.com/microsoft/fastseq.git

# install fastseq + transformers
$ pip install git+https://github.com/microsoft/fastseq.git#egg=fastseq[transformers]

# install fastseq + fairseq
$ pip install git+https://github.com/microsoft/fastseq.git#egg=fastseq[fairseq]

# install fastseq + transformers + fairseq
$ pip install git+https://github.com/microsoft/fastseq.git#egg=fastseq[transformers,fairseq]

Usage

Use source code for speedup

Only one line of code change is needed to use the optimizations provided by FastSeq.

# import fastseq at the beginning of your program
import fastseq
import torch

# Download bart.large.cnn
bart = torch.hub.load('pytorch/fairseq', 'bart.large.cnn')

bart.cuda()  # use GPU
bart.eval()  # disable dropout for evaluation
bart.half()

slines = ['FastSeq provides efficient implementations of the popular sequence models. Please visit https://github.com/microsoft/fastseq for more details.']

hypotheses = bart.sample(
    slines, beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)

print(hypotheses)

Use command line tool to speedup fairseq models

Example usage for bart model on cnn daily mail task.

$ fastseq-generate-for-fairseq \
    cnn_dnn/bin \
    --path bart.large.cnn/model.pt \
    --fp16 \
    --task translation \
    --batch-size 128 \
    --gen-subset valid \
    --truncate-source  \
    --bpe gpt2 \
    --beam 4 \
    --num-workers 4 \
    --min-len 55 \
    --max-len-b 140 \
    --no-repeat-ngram-size 3 \
    --lenpen 2.0

Both model file and task data file are the same as original Fairseq version.

Use command line tool to speedup transformers models

Example usage for bart model on cnn daily mail task.

$ fastseq-generate-for-transformers \
    facebook/bart-large-cnn \
    cnn_dm/val.source \
    out.summary \
    --reference_path cnn_dm/val.target \
    --device cuda \
    --bs 128 \
    --fp16 \
    --score_path out.score \
    --task summarization

Both model file and task data file are the same as original Transformers version.

Run tests

# run a single test.
$ python tests/optimizer/fairseq/test_fairseq_optimizer.py

# run all the tests.
$ python -m unittest discover -s tests/ -p '*.py'

# run all the benchmarks.
$ cd benchmarks && bash run_all_benchmarks.sh

Code Style

Python coding style

Changes to Python code should conform to PEP 8. yapf can be used to help format the python code, and use pylint to check your Python changes.

# format the code by yapf
$ yapf --style pep8 -i -r PYTHON_FILE/PACKAGE

# run pylint check
$ pylint --rcfile=.pylintrc  PYTHON_FILE/PACKAGE

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Citation

Please cite as:

@inproceedings{yan-etal-2021-fastseq,
    title = "{F}ast{S}eq: Make Sequence Generation Faster",
    author = "Yan, Yu and Hu, Fei and Chen, Jiusheng and Bhendawade, Nikhil and Ye, Ting and Gong, Yeyun  and Duan, Nan  and Cui, Desheng  and Chi, Bingyu and Zhang, Ruofei",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    year = "2021",
}


@InProceedings{pmlr-v139-yan21a,
  title = 	 {EL-Attention: Memory Efficient Lossless Attention for Generation},
  author =       {Yan, Yu and Chen, Jiusheng and Qi, Weizhen and Bhendawade, Nikhil and Gong, Yeyun and Duan, Nan and Zhang, Ruofei},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {11648--11658},
  year = 	 {2021},
}

fastseq's People

Contributors

Stargazers

Watchers

fastseq's Issues

fairseq/transformers unit test modify local environment

after run tests/run_fairseq_tests.py, user's original Fairseq installation is deleted, and replaced.

Before run
Location: /opt/conda/lib/python3.6/site-packages

After run:
Location: /tmp/fairseq

Version could also be changed.
And same problem for Transformers.

It breaks user's environment. User needs to reinstall the package.

Can we isolate unit test pip environment from user's local environment? something like virtual environment, conda.

fairseq eval_lm

Have you guys looked at fairseq eval_lm?

It's not generative, but I was wondering whether any of these tricks would work.
Thanks in advance!

T5 speed

T5 speed on latest code is lower than expected (docker used). Caused benchmark test failed.
$CUDA_VISIBLE_DEVICES=3 bash models/hf_t5.sh

Util	Model	Task	Split	BatchSize	Bleu	Throughput(samples/s)	Expected
transformers_v3.0.2	t5-base	wmt_en_ro/raw	val	64	27.44	5	5~5.5
transformers_v3.0.2+fastseq_v0.0.3	t5-base	wmt_en_ro/raw	val	64	27.43	6.3	7~7.5
transformers_v3.0.2+fastseq_v0.0.3	t5-base	wmt_en_ro/raw	val	128	27.42	7.2	7.9~8.4

Not sure if it is due to docker.

In which file to read the source code implementation of El-Attention for self-attention

Hello, thank you very much for your outstanding work.
We want to read the implementation of EL-Attention source code about self-attention, but we haven't found the relevant source code implementation. In which folder should we read the source code implementation of EL-Attention.
Thanks a lot

memory not-release issue for large BS on FastSeq

I'd like to report a memory not-release issue for large BS on FastSeq.

Impact:
I can re-produce it every time. As it does not release memory after crash, I am afraid if releasing this package to users, they may experience the same issue and are not easy to handle it.

How to reproduce:
I tested in gpu0 machine.
Below are the detailed steps of re-producing this issue:

Docker run image:
sudo docker run --gpus all --privileged --name fastseq_dev_py3_tiy -it adsbrainwestus2.azurecr.io/fastseq:dev-py3 /bin/bash
Inside the container:

Create RSA-key, add it to github account (just make it easy to download code)
mkdir tiy & cd tiy
Install the latest fastseq:
git clone [email protected]:microsoft/fastseq.git
cd fastseq
pip install --editable ./
cd benchmarks
Set LOOP in utils.sh to be 1
Run nvidia-smi the first time, no memory occupation, which is expected:

Run ./benchmark.sh fairseq+fastseq bart.large.cnn cnn_dm/len-1024.bin valid 256
Failed because of Bus error:
Processing Loop=1/1 Util=fairseq_v0.9.0+fastseq_v0.0.3 Model=bart.large.cnn Task=cnn_dm/len-1024.bin Split=valid BS=256
benchmark_seq.sh: line 55: 533 Bus error (core dumped) $util $data_dir --path $model_path --fp16 --task translation --batch-size $bs --gen-subset $split --truncate-source --bpe gpt2 --beam 4 --num-workers 4 --min-len 55 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen 2.0 #--print-alignment #--print-step # KeyError: steps --skip-invalid-size-inputs-valid-test $* > $STDOUT_FILE 2> $STDERR_FILE
Failed at benchmark_seq.sh (line 80): $util $data_dir --path $model_path --fp16 --task translation --batch-size $bs --gen-subset $split --truncate-source --bpe gpt2 --beam 4 --num-workers 4 --min-len 55 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen 2.0 #--print-alignment #--print-step # KeyError: steps --skip-invalid-size-inputs-valid-test $* > $STDOUT_FILE 2> $STDERR_FILE
Run nvidia-smi the second time, memory occupation on GPU0:

Other information:
I re-run 5 times to check if there is any information in fastseq.stderr. Most of time, there is no any error msg in fastseq.stderr.

4 times, no any error message in fastseq.stderr

root@6e86574394fb:/workspace/tiy/fastseq/benchmarks# cat /tmp/fastseq.stderr
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "

1 time, there was EOFError recorded in fastseq.stderr

_root@6e86574394fb:/workspace/tiy/fastseq/benchmarks# cat /tmp/fastseq.stderr
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/resource_sharer.py", line 142, in _serve
with self._listener.accept() as conn:
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 456, in accept
answer_challenge(c, self._authkey)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 383, in recv
raise EOFError
EOFError

Any end-end inference example with Google Colab & HuggingFace

Hi Team,

Thanks a lot for this.

Few questions -

Is the speedup only for GPU or the inference on CPU is also boosted?
Wondering if an inference example with T5/BART summarization from Huggingface etc can be provided in a colab notebook or so. Easier to adopt.

Sorry if it is a bit of a stretch to request this. Appreciate you reading this.

Support for ONNX models & INT8 quantization

Thanks for the excellent toolkit.
Are there plans to support ONNX models? It would be nice to see some speed-up data on quantized(INT8) attention.

Transformers unit tests failure

test_modeling_t5.py fails when fastseq is imported.

Steps to replicate :

clone from test_wrapper branch . cd to tests directory.
2.CUDA_VISIBLE_DEVICES=<> bash run_transformers_tests.py

ModuleNotFoundError: No module named 'fastseq.models'

Hello,

The fastseq library, installed through pip, seems to have the models directory missing and crashes with the following exception:

File "/root/miniconda3/envs/Env37/lib/python3.7/site-packages/fastseq/init.py", line 9, in
import fastseq.models # pylint: disable=wrong-import-position
ModuleNotFoundError: No module named 'fastseq.models'

And while this directory can be found in the repository here, I can only find prophetnet, while I am looking for BART.
Could you give me a hint on how to install the according models?

Thanks and happy holidays :-)

RuntimeError: CUDA error: no kernel image is available for execution on the device

I am trying to use your repeat ngram extension, but when I switch GPUs (without rebuilding the extension) it breaks with RuntimeError: CUDA error: no kernel image is available for execution on the device. If I rerun: python setup.py build_ext --inplace it works again. Any clues how to build the extension so that it works on a different GPU (same cuda version, same python version, same torch) than where it was built?

Also, we're considering pulling some of these changes back into fairseq, if that's alright with you guys!

Running error with PyTorch 1.12.1

The following error occurs with PyTorch 1.12.1, but disappears with PyTorch 1.11.0

  File "/home/tangtianyi/transformers/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py", line 26, in <module>
    from ...modeling_utils import PreTrainedModel
  File "/home/tangtianyi/transformers/src/transformers/modeling_utils.py", line 41, in <module>
    from .generation_utils import GenerationMixin
  File "/home/tangtianyi/transformers/src/transformers/generation_utils.py", line 29, in <module>
    from .generation_logits_process import (
  File "/home/tangtianyi/transformers/src/transformers/generation_logits_process.py", line 25, in <module>
    import ngram_repeat_block_cuda
ImportError: /home/tangtianyi/miniconda3/lib/python3.8/site-packages/ngram_repeat_block_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl36is_contiguous_nondefault_policy_implENS_12MemoryFormatE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tangtianyi/model/pretrained_models.py", line 31, in <module>
    from transformers import AutoConfig, AutoModelForCausalLM, AutoModelForSeq2SeqLM, EncoderDecoderModel
  File "<frozen importlib._bootstrap>", line 1039, in _handle_fromlist
  File "/home/tangtianyi/transformers/src/transformers/utils/import_utils.py", line 948, in __getattr__
    value = getattr(module, name)
  File "/home/tangtianyi/transformers/src/transformers/utils/import_utils.py", line 947, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/home/tangtianyi/transformers/src/transformers/utils/import_utils.py", line 959, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.encoder_decoder.modeling_encoder_decoder because of the following error (look up to see its traceback):
/home/tangtianyi/miniconda3/lib/python3.8/site-packages/ngram_repeat_block_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1010TensorImpl36is_contiguous_nondefault_policy_implENS_12MemoryFormatE

Does it support Tensorflow 2?

This is awesome and much required.
Does it support Tensorflow models in Hugging face.

Thanks.

EL-attention GPT-2

Hi,
In the EL-Attention paper, a GPT-2 implementation with 1.8x speedup is mentioned. Will that ever be released publicly?

Compatible with torch-1.6.0

fastseq-generate works on torch-1.5.0. But when running fastseq-generate on torch-1.6.0, got the following error:

Traceback (most recent call last):
  File "/datadrive/jiuchen/src/git.fastseq/fastseq_cli/generate.py", line 14, in <module>
    cli_main()
  File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 199, in cli_main
    main(args)
  File "/datadrive/jiuchen/src/git.fastseq/fastseq/optimizer/fairseq/generate_v1.py", line 113, in main_v1
    prefix_tokens)
  File "/usr/local/lib/python3.6/dist-packages/fairseq/tasks/fairseq_task.py", line 265, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/fairseq/sequence_generator.py", line 113, in generate
    return self._generate(model, sample, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/datadrive/jiuchen/src/git.fastseq/fastseq/optimizer/fairseq/beam_search_optimizer_v1.py", line 704, in _generate
    scores.view(bsz, beam_size, -1)[:, :, :step],
  File "/usr/local/lib/python3.6/dist-packages/fairseq/search.py", line 81, in step
    torch.div(self.indices_buf, vocab_size, out=self.beams_buf)
RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

Support for current fairseq 0.10.2

Supported version of fairseq is 0.9.0, from late 2019. Current version has some breaking API changes. Is it planned to update fastseq to support current fairseq?
Thanks!

Does fastseq support cpu

Can the fastseq install on windows?

When I install fastseq on windows, it reports an error.

Support for Hugging Face's PEGASUS Model

Hi,

Thank you for the excellent work. Do you have plans to support Hugging Face's PEGASUS, such as google/pegasus-large?

Thank you.

Where to read EL-Attention source code for huggingface-transformers

We are very interested in your work and thank you for your work. We have read your paper"EL-Attention". The more comprehensive examples can be found here for huggingface-transformers, but the self-attention save the key and value, not only hidden_states. El-Attention proves that saving hidden_states can half of the memory.

Errors in test_fairseq_optimizer.py

After #27, we still see errors like below. And it prints to logger.error. should we make test fail instead?

cd to project root dir
$CUDA_VISIBLE_DEVICES=3 python -m unittest discover tests/
...
ERROR 2020-09-02 16:33:06,101 test_fairseq_optimizer.py:119]
Mohammad Javad Zarif is the Iranian foreign minister. He has been U.S. Secretary of State John Kerry 's opposite number in nuclear talks. Zarif received a hero 's welcome as he arrived in Iran on a sunny Friday morning. The feds investigated him over his alleged role in controlling the Alavi Foundation.
 v.s.
Mohammad Javad Zarif is the Iranian foreign minister. He has been John Kerry 's opposite number in securing a breakthrough in nuclear discussions. Zarif received a hero 's welcome as he arrived in Iran on a sunny Friday morning. But there are some facts about Zarif that are less well-known.
....
----------------------------------------------------------------------
Ran 9 tests in 50.048s

OK

Does it support model seq2seq with encoder, and decoder base on lstm, bi-lstm?

Hi,
I want to inference my model seq2seq with encoder and decoder base on lstm, and bi-lstm. And i find out your project, all model in readme for sample of improve performance are transformer model base. And i not see other architecture like lstm, or conv? Can you confirm what's type of model can improve performance ?
Thankyou.

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your fastseq repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/fastseq/compliance

The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @ruofeizhang, @yetingqiaqia, @JiushengChen, @yuyan2do, @feihugis, @NickNickGo

Illegal memory access when batch_size is between (128, 256)

tests/optimiser/fairseq/test_fairseq_optimiser.py can work well when batch_size <= 128; However, when setting batch_size between [129, 255], the below error will be raised:

Traceback (most recent call last):
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/absl/testing/parameterized.py", line 263, in bound_param_test
    test_method(self, **testcase_params)
  File "tests/optimiser/fairseq/test_fairseq_optimiser.py", line 101, in test_beam_search_optimiser
    no_repeat_ngram_size=no_repeat_ngram_size)
  File "/home/fhu/github/fairseq/fairseq/models/bart/hub_interface.py", line 107, in sample
    hypos = self.generate(input, beam, verbose, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/models/bart/hub_interface.py", line 123, in generate
    prefix_tokens=sample['net_input']['src_tokens'].new_zeros((len(tokens), 1)).fill_(self.task.source_dictionary.bos()),
  File "/home/fhu/github/fairseq/fairseq/tasks/fairseq_task.py", line 361, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 159, in generate
    return self._generate(sample, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 198, in _generate
    encoder_outs = self.model.forward_encoder(net_input)
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 697, in forward_encoder
    for model in self.models
  File "/home/fhu/github/fairseq/fairseq/sequence_generator.py", line 697, in <listcomp>
    for model in self.models
  File "/home/fhu/github/fairseq/fairseq/models/fairseq_encoder.py", line 53, in forward_torchscript
    return self.forward_non_torchscript(net_input)
  File "/home/fhu/github/fairseq/fairseq/models/fairseq_encoder.py", line 62, in forward_non_torchscript
    return self.forward(**encoder_input)
  File "/home/fhu/github/fairseq/fairseq/models/transformer.py", line 411, in forward
    x = layer(x, encoder_padding_mask)
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fhu/github/fairseq/fairseq/modules/transformer_layer.py", line 122, in forward
    attn_mask=attn_mask,
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/fhu/github/fastseq/fastseq/optimiser/fairseq/beam_search_optimiser_v2.py", line 200, in forward
    v_proj_weight=self.v_proj.weight,
  File "/home/fhu/py-env/nlp/lib/python3.7/site-packages/torch/nn/functional.py", line 3937, in multi_head_attention_forward
    float('-inf'),
RuntimeError: CUDA error: an illegal memory access was encountered

Support for HF's transformers 3.1+

Thanks for an excellent library.
Any plan to support the new model APIs that happened in transformers==3.1.0 and beyond?
Would it require major work or are there pointers on how to adapt it?

Thank you!

NMT models speedup abnormally related to batch size

Hi, Thanks for the great work. I just tested the fairseq-generate in my test set(ZH-EN translation) using the FastSeq and Fairseq, and the speedup is quiet abnormal comparing with the example link.
My test set has 1526 sentences with 5~150 Chinese characters each, and my experiment is on NVIDIA Tesla T4. The translation model I used is base transformer arch in fairseq, with encoder layer nums equals to 30.
I tested with following command:
for fairseq, fairseq-generate ../data-bin --path model_avg.pt --remove-bpe --batch-size 128
for fastseq, fastseq-generate-for-fairseq ../data-bin --path model_avg.pt --remove-bpe --batch-size 128 --postprocess-workers 5
I didn't use the --no-repeat-ngram-size in fastseq, and the beam size is default 5, lenpen is 1.
My test result is as follows:

BatchSize	not assigned	128	10	5	1
fairseq-0.10.2	65.79 sentences/s	63.18 sentences/s	19.06 sentences/s	11.79 sentences/s	3.06 sentences/s
above + fastseq	75.55 sentences/s	74.28 sentences/s	17.38 sentences/s	11.47 sentences/s	2.92 sentences/s

I found when the batch size is large(such as 128 and above), the fastseq has obvious speedup(but not as much as 2x or above), but when the batch size is small( I test this because of my need for model used in actual situation for deployment), the fastseq seems like behaving no speedup at all, and even slower. I think the phenomenon quiet abnormal and ask for your help. Looking for your reply.