Giter Club home page Giter Club logo

attention-is-all-you-need's Introduction

Attention Is All You Need

PyTorch implementation of the transformer architecture presented in "Attention Is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Architecture

Positional Encoding

The position of each token in a sequence is encoded using the following formula and then added on top of the token's embedding vector.

$$PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{model}})$$

$$PE_{(pos, 2i + 1)} = cos(pos / 10000^{2i / d_{model}})$$

A special property of this positional encoding method is that $PE_{x + k}$ can be represented as a linear function of $PE_{x}$, which allows the model to easily attend to tokens by their relative positions:

$$PE_{(x + k, 2i)} = sin((x + k) / 10000^{2i / d_{model}})$$

$$PE_{(x + k, 2i)} = sin(x / 10000^{2i / d_{model}}) * cos(k / 10000^{2i / d_{model}}) + cos(x / 10000^{2i / d_{model}}) * sin(k / 10000^{2i / d_{model}})$$

$$PE_{(x + k, 2i)} = PE_{(x, 2i)} * cos(k / 10000^{2i / d_{model}}) + PE_{(x, 2i+1)} * sin(k / 10000^{2i / d_{model}})$$

Multi-head Attention

In a multi-head attention sublayer, the input queries, keys, and values are each projected into num_heads vectors of size d_model / num_heads. Then, num_heads scaled dot-product attention operations are performed in parallel, and their outputs are concatenated and projected back into size d_model.

Methods

Overall, this implementation almost exactly follows the architecture and parameters described in [1]. However, due to limited resources, I instead trained using the smaller Multi30k machine translation dataset.

Learning Rate Schedule

The learning rate schedule used in [1] is shown below:

However, during my experiments, models trained using this schedule failed to achieve BLEU scores above 0.01. Instead, I used PyTorch's ReduceLROnPlateau scheduler, which decreases the learning rate by factor=0.5 every time the validation loss plateaus:

Results

For English-to-German translation (EN-DE), my implementation achieved a maximum BLEU score of 27.0 on the test set, which is comparable to the score of 27.3 found in [1].

The trained model weights can be found here.

Notes

Transformers are trained using a technique called "teacher forcing", which is also used to train recurrent neural networks. During training, the model is actually given the ground truth tokens[:n] as input and asked to predict the nth token.

Setup Instructions

  1. Install requirements
python -m pip install -r requirements.txt
  1. Download spacy language pipelines
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762 [cs.CL]

attention-is-all-you-need's People

Contributors

tanjeffreyz avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.