Giter Club home page Giter Club logo

neural-machine-translation's Introduction

Attention Is All You Need

TensorFlow Keras nVIDIA

Code style: black Checked with mypy Build Status

Note

I adapted the code from this awesome PyTorch version. Please check it out as well.

Important

I am using python 3.9 with tensorflow 2.10 as this is their last available version for native-Windows on GPU.

Steps

  1. pip install -r requirements.txt
  2. download.py downloads all the data (en-de file pairs from Europarl, Common Crawl and News Commentary) to the specified folder as argument.
  3. encode.py filters the data based on the arguments (origin, maximum length etc.) and trains the BPE model, saving it to a file.
  4. train.py runs the whole training pipeline with top-down logic found in the file. Everything is managed by the Trainer from trainer.py (logging embeddings, checkpointing etc.).
  5. translate.py runs the model inference and optionally evaluates it with sacrebleu using the Evaluator from evaluator.py.
  6. docs contains notes with svg drawings from the original repo and markdown files explaining the choices I had to make for adaptating from one framework to another.

The code itself is heavily commented and you can get a feel for how language models work by looking at the tests.

Overfitting on one sentence

Input sequence:

"I declare resumed the session of the European Parliament "
"adjourned on Friday 17 December 1999, and I would like "
"once again to wish you a happy new year in the hope that "
"you enjoyed a pleasant festive period."

Results in the following generated hypotheses (all should to be the top one and the exact label for this sentence):

Top generated sequence:

('Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode '
 'des Europäischen Parlaments für wiederaufgenommen, wünsche Ihnen nochmals '
 'alles Gute zum Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.')

All generated sequences in the beam (k=5) search:

[{'hypothesis': 'Ich die am Freitag, dem 17. Dezember unterbrochene '
                'Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.3601136207580566},
 {'hypothesis': 'Ich erkläre die am Freitag, dem 17. Dezember unterbrochene '
                'Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -1.4448045492172241},
 {'hypothesis': 'Ich Ich erkläre die am Freitag, dem 17. Dezember '
                'unterbrochene Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.1513545513153076},
 {'hypothesis': 'Ich erkläre die die am Freitag, dem 17. Dezember '
                'unterbrochene Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.3080737590789795},
 {'hypothesis': 'Ich erkläre erkläre die am Freitag, dem 17. Dezember '
                'unterbrochene Sitzungsperiode des Europäischen Parlaments für '
                'wiederaufgenommen, wünsche Ihnen nochmals alles Gute zum '
                'Jahreswechsel und hoffe, daß Sie schöne Ferien hatten.',
  'score': -3.3361663818359375}]

These are negative as they are log probabilities, the closest to zero being the top sequence

As a sanity check, the BLUE score should be a perfect 100/100 in all cases:

INFO:root:13a tokenization, cased
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)
INFO:root:13a tokenization, caseless
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)
INFO:root:International tokenization, cased
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)
INFO:root:International tokenization, caseless
INFO:root:BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 34 ref_len = 34)

Embeddings

After training for a while, some interesting patterns arise. This project integrates them into the Embedding Projector.

In the shared vocabulary between the encoder (english) and decoder (german) we can see some cosine similarities:

British with britischen and nationaler

british

will and wollte (& konnte, mochte, wurde)

will

Bedenken (pondering) is closest to glaube (believe)

image

Entschließung (resolution) gets associated with completed

Resolution60k

gessammelt (collected) maps to decision and Bestimmung (determination) as well as verstärkt (strenghtened)

strengthened_collected_consistent60k

Change also gets associated with neuer (new)

change_neuer_60k

neural-machine-translation's People

Contributors

andreimoraru123 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.