Light

olivier-tl / low-resource-translation Goto Github PK

View Code? Open in Web Editor NEW

1.0 3.0 0.0 2.55 MB

Project 2 of the course IFT6759

License: MIT License

Python 42.62% Jupyter Notebook 57.29% Shell 0.10%

low-resource-translation's Introduction

low-resource-translation

Project 2 of the course IFT6759

Running the evaluation script

Install the virtual environnement and call the evaluator.py script with the proper input file and target file.

# create and source a clean virtual env
pip install -r requirements.txt

# run the evaluator
python evaluator.py --input-file-path /project/cq-training-1/project2/data/train.lang1 --target-file-path /project/cq-training-1/project2/data/train.lang2

Train models

Train on aligned data

Example to train a transformer on the aligned data.

python train.py --model_name=transformer --batch_size=128 --epochs=100

Train with pre-trained embeddings

Example to train a transformer with pre-trained embeddings.

python train.py --embedding=fasttext --embedding_dim=256 --model_name=transformer --batch_size=128 --epochs=100

Train with back-translation

Example to train a transformer using back-translation.

First train a model target->source (fr->en)

python train.py --fr_to_en --model_name=transformer --batch_size=128 --epochs=100

Train the model source->target (en->en)

python train.py --back_translation=True --back_translation_model=<path_to_model> --back_translation_ratio=4 --model_name=transformer --batch_size=128 --epochs=100

Model configurations can be passed with the argument --config=<configuration_dict>

low-resource-translation's People

Contributors

Stargazers

Watchers

low-resource-translation's Issues

Naive Bayes

Model pretraining

Model pretraining by doing language modeling on our unlabeled data. (predicting the next word in the text)

We want something general that could be used to pretrain most of our models.

GRU (encoder decoder)

Try some embeddings

Try some word embeddings on the unlabelled data. (e.g. Word2Vec, GLoVE)
Serialize the word embeddings and make a function that takes as input a token and return an embedding.

Multi-GPU training

Investigate learning rate warmup

This could be useful when using a transformer architecture.

https://huggingface.co/transformers/_images/warmup_constant_schedule.png

Investigate AWD-LSTM

Set up tensorflow training loop

Load tokens
Padding (Sentence length should be an hyperparameter)
Turn them into vectors
Train
Valid

Subword vocabulary

byte-pair encod-ings (BPE)

Create vocabulary

Only for words
We need to have a vocab for english and french.
Could be serialized in a readable file. e.g (txt file) If it is very fast, no need to serialize.
Create a function that takes a word and return a number and vice-versa.
The number of words in the vocab is a hyperparameter.
Add start and end of sentence tokens.

Single layer LSTM (encoder decoder)

Try with and without attention
It would be great to also try a simple RNN

Investigate small transformer

Original transformer paper :
https://arxiv.org/pdf/1706.03762.pdf

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.