Giter Club home page Giter Club logo

microsoft / inmt-lite Goto Github PK

View Code? Open in Web Editor NEW
39.0 11.0 6.0 118.29 MB

Interactive Neural Machine Translation-lite (INMT-lite) is a framework to train and develop lite versions (.tflite) of models for neural machine translation (NMT) that can be run on embedded devices like mobile phones and tablets that have low computation power and space. The tflite models generated can be used to build the offline version of INMT mobile, a mobile version of INMT web.

License: MIT License

Python 3.24% Shell 0.06% Jupyter Notebook 6.00% CMake 0.43% C 0.02% C++ 89.53% Kotlin 0.38% Java 0.10% PowerShell 0.10% Roff 0.16%

inmt-lite's Introduction

INMT-Lite

Interactive Neural Machine Translation-lite (INMT-Lite) is an assistive translation service that can be run on embedded devices like mobile phones and tablets that have low computation power, space and no internet connectivity. A detailed background of the compression techniques used to drive the assistive interfaces, the model's data and evaluation and the interface design can be found at the linked works.

Collecting Data through community-oriented channels in under-resourced communities

Compression of Massively Multilingual Translation Models for Offline Operation

Assistive Interfaces for Enhancing and Evaluating Data Collection (Coming Soon!)

Table of Contents

Data

Hindi-Gondi Parallel Corpus

INMT was developed to help expand the digital datasets for low resource languages and further support in developing other language tools for such low resource languages. The models on this repository are trained on the first-ever Hindi-Gondi Parallel Corpus released by CGNet Swara which can be found here.

Models

You can access all our transformer-arch based models by the scripts provided in the /models folder.

Transformer-Suite

This section delineates the instructions for Transformer Dev-Variants: For Model Setup, Training, Inference and Edge Deployment (Preparing the model for Android Compatible Training). Note that code on this repository is heavily adapted for code specified at this repository for generating light-weight, NMT Models.

Environment Information

The environment can be setup using the provided requirements file

pip install -r requirements.txt

Training Procedure (Generic/Not Compatible with the Android Deployment Pipeline)

1. Run **preprocess.py** to convert training data to HF format and generating the Tokenizer Files for the Vanilla tranformer. 
2. Run **train.py** for training and saving the best model. (monitored metric is BLEU with mt13eval tokenizer)
3. Run **split_saving_{model_architecture_type}.py** to quantize the encoder and decoder separately. 
4. Run **inference.py** (with offline = True) for offline inference on the quantized graphs.  

Note that for making the model Android-Compatible: We use an entirely different tokenization procedure - specified as:

1. Run final_tokenizer_train.py - This create the spm models that will be used for tokenization
2. Run spm_extractor.py - This is used to create the vocab files ( required by the Hugging Face interface ) from the serialized models
3. Run make_vocab_from_extracted_files.py - To generate a concated_vocab that will be used to instantiate the tokenizer. Make sure to edit the start_idx to match the len of your source_vocab.
4. Run train.py with the required arguments (marian_tokenizer to True, and provide spm models) to start the training.

Directory Structure

├── confidence_estimation.py                # Monitoring online models' logits - Softmax Entropy, Top-K probabilities Dispersion
├── confidence_visualization.ipynb          # Visualization confidence of the models 
├── inference.py                            # Inferencing - All models [mt5, vanilla](For distillation models go to tflite_inference_distilled_models.py)
├── make_concatenated_vocab.py              # Making the vocab to train the Marian Tokenizer (Used for compatibility with Deployment goal)
├── mt5_inference.py                        # mt5-specific inferencing script 
├── preprocess.py                           # Creates train/test files in the format that is required by the dataloader + tokenizer training 
├── requirements.txt                        # package requirements for these scripts      
├── sc.py                                   # For script converting before and after training for languages with unseen scripts 
├── split_saving_mt5.py                     # Converting finetuned mt5 models to offline graphs (split into encoder and decoder)
├── split_saving_tfb.py                     # Converting trained vanilla transformer models to offline graphs (split into encoder and decoder)
├── spm_extractor.py                        # Used to extract vocab/merges from the spm models (Used for compatibility with Deployment goal)
├── spm_model_generator.py                  # Generating the spm models for the Marian Tokenizer (Used for compatibility with Deployment goal)
├── student_labels.py                       # Generates distillation labels in batches using source-lang monolingual data
├── sweep.yaml                              # Yaml configuration file for running sweeps on Wandb
├── tflite_inference_distilled_models.py    # Sequential inferencing with the vanilla transformer models 
├── marian                                  # Marian tokenizer models (Used for compatibility with Deployment goal, Example files are provided as output ref)
│   └── hi-gondi
│       ├── merges_gondi.txt
│       ├── merges_hi.txt
│       ├── spiece_test_gondi.model
│       ├── spiece_test_gondi.vocab
│       ├── spiece_test_hi.model
│       └── spiece_test_hi.vocab 
├── LICENSE
├── README.md
├── SECURITY.md
└── train.py                                # Training script (Supports Continued Pretraining of mt5, Marian Tokenizer Training)

RNN-Suite

Directory Structure: 
├── RNN-Suite
│   ├── preprocess.py
│   ├── preprocess.py
│   ├── train.py
│   ├── translate.py
│   └── utils
│       ├── Model_architectures.py
│       └── Model_types.py
└── requirements.txt

Environment Information

Create a separate environment and install the necessary packages using the following command in the root path:

pip install -r requirements.txt

Training Procedure


1. **preprocess.py** - Code for preprocessing the input data for the models.
2. **train.py** - Code for training the models.
3. **translate.py** - Code for performing inference/testing on the trained models.
4. **utils/Model_architectures.py** - Code for defining the architecture of the Encoder and the Decoder blocks.
5. **utils/Model_types.py** - Code for building specific models for translation and partial mode.

Please refer to the readme for a detailed overview in RNN root folder. 

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

inmt-lite's People

Contributors

anuragshukla06 avatar dependabot[bot] avatar harshitadd avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mohdsanadzakirizvi avatar tanuja-ganu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inmt-lite's Issues

Build Data Pipelines for training on larger datasets

Presently, the model can train on 320,000 sentence pairs consisting 14 tokens each on a Tesla P100 GPU. The dataset is loaded all at once in memory.

Constructing data pipelines would allow picking up only batch size of data into memory allowing it to train on larger datasets

Plotting training graph

We have to provide a mechanism to plot training and validation graphs by storing the metrics

Performance model

Hello experts,

Thank you for your contribution.

I tried to train a model with 90000 sentences from English to Spanish, but the performance of my model is not good.

I tried to change recurrent_hidden to 1024, also, I changed --src_word_vec_size and --tgt_word_vec_size, as you recomended, but the problem is the same.

Below I show the results for 20 epochs. As you can see, the validation lost is bad.

Could you help me, please?

  Epoch 10 Batch 0 Loss 5.6960
  Epoch 10 Batch 100 Loss 5.3570
  Epoch 10 Training Loss 5.4667
  Time taken for 1 epoch 529.2065329551697 sec
  
  Epoch 10 Validation Loss 10.6312
  Time taken for validation 60.65908360481262 sec
  
  Epoch 11 Batch 0 Loss 4.7367
  Epoch 11 Batch 100 Loss 5.1832
  Epoch 11 Training Loss 5.1685
  Time taken for 1 epoch 529.9154849052429 sec
  
  Epoch 11 Validation Loss 10.8019
  Time taken for validation 59.725284814834595 sec
  
  Epoch 12 Batch 0 Loss 4.7126
  Epoch 12 Batch 100 Loss 5.1652
  Epoch 12 Training Loss 5.1701
  Time taken for 1 epoch 532.0332324504852 sec
  
  Epoch 12 Validation Loss 11.5809
  Time taken for validation 60.39210081100464 sec
  
  Epoch 13 Batch 0 Loss 4.7674
  Epoch 13 Batch 100 Loss 5.1166
  Epoch 13 Training Loss 5.0376
  Time taken for 1 epoch 527.922877073288 sec
  
  Epoch 13 Validation Loss 10.9454
  Time taken for validation 60.049590826034546 sec
  
  Epoch 14 Batch 0 Loss 4.6097
  Epoch 14 Batch 100 Loss 5.0026
  Epoch 14 Training Loss 4.9551
  Time taken for 1 epoch 533.555011510849 sec
  
  Epoch 14 Validation Loss 11.2711
  Time taken for validation 62.934301137924194 sec
  
  Epoch 15 Batch 0 Loss 4.5907
  Epoch 15 Batch 100 Loss 4.6141
  Epoch 15 Training Loss 4.8234
  Time taken for 1 epoch 530.599461555481 sec
  
  Epoch 15 Validation Loss 11.9423
  Time taken for validation 59.88873314857483 sec
  
  Epoch 16 Batch 0 Loss 4.5649
  Epoch 16 Batch 100 Loss 4.8182
  Epoch 16 Training Loss 4.7378
  Time taken for 1 epoch 533.0982737541199 sec
  
  Epoch 16 Validation Loss 11.9251
  Time taken for validation 62.00506854057312 sec
  
  Epoch 17 Batch 0 Loss 4.2148
  Epoch 17 Batch 100 Loss 4.1748
  Epoch 17 Training Loss 4.5810
  Time taken for 1 epoch 530.5870060920715 sec
  
  Epoch 17 Validation Loss 11.7330
  Time taken for validation 60.24014854431152 sec
  
  Epoch 18 Batch 0 Loss 4.2759
  Epoch 18 Batch 100 Loss 4.6324
  Epoch 18 Training Loss 4.5561
  Time taken for 1 epoch 534.9815211296082 sec
  
  Epoch 18 Validation Loss 12.2905
  Time taken for validation 61.87094283103943 sec
  
  Epoch 19 Batch 0 Loss 4.3838
  Epoch 19 Batch 100 Loss 4.7557
  Epoch 19 Training Loss 4.4805
  Time taken for 1 epoch 533.8275711536407 sec
  
  Epoch 19 Validation Loss 12.4593
  Time taken for validation 62.41018629074097 sec
  
  Epoch 20 Batch 0 Loss 4.0156
  Epoch 20 Batch 100 Loss 4.3589
  Epoch 20 Training Loss 4.3792
  Time taken for 1 epoch 533.2257282733917 sec
  
  Epoch 20 Validation Loss 12.6180
  Time taken for validation 63.08395957946777 sec

Training Convergence

Hi,
Thank you for your great work.
I followed your instructions and tried to train a TFLite model.
But it looks like the models do not converge well.

The loss is as follows:
Epoch 1 Batch 0 Loss 10.1416
Epoch 1 Batch 100 Loss 6.4002
Epoch 1 Batch 200 Loss 5.1489
Epoch 1 Training Loss 6.0853
Time taken for 1 epoch 76.00192999839783 sec

Epoch 1 Validation Loss 11.4574
Time taken for validation 2.3923656940460205 sec

......

Epoch 99 Batch 0 Loss 1.8126
Epoch 99 Batch 100 Loss 2.3800
Epoch 99 Batch 200 Loss 2.4746
Epoch 99 Training Loss 2.4855
Time taken for 1 epoch 52.326969385147095 sec

Epoch 99 Validation Loss 25.6011
Time taken for validation 2.445949077606201 sec

Epoch 100 Batch 0 Loss 2.5111
Epoch 100 Batch 100 Loss 2.7126
Epoch 100 Batch 200 Loss 2.5386
Epoch 100 Training Loss 2.5118
Time taken for 1 epoch 56.2775776386261 sec

Epoch 100 Validation Loss 25.9406
Time taken for validation 6.213912010192871 sec

And I tested the ACC of the model. It is only around 8.3.

What should I do to improve the performance of the model.

Missing files

You mentioned that for making the model Android-Compatible: We use an entirely different tokenization procedure.
Could you let us know where are these files?

  1. Run final_tokenizer_train.py
  2. Run spm_extractor.py

I couldn't find them in GitHub.

Constructing a script for automatic android build on provding model config

The model generated has to be manually pasted into the app asset folder and also some code has to be manually changed to build the app.

A script to automatically build the app using model configuration providing all the parameters about model including vocabulary and model path would smooth the process

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.