INMT-Lite

Interactive Neural Machine Translation-lite (INMT-Lite) is an assistive translation service that can be run on embedded devices like mobile phones and tablets that have low computation power, space and no internet connectivity. A detailed background of the compression techniques used to drive the assistive interfaces, the model's data and evaluation and the interface design can be found at the linked works.

Data

Hindi-Gondi Parallel Corpus

INMT was developed to help expand the digital datasets for low resource languages and further support in developing other language tools for such low resource languages. The models on this repository are trained on the first-ever Hindi-Gondi Parallel Corpus released by CGNet Swara which can be found here.

Models

You can access all our transformer-arch based models by the scripts provided in the /models folder.

Transformer-Suite

This section delineates the instructions for Transformer Dev-Variants: For Model Setup, Training, Inference and Edge Deployment (Preparing the model for Android Compatible Training). Note that code on this repository is heavily adapted for code specified at this repository for generating light-weight, NMT Models.

Environment Information

The environment can be setup using the provided requirements file

pip install -r requirements.txt

Training Procedure (Generic/Not Compatible with the Android Deployment Pipeline)

1. Run **preprocess.py** to convert training data to HF format and generating the Tokenizer Files for the Vanilla tranformer. 
2. Run **train.py** for training and saving the best model. (monitored metric is BLEU with mt13eval tokenizer)
3. Run **split_saving_{model_architecture_type}.py** to quantize the encoder and decoder separately. 
4. Run **inference.py** (with offline = True) for offline inference on the quantized graphs.

Note that for making the model Android-Compatible: We use an entirely different tokenization procedure - specified as:

1. Run final_tokenizer_train.py - This create the spm models that will be used for tokenization
2. Run spm_extractor.py - This is used to create the vocab files ( required by the Hugging Face interface ) from the serialized models
3. Run make_vocab_from_extracted_files.py - To generate a concated_vocab that will be used to instantiate the tokenizer. Make sure to edit the start_idx to match the len of your source_vocab.
4. Run train.py with the required arguments (marian_tokenizer to True, and provide spm models) to start the training.

Directory Structure

├── confidence_estimation.py                # Monitoring online models' logits - Softmax Entropy, Top-K probabilities Dispersion
├── confidence_visualization.ipynb          # Visualization confidence of the models 
├── inference.py                            # Inferencing - All models [mt5, vanilla](For distillation models go to tflite_inference_distilled_models.py)
├── make_concatenated_vocab.py              # Making the vocab to train the Marian Tokenizer (Used for compatibility with Deployment goal)
├── mt5_inference.py                        # mt5-specific inferencing script 
├── preprocess.py                           # Creates train/test files in the format that is required by the dataloader + tokenizer training 
├── requirements.txt                        # package requirements for these scripts      
├── sc.py                                   # For script converting before and after training for languages with unseen scripts 
├── split_saving_mt5.py                     # Converting finetuned mt5 models to offline graphs (split into encoder and decoder)
├── split_saving_tfb.py                     # Converting trained vanilla transformer models to offline graphs (split into encoder and decoder)
├── spm_extractor.py                        # Used to extract vocab/merges from the spm models (Used for compatibility with Deployment goal)
├── spm_model_generator.py                  # Generating the spm models for the Marian Tokenizer (Used for compatibility with Deployment goal)
├── student_labels.py                       # Generates distillation labels in batches using source-lang monolingual data
├── sweep.yaml                              # Yaml configuration file for running sweeps on Wandb
├── tflite_inference_distilled_models.py    # Sequential inferencing with the vanilla transformer models 
├── marian                                  # Marian tokenizer models (Used for compatibility with Deployment goal, Example files are provided as output ref)
│   └── hi-gondi
│       ├── merges_gondi.txt
│       ├── merges_hi.txt
│       ├── spiece_test_gondi.model
│       ├── spiece_test_gondi.vocab
│       ├── spiece_test_hi.model
│       └── spiece_test_hi.vocab 
├── LICENSE
├── README.md
├── SECURITY.md
└── train.py                                # Training script (Supports Continued Pretraining of mt5, Marian Tokenizer Training)

RNN-Suite

Directory Structure: 
├── RNN-Suite
│   ├── preprocess.py
│   ├── preprocess.py
│   ├── train.py
│   ├── translate.py
│   └── utils
│       ├── Model_architectures.py
│       └── Model_types.py
└── requirements.txt

Environment Information

Create a separate environment and install the necessary packages using the following command in the root path:

pip install -r requirements.txt

Training Procedure


1. **preprocess.py** - Code for preprocessing the input data for the models.
2. **train.py** - Code for training the models.
3. **translate.py** - Code for performing inference/testing on the trained models.
4. **utils/Model_architectures.py** - Code for defining the architecture of the Encoder and the Decoder blocks.
5. **utils/Model_types.py** - Code for building specific models for translation and partial mode.

Please refer to the readme for a detailed overview in RNN root folder.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Performance model

Hello experts,

Thank you for your contribution.

I tried to train a model with 90000 sentences from English to Spanish, but the performance of my model is not good.

I tried to change recurrent_hidden to 1024, also, I changed --src_word_vec_size and --tgt_word_vec_size, as you recomended, but the problem is the same.

Below I show the results for 20 epochs. As you can see, the validation lost is bad.

Could you help me, please?

  Epoch 10 Batch 0 Loss 5.6960
  Epoch 10 Batch 100 Loss 5.3570
  Epoch 10 Training Loss 5.4667
  Time taken for 1 epoch 529.2065329551697 sec
  
  Epoch 10 Validation Loss 10.6312
  Time taken for validation 60.65908360481262 sec
  
  Epoch 11 Batch 0 Loss 4.7367
  Epoch 11 Batch 100 Loss 5.1832
  Epoch 11 Training Loss 5.1685
  Time taken for 1 epoch 529.9154849052429 sec
  
  Epoch 11 Validation Loss 10.8019
  Time taken for validation 59.725284814834595 sec
  
  Epoch 12 Batch 0 Loss 4.7126
  Epoch 12 Batch 100 Loss 5.1652
  Epoch 12 Training Loss 5.1701
  Time taken for 1 epoch 532.0332324504852 sec
  
  Epoch 12 Validation Loss 11.5809
  Time taken for validation 60.39210081100464 sec
  
  Epoch 13 Batch 0 Loss 4.7674
  Epoch 13 Batch 100 Loss 5.1166
  Epoch 13 Training Loss 5.0376
  Time taken for 1 epoch 527.922877073288 sec
  
  Epoch 13 Validation Loss 10.9454
  Time taken for validation 60.049590826034546 sec
  
  Epoch 14 Batch 0 Loss 4.6097
  Epoch 14 Batch 100 Loss 5.0026
  Epoch 14 Training Loss 4.9551
  Time taken for 1 epoch 533.555011510849 sec
  
  Epoch 14 Validation Loss 11.2711
  Time taken for validation 62.934301137924194 sec
  
  Epoch 15 Batch 0 Loss 4.5907
  Epoch 15 Batch 100 Loss 4.6141
  Epoch 15 Training Loss 4.8234
  Time taken for 1 epoch 530.599461555481 sec
  
  Epoch 15 Validation Loss 11.9423
  Time taken for validation 59.88873314857483 sec
  
  Epoch 16 Batch 0 Loss 4.5649
  Epoch 16 Batch 100 Loss 4.8182
  Epoch 16 Training Loss 4.7378
  Time taken for 1 epoch 533.0982737541199 sec
  
  Epoch 16 Validation Loss 11.9251
  Time taken for validation 62.00506854057312 sec
  
  Epoch 17 Batch 0 Loss 4.2148
  Epoch 17 Batch 100 Loss 4.1748
  Epoch 17 Training Loss 4.5810
  Time taken for 1 epoch 530.5870060920715 sec
  
  Epoch 17 Validation Loss 11.7330
  Time taken for validation 60.24014854431152 sec
  
  Epoch 18 Batch 0 Loss 4.2759
  Epoch 18 Batch 100 Loss 4.6324
  Epoch 18 Training Loss 4.5561
  Time taken for 1 epoch 534.9815211296082 sec
  
  Epoch 18 Validation Loss 12.2905
  Time taken for validation 61.87094283103943 sec
  
  Epoch 19 Batch 0 Loss 4.3838
  Epoch 19 Batch 100 Loss 4.7557
  Epoch 19 Training Loss 4.4805
  Time taken for 1 epoch 533.8275711536407 sec
  
  Epoch 19 Validation Loss 12.4593
  Time taken for validation 62.41018629074097 sec
  
  Epoch 20 Batch 0 Loss 4.0156
  Epoch 20 Batch 100 Loss 4.3589
  Epoch 20 Training Loss 4.3792
  Time taken for 1 epoch 533.2257282733917 sec
  
  Epoch 20 Validation Loss 12.6180
  Time taken for validation 63.08395957946777 sec

microsoft / inmt-lite Goto Github PK

inmt-lite's Introduction

INMT-Lite

Table of Contents

Data

Hindi-Gondi Parallel Corpus

Models

Transformer-Suite

Environment Information

Training Procedure (Generic/Not Compatible with the Android Deployment Pipeline)

Directory Structure

RNN-Suite

Environment Information

Training Procedure

Contributing

Trademarks

inmt-lite's People

Contributors

Stargazers

Watchers

Forkers

inmt-lite's Issues

Recommend Projects

Recommend Topics

Recommend Org