Giter Club home page Giter Club logo

alp-mt's Introduction

ALP-MT

This project is aimed at aligning translations of the Bible for the task of Machine Translation. We consider the English version to be our point of reference so all other efforts at adding datasets for other languages work towards aligning them with the English version, hence by extension, with each other. NEVER EDIT THE ENGLISH TEXTS AS THEY ARE THE 'SOURCE OF TRUTH'.

Current Languages:

  • english
  • kamba
  • kikuyu
  • kinyarwanda
  • luganda
  • luo
  • sesotho_lso
  • setswana
  • shona
  • swahili
  • tsonga
  • xhosa

Software Requirements:

  • python3
  • click

The folder named datasets contains raw, preprocessed and aligned folders where different versions of the files we are working on will live.

This folder contains 4 scripts that we will use:

  • text_preprocessing.py
  • alignment_check.py
  • discrepancy_discovery.py
  • final_processing.py Running each of the above with the flag --help will give you a quick reminder of the options the script requires to run. example usage:
$ python3 text_preprocessing.py --help 

or

$ python text_preprocessing.py --help 

(depending on whether you have both python2 and python3 installed) The output will be as below:

Usage: text_preprocessing.py [OPTIONS]

Options:
  --language TEXT  language you are working with eg. swahili
  --book TEXT      number of book you are aligning eg. 12
  --help           Show this message and exit.

How to:

  1. the text_preprocessing function, as you might have guessed, does some preprocessing on the text. It requires the options --language that you are working with, and --book of the bible that you are aligning. It removes most punctuation and breaks sentences up at full-stop so that each is on a new line.

example usage:

$ python3 text_preprocessing.py --language swahili --book 4

NOTE:

-The text_preprocessing function may throw a UnicodeDecodeError. This will require you to edit line 10 of text_preprocessing.py, to explicitly state the encoding to be used, depending on the language you are working on.For example, in swahili, use utf-8, in kikuyu, use latin-1.

example usage:

   10.  file = open("datasets/raw/" + language + "/" + book + "." + bible_dictionaries.languages[language][book] + ".txt", "r",encoding="latin-1").read()
  1. the alignment_check.py function will check to see if the text you are working on has the same number of sentences as its English equivalent. Should it not, you will see the difference in the number of lines.

example usage:

$ python3 alignment_check.py --language swahili --book 4

example response if unaligned:

2 line(s) mismatch

example response if aligned:

All Good!

If you have run this script on a book for the first time and it is aligned, please also run the third script for good measure. Sometimes the number of sentences may be equal but they have not been split correctly. Also, as this process is iterative, as you continually work on a book, sometimes the difference in the number of sentences will rise as opposed to fall. Do not be discouraged, keep going, eventually it shall get to alignment.

  1. the discrepancy_discovery.py function will let you know where exactly sentences are not aligned when the alignment check has failed.

example usage:

$ python3 discrepancy_discovery.py --language swahili --book 4

example response:

1045

It returns an integer, which is the line with the problem. a) Open the preprocessed version of the book you have been working on in a text editor that shows line numbers. Navigate to the line indicated as being the problem.

b) If this line begins with a number, copy a small part of it. If does not begin with a number(the verse), instead navigate to the closest sentence before it that begins with a number and copy a small part of that instead.

c) Proceed to the raw version of the book of the Bible you are working on and search for the snippet you copied from the preprocessed text file. Note that this file has punctuation whereas the other does not so you may need to alter the snippet copied, make it shorter, before it matches to the equivalent here because of the differences in punctuation.

d) Identify the number of the line that has the issue on the raw text file of your language and then open the raw English equivalent text and navigate to the same sentence number.

e) Examine the differences in the two looking out particularly for differences in punctuation. One sentence may have more fullstops than the other, hence they are split into a different number of sentences. More often than not, all you have to do is change the punctuation in the version that you are working with to MATCH THE ENGLISH PUNCUATION. Remember, NEVER EDIT THE ENGLISH TEXTS AS THEY ARE THE 'SOURCE OF TRUTH'. More rare is having to add a short sentence or phrase for the two texts to match.

f) Once the change is made, save the raw text of the file you are working on and then proceed to repeat the process from step 1, the text_preprocessing script. Repeat this process, steps one to three, until the alignment_check is all good and the discrepancy_discovery gives you the 'IndexError: string index out of range' error.

  1. the final_processing.py removes all punctutation, does further processing and saves the output in the aligned folder.

example usage:

$ python3 final_processing.py --language swahili --book 4

Contributing to ALP-MT

1.Fork the project repository.

2.Make a topic branch. In your github form, keep the master branch clean. When you create a branch, it essentially will be a copy of the master.

Pull all changes, make sure your repository is up to date

$ cd ALP-MT
$ git pull origin master

Create a new branch as follows-> git checkout -b [language_you_are_contributing_to], e.g.

$ git checkout -b swahili master

See all branches created

$ git branch
* swahili
  master

Push the new branch to github

$ git push origin -u swahili

3.Remember to only make changes to your branch! Also, thanks for doing all this boring work and contributing! <3 <3

4.Commit the changes to your branch.

5.Make a pull request to the ALP-MT Repo.

alp-mt's People

Contributors

freshia avatar witty-kitty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.