Giter Club home page Giter Club logo

wycliffe-urbana-2015's Introduction

Build Status:

Master

Build Status

Development

Build Status

Wycliffe@Urbana15: #Hacktranslation

We aim to bring the power of natural language processing (NLP) to bear on the work of Bible translation.

The problem

Bible translation is a task that takes many years of dedicated, hard work from a large number of people. One of the major tasks, accounting for a large portion of the time required in translation, is gathering properly defined vocabulary and identifying parts of speech. An existing process called Rapid Word Collection does an excellent job of collecting a large number of words in a very short period of time. However, the process is almost entirely manual and dependent on human intervention. We would like to develop processes leveraging NLP tools (or even other things. We aren't picky!) that would allow us to 'discover' more words in the target language in a shorter amount of time, using fewer resources.

Many of these languages are poor in written literature, so large sets of words are hard to come by. Some are still largely unwritten. Almost all are in areas where internet connectivity is scarce, slow, or unreliable. The translation activity is happening in remote areas where power is also unreliable and shipping equipment to the site is a gamble.

Vagrant Setup Instructions

I have provided Vagrant machines that will get you up and running with NLP tool sets and some sample data (but not in any of the languages where translation is taking place). Additionally I will provide one or more "unknown" languages in the sample data to serve as material for proof-of-concept when working to develop NLP tools against poorly documented languages.

Each of the tools can be found in one or more Vagrant machines. Get instructions on setup here.

Tools:

  • MGIZA: A multithreaded implementation of GIZA, a word alignment tool. This is rolled into the mgiza box.
  • Python's NLTK
  • TensorFlow, for you tryhards really smart people who understand neural networks.
  • Word2Vec using Gensim instead of the original.

Any additional tools requested I will try to provide, but the tight turnaround time means I might not be able to do it during Urbana '15. As work on this project continues, I will maintain and support these boxes to the best of my ability. Issues and pull requests are welcome and encouraged.

Setup instructions

The only thing you will have to set up on your own will be Vagrant and the Vagrant provider (a hypervisor such as VirtualBox or VMware, etc). Platform specific instructions are on Vagrant's site. Generally speaking, once Vagrant is installed, just pop open a terminal, navigate to the directory containing the Vagrantfile you need to use, and do a vagrant up. A couple minutes later, you have a working VM with a development environment built for you. Once it's up, you can vagrant ssh into it and begin work.

How to Use it

cd /tools ./getcopora.sh /home/vagrant ##This command will pull down the bibles from the cloud ./en-sp-align_words.sh /home/vagrant /home/vagrant/tools/mgiza_configfile ##Pass it the full file name. It will output information about how the program ran and it will create a file that contains the final output called /home/vagrant/copora src_trg.dict.A3.final.part000.

What the script is doing

When the en-sp-align_words.sh script runs it converts unbound format into a plain string. It then outputs a file called asv_plain. It runs tokenizer.py, which takes the words from the file and lower cases it and outputs it to a root of the file, for example asv for the english version or spanish for the spanish version. This is important because the mgiza script files require a specific naming convention. The mkcls script will create the vocabulary.classes file which returns each unique token and the cluster to which it belongs. As a default you can run the clustering algorithm ten times or change it with the -n flag. You can also do -c number of clusters, for example -c80 to run 80 clusters. It converts it to a file that mgiza can consume. It specifically means it creates a .vcb file that contains a unique identifier with each token and its frequency within the document. It also created other files that mgiza consumes. Other files are created called .vcb.classes.cats . A file called .snt converts the entire document but replaces each token with a unique number identifier in the .vcb file. It creates a cooccurrence file using the /mgiza/snt2cooc binary. The cooccurrence file is .cooc, and contains the coocurrence matrix. The output of the file is going to be /copora/src_trg.dict.A3.final.part000 It outputs every line from the target bible, followed by the proposed translation.

wycliffe-urbana-2015's People

Contributors

bbriggs avatar

Watchers

Michael Montgomery avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.