modernmt / modernmt Goto Github PK

View Code? Open in Web Editor NEW

326.0 50.0 67.0 61.25 MB

Neural Adaptive Machine Translation that adapts to context and learns from corrections.

Home Page: http://www.modernmt.eu/

License: Apache License 2.0

Python 17.21% Java 77.07% CMake 0.12% C++ 4.78% Shell 0.17% Roff 0.14% Dockerfile 0.14% ColdFusion 0.21% Cython 0.16%

translation mmt machine-translation machine-learning mt neural-network neural-machine-translation neural

modernmt's Introduction

Simple. Adaptive. Neural.

We think that artificial intelligence is going to be the next big thing in our nearby future. It will bring humanity to a new era of access and organization of information. Language translation is probably the most complex of the human tasks for a machine to learn but it is also the one with the greatest potential to make the world a single family.

With this project we want to give our contribution to the evolution of machine translation toward singularity. We want to consolidate the current state of the art into a single easy to use product, evolve it and keeping it an open to integrate the greatest opportunities in machine intelligence like deep learning.

To achieve our goals we need a better MT technology that is able to extract more from data, adapt to context and be easy to deploy. We know that the challenge is big, but the reward is potentially so big that we think it is worth trying hard.

About ModernMT

ModernMT is a context-aware, incremental and distributed general purpose Neural Machine Translation technology based on Fairseq Transformer model. ModernMT is:

Easy to use and scale with respect to domains, data, and users.
Trained by pooling all available projects/customers data and translation memories in one folder.
Queried by providing the sentence to be translated and optionally some context text.

ModernMT goal is to deliver the quality of multiple custom engines by adapting on the fly to the provided context.

You can find more information on: http://www.modernmt.eu/

Your first translation with ModernMT

Installation

Read INSTALL.md

The distribution includes a small dataset (folder examples/data/train) to train and test translations from English to Italian.

Create an engine

We will now demonstrate how easy it is to train your first engine with ModernMT. Please notice however that the provided training set is tiny and exclusively intended for this demo. If you wish to train a proper engine please follow the instructions provided in this guide: Create an engine from scratch.

Creating an engine in ModernMT is this simple:

$ ./mmt create en it examples/data/train/ --train-steps 10000

This command will start a fast training process that will last approximately 20 minutes; not enough to achieve good translation performance, but enough to demonstrate its functioning. Please consider that a real training will require much more time and parallel data.

Start the engine

$ ./mmt start

Starting engine "default"...OK
Loading models...OK

Engine "default" started successfully

You can try the API with:
	curl "http://localhost:8045/translate?q=world&source=en&target=it&context=computer" | python -mjson.tool

You can check the status of the engine with the status command like this:

$ ./mmt status

[Engine: "default"]
    REST API:   running - 8045/translate
    Cluster:    running - port 5016
    Binary log: running - localhost:9092
    Database:   running - localhost:9042

and finally, you can stop a running engine with the stop command.

Start translating

Let's now use the command-line tool mmt to query the engine with the sentence This is an example:

$ ./mmt translate "this is an example"
ad esempio, è un esempio

Why this translation? An engine trained with so little data, and for so little time is not able to output nothing more than gibberish. Follow these instructions to create a proper engine: Create an engine from scratch

Note: You can query ModernMT directly via REST API, to learn more on how to do it, visit the Translate API page in this project Wiki.

How to import a TMX file

Importing a TMX file is very simple and fast. We will use again the command-line tool mmt:

$ ./mmt memory import -x  /path/to/example.tmx
Importing example... [========================================] 100.0% 00:35
IMPORT SUCCESS

Evaluating quality

How is your engine performing compared to the commercial state-of-the-art technologies? Should I use Google Translate or ModernMT given this data?

Evaluate helps you answer these questions.

During engine training, ModernMT has automatically removed a subset of sentences corresponding to 1% of the training set (or up to 1200 lines at most). With evaluate command, these sentences are used to compute the BLEU Score and Matecat Post-Editing Score against the ModernMT and Google Translate engines.

With your engine running, just type:

./mmt evaluate --gt-key YOUR_GOOGLE_TRANSLATE_API_KEY

The typical output will be like the following:

============== EVALUATION ==============

Testing on 980 lines:

(1/5) Translating with ModernMT...                               DONE in 1m 27s
(2/5) Translating with Google Translate...                       DONE in 1m 3s
(3/5) Preparing data for scoring...                              DONE in 0s
(4/5) Scoring with Matecat Post-Editing Score...                 DONE in 3s
(5/5) Scoring with BLEU Score...                                 DONE in 0s

=============== RESULTS ================

Matecat Post-Editing Score:
  ModernMT            : 57.2 (Winner)
  Google Translate    : 53.9

BLEU Score:
  ModernMT            : 35.4 (Winner)
  Google Translate    : 33.1

Translation Speed:
  Google Translate    : 0.07s per sentence
  ModernMT            : 0.09s per sentence

If you want to test on a different test-set just type:

./mmt evaluate --gt-key YOUR_GT_API_KEY --path path/to/your/test-set

Notes: To run evaluate you need a Google Translate API key and an internet connection for Google Translate API and the Matecat Post-Editing Score API.

What's next?

Create an engine from scratch

Following this README you have learned the basic usage of ModernMT. You are now ready to create your engine with your own data; you can find more info in the Wiki Create an engine from scratch

See API Documentation

ModernMT comes with built-in REST API that allows the user to control every single feature of the tool via a simple and powerful interface. You can find the API Documentation in the ModernMT Wiki.

Run ModernMT cluster

You can setup a cluster of ModernMT nodes in order to load balancing translation requests. You can learn more on the Wiki page ModernMT Cluster.

Use advanced configurations

If you need to customize the properties and behaviour of your engines, you can specify advanced settings in their configuration files. You can learn how on the Wiki page Advanced Configurations

ModernMT Enterprise Edition

ModernMT is free and Open Source, and it welcomes contributions and donations. ModernMT is sponsored by its funding members (Translated, FBK, UEDIN and TAUS) and the European Commission.

ModernMT Enterprise Edition is our cloud solution for professional translators and enterprises. It is proprietary, and it includes an improved adaptation algorithm, "crafted" with months of optimization and fine-tuning of the system. Moreover, our Enterprise Edition comes with top-quality baseline models trained on billions of high-quality training data.

In a nutshell ModernMT Enterprise Edition offers:

Higher quality. A top-notch adaptation algorithm refined with our inner knowledge of the tool.
Designed for intensive datacenter usage. 4x cheaper per MB of text translated.
Pre-trained generic and custom models in 60 language pairs on multiple billion words of premium data.
Support for cluster of servers for higher throughput, load balancing and high availability.
Support for 71 files formats without format loss (Office, Adobe, Localization, etc).
Enterprise Customer Support via Video Conference Call, Phone and Email on business hours (CET) and optionally 24x7.
Custom developments billed per hour of work.

For any information please email us at [email protected]

modernmt's People

Contributors

Stargazers

Watchers

modernmt's Issues

improvement of server status

I would suggest to improve the output of ./server status in two respects:

adding details for all servers (Context Analyzer, Moses Decoder, REST Interface):

ports
running/stopped status

similarly to what is output upon start
Starting Context Analyzer on port 56290... DONE
Starting Moses Decoder on port 56291... DONE
Starting REST Interface on port 56292... DONE

adding URL as well as port

Tokenizer stalls

If the tokenizer encounters an unusual char it will stall.
example data attached.

data-mid.zip

Out of Memory and/or Disk

I tried to create a few large and the aligner crashed.
After some debugging I discovered is a out-of-memory problem and in another case a temporary disk full.
The software crashed without providing any hint to the problem.

I suggest adding a training step to check hw requirements and add a warning.
Example Text:
Warning: >45GB of ram required, you have 32GB
Warning: >125GB of storage required, you have 110GB

Hardware requirement calculation available at INSTALL.md

'server' script should not require superuser permissions

Context Analyzer Crashes on 1B source words

I created a corpus made of

ls -l ../data/
total 14112548
-rw-r--r-- 1 ubuntu ubuntu  434655470 Jan 11  2013 commocrawl.en
-rw-r--r-- 1 ubuntu ubuntu  500374763 Jan 11  2013 commocrawl.fr
-rw-r--r-- 1 ubuntu ubuntu  301523301 Nov 21  2011 europarl7.en
-rw-r--r-- 1 ubuntu ubuntu  346919801 Nov 21  2011 europarl7.fr
-rw-r--r-- 1 ubuntu ubuntu 2085411017 Oct 12  2011 un.en
-rw-r--r-- 1 ubuntu ubuntu 2427161501 Oct 12  2011 un.fr
-rw-r--r-- 1 ubuntu ubuntu 3789880407 Dec  7  2008 wmt.en
-rw-r--r-- 1 ubuntu ubuntu 4565280284 Dec  7  2008 wmt.fr

After ~3 hours it crashed in this way:

=========== TRAINING STARTED ===========

ENGINE:  1Benfr
CORPORA: /home/ubuntu/data (4 documents)
LANGS:   en > fr

INFO: (1 of 6) Corpora tokenization...                                 DONE (in 10052s)
INFO: (2 of 6) Corpora cleaning...                                     DONE (in 1200s)
INFO: (3 of 6) Context Analyzer training...                            DONE (in 110s)
2016-01-08 00:02:55,699 [ERROR] - Command 'java -cp /home/ubuntu/mmt/build/mmt-0.11.jar eu.modernmt.cli.ContextAnalyzerMain -i /home/ubuntu/mmt/engines/1Benfr/data/context/index -c /home/ubuntu/data' failed with exit code 1
Traceback (most recent call last):
  File "/home/ubuntu/mmt/scripts/engine.py", line 279, in build
    self._analyzer.create_index(self._context_index, original_corpora[0].root, log_file)
  File "/home/ubuntu/mmt/scripts/mt/contextanalysis.py", line 34, in create_index
    shell.execute(command, stdout=log, stderr=log)
  File "/home/ubuntu/mmt/scripts/libs/shell.py", line 55, in execute
    raise ShellError(str_cmd, returncode, stderr_dump)
ShellError: Command 'java -cp /home/ubuntu/mmt/build/mmt-0.11.jar eu.modernmt.cli.ContextAnalyzerMain -i /home/ubuntu/mmt/engines/1Benfr/data/context/index -c /home/ubuntu/data' failed with exit code 1
Traceback (most recent call last):
  File "./mmt", line 370, in <module>
    main()
  File "./mmt", line 346, in main
    main_create(argv[1:])
  File "./mmt", line 329, in main_create
    engine.build(corpora, debug=args.debug, steps=args.training_steps)
  File "/home/ubuntu/mmt/scripts/engine.py", line 279, in build
    self._analyzer.create_index(self._context_index, original_corpora[0].root, log_file)
  File "/home/ubuntu/mmt/scripts/mt/contextanalysis.py", line 34, in create_index
    shell.execute(command, stdout=log, stderr=log)
  File "/home/ubuntu/mmt/scripts/libs/shell.py", line 55, in execute
    raise ShellError(str_cmd, returncode, stderr_dump)
scripts.libs.shell.ShellError: Command 'java -cp /home/ubuntu/mmt/build/mmt-0.11.jar eu.modernmt.cli.ContextAnalyzerMain -i /home/ubuntu/mmt/engines/1Benfr/data/context/index -c /home/ubuntu/data' failed with exit code 1

The log file says this:

ubuntu@ip-10-159-153-37:~/mmt$ cat  engines/1Benfr/logs/build.context.log 
[main] INFO eu.modernmt.context.ContextAnalyzer - Rebuild ContextAnalyzer index...
[main] INFO eu.modernmt.context.lucene.ContextAnalyzerIndex - Adding to index document un.fr
Exception in thread "main" java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=2147483645,endOffset=-2147483647
    at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107)
    at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:208)
    at org.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:57)
    at org.apache.lucene.analysis.util.ElisionFilter.incrementToken(ElisionFilter.java:52)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:62)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:90)
    at org.apache.lucene.analysis.fr.FrenchLightStemFilter.incrementToken(FrenchLightStemFilter.java:48)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:618)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1526)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1252)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1234)
    at eu.modernmt.context.lucene.ContextAnalyzerIndex.add(ContextAnalyzerIndex.java:90)
    at eu.modernmt.context.ContextAnalyzer.rebuild(ContextAnalyzer.java:30)
    at eu.modernmt.cli.ContextAnalyzerMain.main(ContextAnalyzerMain.java:36)

create-engine fails with 1 domain only

@davidecaroselli check create-engine, it will fail after tokenization if there is only one domain.

mert does not work as expected

I found that on FBK cluster (Red Hat 4.4.6-4)
the symbol '*' (asterisk) inside a bash is expanded to the content of the directory

This is avoided if the symbol is surrounded by the double quotation

Changes in two lines
OLD: for source_file in $( find $arg_train_data -type f -name .$arg_source_lang ); do
NEW: for source_file in $( find $arg_train_data -type f -name ".$arg_source_lang" ); do

OLD: for target_file in $( find $arg_train_data -type f -name .$arg_target_lang ); do
NEW: for target_file in $( find $arg_train_data -type f -name ".$arg_target_lang" ); do

Furthemore, although this is not an issue
I also changed the command to extract the basenmae of the file

Changes in 3 lines

NEW: filename=$(basename "$source_file" ".$arg_source_lang")
OLD: filename=$(basename "$source_file")
filename="${filename%.*}"

NEW: filename=$(basename "$target_file" ".$arg_target_lang")
OLD: filename=$(basename "$target_file")
filename="${filename%.*}"

Nicola

Problem in ContextAnalyzer query by Moses Decoder

Moses decoder is unable to retrieve the context-bias info from the ContextAnalyzer instance on the same machine. Steps to reproduce this bug on mmt.rocks:

Move to path /mnt/mvp/v/0.11/content/mmt_ca-0.2 and start monitoring bias server response with command tail -f engines/ca-0.2/logs/moses-decoder.log | grep "SERVER RESPONSE"
With another terminal launch this translation request: curl "http://localhost:9102/?text=In+the+%E2%80%9C+View+%E2%80%9D+field+%2C+click+the+drop+down+box+to+select+activity+options+to+view+.&context=What+other+advice+would+you+provide%0AThis+screen+will+tell+you+that+DiaryPRO+is+now+ready+for+a+new+subject+.%0AIt+may+take+up+to+2+business+days+to+process+your+Electronic+Signature+once+it+is+received+by+invivodata+.%0AStrongly+agree%0AV.A.C.+%C2%AE+Therapy+treated+wounds+show+a+rich+vascular+network+compared+to+control+sites+without+foam"
You can now observe from the log (on the first terminal) that moses could not get a response from Context Analyzer: SERVER RESPONSE:
Launch now the same context text directly to the context analyzer: curl "http://localhost:9100/context?language=en&context=What+other+advice+would+you+provide%0AThis+screen+will+tell+you+that+DiaryPRO+is+now+ready+for+a+new+subject+.%0AIt+may+take+up+to+2+business+days+to+process+your+Electronic+Signature+once+it+is+received+by+invivodata+.%0AStrongly+agree%0AV.A.C.+%C2%AE+Therapy+treated+wounds+show+a+rich+vascular+network+compared+to+control+sites+without+foam"
The ContextAnalyzer actually returns this result: {"ab-science-0":0.0011064556892961264,"empty-108":0.0012776250950992107,"empty-407691":0.044811900705099106,"empty-114":0.011597461998462677,"MyMemory_7210ae5d74790f6a99d3-493529":0.029371602460741997,"empty-39821":0.03270958364009857,"sistemi_informativi-7923":0.0025133315939456224,"anonymous-0":0.09060709178447723,"empty-383720":0.0017771938582882285,"MyMemory_0ece2c3cffa58c3f0e24-497689":0.7842276692390442}

API merge - translate and nbest

Enhancement: nbest should be part of the "translate" endpoint.
Bug: nbest currently crashes when context is passed

How is it now

http://localhost:8045/translation/nbest?q=I+like+it&nbest=10

How it should be

http://localhost:8045/translate?q=+I+like+it+&context=president&context=computer&n=10

the nbest list should be added as an extra array in the results. Not change the the standard structure.

The motivations:

We want to keep the num of end-points small to make it easy for developers to join.
nbest is just a specific case of translate, it not a different
Less maintenance and bugs (intact best API was already disalligned with translate)

Port Already in Use will Crash

8000 and 5000 being round numbers are used by many services that could be pre-installed on the machine (see Domenico problem, he had all the 3 used).

Today when this happen MMT will crash.

I suggest we do 2 different things:

We will give you an error with the suggestion to change port.
To maximize the UX, minimize the problem by changing the default:
8000 -> 8045
5000 -> 5016
5001 -> 5017

Reference:
http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?&page=107

Possibile Tokenizer Memory Leak

Tokenizer's Java process uses 15GB of RAM and keep growing...
Data: 1B words en fr WMT task, (News+Europarl+UN+CommonCrawl)

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                       
 77162 ubuntu    20   0 14.031g 159668  14388 S  85.9  0.5  37:44.91 java -cp /home/ubuntu/mmt/build/mmt-0.11.jar -Dmmt.tokenizer.models.path=/home/ubuntu/mmt/opt/tokenizer/models eu.modernmt.cli.TokenizerMain +

Is it normal that the dispatcher uses so much RAM?

Individual startup and shutdown of servers

It would be helpful if the server script allowed start-up and shutdown of individual servers. For example, for debugging I may want to have only the context analyser running and start and monitor the moses server myself.

Running folder outside engines?

"running" folder was previously inside "engines".

Is it a bug or by design?

Bidirectional server

The same server should be able to translate both from source to target language and from the target to the source language, just specifying the direction for each query.

directory bin/version is not available in tagged release "v0.10-alpha"

in the tag "v0.10-alpha",
the directory "bin/version" is not present

while it is in branch "master"

Filename bug

MMT is not able to handle filename containing whitespace or filename starting with a dash:
"this is a test.txt" or "-this_is_a_test.txt"

missing jsvc

In FBK, probably because our Java installation is not standard,
we had to use a specific version of jsvc,
which was provided by Davide to Roldano,
copy it in "bin/context-analyzer/"
and change the script "server" as follows

OLD:
jsvc=$(which jsvc)

NEW:
jsvc=${bin_dir}/context-analyzer/jsvc

-no-escape option of tokenizer.perl ignored

The current version of the script tokenizer.perl has no option -no-escape. Must be implemented.

impossible download of a release using wget

I don't know if it is a real issue
but I noticed that from the release page https://github.com/ModernMT/MMT/releases

if I copy the link for the download of the source code
https://github.com/ModernMT/MMT/archive/v0.10-alpha.tar.gz

I fail to download the tar file using wget from a linux shell
Below the error I get.

no problem instead, if I just click on the link on the website
it download the file in th

$> wget http://github.com/ModernMT/MMT/archive/v0.10-alpha.zip
--2015-08-14 09:18:44-- http://github.com/ModernMT/MMT/archive/v0.10-alpha.zip
Resolving github.com... 192.30.252.129
Connecting to github.com|192.30.252.129|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/ModernMT/MMT/archive/v0.10-alpha.zip [following]
--2015-08-14 09:18:45-- https://github.com/ModernMT/MMT/archive/v0.10-alpha.zip
Connecting to github.com|192.30.252.129|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2015-08-14 09:18:45 ERROR 404: Not Found.

Tokenized corpora for Language Model and Context Analyzer

We should use the tokenized corpora (not the cleaned) to train LM and Context Analyzer

Improvement for tokenization

Not sure if this is the right place to post the issue

I found that the tokenizer
bin/tokenizer/tokenize.sh

could be modified as follows:

OLD:
perl ${script_dir}/detokenizer.perl -b -l $lang | perl ${script_dir}/tokenizer.perl -b -X -l $lang | perl ${script_dir}/deescape-special-chars.perl -b

NEW:
perl ${script_dir}/detokenizer.perl -b -l $lang | perl ${script_dir}/tokenizer.perl -b -X -l $lang -no-escape

It may be faster

A quick investigation shows that the two outcomes are identical (at least for EP, IBM, and MS) for both English and Italian

Problem with "scripts/EncodeContext.jar" with small files

The command to generate a file with context
java -jar ./scripts/EncodeContext.jar context_num_lines in_file out_file
gives problem when in_file is small.

Namely a problem occurs when the number of lines of in_file are is less or equal to (context_num_lines + 1)

In details:

if in_file num lines == context_num_lines + 1, then the out_file contains one line only;
if in_file num lines < context_num_lines + 1, then java failes with NullPointerException.

This problem has been found with some files of the recent data set "data-100d-62Mw": files with num of lines <= 6 cannot be enriched with context.

Better Process Management

Some of the MMT core process (server and translation) remains up when the task is closed killed.

Example 1
Start the engine, mv the engine directory. You cannot stop the server anymore. You have to kill the processes 1 by 1.

Example 2
your request a translation, you type ctrl-c to stop it and the process stay alive.

Cannot change Context Analyzer port after engine creation: hard-coded value is 7531

wrong alpha release number

In release MMT-0.10-alpha
the "README.md" contains the wrong number (0.9) in the first line

0.11.1 SNAPSHOT Fail to Run Example

ubuntu@ip-172-30-0-162:~/mmt$ ./mmt create en it examples/data/train/

=========== TRAINING STARTED ===========

ENGINE:  default
CORPORA: /home/ubuntu/mmt/examples/data/train (3 documents)
LANGS:   en > it

INFO: (1 of 6) Corpora tokenization...                                 DONE (in 5s)
INFO: (2 of 6) Corpora cleaning...                                     DONE (in 0s)
INFO: (3 of 6) Context Analyzer training...                            DONE (in 0s)
INFO: (4 of 6) Language Model training...                              DONE (in 1s)
2016-01-14 23:49:40,164 [ERROR] - Command '/home/ubuntu/mmt/opt/bin/irstlm-adaptivelm-v0.6/bin/compile-lm /home/ubuntu/mmt/runtime/default/training/tmp/lm/arpa /home/ubuntu/mmt/engines/default/data/lm/europarl.alm' failed with exit code -6
Traceback (most recent call last):
  File "/home/ubuntu/mmt/scripts/engine.py", line 258, in build
    self._engine.lm.train(tokenized_corpora, self._engine.target_lang, working_dir, log_file)
  File "/home/ubuntu/mmt/scripts/mt/lm.py", line 118, in train
    self._train_lm(cfile, os.path.join(models_folder, lm), working_dir, log)
  File "/home/ubuntu/mmt/scripts/mt/lm.py", line 146, in _train_lm
    shell.execute(command, stderr=log)
  File "/home/ubuntu/mmt/scripts/libs/shell.py", line 55, in execute
    raise ShellError(str_cmd, returncode, stderr_dump)
ShellError: Command '/home/ubuntu/mmt/opt/bin/irstlm-adaptivelm-v0.6/bin/compile-lm /home/ubuntu/mmt/runtime/default/training/tmp/lm/arpa /home/ubuntu/mmt/engines/default/data/lm/europarl.alm' failed with exit code -6
Traceback (most recent call last):
  File "./mmt", line 387, in <module>
    main()
  File "./mmt", line 363, in main
    main_create(argv[1:])
  File "./mmt", line 346, in main_create
    engine.builder.build(corpora, debug=args.debug, steps=args.training_steps)
  File "/home/ubuntu/mmt/scripts/engine.py", line 258, in build
    self._engine.lm.train(tokenized_corpora, self._engine.target_lang, working_dir, log_file)
  File "/home/ubuntu/mmt/scripts/mt/lm.py", line 118, in train
    self._train_lm(cfile, os.path.join(models_folder, lm), working_dir, log)
  File "/home/ubuntu/mmt/scripts/mt/lm.py", line 146, in _train_lm
    shell.execute(command, stderr=log)
  File "/home/ubuntu/mmt/scripts/libs/shell.py", line 55, in execute
    raise ShellError(str_cmd, returncode, stderr_dump)
scripts.libs.shell.ShellError: Command '/home/ubuntu/mmt/opt/bin/irstlm-adaptivelm-v0.6/bin/compile-lm /home/ubuntu/mmt/runtime/default/training/tmp/lm/arpa /home/ubuntu/mmt/engines/default/data/lm/europarl.alm' failed with exit code -6

print hostname:port in server log files

It would be useful to have in server log files (e.g. context-analyzer.err) not only the port but also the hostname -- such as "hostname:port".
Currently in a distributed environment a client with only the port misses the hostname to connect to a server.
(by the way the current string contains a typo "Staring server" instead of "Starting server")

LM Training Speed

Similar to #29 Language model training can be parallelized much more.
Using only 19% of CPU on a 16 core machine, 0% iowait
Using only 9% of CPU on a 32 core machine, 0% iowait

How do we choose the num of parallel process for build-sublm.pl and gzip?

see below

top - 18:39:28 up 19:38,  1 user,  load average: 5.52, 22.65, 27.02
Tasks: 556 total,   4 running, 552 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.7 us,  0.1 sy,  0.0 ni, 91.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  61836572 total, 47466540 used, 14370032 free,    19964 buffers
KiB Swap:        0 total,        0 used,        0 free. 45557656 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                       
 74022 ubuntu    20   0   53676  36872   2060 R 100.0  0.1  11:51.93 build-sublm.pl                                                                                                                                
 74181 ubuntu    20   0   22412   5652   2060 R  99.9  0.0  11:24.79 build-sublm.pl                                                                                                                                
 75204 ubuntu    20   0    4596    812    416 S  35.2  0.0   0:40.14 gzip                                                                                                                                          
 75210 ubuntu    20   0    4596    808    416 S  33.2  0.0   0:29.85 gzip                                                                                                                                          
 75211 ubuntu    20   0    4596    808    416 R  27.9  0.0   0:24.36 gzip                                                                                                                                          
 75205 ubuntu    20   0    4596    808    416 S  17.3  0.0   0:19.86 gzip                                                                                                                                          
 75201 ubuntu    20   0    4728    680    488 S   7.0  0.0   0:07.77 gzip                                                                                                                                          
 75206 ubuntu    20   0    4728    680    488 S   6.3  0.0   0:05.37 gzip                                                                                                                                          
 75207 ubuntu    20   0    4728    684    488 S   4.0  0.0   0:05.85 gzip                                                                                                                                          
 75200 ubuntu    20   0    4728    684    488 S   3.3  0.0   0:04.20 gzip                                                                                                                                          
   478 root      20   0       0      0      0 S   0.3  0.0   0:20.51 kworker

Explain port usage in distributed infrastructure

We should mention which ports are used in the distributed infrastructure and to ensure that the machines are reachable each other

create-engine does not work as expected

I found that on FBK cluster (Red Hat 4.4.6-4)
the symbol '*' (asterisk) inside a bash is expanded to the content of the directory

This is avoided if the symbol is surrounded by the double quotation

Changes in three lines
OLD: for source_file in $( find $arg_train_data -type f -name .$arg_source_lang ); do
NEW: for source_file in $( find $arg_train_data -type f -name ".$arg_source_lang" ); do

OLD: for target_file in $( find $arg_train_data -type f -name .$arg_target_lang ); do
NEW: for target_file in $( find $arg_train_data -type f -name ".$arg_target_lang" ); do

OLD: for source_file in $( find $tokenizer_out -type f -name .$arg_source_lang ); do
NEW: for source_file in $( find $tokenizer_out -type f -name ".$arg_source_lang" ); do

Furthemore, although this is not an issue
I also changed the command to extract the basenmae of the file

Changes in 3 lines

NEW: filename=$(basename "$source_file" ".$arg_source_lang")
OLD: filename=$(basename "$source_file")
filename="${filename%.*}"

NEW: filename=$(basename "$target_file" ".$arg_target_lang")
OLD: filename=$(basename "$target_file")
filename="${filename%.*}"

NEW: filename=$(basename "$source_file" ".$arg_source_lang")
OLD: filename=$(basename "$source_file")
filename="${filename%.*}"

Nicola

Tokenizers Speed

New parallel tokenization is working great (>4x faster) but it can be up to 12x faster by a better prediction of how many process to spawn.

Today it may be
processes = num_cores/32
but the right number could be
processes = num_core2

With a 1B words model it only consumes 30% of CPU with still 0% iowait.

context-analyzer server fails at startup

In the new MVP (0.10) the context-analyzer fails at startup with the following error message:

Cannot find daemon loader org/apache/commons/daemon/support/DaemonLoader
Service exit with a return value of 1

Maybe some class is missing in the context-analyzer.jarfile?

Non-Amazon Ubuntu 14.04.2 LTS - Java Error

I got the original ISO of Ubuntu 14.04 LTS.
I installed in on a VirtualBox VM
I followed the instructions on INSTALL.md (no errors)

It trains the model, but it crashes on translation.

See screenshot:

Debugging quality

@ugermann @davidecaroselli @nicolabertoldi @mfederico

I have many indicators that there is something going wrong somewhere.
I was not able to spot the problem, but I report my findings and hope in your analytical skills.

I have training an engine with only CommonCrawl data from the WMT15 task, so it is just a 1 domain.

By design, I expect these 3 to return the same output, but it is not.

Context Passed

ubuntu@ip-172-30-0-162:~/mmt$ curl "http://localhost:8045/translate?q=I+like+it&context=president" | python -mjson.tool
{
    "context": [
        {
            "id": "commocrawl",
            "score": 0.017159795
        }
    ],
    "translation": "Je comme"
}

Context Not Passed

curl "http://localhost:8045/translate?q=I+like+it&context=" | python -mjson.tool
{
    "translation": "J'aime il"
}

Impossible Context

curl "http://localhost:8045/translate?q=I+like+it&context=XXYYSSDDSSWW" | python -mjson.tool
{
    "context": [],
    "translation": "J'aime il"
}

Missing python2.7 requirement in documentation

We should include python2.7 in requirements list of file INSTALL.md

More than one MMT server per computer

The MMT server should check for free ports in order to start and allow more than one server running on the same computer.

Critical bug in TUNING process

During the execution of MERT algorithm, the features weights are not updated. You can verify the bug by checking the diff between two different iteration of the MERT:

diff runN.out runN+1.out

you will notice that the two files are identical.

Corpus Clean Crash

I was training the news v10 corpus from the WMT task.

I attached the exact sentence that make MMT crash.
test.zip

And I got this error:

./mmt create en fr ../test/ --debug

=========== TRAINING STARTED ===========

ENGINE:  default
CORPORA: /home/ubuntu/test (1 documents)
LANGS:   en > fr

INFO: (1 of 6) Corpora tokenization...                                2016-01-09 10:34:12,207 [DEBUG] - Shell exec: java -cp /home/ubuntu/mmt/build/mmt-0.11.jar -Dmmt.tokenizer.models.path=/home/ubuntu/mmt/opt/tokenizer/models eu.modernmt.cli.TokenizerMain en
2016-01-09 10:34:12,729 [DEBUG] - Shell exec: java -cp /home/ubuntu/mmt/build/mmt-0.11.jar -Dmmt.tokenizer.models.path=/home/ubuntu/mmt/opt/tokenizer/models eu.modernmt.cli.TokenizerMain fr
 DONE (in 1s)
INFO: (2 of 6) Corpora cleaning...                                    2016-01-09 10:34:13,264 [DEBUG] - Shell exec: perl /home/ubuntu/mmt/opt/bin/cleaner-mosesofficial/clean-corpus-n-ratio.perl -ratio 3 /home/ubuntu/mmt/engines/default/temp/tokenizer/news10 en fr /home/ubuntu/mmt/engines/default/temp/cleaner/news10 1 80
 DONE (in 0s)
2016-01-09 10:34:13,281 [ERROR] - Command 'perl /home/ubuntu/mmt/opt/bin/cleaner-mosesofficial/clean-corpus-n-ratio.perl -ratio 3 /home/ubuntu/mmt/engines/default/temp/tokenizer/news10 en fr /home/ubuntu/mmt/engines/default/temp/cleaner/news10 1 80' failed with exit code 255
Traceback (most recent call last):
  File "/home/ubuntu/mmt/scripts/engine.py", line 273, in build
    cleaned_corpora = self._cleaner.batch_clean(tokenized_corpora, cleaner_output)
  File "/home/ubuntu/mmt/scripts/mt/processing.py", line 45, in batch_clean
    [(corpus, ParallelCorpus(corpus.name, dest_folder, corpus.langs), langs) for corpus in corpora])
  File "/home/ubuntu/mmt/scripts/mt/processing.py", line 20, in _pool_exec
    return [job.get() for job in aync_jobs]
  File "/home/ubuntu/mmt/scripts/libs/multithread.py", line 289, in process
    result = self._func(*self._args, **self._kwds)
  File "/home/ubuntu/mmt/scripts/mt/processing.py", line 57, in clean_corpus
    shell.execute(command, stdout=shell.DEVNULL, stderr=shell.DEVNULL)
  File "/home/ubuntu/mmt/scripts/libs/shell.py", line 55, in execute
    raise ShellError(str_cmd, returncode, stderr_dump)
ShellError: Command 'perl /home/ubuntu/mmt/opt/bin/cleaner-mosesofficial/clean-corpus-n-ratio.perl -ratio 3 /home/ubuntu/mmt/engines/default/temp/tokenizer/news10 en fr /home/ubuntu/mmt/engines/default/temp/cleaner/news10 1 80' failed with exit code 255
Traceback (most recent call last):
  File "./mmt", line 370, in <module>
    main()
  File "./mmt", line 346, in main
    main_create(argv[1:])
  File "./mmt", line 329, in main_create
    engine.build(corpora, debug=args.debug, steps=args.training_steps)
  File "/home/ubuntu/mmt/scripts/engine.py", line 273, in build
    cleaned_corpora = self._cleaner.batch_clean(tokenized_corpora, cleaner_output)
  File "/home/ubuntu/mmt/scripts/mt/processing.py", line 45, in batch_clean
    [(corpus, ParallelCorpus(corpus.name, dest_folder, corpus.langs), langs) for corpus in corpora])
  File "/home/ubuntu/mmt/scripts/mt/processing.py", line 20, in _pool_exec
    return [job.get() for job in aync_jobs]
  File "/home/ubuntu/mmt/scripts/libs/multithread.py", line 289, in process
    result = self._func(*self._args, **self._kwds)
  File "/home/ubuntu/mmt/scripts/mt/processing.py", line 57, in clean_corpus
    shell.execute(command, stdout=shell.DEVNULL, stderr=shell.DEVNULL)
  File "/home/ubuntu/mmt/scripts/libs/shell.py", line 55, in execute
    raise ShellError(str_cmd, returncode, stderr_dump)
scripts.libs.shell.ShellError: Command 'perl /home/ubuntu/mmt/opt/bin/cleaner-mosesofficial/clean-corpus-n-ratio.perl -ratio 3 /home/ubuntu/mmt/engines/default/temp/tokenizer/news10 en fr /home/ubuntu/mmt/engines/default/temp/cleaner/news10 1 80' failed with exit code 255

Query and context tokenization

The query and the context specified by the user should be tokenized automatically.

Model Creation Crash - 2B words corpus (17GB)

tail -n 10000 -f build.tm.log

INITIAL PASS 
.................................................. [50000]
.................................................. [100000]
.................................................. [150000]
--- continue
.................................................. [16750000]
.................................................. [16800000]
.................................................. [16850000]
.................................................. [16900000]
..................................Error in line 16934554
The third objective is urban and rural development, within the scope of a balanced territorial policy.

Data available on Amazon (MMT High Performance instance, requires Translated pem)

Monolingual LM training data

@mfederico
What is the priority of supporting monolingual data for LM training.
MMT uses today only the target translation as LM.

missing execute permission

the following script has no execute permission,
scripts/getfreeports.py

hence script "server" fails at line 70

Enabling the permission this step works properly.

Ubuntu 15.10 - Python Import Error

I created a new Ubuntu 15.10 using VirtualBox.

I was able to install Java 8 using the instructions provided.

# ./mmt create en it examples/data/train

Will say there is an error in met line 9.

ImportError: No module named requests.

Draft release crashes

It took me some time to update the release of java to version 8 and to tell my computer to actually use it with

sudo update-alternatives --config java
maybe some further instructions would be helpful.

Then, while running

./mmt create en it examples/data/train
the command crashes with the following message:

=========== TRAINING STARTED ===========

ENGINE: default
CORPORA: /home/marcello/Progetti/mmt/examples/data/train (3 documents)
LANGS: en > it

INFO: (1 of 6) Corpora tokenization... DONE (in 22s)
INFO: (2 of 6) Corpora cleaning... DONE (in 0s)
INFO: (3 of 6) Context Analyzer training... DONE (in 2s)
INFO: (4 of 6) Language Model training... DONE (in 0s)
2016-01-08 00:52:19,721 [ERROR] - [Errno 8] Exec format error
Traceback (most recent call last):
File "/home/marcello/Progetti/mmt/scripts/engine.py", line 286, in build
self._lm.train(tokenized_corpora, self.target_lang, working_dir, log_file)
File "/home/marcello/Progetti/mmt/scripts/mt/lm.py", line 117, in train
self._train_lm(file, os.path.join(models_folder, lm), working_dir, log)
File "/home/marcello/Progetti/mmt/scripts/mt/lm.py", line 144, in _train_lm
shell.execute(command, stderr=log)
File "/home/marcello/Progetti/mmt/scripts/libs/shell.py", line 42, in execute
process = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr, shell=(True if isinstance(cmd, basestring) else False))
File "/usr/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
Traceback (most recent call last):
File "./mmt", line 370, in
main()
File "./mmt", line 346, in main
main_create(argv[1:])
File "./mmt", line 329, in main_create
engine.build(corpora, debug=args.debug, steps=args.training_steps)
File "/home/marcello/Progetti/mmt/scripts/engine.py", line 286, in build
self._lm.train(tokenized_corpora, self.target_lang, working_dir, log_file)
File "/home/marcello/Progetti/mmt/scripts/mt/lm.py", line 117, in train
self._train_lm(file, os.path.join(models_folder, lm), working_dir, log)
File "/home/marcello/Progetti/mmt/scripts/mt/lm.py", line 144, in _train_lm
shell.execute(command, stderr=log)
File "/home/marcello/Progetti/mmt/scripts/libs/shell.py", line 42, in execute
process = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr, shell=(True if isinstance(cmd, basestring) else False))
File "/usr/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error

Possible Bug In Context and/or Suffix

Using MMT 0.10, If I don't provide a context the translation quality get worst. But It should not.

With a fake context the translation is correct:

./translate "I often go to school"  "thisisafakeword"
je vais souvent à école

With no context (that should be equal to fake context), the translation is wrong

./translate "I often go to school" 
Je vont souvent à l' école

The training set is made of 2 domains, one with WMT data the other is a single word ('impossibleword') so the context match for that domain is always 0.

-rw-rw-r-- 1 ubuntu ubuntu         13 Jan  2 21:15 app.en
-rw-rw-r-- 1 ubuntu ubuntu          8 Jan  2 21:15 app.fr
-rw-r--r-- 1 ubuntu ubuntu 3789880407 Jan  2 22:05 wmt.en
-rw-r--r-- 1 ubuntu ubuntu 4565280284 Jan  2 22:05 wmt.fr

If it is not a bug, than we should find a different sampling algorithm for the suffix.

@ugermann I am available to discuss sampling algorithms that replicate the way professional translator pick between alternatives.

@davidecaroselli can you add this kind of test in the integration tests?

"./mmt evaluate" and "./mmt tune" enhancement

Open for discussion:

As by discussion we are missing the model evaluation functionality.

Here is the proposal:

When training, up to 1000 sentences will be removed from the corpus for creating the dev set (500) and test set (500) in the folder 'engines/youmodel/data/[dev | test]'. Min (1000, 1% of the corpus sentences)
Sentences should be extracted so that there is not an extra rewrite of the file
Position of extraction should be spread across the corpus and fixed position based given the corpus. Short: 2 training on the same data will provide the same output.
Duplication management: by default we don't dedup test|dev and corpus. If there are 2 equal sentences in the corpus the test|dev may contain a sentence that is also in the corpus. There should be an option to avoid duplicates.
One data is in the engines dir this can be substituted by data provided by the user.

For the user, at this point it should be as easy as:
Tuning
./mmt tune
Evaluating
./mmt evaluate

Auto Tuning, Self Evaluation and Time to complete

It would be great if MMT could provide:
1 - Tuning during the training as an optional parameter.
2 - Estimation of the quality of the engine ad the end of training.
3 - Training progress bar. (time to complete the training)

Assuming a real world scenario where a company will train an engine on their TM to translate similar future content.

@mfederico
I know how to do a good estimation of 3.
Do you think that 1 and 2 are feasible?
eg. In this scenario does it make sense to remove 2 blocks of ~1K random sentences to do a MERT and BLUE estimation? Or is there a better way to do it?

non standard installation for Java

if java is not installed in the standard way, the server script fails
in FBK we changed the script as follows

OLD
java_home=$(dirname $(dirname $(readlink -e /usr/bin/javac)))

NEW:
java_home=$(dirname $(dirname $(readlink -e ${JAVA_HOME}/bin/javac)))

Problem with "scripts/EncodeContext.jar" with small files

related to issue #15

There is still a problem:

when the input document contain one sentence only, the process fails.

Assuming that "input" file contains one line, this is the error message I got

java -jar EncodeContext.jar 5 input output
Exception in thread "main" java.util.NoSuchElementException: queue is empty
at CircularFifoQueue.remove(CircularFifoQueue.java:177)
at ContextLineIterator.next(ContextLineIterator.java:47)
at EncodeContext.main(EncodeContext.java:24)

Manually Changing Parameters

in 0.11 I cannot longer find where to change the moses.ini parameters.

This files do not included the weights.

# find engines/|grep ini
engines/default/runtime/moses.ini
engines/default/engine.ini
engines/default/data/moses.ini

@davidecaroselli where is now the file?

If the file was removed on purpose, I think we should put back the opportunity to change them manually.

modernmt / modernmt Goto Github PK

modernmt's Introduction

Simple. Adaptive. Neural.

About ModernMT

Your first translation with ModernMT

Installation

Create an engine

Start the engine

Start translating

How to import a TMX file

Evaluating quality

What's next?

Create an engine from scratch

See API Documentation

Run ModernMT cluster

Use advanced configurations

ModernMT Enterprise Edition

modernmt's People

Contributors

Stargazers

Watchers

Forkers

modernmt's Issues

Recommend Projects

Recommend Topics

Recommend Org