helsinki-nlp / opus-mt Goto Github PK

Open neural machine translation models and web services

License: MIT License

Python 50.54% Makefile 38.68% Shell 5.56% Dockerfile 1.43% HTML 1.01% CSS 0.39% JavaScript 1.02% Perl 1.38%

natural-language-processing machine-translation neural-machine-translation language-technology machine-learning translation-service translation-interface language-translation-service

opus-mt's Introduction

Tools and resources for open translation services

based on Marian-NMT
trained on OPUS data using OPUS-MT-train (New: leaderboard)
mainly SentencePiece-based segmentation
mostly trained with guided alignment based on eflomal wordalignments
pre-trained downloadable translation models (matrix view), CC-BY 4.0 license
more freely available translation models from the Tatoeba translation challenge, CC-BY 4.0 license
543 live demo APIs of language variants available at Tiyaro.ai. For example, an English to German finetuned translator

This repository includes two setups:

Setup 1: a Tornado-based web application providing a web UI and API to work with multiple language pairs (developed by Santhosh Thottingal and his team at the Wikimedia Foundation)
Setup 2: a simple websocket service setup with some experimental API extensions

There are also scripts for training models, but those are currently only useful in the computing environment used by the University of Helsinki and CSC as the IT service provider.

Please cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Installation of the Tornado-based Web-App

Download the latest version from github:

git clone https://github.com/Helsinki-NLP/Opus-MT.git

Option 1: Manual setup

Install Marian MT. Follow the documentation at https://marian-nmt.github.io/docs/ (don't forget to include the cmake option for compiling the server binary -DCOMPILE_SERVER=ON) After the installation, marian-server is expected to be present in path. If not, place it in /usr/local/bin

Install pre-requisites. Using a virtual environment is recommended.

pip install -r requirements.txt

Download the translation models from https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models and place them in models directory.

Then edit the services.json to point to those models.

And start the webserver.

python server.py

By default, it will use port 8888. Launch your browser to localhost:8888 to get the web interface. The languages configured in services.json will be available.

Option 2: Using Docker

docker-compose up

docker build . -t opus-mt
docker run -p 8888:8888 opus-mt:latest

And launch your browser to localhost:8888

Option 2.1: Using Docker with CUDA GPU

docker build -f Dockerfile.gpu . -t opus-mt-gpu
nvidia-docker run -p 8888:8888 opus-mt-gpu:latest

And launch your browser to localhost:8888

Configuration

The server.py program accepts a configuration file in json format. By default it try to use services.json in the current directory. But you can give a custom one using -c flag.

An example configuration file looks like this:

{
    "en": {
        "es": {
            "configuration": "./models/en-es/decoder.yml",
            "host": "localhost",
            "port": "10001"
        },
        "fi": {
            "configuration": "./models/en-fi/decoder.yml",
            "host": "localhost",
            "port": "10002"
        },
    }
}

This example configuration can provide MT service for en->es and en->fi language pairs.

configuration points to a yaml file containing the decoder configuration usable by marian-server. If this value is not provided, Opus-MT will assume that the service is already running in a remote host and post as given in other options. If value is provided, a new subprocess will be created using marian-server
host: The host where the server is running.
port: The port to be listen for marian-server

Installation of a websocket service on Ubuntu

There is another option of setting up translation services using WebSockets and Linux services. Detailed information is available from doc/WebSocketServer.md.

Public MT models

We store public models (CC-BY 4.0 License) at https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models They should all be compatible with the OPUS-MT services, and you can install them by specifying the language pair. The installation script takes the latest model in that directory. For additional customisation you need to adjust the installation procedures (in the Makefile or elsewhere).

There are also development versions of models, which are often a bit more experimental and of low quality. But there are additional language pairs and they can be downloaded from https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/work-spm/models

Train MT models

There is a Makefile for training new models from OPUS data in the Opus-MT-train repository, but this is heavily customized for the work environment at CSC and the University of Helsinki projects. This will (hopefully) be more generic in the future to be able to run in different environments and setups as well.

Known issues

most automatic evaluations are made on simple and short sentences from the Tatoeba data collection; those scores will be too optimistic when running the models with other more realistic data sets
Some (older) test results are not reliable as they use software localisation data (namely GNOME system messages) with a large overlap with other localisation data (i.e. Ubuntu system messages) that are included in the training data
All current models are trained without filtering, data augmentation (like backfanslation) and domain adaptation and other optimisation procedures; there is no quality control besides of the automatic evaluation based on automatically selected test sets; for some language pairs there are at least also benchmark scores from official WMT test sets
Most models are trained with a maximum of 72 training hours on 1 or 4 GPUs; not all of them converged before this time limit
Validation and early stopping is based on automatically selected validation data, often from Tatoeba; the validation data is not representative for many applications

To-Do and wish list

more languages and language pairs
better and more multilingual models
optimize translation performance
add backtranslation data
domain-specific models
GPU enabled container
dockerized fine-tuning
document-level models
load-balancing and other service optimisations
public MT service network
feedback loop and personalisation

Links and related work

OPUS-translator: implementation of a simple on-line translation interface
OPUS-CAT: an implementation of an NMT plugin for Trados Studio that can run OPUS-MT models
fiskmö: a project on the devlopment of resources and tools for translating between Finnish and Swedish
The Tatoeba MT Challenge with lots of pre-trained NMT models
The NMT map that plots the status of Tatoeba NMT models on a map
The OPUS-MT leaderboard
pre-trained multilingual models trained on OPUS-100 using the zero toolkit

Acknowledgements

The work is supported by the European Language Grid as pilot project 2866, by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.

opus-mt's People

Contributors

Stargazers

Watchers

opus-mt's Issues

Arabic Diacritics "the meaning is lost in translation"

Opus-MT shows great potential with the Arabic language, though i have noticed an issue with the Arabic model caused by the removal of the Arabic Diacritics during preprocessing.
The Arabic Diacritics actually contains various information such as "who", "gender", "command", "time", and much more etc......
Removing the diacritics whether before training or before prediction actually removes lots of information and meaning from the sentence, and can create conflicts between the feature representations or even create destruction of meaning.

who example:
ذهبتُ إلى المدرسة
i went to the school

who+gender example:
ذهبتْ إلى المدرسة
she went to the school

command+gender "male"+ time "future" example:
إفتعِل شيئاً
you do something

gender "male"+ time "past" example:
إفتعَلَ شيئاً
he did something

How does the removal of the Diacritics create conflicts between the feature representations or even create destruction of the meaning:
Simply put, when training the model, it tries to create an understanding of the words, correlations, importance, sequences, dimensions, etc...
The model sees the same word in different sentences to make an understanding, the issue is that when removing the Arabic Diacritics the meaning is changed, lots of information is removed, "time past, present, future" "gender" "command" and much more.... so the model creates it's bases and "meaning, correlation,...." infrastructure on wrong concepts, this causes words that are supposed to be far from each other become closer, sentences structured wrongly and the meaning is "lost in translation".
A single Diacritic can convey lots of information in the same word, with just a single Diacritic!

translation app for android / iphone

Create apps for mobiles. First, use on-line services - later integrate translation engine in the app?
collaborate with bergamot project? (https://browser.mt/) (include also fine-tuning options and data donation)

'en-jap' model. What is 'jap'?

Hi, About 'en-jap' model: Could not find 'jap' ISO codes. Is it is the same as 'ja' (in Helsinki-NLP/opus-mt-en-jap)?
Is there a doc listing all the codes used here? Assuming that 'jap' is the same as 'ja', and using the demo huggingface.co/Helsinki-NLP/opus-mt- for en-jap and ja-en, I got several roundtrip translations pasted below.
From those, I am still not sure if 'jap' is the same as 'ja'.
Thankful for any pointer to a doc with the language codes you use.

_“My name is Wolfgang and I live in Berlin” -> “わが名はシナルといい , また " わたしは永遠に生きる " と .” -> “I say to you, " said Jesus, "I am going to make a helper for him, as a complement of him.”

“In winter we switch on the heating in our house.” -> “わたしたちは , 家にあって熱病にされているのである .” -> “We have a fever at home, " he said.”

“I want to go home” -> “わたしは , 家に帰り ,” -> “I went home,”_

upload and integration of MT models at ELG

Develop procedures for uploading MT models to ELG and adding information to their catalogue

docker-compose installation fails

Hello,

I am trying to install Opus-MT via the docker-compose solution, but I am encountering an error during the compilation step:

Marian compilation error

In file included from /usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.h:24,
                 from /usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.cc:15:
/usr/src/app/marian/src/3rd_party/sentencepiece/src/freelist.h: In instantiation of 'T* sentencepiece::model::FreeList<T>::Allocate() [with T = sentencepiece::unigram::Lattice::Node]':
/usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.cc:83:41:   required from here
/usr/src/app/marian/src/3rd_party/sentencepiece/src/freelist.h:62:13: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct sentencepiece::unigram::Lattice::Node'; use assignment or value-initialization instead [-Wclass-memaccess]
       memset(chunk, 0, sizeof(*chunk) * chunk_size_);
       ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.cc:15:
/usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.h:38:10: note: 'struct sentencepiece::unigram::Lattice::Node' declared here
   struct Node {
          ^~~~
In file included from /usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.h:24,
                 from /usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.cc:15:
/usr/src/app/marian/src/3rd_party/sentencepiece/src/freelist.h: In instantiation of 'void sentencepiece::model::FreeList<T>::Free() [with T = sentencepiece::unigram::Lattice::Node]':
/usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.cc:93:24:   required from here
/usr/src/app/marian/src/3rd_party/sentencepiece/src/freelist.h:39:13: warning: 'void* memset(void*, int, size_t)' clearing an object of non-trivial type 'struct sentencepiece::unigram::Lattice::Node'; use assignment or value-initialization instead [-Wclass-memaccess]
       memset(chunk, 0, sizeof(*chunk) * chunk_size_);
       ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.cc:15:
/usr/src/app/marian/src/3rd_party/sentencepiece/src/unigram_model.h:38:10: note: 'struct sentencepiece::unigram::Lattice::Node' declared here
   struct Node {
          ^~~~

Python package installation error

  ----------------------------------------
  Failed building wheel for tornado
  Running setup.py clean for tornado
  Running setup.py bdist_wheel for mosestokenizer: started
  Running setup.py bdist_wheel for mosestokenizer: finished with status 'error'
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-lughb366/mosestokenizer/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-thswx0b2 --python-tag cp37:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for mosestokenizer
  Running setup.py clean for mosestokenizer
  Running setup.py bdist_wheel for pycld2: started
  Running setup.py bdist_wheel for pycld2: finished with status 'error'
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-lughb366/pycld2/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-v5m0lqhx --python-tag cp37:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for pycld2
  Running setup.py clean for pycld2
  Running setup.py bdist_wheel for sqlitedict: started
  Running setup.py bdist_wheel for sqlitedict: finished with status 'error'
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-lughb366/sqlitedict/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-8em57psk --python-tag cp37:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for sqlitedict
  Running setup.py clean for sqlitedict
  Running setup.py bdist_wheel for docopt: started
  Running setup.py bdist_wheel for docopt: finished with status 'error'
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-lughb366/docopt/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-1536sh2e --python-tag cp37:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for docopt
  Running setup.py clean for docopt
  Running setup.py bdist_wheel for toolwrapper: started
  Running setup.py bdist_wheel for toolwrapper: finished with status 'error'
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-lughb366/toolwrapper/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-svf3gk7k --python-tag cp37:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for toolwrapper
  Running setup.py clean for toolwrapper
  Running setup.py bdist_wheel for uctools: started
  Running setup.py bdist_wheel for uctools: finished with status 'error'
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-lughb366/uctools/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-i2p1v3j0 --python-tag cp37:
  usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: -c --help [cmd1 cmd2 ...]
     or: -c --help-commands
     or: -c cmd --help

  error: invalid command 'bdist_wheel'

  ----------------------------------------
  Failed building wheel for uctools
  Running setup.py clean for uctools
Failed to build tornado mosestokenizer pycld2 sqlitedict docopt toolwrapper uctools
Installing collected packages: tornado, docopt, openfile, toolwrapper, uctools, mosestokenizer, pycld2, sqlitedict, sentencepiece
  Running setup.py install for tornado: started
    Running setup.py install for tornado: finished with status 'done'
  Running setup.py install for docopt: started
    Running setup.py install for docopt: finished with status 'done'
  Running setup.py install for toolwrapper: started
    Running setup.py install for toolwrapper: finished with status 'done'
  Running setup.py install for uctools: started
    Running setup.py install for uctools: finished with status 'done'
  Running setup.py install for mosestokenizer: started
    Running setup.py install for mosestokenizer: finished with status 'done'
  Running setup.py install for pycld2: started
    Running setup.py install for pycld2: finished with status 'done'
  Running setup.py install for sqlitedict: started
    Running setup.py install for sqlitedict: finished with status 'done'

Runtime error

λ docker-compose up
Creating network "opus-mt_default" with the default driver
Starting opus-mt_opus-mt_1 ... done
Attaching to opus-mt_opus-mt_1
opus-mt_1  | marian-server: error while loading shared libraries: libsentencepiece_train.so.0: cannot open shared object file: No such file or directory
opus-mt_1  | ERROR:asyncio:Future exception was never retrieved
opus-mt_1  | future: <Future finished exception=CalledProcessError(127, 'unknown')>
opus-mt_1  | Traceback (most recent call last):
opus-mt_1  |   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 742, in run
opus-mt_1  |     yielded = self.gen.throw(*exc_info)  # type: ignore
opus-mt_1  |   File "server.py", line 49, in run
opus-mt_1  |     ret = yield self.p.wait_for_exit()
opus-mt_1  |   File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 735, in run
opus-mt_1  |     value = future.result()
opus-mt_1  | subprocess.CalledProcessError: Command 'unknown' returned non-zero exit status 127.

I am running Windows 10 with Docker on a WSL2 backend. Is there any pre-requisite step in order to build the container?
Thank you,

#Question, help. How to build a new language pair ?

Sorry for this question.
I am very interested in your repository, I want to use this repository in my project (translate from language A to language B).
however, some of the language pairs I need are not in your repository.
So, I am looking for a guide to building a new language pair like the language pairs you provide.
I have found a tutorial on building a new language from scratch but it isn't helping me.
This link: https://huggingface.co/blog/how-to-train

If you have any information, tutorials, or articles please feel free to share with me. Thank you.
I am just an outsider in the NLP field, sorry about this.

Unzipping model.zip fails

It looks like there is an issue with model.zip downloaded from pouta.csc.fi. This is a recent issue, as this worked fine yesterday:

joni@Joker:~/Opus-MT$ sudo make all
wget -O model-list.txt https://object.pouta.csc.fi/OPUS-MT-models
--2020-01-09 12:36:12--  https://object.pouta.csc.fi/OPUS-MT-models
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.19, 86.50.254.18
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: ‘model-list.txt’

model-list.txt                          [ <=>                                                              ] 334,09K  --.-KB/s    in 0,08s   

2020-01-09 12:36:12 (4,18 MB/s) - ‘model-list.txt’ saved [342108]

wget -O model.zip \
	https://object.pouta.csc.fi/OPUS-MT-models/`tr "<>" "\n\n" < model-list.txt | \
	grep 'fi-en/opus' |\
	sort | tail -1`
--2020-01-09 12:36:12--  https://object.pouta.csc.fi/OPUS-MT-models/
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.19, 86.50.254.18
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/xml]
Saving to: ‘model.zip’

model.zip                               [ <=>                                                              ] 334,09K  --.-KB/s    in 0,09s   

2020-01-09 12:36:12 (3,56 MB/s) - ‘model.zip’ saved [342108]

mkdir -p model
cd model && unzip ../model.zip
Archive:  ../model.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of ../model.zip or
        ../model.zip.zip, and cannot find ../model.zip.ZIP, period.
Makefile:163: recipe for target '/usr/local/share/opusMT/models/fi-en/opus.npz' failed
make: *** [/usr/local/share/opusMT/models/fi-en/opus.npz] Error 9

many repetitions & duplicates in translation

Hi Guys,

Problem:
regardless of the model i use, there are situations where the translation is broken, and contains many repetitions.

One Example:

echo "30.4 C\nYaoundé\n \nLUNDI, 4 OCTOBRE 2021 11:46\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n \nINTERNET\n \nENTRETIENS\n \nFRANÇAIS\nMORE\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n" | ./opusMT-client.py -H localhost -s fr -t en

marian-opus-fr-en arguments

--alignment -p 11002 -b2 -n1 -m /usr/local/share/opusMT/models/fr-en/opus.npz -v /usr/local/share/opusMT/models/fr-en/opus.vocab.yml /usr/local/share/opusMT/models/fr-en/opus.vocab.yml

opusMT-opus-fr-en arguments

-p 20012 -c /var/cache/opusMT/opus.fr-en.cache.db --spm /usr/local/share/opusMT/models/fr-en/opus.fr.spm --mtport 11002 -s fr -t en

Result:

 { 
    "alignment": [
        "0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11 12-12 13-13 14-14 15-15 16-16 17-17 19-18 20-19 21-20 22-21 23-22 24-23 25-24",
        "0-0 2-105 4-3 7-1 8-2 10-4 10-184 10-194 11-5 11-45 11-55 11-60 11-65 11-110 11-115 11-120 11-125 11-130 11-135 11-140 11-145 12-6 12-81 12-86 12-91 12-141 12-151 12-156 12-161 12-166 12-231 12-236 12-241 13-50 13-70 13-75 13-80 13-85 13-90 13-95 13-100 13-150 13-155 13-160 13-165 13-170 13-175 13-180 13-185 13-190 13-195 13-200 13-205 13-210 13-215 13-220 13-225 13-230 13-235 13-240 13-245 13-250 13-255 13-260 13-265 15-8 15-13 15-83 15-88 15-138 15-143 15-148 15-153 15-158 15-163 15-168 15-173 15-178 15-203 15-208 15-213 15-218 15-223 15-228 15-233 15-238 15-243 15-248 15-253 15-258 15-263 19-259 20-261 21-7 21-72 21-77 21-82 21-227 21-232 21-237 21-242 21-247 22-9 22-14 22-84 22-89 22-94 22-99 22-139 22-144 22-149 22-154 22-159 22-164 22-169 22-174 22-179 22-219 22-224 22-229 22-234 22-239 22-244 22-249 22-254 22-264 23-10 24-11 24-16 24-251 24-256 32-15 47-96 47-171 47-176 57-47 57-52 57-62 57-67 67-18 67-43 67-48 67-53 67-58 67-63 67-68 67-73 67-78 67-93 67-98 67-103 67-108 67-113 67-118 67-123 67-128 67-133 67-183 67-188 67-193 67-198 70-12 70-252 70-257 70-262 73-19 74-20 76-25 78-23 78-28 83-21 83-266 84-17 84-22 84-27 84-57 84-87 84-92 84-97 84-102 84-107 84-112 84-117 84-122 84-127 84-132 84-137 84-142 84-147 84-152 84-157 84-162 84-167 84-172 84-177 84-182 84-187 84-192 84-197 84-202 84-207 84-212 84-217 84-222 85-24 85-104 85-109 85-114 85-119 85-189 85-199 86-30 87-26 87-31 87-36 87-41 87-46 87-51 87-56 87-61 87-66 87-71 87-101 87-106 87-111 87-116 87-121 87-126 87-131 87-136 87-146 87-181 87-186 87-191 87-196 87-201 87-206 87-211 87-216 89-33 89-38 93-32 93-37 93-42 94-29 94-34 94-39 94-44 94-49 94-54 94-59 94-64 95-35 95-40 96-76 96-221 96-226 96-246 102-69 102-74 102-79 102-124 102-129 102-134 102-204 102-209 102-214"
    ],
    "result": "30.4 C\\nYaound\u00e9\\n \\nLUNDI, 4 OCTOBER 2021 11: CENTRAL AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\NLAND",
    "segmentation": "spm",
    "server": "localhost:20012",
    "source": "fr",
    "source-segments": [
        "\u258130 .4 \u2581C \\ n Y a ound \u00e9 \\ n \u2581\\ n L UND I , \u25814 \u2581 OC TO BRE \u258120 21 \u258111 :",
        "\u258146 \\ n A FR IQUE \u2581C ENT R ALE \\ n \u2581\\ n A FR IQUE \u2581DE \u2581L ' OU EST \\ n \u2581\\ n T\u00c9 L \u00c9 COM S \\ n \u2581\\ n IN NO V ATION \\ n \u2581\\ n INTER NET \\ n \u2581\\ n ENT RET IENS \\ n \u2581\\ n FR AN \u00c7 AIS \\ n M ORE \\ n A FR IQUE \u2581C ENT R ALE \\ n \u2581\\ n A FR IQUE \u2581DE \u2581L ' OU EST \\ n \u2581\\ n T\u00c9 L \u00c9 COM S \\ n \u2581\\ n IN NO V ATION \\ n"
    ],
    "target": "en",
    "target-segments": [
        "\u258130 .4 \u2581C \\ n Y a ound \u00e9 \\ n \u2581\\ n LU ND I , \u25814 \u2581OCT O BER \u258120 21 \u258111 :",
        "\u2581C ENT RAL \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ N LAND"
    ]
}

As you can see the result contains many times:WEST AFRICA

Question:
Has anybody an idea why this happens?
Could it be related to marian-decoder or sentencepiece?

Kind Regards

Alex

88 BLEU score but result is not correct.

Hi, I am trying to do urdu transliterate like converting roman urdu to urdu font.
In 10 epochs, i got 88 bleu score.
Despite the good score, the results are bad.
Is there a way to add ngram LM model to boost the performance.

convert pretrained models to nematus / tensorflow

Hi,

I'm looking for pretrained machine translation models in tensorflow.
If I understand correctly, Opus-MT is based on marian-nmt, which in turn is a pure C++ implementation of nematus.
Would it be possible to convert an Opus-MT model to tensorflow?

Please excuse this strange question. This is definitely not a preference for tensorflow, just for the sake of tensorflow.
I would love to use Opus-MT. Your infrastructure and deployment options look super clean!

The reason I'm asking is that I'm looking for a way to deploy machine translation on the client-side.
I've used tensorflow-js and tensorflow-lite before, for custom image analysis tasks on android/web.
So with a pretrained tensorflow model, it should be quite straightforward to get it to run.

Then there would still be the text preprocessing/tokenization. But it seems that most Opus-MT models rely on sentencepiece. The python source is rather clean and I think I could get this ported to Typescript quickly.

Running client.py inside Docker does not give json output

Hi,

I spun up a quick and basic docker container with the default models (en-es, en-fi, en-ml,en-mr). I try to run the websocket api inside the docker.
echo "France passed the budget today morning." | ./opusMT-client.py -H localhost -P 10001 -s en -t es
it just gives me the tokenized translated result without any json, as below , any reason why :

also it looks like to run translations against other models(en-fi), i cannot directly pass raw text I need to pass tokenized input for better results (including example in docs might help)

Thanks !

How to add term intervention to Helsinki-NLP/opus-mt-en-zh model on huggingface? Specify translation results according to term dictionary

Issue with Marian Server on Colab

Hi, while trying to run server.py. I encountered this issue and couldn't figure out

Future exception was never retrieved
future: <Future finished exception=FileNotFoundError(2, "No such file or directory: 'marian-server'")>
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tornado/gen.py", line 326, in wrapper
yielded = next(result)
File "server.py", line 65, in run
'--maxi-batch', '100',
File "/usr/local/lib/python3.7/dist-packages/tornado/process.py", line 240, in init
self.proc = subprocess.Popen(*args, **kwargs)
File "/usr/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'marian-server': 'marian-server'

Traceback (most recent call last):
File "server.py", line 188, in
application = make_app(args)
File "server.py", line 166, in make_app
worker_pool = initialize_workers(services)
File "server.py", line 148, in initialize_workers
source_lang, target_lang, pair_config, models[pair_config['configuration']])
File "server.py", line 23, in init
targetspm=self.service.get('targetspm')
File "/content/Opus-MT/content_processor.py", line 18, in init
self.bpe_source = BPE(BPEcodes)
File "/content/Opus-MT/apply_bpe.py", line 37, in init
firstline = codes.readline()
File "/usr/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 54: invalid start byte

Docker build fails due to missing libprotobuf23

Building the docker image with the command docker-compose up fails with the following error:

E: Unable to locate package libprotobuf23

libprotobuf23 is not available in debian buster (it only has libprotobuf17), but it is available from debian buster-backports.

Docker compile for GPU fails due to missing CUDA installation

When compiling the docker image after changing the Dockerfile cmake line from cpu to gpu usage, as instructed by the comment in Dockerfile, the following error is raised during docker build:

CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) (Required is at least version "9.0")

Cannot find suitable CUDA libraries. Specify the path explicitly with
  -DCUDA_TOOLKIT_ROOT_DIR=/path/to/appropriate/cuda/installation
   (hint: try /usr/local/$(readlink /usr/local/cuda))
OR compile the CPU-only version of Marian with
  -DCOMPILE_CUDA=off

CMake Error at CMakeLists.txt:289 (message):
  FATAL ERROR: No suitable CUDA library found.

Is it possible to finetune a OPUS model that has once been finetuned?

Is it possible to finetune a model that has once been finetuned? Thus, incrementally improving the model gradually. If not, I will have several new finetuned models. If I have several already finetuned models is it possible to use them at the same time or do I have to compose 2 new .tmx into one .tmx file before the fine-tuning??

OPUS-MT docker for ELG

develop OPUS-MT docker for ELG (based on https://github.com/ugermann/marian-docker?). Information from Ullrich Germann:

The easiest way to get Marian-trained models into ELG is as follows:

Create a Docker image with an ELG-compatible REST-based translation server that incorporates the respective model. This is easy. See https://github.com/ugermann/marian-docker for code and details. In a nutshell:
Put all relevant files (vocab.spm, model.bin) into a separate model directory.
Create a decoder.yml file (with a few extra fields) for your setup in the model directory.
Copy the appropriate Dockerfile into the model directory.
Run docker build -t /path/to/your/model/directory
Push the image to a Docker repository of your choice
Announce the resource to ELG. For this, you'll need to provide a metadata record to ELG. This process is currently still a bit rough around the edges, but ILSP are working hard to make it much easier. Currently, you'll need to provide the metadata as an XML file that confirms to a specific DTD. Penny is the person to talk to about creating and ingesting resource metadata records into ELG.

docker-compose up fails with FileNotFoundError

docker-compose up

Results in a succesful build, but fails on attach with the following traceback:

WARNING: Image for service opus-mt was built because it did not already exist. To rebuild this image you must use `docker-compose build` or `docker-compose up --build`.
Creating opus-mt_opus-mt_1 ... done
Attaching to opus-mt_opus-mt_1
opus-mt_1  | Traceback (most recent call last):
opus-mt_1  |   File "server.py", line 150, in <module>
opus-mt_1  |     application = make_app(args)
opus-mt_1  |   File "server.py", line 130, in make_app
opus-mt_1  |     worker_pool = initialize_workers(services)
opus-mt_1  |   File "server.py", line 115, in initialize_workers
opus-mt_1  |     source_lang, target_lang, decoder_config)
opus-mt_1  |   File "server.py", line 26, in __init__
opus-mt_1  |     targetspm=self.service.get('targetspm')
opus-mt_1  |   File "/usr/src/app/content_processor.py", line 17, in __init__
opus-mt_1  |     BPEcodes = open(sourcebpe, 'r', encoding="utf-8")
opus-mt_1  | FileNotFoundError: [Errno 2] No such file or directory: './models/en-es/source.bpe'
opus-mt_opus-mt_1 exited with code 1

Any ideas how to fix?

missing keys for some low-resource language pairs

On the ber-es transformer, if I run:

spm_encode --model source.spm <<< "Bessif kanay."

you get:

▁Be ssif ▁kan ▁ay .

But ▁Be is not in opus.spm32k-spm32k.vocab.yml, so my python tokenizer raises a KeyError when it encounters these tokens.

This doesn't change if I run preprocess.sh first.
When I run the pieced sequence through marian_decoder I get a good translation, no error.

This happens for other model,character combos, here is a list of (pair, missing key) from a random sample of models I tested.

{'ha-en': '|',
 'ber-es': '▁Be',
 'pis-fi': '▁|',
 'es-mt': '|',
 'fr-he': '₫',
 'niu-sv': 'OGI',
 'fi-fse': '▁rentou',
 'fi-mh': '|',
 'hr-es': '|',
 'fr-ber': '▁devr',
 'ase-en': 'olos',
 'sv-uk': '|'}

Is this expected? Should my encoder use the id in these cases?

Missed pytorch_model.bin in https://huggingface.co/Helsinki-NLP/opus-mt-eng-ukr

Could you add the pytorch_model.bin please to the https://huggingface.co/Helsinki-NLP/opus-mt-eng-ukr

Without this file I cannot download the model, I get:

OSError: Can't load weights for 'Helsinki-NLP/opus-mt-eng-ukr'. Make sure that:

'Helsinki-NLP/opus-mt-eng-ukr' is a correct model identifier listed on 'https://huggingface.co/models'
or 'Helsinki-NLP/opus-mt-eng-ukr' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

Thank you in advance

Unable to run using docker

Hi,
I've built Opus MT using docker when I try to run it, the following error shows up:

Starting opus-mt_opus-mt_1 ... done
Attaching to opus-mt_opus-mt_1
opus-mt_1  | Traceback (most recent call last):
opus-mt_1  |   File "server.py", line 150, in <module>
opus-mt_1  |     application = make_app(args)
opus-mt_1  |   File "server.py", line 130, in make_app
opus-mt_1  |     worker_pool = initialize_workers(services)
opus-mt_1  |   File "server.py", line 115, in initialize_workers
opus-mt_1  |     source_lang, target_lang, decoder_config)
opus-mt_1  |   File "server.py", line 26, in __init__
opus-mt_1  |     targetspm=self.service.get('targetspm')
opus-mt_1  |   File "/usr/src/app/content_processor.py", line 17, in __init__
opus-mt_1  |     BPEcodes = open(sourcebpe, 'r', encoding="utf-8")
opus-mt_1  | FileNotFoundError: [Errno 2] No such file or directory: './models/en-es/source.bpe'
opus-mt_opus-mt_1 exited with code 1

I've downloaded the pretrained models but it doesn't contain the bpe files only .npz files.
Kindly help.

Thanks!!!

Docker compile for CPU fails due to sentencepiece.pc missing

When compiling the dockerfile with docker build . error is raised at make install step of marian compilation:

CMake Error at src/3rd_party/sentencepiece/cmake_install.cmake:41 (file):
  file INSTALL cannot find "/usr/src/app/marian/sentencepiece.pc".
Call Stack (most recent call first):
  src/3rd_party/cmake_install.cmake:46 (include)
  src/cmake_install.cmake:42 (include)
  cmake_install.cmake:42 (include)

English to Korean

Is there an English to Korean MT model? There are only ko-{tgt} models

Thanks.

Use multiple CPU cores during decoding

Hello, I'm trying to use multiple cpu cores in decoding. I added the "cpu-threads: 8" to the decoder.yml, as per marian documentation.

This seems to recognize 8 cpus in loading time.

pus-mt_1 | [2020-10-16 12:08:54] [memory] Extending reserved space to 512 MB (device cpu0) opus-mt_1 | [2020-10-16 12:08:54] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:08:54] Loading model from /usr/src/app/models/en-es/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:08:54] [memory] Extending reserved space to 512 MB (device cpu0) opus-mt_1 | [2020-10-16 12:08:54] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:08:54] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:08:58] Server is listening on port 10001 opus-mt_1 | [2020-10-16 12:09:04] [memory] Extending reserved space to 512 MB (device cpu1) opus-mt_1 | [2020-10-16 12:09:04] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:09:04] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:09:08] [memory] Extending reserved space to 512 MB (device cpu2) opus-mt_1 | [2020-10-16 12:09:08] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:09:08] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:09:11] [memory] Extending reserved space to 512 MB (device cpu3) opus-mt_1 | [2020-10-16 12:09:12] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:09:12] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:09:18] [memory] Extending reserved space to 512 MB (device cpu4) opus-mt_1 | [2020-10-16 12:09:19] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:09:19] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:09:22] [memory] Extending reserved space to 512 MB (device cpu5) opus-mt_1 | [2020-10-16 12:09:22] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:09:22] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:09:25] [memory] Extending reserved space to 512 MB (device cpu6) opus-mt_1 | [2020-10-16 12:09:26] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:09:26] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:09:29] [memory] Extending reserved space to 512 MB (device cpu7) opus-mt_1 | [2020-10-16 12:09:29] Loading scorer of type transformer as feature F0 opus-mt_1 | [2020-10-16 12:09:29] Loading model from /usr/src/app/models/ar-en/opus.bpe32k-bpe32k.transformer.model1.npz.best-perplexity.npz opus-mt_1 | [2020-10-16 12:09:32] Server is listening on port 10002

But then in execution time I see it only uses 1 CPU and takes the same time as without the cpu-threads:8 config. It also just prints:

opus-mt_1 | [2020-10-16 12:10:30] [memory] Reserving 295 MB, device cpu0

Does any one know how to use multiple CPUs in decoding ?

Thanks.

integrate sentencepiece

Use sentencepiece instead of Moses tokenizer + BPE

get rid of tokenization/detokenization
sentencepiece internally in MarianNMT or as pre-processing (probably necessary for guided alignment + attention output)
integration in opusMT server through python interface https://pypi.org/project/sentencepiece/

Alternatively use https://pypi.org/project/youtokentome/ for BPE tokenizaion (does this also work like sentencepiece without tokenization?)

Persian pretrained models

I want to find any pretrained models paired with Persian (Farsi). In the matrix view of the downloadable models, I see that you have fa-fi and fi-fa pairs, but the links to their models do not work. How can I download them?

Experimental models link broken

The link mentioned in the Readme to experimental models seems broken or with restricted access: https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/work-spm/models

run ‘docker-compose up’ Errors

#0 145.7 Errors were encountered while processing:
#0 145.7 /tmp/apt-dpkg-install-i0Z5wU/29-intel-mkl-gnu-rt-2019.5-281_2019.5-281_amd64.deb
#0 145.7 /tmp/apt-dpkg-install-i0Z5wU/30-intel-mkl-gnu-2019.5-281_2019.5-281_amd64.deb
#0 145.7 /tmp/apt-dpkg-install-i0Z5wU/33-intel-mkl-gnu-f-2019.5-281_2019.5-281_amd64.deb
#0 145.7 /tmp/apt-dpkg-install-i0Z5wU/34-intel-mkl-pgi-rt-2019.5-281_2019.5-281_amd64.deb
#0 145.7 /tmp/apt-dpkg-install-i0Z5wU/35-intel-mkl-pgi-2019.5-281_2019.5-281_amd64.deb
#0 145.7 /tmp/apt-dpkg-install-i0Z5wU/38-intel-mkl-tbb-rt-2019.5-281_2019.5-281_amd64.deb
#0 145.7 /tmp/apt-dpkg-install-i0Z5wU/39-intel-mkl-tbb-2019.5-281_2019.5-281_amd64.deb
#0 145.7 E: Sub-process /usr/bin/dpkg returned an error code (1)

opusmt.wmflabs.org down

Someone reported in a Telegram chat that the instance at opusmt.wmflabs.org is down (502 Bad Gateway). Not sure if this is the correct place to report it, but hopefully someone here knows how to fix it.

Manually replacement of model with another model.

I am using the model which is working by default download. I want to replace with another model. I did it with simple replacemen,. But apparently didn't work in docker.

Is there any way to do this with some steps? If you can provide the proper way how to do this with the customized model / language.

Understanding model architecture

I was reading a bit about this project in order to determine the feasibility of porting these models to pytorch.

Is there a place I could read the modeling code/forward pass to see how the weights in the npz files are used? I got one from here

More general advice on what to read/understand is also appreciated.
Thanks in advance!

Installation via Docker fails (CMake error: file INSTALL cannot find "/usr/src/app/marian/sentencepiece.pc")

Installation via Docker fails. Output after following the README instructions:

$ docker-compose up
...
[ 97%] Building CXX object src/CMakeFiles/marian_scorer.dir/command/marian_scorer.cpp.o
[ 98%] Linking CXX executable ../marian-server
[ 98%] Built target marian_server
[100%] Linking CXX executable ../marian-scorer
[100%] Built target marian_scorer
Install the project...
-- Install configuration: "Release"
CMake Error at src/3rd_party/sentencepiece/cmake_install.cmake:41 (file):
  file INSTALL cannot find "/usr/src/app/marian/sentencepiece.pc".
Call Stack (most recent call first):
  src/3rd_party/cmake_install.cmake:46 (include)
  src/cmake_install.cmake:42 (include)
  cmake_install.cmake:42 (include)


make: *** [Makefile:108: install] Error 1
ERROR: Service 'opus-mt' failed to build : The command '/bin/sh -c set -eux; 	git clone https://github.com/marian-nmt/marian marian; 	cd marian;git checkout 1.9.0; 	cmake . -DCOMPILE_SERVER=on -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DCOMPILE_CUDA=off -DUSE_STATIC_LIBS=on; 	make -j 2 install;' returned a non-zero code: 2

As a workaround, I did the following:

diff --git a/Dockerfile b/Dockerfile
index f5331c5..34fc2db 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -23,14 +23,16 @@ RUN set -eux; \
        rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB;
 
 # Install Marian MT
-RUN set -eux; \
-       git clone https://github.com/marian-nmt/marian marian; \
-       cd marian; \
-       git checkout 1.9.0; \
+RUN set -eux && \
+       git clone https://github.com/marian-nmt/marian marian && \
+       cd marian && \
+       git checkout 1.9.0 && \
+       mkdir build && \
+       cd build && \
        # Choose CPU or GPU(CUDA) from below lines.
        # cmake . -DCOMPILE_SERVER=on -DUSE_SENTENCEPIECE=on -DCOMPILE_CUDA=on -DUSE_STATIC_LIBS=on; \
-       cmake . -DCOMPILE_SERVER=on -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DCOMPILE_CUDA=off -DUSE_STATIC_LIBS=on; \
-       make -j 2 install;
+       cmake . -DCOMPILE_SERVER=on -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DCOMPILE_CUDA=off -DUSE_STATIC_LIBS=on .. && \
+       make -j4;
 
 COPY . .
 
@@ -40,14 +42,14 @@ RUN set -eux; \
        pip3 install -r requirements.txt
 
 # install services
-RUN install -m 755 marian/marian /usr/local/bin/; \
-       install -m 755 marian/marian-server /usr/local/bin/; \
-       install -m 755 marian/marian-server /usr/local/bin/; \
-       install -m 755 marian/marian-vocab /usr/local/bin/; \
-       install -m 755 marian/marian-decoder /usr/local/bin/; \
-       install -m 755 marian/marian-scorer /usr/local/bin/; \
-       install -m 755 marian/marian-conv /usr/local/bin/; \
-       install -m 644 marian/libmarian.a  /usr/local/lib/;
+RUN install -m 755 marian/build/marian /usr/local/bin/; \
+       install -m 755 marian/build/marian-server /usr/local/bin/; \
+       install -m 755 marian/build/marian-server /usr/local/bin/; \
+       install -m 755 marian/build/marian-vocab /usr/local/bin/; \
+       install -m 755 marian/build/marian-decoder /usr/local/bin/; \
+       install -m 755 marian/build/marian-scorer /usr/local/bin/; \
+       install -m 755 marian/build/marian-conv /usr/local/bin/; \
+       install -m 644 marian/build/libmarian.a  /usr/local/lib/;
 
 EXPOSE 80
 CMD python3 server.py -c services.json -p 80

Docker build fails after debian update

Hi,

I am trying to build docker image, following steps in readme. However, I am facing several issues.

The current stable docker image that we get from debian has been updated, and this image does not have libprotobuf17 so docker build raises following error :
E: Unable to locate package libprotobuf17

I have updated libprotobuf17 to libprotobuf23, this allows me to continue with building the docker, however, another error occurs. This happens in marian_server:

#13 873.6 /usr/src/app/marian/src/3rd_party/simple-websocket-server/server_ws.hpp: In instantiation of 'void SimpleWeb::SocketServerBase<socket_type>::Connection::set_timeout(long int) [with socket_type = boost::asio::basic_stream_socket<boost::asio::ip::tcp>]':
#13 873.6 /usr/src/app/marian/src/3rd_party/simple-websocket-server/server_ws.hpp:525:30:   required from 'void SimpleWeb::SocketServerBase<socket_type>::read_handshake(const std::shared_ptr<SimpleWeb::SocketServerBase<socket_type>::Connection>&) [with socket_type = boost::asio::basic_stream_socket<boost::asio::ip::tcp>]'
#13 873.6 /usr/src/app/marian/src/3rd_party/simple-websocket-server/server_ws.hpp:816:36:   required from here
#13 873.6 /usr/src/app/marian/src/3rd_party/simple-websocket-server/server_ws.hpp:190:84: error: 'class boost::asio::basic_stream_socket<boost::asio::ip::tcp>' has no member named 'get_io_service'
#13 873.6   190 |         timer = std::unique_ptr<asio::steady_timer>(new asio::steady_timer(socket->get_io_service()));
#13 873.6           |                                                                                                                                    ~~~~~~~~^~~~~~~~~~~~~~
#13 878.7 cc1plus: note: unrecognized command-line option '-Wno-unknown-warning-option' may have been intended to silence earlier diagnostics
#13 878.9 make[3]: *** [src/CMakeFiles/marian_server.dir/build.make:82: src/CMakeFiles/marian_server.dir/command/marian_server.cpp.o] Error 1
#13 878.9 make[2]: *** [CMakeFiles/Makefile2:549: src/CMakeFiles/marian_server.dir/all] Error 2
#13 878.9 make[1]: *** [CMakeFiles/Makefile2:556: src/CMakeFiles/marian_server.dir/rule] Error 2
#13 878.9 make: *** [Makefile:322: marian_server] Error 2
------
executor failed running [/bin/sh -c set -eux; 	git clone https://github.com/marian-nmt/marian marian; 	cd marian; 	git checkout 1.9.0; 	cmake . -DUSE_STATIC_LIBS=on -DCOMPILE_SERVER=on -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DCOMPILE_CUDA=off;  	make -j4 marian_server ;]: exit code: 2
ERROR: Service 'opus-mt' failed to build

I have seen marian_server has released a new version, so I have also tried to update it from v1.9.0 to v1.10.0, making following change in Dockerfile:

git clone https://github.com/marian-nmt/marian marian; \
        cd marian; \
        git checkout 1.10.0; \
        cmake . -DUSE_STATIC_LIBS=on -DCOMPILE_SERVER=on -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DCOMPILE_CUDA=off;  \
        make -j4 marian_server

This seems to fix the issue, however a different issues arises:

#13 1085.0 /usr/include/boost/asio/impl/executor.hpp:94:15: error: 'class boost::asio::execution::any_executor<boost::asio::execution::context_as_t<boost::asio::execution_context&>, boost::asio::execution::detail::blocking::never_t<0>, boost::asio::execution::prefer_only<boost::asio::execution::detail::blocking::possibly_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::outstanding_work::tracked_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::outstanding_work::untracked_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::relationship::fork_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::relationship::continuation_t<0> > >' has no member named 'dispatch'
#13 1085.0    94 |     executor_.dispatch(BOOST_ASIO_MOVE_CAST(function)(f), allocator_);
#13 1085.0       |     ~~~~~~~~~~^~~~~~~~
#13 1085.0 /usr/include/boost/asio/impl/executor.hpp: In instantiation of 'void boost::asio::executor::impl< <template-parameter-1-1>, <template-parameter-1-2> >::post(boost::asio::executor::function&&) [with Executor = boost::asio::execution::any_executor<boost::asio::execution::context_as_t<boost::asio::execution_context&>, boost::asio::execution::detail::blocking::never_t<0>, boost::asio::execution::prefer_only<boost::asio::execution::detail::blocking::possibly_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::outstanding_work::tracked_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::outstanding_work::untracked_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::relationship::fork_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::relationship::continuation_t<0> > >; Allocator = std::allocator<void>; boost::asio::executor::function = boost::asio::detail::executor_function]':
#13 1085.0 /usr/include/boost/asio/impl/executor.hpp:97:8:   required from here
#13 1085.0 /usr/include/boost/asio/impl/executor.hpp:99:15: error: 'class boost::asio::execution::any_executor<boost::asio::execution::context_as_t<boost::asio::execution_context&>, boost::asio::execution::detail::blocking::never_t<0>, boost::asio::execution::prefer_only<boost::asio::execution::detail::blocking::possibly_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::outstanding_work::tracked_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::outstanding_work::untracked_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::relationship::fork_t<0> >, boost::asio::execution::prefer_only<boost::asio::execution::detail::relationship::continuation_t<0> > >' has no member named 'post'
#13 1085.0    99 |     executor_.post(BOOST_ASIO_MOVE_CAST(function)(f), allocator_);
#13 1085.0       |     ~~~~~~~~~~^~~~

I am very interested in using Opus-MT, how can I fix it? do you have a working docker image that I can use?

Bad translations using marian-decoder

Hi, I've loaded the models from the following directory: https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models/ru-en
When I tried some of them I often get translation like: "▁Y O O O O O O O O O O O O O O O O O O O O" or "I 'm b@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@ m@@"
Then I tried to load the model from the Hugging Face site: but get pretty similar outputs while using Hugging Face framework gives good translations. Probably something wrong with config.
I launch it using the Marian library. For example:

 echo "привет" | ./marian-decoder -c /path/to/opus_models/opus-2019-12-05-ru-en/decoder.yml

So what can be wrong?

Probably I somehow should do preprocessing and postprocessing?

chinese2English NMT

i want to find a chinese2English NMT ,but i can `t find in these model you provide,can you give me a link?

Support multiline text

If the text has several lines, then we get an error.

How Helsinki models (in the transformers library) are trained ?

Hello @jorgtied

It seems to me that there is no model to translate from french to wolof.
I'm trying to do it myself by training it from scratch using the Huggingface library.
I want to use the same class (MarianMT) as you did for your translation models.
I'm having difficulties with this model because I don't know how to initialize the tokenizer (MarianTokenizer). It requires SentencePiece files ( a .spm extension) file but in general, SentencePiece models are stored in a ".model" extension file and I haven't seen nowhere a sentencePiece model saved in a ".spm". So could you tell me how you did initialize the tokenizer class for your models Please?

Also, I've seen tutorials teaching the process to train translation models from scratch in Hugginface, and apparently, some people are struggling with it too. So code snippets or resources that you used to train the Helsinki models (in Hugginface) are welcome too?

thank you in advance

Reproduced crash on Opus-mt-en-de model using string "J" and "J-10"

Try any of the of the two and translation on web UI will return "J..........." or "J-10............" after 16 seconds but in fact, it caused a server crash.

https://huggingface.co/Helsinki-NLP/opus-mt-en-de?text=J-10
https://huggingface.co/Helsinki-NLP/opus-mt-en-de?text=J

Env: Conda Pytorch 1.13, Cuda 11.7, transformers on GPU

The crash is also happening on CPU only device.

An error occurred, model: en->de, translating: ['J']
Stacktrace:
Traceback (most recent call last):
  File "/raid0/translate/app.py", line 202, in trans
    translated.extend(translator.translate(sents))
  File "/raid0/translate/translator.py", line 60, in translate
    return self.translator.translate(input_text)
  File "/raid0/translate/model.py", line 114, in translate
    return self._translate(input_text)
  File "/raid0/translate/model.py", line 96, in _translate
    translated = self.model.generate(**tokens, max_new_tokens=50000)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1577, in generate
    return self.beam_search(
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 2747, in beam_search
    outputs = self(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1440, in forward
    outputs = self.model(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1240, in forward
    decoder_outputs = self.decoder(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 1042, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py", line 195, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

On opus-mt-es-fr model we saw another GPU crash with very same stack trace. The UI link (CPU) shows failed translation with gibberish output at end. On GPU it should stacktrace like previous.

https://huggingface.co/Helsinki-NLP/opus-mt-es-fr

- ¿Porqué crees que renté una habitación con una poza privada… Akane? – Le mordió el lóbulo. – Ella se encogió de hombros. – Para quitarte todas esas dudas de la cabeza.

Provide link to https://opusmt.wmflabs.org/

Is it a good idea to provide a link to https://opusmt.wmflabs.org/ in the README for trying out the web interface and api for a limited set of languages? Seeing the system in action is fun and encouraging :-)

en-ml Model and test data bug

(Apologies if this is not the right repository to report this issue)

The data prepared for the Malayalam language has an issue. Consistantly there is a space before and after the Virama ് (U+0D4D). It is a connecting character and there should not be space around it.

Here is an example https://object.pouta.csc.fi/OPUS-MT/eval/ml-en/Tatoeba.opus.bpe32k-bpe32k.mlen1.transformer.ml.en.test.txt

Here the first word "ഒന ് നാം" should be "ഒന്നാം" by removing spaces around ്.

No model translate English to Thai ?

Thank you for your contribution on providing over 1,000 pre-trained translation models. Great work !

I'm using models from https://huggingface.co/Helsinki-NLP
However, it seems that you don't provide model that can translates English to Thai language ?
Am i right ? Will you plan to release such an model ?

de-en model needs a lot more memory than de-cs

I have been using the de-en and de-cs model on the same dataset (a few hundred thousand texts), and noticed that the English model needs a lot more memory than the Czech one. I'm running on an A100 GPU (40 GB memory).

In practice, I ended up with a batch size for English smaller than half of the Czech batch, even though the model config says they are roughly the same size - the only difference being that actually the de-cs vocabulary is slightly larger.

On top of that, the English model gets the repeating nonsense subsequence issue a lot more often. I approximated that by a type to token ratio below 0.15, which gives 20 texts to Czech and around 70k in English. I don't see how this might relate to memory consumption but maybe there's something.

Websocket service: [Errno 111] Connection refused

I was successful in transforming from French to Finnish and Finnish to English. But French to English didn't work for me with an error that 'unsupported language pair'. I checked the models here /usr/local/share/opusMT and found that fr-en model is present.
Even for those pair which are working; It doesn't work when I mention host and port (default host and port works fine). It gives me an error that "ConnectionRefusedError: [Errno 111] Connection refused".

How to make language pair work with defined host and port (it works with default values)?
Some language pairs like fr-en don't work even with default host and port. It gives unsupported language pair error.

What is the best way to deploy models in terms of translation speed?

Hi,

Firstly, thanks for making your translation models publicly available. It is really helpful for the industry.

I have a question though, related to this question, if I am going to translate a large amount of text, what is the best way to use your models? Currently I am using the transformers library, but the speed is pretty slow even on gpu, which is not satisfying enough.

Weird results when translating english to finnish (using EasyNMT with opus-mt)

While translating English to Finnish using your model via EasyNMT, I noticed something weird. Check this code and the results.

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

text='''Religion and theology is the study of religious beliefs, concepts, symbols, expressions and texts of spirituality.
Programmes and qualifications with the following main content are classified here:
Religious history
Study of sacred books
Study of different religions
Theology
=== Inclusions
Included in this detailed field are programmes for children and young people.'''

print(model.translate(text,target_lang='fi'))

The output is:

'Uskonto ja teologia tutkivat uskonnollisia käsityksiä, käsitteitä, symboleja, ilmaisuja ja tekstejä hengellisyydestä.
Ohjelmat ja tutkinnot, joiden pääsisältö on seuraava:
Uskonnollinen historia
Pyhien kirjojen tutkiminen
Eri uskontojen tutkiminen
Teologia
Suomennos: Michael T. Francis Pinmontagne SUBHEAVEN.ORG
Tähän yksityiskohtaiseen kenttään kuuluvat lasten ja nuorten ohjelmat.'

So "=== Inclusions" is translated into "Suomennos: Michael T. Francis Pinmontagne SUBHEAVEN.ORG".

What is going on here? Is this a problem with Opus-MT model or its EasyMT implementation?

PS. The sample text is from ESCO ontology

[RQUEST] TMX Generator - Bitextor

Hello developers
I suggest creating a GUI for this code for creating a tool to harvest multilanguage websites to create a TMX to train MT's
Kindly check this
Bitextor generates translation memories from multilingual websites.
https://github.com/bitextor/bitextor
you may extract parallel text from this Medical website
https://www.mayoclinic.org/
Eng-Arabic and other languages and train NMT, to increase the translation accuracy for testing
https://webisearch.com/
Regards--

Transformer-align models with opus-mt

I was wondering if it is possible to use transformer-align models, based on SentencePiece within this framework. Up to now I only got the .bpe from transformer models working.

Some language-specific models are not translating multi-sentence sequences

Hello

I put together a quick demo for using the open-source Opus MT models from the Hugging Face hub. I quickly found that for some languages, the model is not translating all sentences. You can reproduce the same issue when loading the models straight in Python, e.g., with these functions

from typing import Optional, Tuple

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizer

def load_mt_pipeline(model_name: str) -> Optional[Tuple[PreTrainedModel, PreTrainedTokenizer]]:
    """Load an opus-mt model, download it if it has not been installed yet."""
    try:
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        return model, tokenizer
    except:
        return None


def translate(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, src_text: str) -> str:
    translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    translations = "".join([tokenizer.decode(tokens, skip_special_tokens=True) for tokens in translated])
    return translations

In the demo, you'll see that the default-selected model Helsinki-NLP/opus-mt-en-nl is used to translate the English sentences Grandma is baking cookies! I love her cookies.. Unfortunately, the model only seems to translate the first one into Oma bakt koekjes.. The second part I love her cookies. is not translated.

I verified that the tokenizer is correctly tokenizing the input, but it seems that generate is not correctly producing all the input. It is stopping prematurely. The issue does not occur for, e.g., Helsinki-NLP/opus-mt-en-fr.

Any thoughts on this?

How would I run these models on IOS mobile device locally?

How would I run these models on IOS mobile device locally? I am ok to convert the model to CoreML or just use Libtorch.

Something wrong with model or maybe bug?

I use this model for EasyNMT and found some interesting behavior, if I try to translate something related to translation or subtitles, I can got response like:

{"target_lang":"en","source_lang":"ru","translated":["== sync, corrected by elderman == @elder_man"],"translation_time":1.5843150615692139}

But source text was: Перевод субтитров in russian.

I try to found something in web and found videos and websites with this text, for example:
https://www.youtube.com/watch?v=cIAdPid3QHU
https://shopee.com.my/-sync-corrected-by-elderman-elder_man-i.512528608.12307149345

Can it be fixed somewhat?