ukplab / easynmt Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 106.0 158 KB

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

License: Apache License 2.0

Python 62.91% Dockerfile 9.38% Shell 4.79% HTML 21.67% PHP 1.25%

easynmt's People

Contributors

Stargazers

Watchers

Forkers

fireae wengbenjue victor3387 hugoabonizio nasa03 hyojunguy c00renut igledaniel abeusher yaroschak vova-1982 nissmogt makeesyai yyht saurabh11baghel binglinggroup theskymaybe scottishfold007 jtshukla24 nashid dhanyweb ashokv22 kiku-jw victorbulba uav-profile ishine tiamat-tech ppartha03 masum06 manhlab oniondai jingmouren phox chschaitanya techthiyanes gulldan renespeck btautist vyoz medium-posts pwtll fdwangchao escottgoodwin lbp0200 odd3s grasp-ai jdtamayoqu cuong3004 vahid132923 gleidsonh nagatharun b-go cni-kbj liakhandrii mythwn nateagr hxdaze fuchikoma-development on1you bariluz93 sterrenhemel sysujayce demikam gustaverw avisoori1x ai2-cdrone celise88 taltaf913 ethio-artifical summerflowers ymx-ikpark ccxx0 de30 jkennedyvz benjieperez qiu-xm aivanouski jorik041 metin-mustafa puntorigen aaroncueckermann mayeaux arpitjain799 aniket3000 ymx-dev khalo-sa montagao leeteukki redbeard-himalaya ahrimdon leandroalbero askeijaz haj jffjhnsn lorenzoridolfi roosterington gayansamuditha newrain7803 hedgehog18 verte21

easynmt's Issues

translate_stream method does not yield all translations if number of input texts is not divisible without remainder by the chunk size

Hi there,

thanks for providing this great package!

I have encountered an issue with the translate_stream method of the EasyNMT class: If the number of input texts ('n_texts') is not divisible by the chunk_size (i.e. without remainder), translate_stream only yields the first (n_texts // chunk_size) * chunk_size translations.

A minimal working example can be found here: https://colab.research.google.com/drive/1coVyCXc8jnPdHVcFHaVEsTfv1KfhGkvm?usp=sharing

How to load the translation model in a custom location?

Hi, I want to load the 300MB translation model in a different directory, let's say in my program folder instead of in the python/huggingface/pytorch folder. Is there any way I can do this? Thanks

OSError after a few translations

Hi and thanks for the cool library!

I want to include the translation function in one of my data pipelines that loops over thousands of text snippets. Without the GPU support and on Windows I was following the instructions in the other issue and successfully added the function.

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

and I translate with:

language = detect_langs(text)
for each_lang in language:
   if (each_lang.lang != "en"):
      translated_text = model.translate(text, target_lang='en')

whereas text is a string.
However, after a few translations (2-3) I always run into this error:

OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-ia-en'. Make sure that:
- 'Helsinki-NLP/opus-mt-ia-en' is a correct model identifier listed on 'https://huggingface.co/models'

Any idea what the problem could be?

Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en'

Hi,
I was trying to translate 19203 sentence data from german to English using the translate_stream method explained in the following link.
https://github.com/UKPLab/EasyNMT/blob/main/examples/translation_streaming.py

I set the chunk size to 32. After successfully translating 3 chunks and writing output on file it gave an error of the model. Can you guide me with this issue? I am pasting error wording down here.

  0%|▌                                                | 96/19203 [00:54<3:00:03,  1.77it/s]
Exception: Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en'. Make sure that:

- 'Helsinki-NLP/opus-mt-nds-en' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'Helsinki-NLP/opus-mt-nds-en' is the correct path to a directory containing relevant tokenizer files


  1%|▋                                                 | 127/19203 [01:06<2:46:24,  1.91it/s]
Traceback (most recent call last):
  File "translate.py", line 12, in <module>
    for translation in model.translate_stream(sentences, chunk_size=32, target_lang='en'):
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 297, in translate_stream
    translated = self.translate(batch, show_progress_bar=False, **kwargs)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 124, in translate
    translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 210, in translate_sentences
    raise e
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 205, in translate_sentences
    translated = self.translate_sentences(grouped_sentences, source_lang=lng, target_lang=target_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 222, in translate_sentences
    output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/models/OpusMT.py", line 46, in translate_sentences
    tokenizer, model = self.load_model(model_name)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/models/OpusMT.py", line 28, in load_model
    tokenizer = MarianTokenizer.from_pretrained(model_name)
  File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en'. Make sure that:

- 'Helsinki-NLP/opus-mt-nds-en' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'Helsinki-NLP/opus-mt-nds-en' is the correct path to a directory containing relevant tokenizer files

Mistranslations - repeated output

Hey team,

Thank you again for the great library.

Today we translated 'id' (Indonesian) sentences and quite a few of them came out as variants of "I'm sorry I'm sorry I'm sorry I'm sorry I'm sorry I'm sorry" even though they did not mention 'sorry' in the text.

Any idea why this could be please? Could it be because I'm not performing sentence splitting prior to translation whilst using the 'translate_sentences' function?

Thanks!

OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-en-yo'.

I want to translate from English to Yoruba, but I am getting the above OS error

Exception: Can't load tokenizer for 'Helsinki-NLP/opus-mt-en-yo'. Make sure that:

- 'Helsinki-NLP/opus-mt-en-yo' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'Helsinki-NLP/opus-mt-en-yo' is the correct path to a directory containing relevant tokenizer files


Traceback (most recent call last):
  File "translatefile.py", line 19, in <module>
    translations = model.translate(sentences,  target_lang='yo')
  File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 122, in translate
    translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
  File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 208, in translate_sentences
    raise e
  File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 203, in translate_sentences
    translated = self.translate_sentences(grouped_sentences, source_lang=lng, target_lang=target_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
  File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 220, in translate_sentences
    output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
  File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/models/OpusMT.py", line 46, in translate_sentences
    tokenizer, model = self.load_model(model_name)
  File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/models/OpusMT.py", line 28, in load_model
    tokenizer = MarianTokenizer.from_pretrained(model_name)
  File "/home/alabi/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-en-yo'. Make sure that:

- 'Helsinki-NLP/opus-mt-en-yo' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'Helsinki-NLP/opus-mt-en-yo' is the correct path to a directory containing relevant tokenizer files

Translating from Yoruba to English worked well. What am I doing wrong?

Change sorting by length order

Hi, I like your package!
Wanted to propose you do reverse sorting by lengths before translation, to avoid sudden memory error at the end of the translation)

easynmt/api:2.0-cuda11.1 doesn't seem to use my GPU

Just got a new CentOS 7 server with a GeForce RTX 2080 GPU...

using "nvidia-smi" I could see my card
installed docker
started new easynmt instance with "docker run -d --name easynmt --restart always -v /easynmt_cache:/cache -p 24080:80 easynmt/api:2.0-cuda11.1"

Using the API the translation was very slow so I tried the following:

docker exec -it easynmt bash
installed curl
downloaded the test_translation_speed.py
ran "python translation_speed.py opus-mt"
I only got 5.29 sentences/s
I tried the following: torch.cuda.is_available() and it returned False

Correct me if I am wrong but it seems like my easynmt instance doesn't use the GPU? Did I do something wrong?

Thank you for your help...

Docker image easynmt/api:2.0-cpu crashes when trying to run on mac

Running this on a 2017 Macbook. Docker image easynmt/api:2.0-cpu fails to start with exceptions, while easynmt/api:1.1-cpu was running fine with the same docker run command previously.

docker run -p 24081:80 -v /Users/agrim/Downloads/easynmt-models:/cache easynmt/api:2.0-cpu

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 92, in __init__
    self.translator = module_class(easynmt_path=model_path, **self.config['model_args'])
KeyError: 'model_args'
Checking for script in /app/prestart.sh
There is no script /app/prestart.sh
[2021-04-27 14:38:22 +0000] [13] [INFO] Starting gunicorn 20.1.0
[2021-04-27 14:38:22 +0000] [12] [INFO] Starting gunicorn 20.1.0
[2021-04-27 14:38:22 +0000] [12] [INFO] Listening at: http://0.0.0.0:8080 (12)
[2021-04-27 14:38:22 +0000] [13] [INFO] Listening at: http://0.0.0.0:80 (13)
[2021-04-27 14:38:22 +0000] [13] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-04-27 14:38:22 +0000] [12] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-04-27 14:38:22 +0000] [17] [INFO] Booting worker with pid: 17
[2021-04-27 14:38:22 +0000] [18] [INFO] Booting worker with pid: 18
[2021-04-27 14:38:22 +0000] [19] [INFO] Booting worker with pid: 19
[2021-04-27 14:38:24 +0000] [19] [INFO] Started server process [19]
[2021-04-27 14:38:24 +0000] [17] [INFO] Started server process [17]
[2021-04-27 14:38:24 +0000] [17] [INFO] Waiting for application startup.
[2021-04-27 14:38:24 +0000] [19] [INFO] Waiting for application startup.
[2021-04-27 14:38:24 +0000] [17] [INFO] Application startup complete.
[2021-04-27 14:38:24 +0000] [19] [INFO] Application startup complete.
{"loglevel": "info", "workers": "1", "bind": "0.0.0.0:8080", "graceful_timeout": 120, "timeout": 120, "keepalive": 5, "errorlog": "-", "accesslog": "-", "host": "0.0.0.0", "port": "8080"}
Booted as backend: True
Load model: opus-mt
[2021-04-27 14:38:25 +0000] [18] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.8/site-packages/uvicorn/workers.py", line 63, in init_process
    super(UvicornWorker, self).init_process()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.8/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/app/main.py", line 36, in <module>
    model = EasyNMT(model_name, load_translator=IS_BACKEND, **model_args)
  File "/usr/local/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 92, in __init__
    self.translator = module_class(easynmt_path=model_path, **self.config['model_args'])
KeyError: 'model_args'
[2021-04-27 14:38:25 +0000] [18] [INFO] Worker exiting (pid: 18)
[2021-04-27 14:38:25 +0000] [12] [INFO] Shutting down: Master
[2021-04-27 14:38:25 +0000] [12] [INFO] Reason: Worker failed to boot.
{"loglevel": "info", "workers": "1", "bind": "0.0.0.0:8080", "graceful_timeout": 120, "timeout": 120, "keepalive": 5, "errorlog": "-", "accesslog": "-", "host": "0.0.0.0", "port": "8080"}
One of the processes has already exited.

Where are the downloaded models stored?

Hello. Where are the downloaded models stored on a Mac?(I am talking about the large, 300MB models.

ROMANCE languages alias in OPUS-MT

Hello, congrats for the initiative!

I've been using Helsinking-NLP models previously and most commonly used models are 'opus-mt-en-ROMANCE' and 'opus-mt-ROMANCE-en' for Portuguese. So, if I use model.translate(sample, source_lang='pt', target_lang='en'), it won't work, but as I've tested, model.translate(sample, source_lang='ROMANCE', target_lang='en') works.

So, It would be nice to have some alias in the code for ROMANCE. :)

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/Helsinki-NLP/opus-mt-no-en

This is the error coming as the model is not available in the repo.

html tags, proper nouns

First, I need to congratulate the team for your work, especially Nils is imho one of the best devs in the NLP community. Sentence-transformers and this NMT translator repo have been very helpful to us in Contents.com.

I use the Opus-mt right now it is great and I noticed that it even keeps hmtl tags. But sometimes it makes a mistake of generating ">/strong>", ×/strong> or --/strong> instead of (strong as example, same for other tags like h3 or li (as far as I know)). I solved it by simply replacing: text = text.replace('×/', '</').replace('>/', '</').replace('--/', '</'). It is a very simple thing, Just wanted to make you know.

I wonder if you are aware of a translator model that is very good at keeping the proper nouns like people, cities, company nouns ect unchanged, even when made by 2 or more words? I could use NER but it could decrease speed and tbh I don't find the free NER libraries enough reliable.

Thank you!

Multi-process isn't working

Hello guys, and thank you for your awesome library!
I'm currently struggling with making test_multi_process_translation.py script working. When it comes to the multi-process part, the following error occures:

2021-03-04 19:00:46 | INFO | easynmt.EasyNMT | Start multi-process pool on devices: cuda:0 Traceback (most recent call last): File "test_multi_process_translation.py", line 80, in <module> process_pool = model.start_multi_process_pool() File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 258, in start_multi_process_pool p.start() File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'BaseFairseqModel.make_generation_fast_.<locals>.train'

Could you please give me any ideas on how to fix this? Thanks in advance!

Library not translating, just returning input

Hello
I am running the following code

from easynmt import EasyNMT

model = EasyNMT('opus-mt')

print(model.translate("停", target_lang='en'))

The result of the code is just "停", which is the exact same thing as the input. How can i fix this?

Custom Terminology for certain words

Hi, Is there any way to add custom rules for certain words so that they won't get translated in another meaning?

Query regarding the smaller size of models

This is not an issue per se, but rather, a query. Could you please elaborate on how the models provided in EasyNMT have been sized down to their current sizes (300MB for Opus-MT, 1.2 GB for mBART and 2.4 GB for the 1.2bn M2M-100)?

The actual size of M2M-100 is over 2.4 GB (the 12bn parameter M2M-100 checkpoint is close to 48GB!).
The actual size of mBART appears to be about 5.2GB.

If there is some model pruning or other downsizing techniques involved to reduce the size of the original models to their EasyNMT versions, how does it affect the performance of these models as compared to their original counterparts?

Thank you for your response in advance.

Missing supported translate pair with M2M_100 model

Hi, I found that M2M_100 support translate directly between any pair in 100 languages (9900 pairs). But when I use EasyNMT with M2M_100 model, it doesn't support all of these pairs.

Example: EasyNMT can't translate directly from 'th' (Thai) to 'en' (English) while M2M_100 model does support this pair.

And when I tried to use HuggingFace to translate directly between Thai and English, it work perfectly.

Can you please solve the problem? By the way, thank you for creating EasyNMT.

AttributeError: 'float' object has no attribute 'split'

Hi Team,
I have a question. I am trying to translate a column which has blanks in between. I am using EasyMT and its giving an error. won't it work if there is a blanks or missing in between the rows of a column?

Thanks
Srinivas

Chinese language variants

EasyNMT uses fasttext to identify language. Some chinese phrases can be misidentified into chinese variants, like 'yue' or 'wuu'. This will cause easyNMT to fail. Can you map Chinese language variants to Chinese. 'yue', 'wuu', 'min' to 'zh'?

how to use m2m_100 _12B Model??

https://github.com/pytorch/fairseq/tree/master/examples/m2m_100

Word replacement with synonym after translation

Hi, I have found this tool very much helpful and excellent effort and help.

I want to know that after translation, is there any way that if I want to replace the word by its synonym, can I call any function?
Please if you can guide.

If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

OSError: Unable to load weights from pytorch checkpoint file for 'facebook/m2m100_1.2B' at '/cache/transformers/68002fb1a7773d8d8373f1a230588141964ef9f249db6987681f295dbe85356c.ee70663869b89be4f68eed03a21d5c3400b223cb544883f411e469aaea0a25f9'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Multi GPU with Docker api:2.0cuda11.1

I am using Ubuntu 20.04 with 3 GPUs of RTX 2080. This is my docker-compose:

version: '2.3'

services: 
  easynmt:
    image: easynmt/api:2.0-cuda11.1
    restart: always
    runtime: nvidia
    volumes: 
      - ./cache:/cache
    ports: 
      - "24080:80"
    env_file: 
      - ./envfile

The envfile:
EASYNMT_MODEL=m2m_100_1.2B

The first problem is that the service always use only the first GPU in my machine.
The second, one backend worker will take around 7GB of GPU's VRAM, so after a few requests of translate CUDA run out of memory and the code do not work any more (I think the spawned woker need another 7GB VRAM of GPU).

I have my own solution for both: set MAX_WORKERS_BACKEND=1.
start 3 services of easynmt with specificed GPU then use nginx as load balancer to group of 3 easynmt services.

Did I do something wrong? or Any better solution in this case?

Docker Error - Booted as backend: False - Response: {'detail': 'Not Found'}

I follow the Docker Setup but it did not work. Tried on both Ubuntu, Windows Docker and both cpu and cuda version but the same error of Booted as backend: False - Response: {'detail': 'Not Found'}.

thai-segmenter not installed

Running Thai verbatim through the pipeline gives an error "ModuleNotFoundError: No module named 'thai_segmenter'.

Does this need to be specified explicitly in setup.py?

Thanks!

New benchmarks for the m2m/mbart models after the switch away from Fairseq to huggingface transformers in EasyNMT 2.0

With the switch away from Fairseq to huggingface transformers in EasyNMT 2.0, there seem to have been substantial changes, e.g. the size of the m2m_100_1.2B model has increased from 2.3 GB to 5 GB in size.

Do we need new benchmark tests for the m2m/mbart models after this change? In general, will this make translation inference slower or faster compared to before (by subjective opinion)?

Offline installation

I am trying to install easyNMT on a semi-offline system. Python libraries are permitted to be installed but accessing other URLs is not permitted from this system. Therefore, can you advise on a way of how to manually install the Facebook (m2m_100_418M and m2m_100_1.2B) models so that easyNMT can see them? I can see they can be manually downloaded from Huggingface.co. If I were to download these on a second system, where should I then save them on the semi-offline Windows system on which I intend to test eastNMT? Thanks

arm devices

@nreimers
can an OPUS-MT model be inferenced on an arm based device, such as a raspberry pi?

Can't access other models in docker image

Hi,

I'm sorry for this noobish question/issue and maybe it is easy to resolve (I'm not experienced with docker). I've built a web app which uses easyNMT in the back via the docker images and REST. When translating from romanian to german I noticed that the docker image is only using the opus model which does not provide this language direction. But when executing the "/model_name" request it shows me only "opus" as part of the docker image.

So how can I get the other models? I have 3 docker images of easynmt (one with 7.7gb, one with 6.02 and one with 3.8 gb size) but it seems none of them contains the other models. Am I doing something wrong here?
And also when they are part of the image, is there some kind of auto selection if a language is not available in one of the packages?

I installed the docker images via the "build-docker-hub.sh" file.

Best regards,
André

Exception: 404 Client Error: Not Found for url: https://huggingface.co/api/models/Helsinki-NLP/opus-mt-ro-en

Hi,

I'm using EasyNMT for translating customer reviews. During translation, I got this error
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/Helsinki-NLP/opus-mt-ro-en
`HTTPError Traceback (most recent call last)
in
1 for index, row in df_review['AnswerValue'].iteritems():
----> 2 translated_row = model.translate(row, target_lang='en')#translating each row
3 df_review.loc[index, 'Translate'] = translated_row

~/opt/anaconda3/lib/python3.8/site-packages/easynmt/EasyNMT.py in translate(self, documents, target_lang, source_lang, show_progress_bar, beam_size, batch_size, perform_sentence_splitting, paragraph_split, sentence_splitter, document_language_detection, **kwargs)
147 method_args['documents'] = [documents[idx] for idx in ids]
148 method_args['source_lang'] = lng
--> 149 translated = self.translate(**method_args)
150 for idx, translated_sentences in zip(ids, translated):
151 output[idx] = translated_sentences

~/opt/anaconda3/lib/python3.8/site-packages/easynmt/EasyNMT.py in translate(self, documents, target_lang, source_lang, show_progress_bar, beam_size, batch_size, perform_sentence_splitting, paragraph_split, sentence_splitter, document_language_detection, **kwargs)
179 #logger.info("Translate {} sentences".format(len(splitted_sentences)))
180
--> 181 translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
182
183 # Merge sentences back to documents

~/opt/anaconda3/lib/python3.8/site-packages/easynmt/EasyNMT.py in translate_sentences(self, sentences, target_lang, source_lang, show_progress_bar, beam_size, batch_size, **kwargs)
276
277 for start_idx in iterator:
--> 278 output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
279
280 #Restore original sorting of sentences

~/opt/anaconda3/lib/python3.8/site-packages/easynmt/models/OpusMT.py in translate_sentences(self, sentences, source_lang, target_lang, device, beam_size, **kwargs)
38 def translate_sentences(self, sentences: List[str], source_lang: str, target_lang: str, device: str, beam_size: int = 5, **kwargs):
39 model_name = 'Helsinki-NLP/opus-mt-{}-{}'.format(source_lang, target_lang)
---> 40 tokenizer, model = self.load_model(model_name)
41 model.to(device)
42

~/opt/anaconda3/lib/python3.8/site-packages/easynmt/models/OpusMT.py in load_model(self, model_name)
20 else:
21 logger.info("Load model: "+model_name)
---> 22 tokenizer = MarianTokenizer.from_pretrained(model_name)
23 model = MarianMTModel.from_pretrained(model_name)
24 model.eval()

~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1645 else:
1646 # At this point pretrained_model_name_or_path is either a directory or a model identifier name
-> 1647 fast_tokenizer_file = get_fast_tokenizer_file(
1648 pretrained_model_name_or_path, revision=revision, use_auth_token=use_auth_token
1649 )

~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in get_fast_tokenizer_file(path_or_repo, revision, use_auth_token)
3406 """
3407 # Inspect all files from the repo/folder.
-> 3408 all_files = get_list_of_files(path_or_repo, revision=revision, use_auth_token=use_auth_token)
3409 tokenizer_files_map = {}
3410 for file_name in all_files:

~/opt/anaconda3/lib/python3.8/site-packages/transformers/file_utils.py in get_list_of_files(path_or_repo, revision, use_auth_token)
1691 else:
1692 token = None
-> 1693 model_info = HfApi(endpoint=HUGGINGFACE_CO_RESOLVE_ENDPOINT).model_info(
1694 path_or_repo, revision=revision, token=token
1695 )

~/opt/anaconda3/lib/python3.8/site-packages/huggingface_hub/hf_api.py in model_info(self, repo_id, revision, token)
246 )
247 r = requests.get(path, headers=headers)
--> 248 r.raise_for_status()
249 d = r.json()
250 return ModelInfo(**d)

~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
941
942 if http_error_msg:
--> 943 raise HTTPError(http_error_msg, response=self)
944
945 def close(self):

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/Helsinki-NLP/opus-mt-ro-en`

Could you please review and fix the issue?

Thank you.

Repetition penalty

Is there a way how to set "temperature" or "repetition_penalty" for Facebook models?

No module named 'easynmt.models.OpusMT' in PyCharm

Hello, I'm running this simple code in pycharm
`from easynmt import EasyNMT
model = EasyNMT("opus-mt")

print(model.translate("Hi", target_lang="fr"))`

and it gives me this error

Traceback (most recent call last):
File "H:/Documents/Python/Random.py", line 2, in
model = EasyNMT("opus-mt")
File "C:\Users\Jerz King\AppData\Local\Programs\Python\Python37\lib\site-packages\easynmt\EasyNMT.py", line 69, in init
module_class = import_from_string(self.config['model_class'])
File "C:\Users\Jerz King\AppData\Local\Programs\Python\Python37\lib\site-packages\easynmt\util.py", line 56, in import_from_string
module = importlib.import_module(module_path)
File "C:\Users\Jerz King\AppData\Local\Programs\Python\Python37\lib\importlib_init_.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1006, in _gcd_import
File "", line 983, in _find_and_load
File "", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'easynmt.models.OpusMT'

I had the same error addressed here when installing easynmt, and followed the steps, nothing happened... how do I fix this?

Some question about implementation of translate and translate_sentences in EasyNMT

Hi, i review the code, and want to give some suggestions.
As the code logic describe, if user not set source_lang in translate method, the object will
auto infer the possible source_lang in translate_sentences.
This behaviour will have good conclusion when the input sentences are short,
when it comes to long sentence, because the perform_sentence_splitting usage, will split the
totally long input sentence into small fragments and do source_lang infer on evey fragments
and choose a good model to translate it (in grouped_sentences group by detected source_lang)
it will suit some mix language input sentences, translate different language fragments
by different model and join them back.
but when the sentence_splitter unfortunate split the long input in bad manner,
consider following example:

import pandas as pd
input_ = 'How many times does the  rebuilt data contain cannot handle non-empty timestamp argument! 1929 and scrapped data contain cannot handle non-empty timestamp argument! 1954?'
#### this output will be ['en', 'en', 'eo'] because the last fragment is "1954?"
#### and language_detection map it to "eo"
pd.Series(sentence_splitter(input_)).map(model.language_detection)

And when use it in opus-mt model, to translate this sentence from "eo" into "zh" this will yield a error that not have this model to load.
I understand that i can avoid this error by set source_lang to "en" in translate method.
But i think also need deal with this problem.
I think if the language_detection and sentence_splitter can run rapidly, can try to valid all possible translate-mapping in
lang_pairs in easynmt.json in the opus-mt folder in the models dir before run translate method.
Or becuse the last fragments is too short to give a good suggestion on language_detection, set a evidence filter on different
length fragments.
Or if you can use some regex (regular expression) to fllter out some symbols (in this example '?' in "1954?") or other bad tokens before input language_detection is more useful. And i think because the different format of our input, someone may input a html
document into the translate method. It is useful to provide a interface let user set a token filter (filter "?" "<\br>" ) before language_detection.

The above example is one sample i use to translate from a dataset, so when this error occured , i lost all previous translated conclusion because one Exception yield . Because the truly translate is running in batch manner, People may want to maintain some success batches by set a small batch_size and a collection to collect success batches. I hope the future version will support collect success batches and not let all lost in the final output of translate method a long (in list measure or string measure) input of documents.

Offline mode impossible?

I have installed easynmt docker on Windows 7 using Docker Toolbox. I also mapped the /cache folder to my local drive so cached data remain persistent after reboot. I tested and it works fine, I can see hundreds of megabytes of data in the cache folder when I try to translate in a new language.

The problem is that after a reboot, using an already downloaded language set, there is always a pretty long delay after the first translation call. I disabled my network card to test and indeed something is still downloaded from the internet even with data/models already in cached folder.
What is downloaded?
Is it possible to run in 100% offline mode?

Sending large documents for translation with GET endpoint can sometimes result in URL parser error

With large documents (several thousand characters), I see a crash in the URL parser. Tried with different source and target languages. Also, for the same source text, the translation may sometimes succeed, but then fail again. For example, for the sample request URL given below, the URL parser may sometimes succeed and sometimes fail with exception.

I'm using the last 1.x commit of EasyNMT (specifically commit 61fcf7154f01f56c02be6d30b1c5d0921b91aa2e) as it has better benchmarks than 2.x for fairseq models, but I believe the same issue should be there for the latest version too as I don't think the URL parser would have changed. I'm using the m2m_100_418M model with a T4 GPU if that matters at all.

EasyNMT error logs:

[2021-05-25 14:58:29 +0000] [16] [WARNING] Invalid HTTP request received.
Traceback (most recent call last):
  File "httptools/parser/parser.pyx", line 245, in httptools.parser.parser.cb_on_url
  File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 216, in on_url
    parsed_url = httptools.parse_url(url)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'ividual+level+we+need+to+take+up+responsibility+to+curb+the+spread+of+this+virus%2C%22+said+Dr+Rao%0D%0A%0D%0APromotedListen+to+the+latest+songs%2C+only+on+JioSaavn.com%0D%0A%0D%0AWhen+it+comes+to+Bengaluru%2C+Dr+Rao+said+the+lockdown+had+reduced+the+number+of+emergency+oxygen+requirements+and+the+panic.+%22That+is+because+the+virus+has+stopped+moving+because+we+have+stopped+moving%2C%22+he+said.+%22Generally%2C+as+a+rule%2C+a+health+care+system+will+not+be+able+to+cope+with+a+sudden+rise+in+numbers%2C+emergency+oxygen+requirements+or+health+care.+The+other+big+concern+is+trained+manpower.%22%0D%0A%0D%0AComments%0D%0AMucormycosis%2C+commonly+known+as+Black+Fungus%2C+is+also+on+the+rise+in+the+state.+Dr+Rao+said%3A+%22At+HCG+we+are+treating+30+cases+and+the+number+is+on+the+rise.+In+Karnataka%2C+currently%2C+it+must+be+about+700+cases.+It+looks+like+an+epidemic+within+a+pandemic+at+this+juncture.+We+need+to+understand+the+source+of+this+infection%2C+have+early+detection+and+treatment.+A+committee+will+give+a+clear+strategy+for+the+state.+We+don%27t+need+to+scare+people+about+black+fungus%2C+we+need+to+create+awareness.+What+we+have+seen+in+the+patients+-+they+have+all+been+Covid+positive%2C+most+have+been+given+steroids%2C+majority+had+high+sugar.+30+to+40+per+cent+had+been+given+oxygen+and+most+important+-+none+of+them+had+been+vaccinated.%22%0D%0A%0D%0A'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 167, in data_received
    self.parser.feed_data(data)
  File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserCallbackError: the on_url callback failed
[2021-05-25 14:58:29 +0000] [18] [WARNING] Invalid HTTP request received.
Traceback (most recent call last):
  File "httptools/parser/parser.pyx", line 245, in httptools.parser.parser.cb_on_url
  File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 216, in on_url
    parsed_url = httptools.parse_url(url)
  File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'irus+is+now+being+reported+more+from+rural+Karnataka+with+often+a+weak+health+infrastructure.%0D%0A%0D%0ADr+Vishal+Rao+of+the+HCG+hospitals+and+a+member+of+the+Karnataka+Covid+task+force+said%2C+%22It+is+going+to+be+an+uphill+task+as+we+move+towards+the+districts+as+the+health+care+systems+get+overburdened+there.+Even+the+oxygen+management.+In+cities%2C+we+have+the+privilege+that+oxygen+comes+to+the+doorstep+of+the+hospital.+Whereas+in+villages+and+districts%2C+hospitals+have+to+carry+their+cylinders+to+refill+them.+Public+health+experts+and+virologists+are+repeatedly+trying+to+enhance+the+surveillance+in+villages+to+ensure+we+are+better+prepared+in+villages.+This+is+the+time+to+ramp+up+the+preparation+for+villages.%22%0D%0A%0D%0AHe+also+said+that+the+lockdown+%22definitely+had+a+very+significant+impact%22+on+the+daily+infections.+%22From+50%2C000+cases+everyday%2C+today+we+are+at+around+20%2C000+odd+cases.It+is+not+a+reassurance+that+once+the+lockdown+is+lifted%2C+we+will+continue+to+have+these+low+numbers.+But+what+is+of+concern+is+that+the+positivity+rate+still+sticks+at+around+20+per+cent+and+the+mortality+has+jumped+to+about+2+per+cent.+We+need+to+understand+that+when+the+waves+flatten%2C+it+is+not+that+the+virus+is+taking+rest.+It+is+a+socio-economic+virus+and+the+more+we+improve+interactions+without+safety%2C+we+are+going+to+explode+and+expand+the+spread+of+this+virus.+At+an+ind'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 167, in data_received
    self.parser.feed_data(data)
  File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserCallbackError: the on_url callback failed

Full request URL path + query params:

/translate?beam_size=2&source_lang=en&target_lang=de&text=Bengaluru%3A+Karnataka+-+one+of+the+worst+hit+states+in+the+country+in+the+second+wave+of+COVID-19%2C+has+been+witnessing+a+slump+in+the+new+case+numbers+over+the+last+few+weeks.+The+authorities%2C+however%2C+are+of+the+view+that+It+is+far+too+early+to+relax.%0D%0AThe+number+of+COVID-19+outnumbered+fresh+infections+in+Karnataka+yet+again+on+Tuesday%2C+as+the+state+reported+38%2C224+discharges+and+22%2C758+new+cases.+Of+the+new+cases+reported+today%2C+6%2C243+were+from+Bengaluru.%0D%0A%0D%0A%22If+you+look+at+the+numbers%2C+it+has+been+reducing+very+drastically.+Except+for+a+few+districts+where+the+numbers+are+not+coming+down.+In+most+of+the+districts+and+Bengaluru%2C+the+numbers+have+come+down.+The+number+should+come+down+drastically+so+that+we+can+unlock+from+the+lockdown%2C%22+Deputy+Chief+Minister+Dr+Ashwath+Narayan+told+NDTV.%0D%0A%0D%0AThe+state+is+in+the+middle+of+a+strict+shutdown.+But+that+doesn%27t+mean+any+major+reduction+in+the+demand+for+oxygen+in+Bengaluru+as+ICU+beds+still+remain+full.%0D%0A%0D%0A%22Since+the+number+has+come+down+very+drastically+-+now+it+is+5%2C000+odd+cases+in+Bengaluru+%28daily+infections%29+-+from+when+it+had+almost+reached+25%2C000%2C+it+is+a+great+relief.+When+it+comes+to+ICU+or+ventilator%2C+however%2C+there+is+still+a+lot+of+demand%2C%22+he+said.%0D%0A%0D%0AAn+extra+concern+is+that+the+virus+is+now+being+reported+more+from+rural+Karnataka+with+often+a+weak+health+infrastructure.%0D%0A%0D%0ADr+Vishal+Rao+of+the+HCG+hospitals+and+a+member+of+the+Karnataka+Covid+task+force+said%2C+%22It+is+going+to+be+an+uphill+task+as+we+move+towards+the+districts+as+the+health+care+systems+get+overburdened+there.+Even+the+oxygen+management.+In+cities%2C+we+have+the+privilege+that+oxygen+comes+to+the+doorstep+of+the+hospital.+Whereas+in+villages+and+districts%2C+hospitals+have+to+carry+their+cylinders+to+refill+them.+Public+health+experts+and+virologists+are+repeatedly+trying+to+enhance+the+surveillance+in+villages+to+ensure+we+are+better+prepared+in+villages.+This+is+the+time+to+ramp+up+the+preparation+for+villages.%22%0D%0A%0D%0AHe+also+said+that+the+lockdown+%22definitely+had+a+very+significant+impact%22+on+the+daily+infections.+%22From+50%2C000+cases+everyday%2C+today+we+are+at+around+20%2C000+odd+cases.It+is+not+a+reassurance+that+once+the+lockdown+is+lifted%2C+we+will+continue+to+have+these+low+numbers.+But+what+is+of+concern+is+that+the+positivity+rate+still+sticks+at+around+20+per+cent+and+the+mortality+has+jumped+to+about+2+per+cent.+We+need+to+understand+that+when+the+waves+flatten%2C+it+is+not+that+the+virus+is+taking+rest.+It+is+a+socio-economic+virus+and+the+more+we+improve+interactions+without+safety%2C+we+are+going+to+explode+and+expand+the+spread+of+this+virus.+At+an+individual+level+we+need+to+take+up+responsibility+to+curb+the+spread+of+this+virus%2C%22+said+Dr+Rao%0D%0A%0D%0APromotedListen+to+the+latest+songs%2C+only+on+JioSaavn.com%0D%0A%0D%0AWhen+it+comes+to+Bengaluru%2C+Dr+Rao+said+the+lockdown+had+reduced+the+number+of+emergency+oxygen+requirements+and+the+panic.+%22That+is+because+the+virus+has+stopped+moving+because+we+have+stopped+moving%2C%22+he+said.+%22Generally%2C+as+a+rule%2C+a+health+care+system+will+not+be+able+to+cope+with+a+sudden+rise+in+numbers%2C+emergency+oxygen+requirements+or+health+care.+The+other+big+concern+is+trained+manpower.%22%0D%0A%0D%0AComments%0D%0AMucormycosis%2C+commonly+known+as+Black+Fungus%2C+is+also+on+the+rise+in+the+state.+Dr+Rao+said%3A+%22At+HCG+we+are+treating+30+cases+and+the+number+is+on+the+rise.+In+Karnataka%2C+currently%2C+it+must+be+about+700+cases.+It+looks+like+an+epidemic+within+a+pandemic+at+this+juncture.+We+need+to+understand+the+source+of+this+infection%2C+have+early+detection+and+treatment.+A+committee+will+give+a+clear+strategy+for+the+state.+We+don%27t+need+to+scare+people+about+black+fungus%2C+we+need+to+create+awareness.+What+we+have+seen+in+the+patients+-+they+have+all+been+Covid+positive%2C+most+have+been+given+steroids%2C+majority+had+high+sugar.+30+to+40+per+cent+had+been+given+oxygen+and+most+important+-+none+of+them+had+been+vaccinated.%22%0D%0A%0D%0ADelhi+received+144.8+mm+rainfall+in+May+this+year%2C+the+highest+for+the+month+in+13+years%2C+according+to+the+India+Meteorological+Department+%28IMD%29.%0D%0A%22No+rain+is+predicted+in+the+next+four+to+five+days.+So%2C+this+is+the+highest+rainfall+in+May+since+2008%2C%22+Kuldeep+Srivastava%2C+the+head+of+the+IMD%27s+regional+forecasting+centre%2C+said+today.%0D%0A%0D%0AThe+Safdarjung+Observatory%2C+considered+the+official+marker+for+the+city%2C+had+recorded+21.1+mm+rainfall+last+year%2C+26.9+mm+in+2019+and+24.2+mm+in+2018.%0D%0A%0D%0AIt+had+gauged+40.5+mm+precipitation+in+2017%3B+24.3+mm+in+2016%3B+3.1+mm+in+2015+and+100.2+mm+in+2014%2C+according+to+IMD+data.%0D%0A%0D%0A

Do documents or large texts need special characters to tokenize before translation?

Hi @nreimers

Will the method Translate split or tokenize the input text in sentences?

model.translate(text, source_lang='es', target_lang='en')

I'm working with large sentences or documents, but without commas or special characters, because I cleaned the text in previous steps, So, I am sending sentences like the example below, but very large:

"amor edificio casa perro" --> expected output ---> "love building house dog"

As you can notice, there is no presence of commas or dots, will your model need this characters for translate in chunks?

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

Hello I got the following error when using pytorch with GPU (the code works fine with using only CPU)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasCreate(handle)``
when I run a simple basic example such as

from easynmt import EasyNMT
model = EasyNMT('m2m_100_1.2B')
print(model.translate("Bonjour", target_lang='en'))

I have cuda 11.0 installed and a GPU 6Gb GeForce GTX 1660.

Does anyone have encountered the same issue ?

It keeps on downloading the models again and again when start a new operation.

Unable to download Opus-MT

I have been unabale to download opus-mt.

Everytime i run: model = EasyNMT('opus-mt')
after downloading 20%, the downloading speed reduces from 2Mbps to 1 kbps.

Can this project support num-beams in opus-mt model ?

I find similar project called ktrain support this.
located in
https://github.com/amaiya/ktrain/blob/5c9c6b333115be44433639c4bc4c091bd79ab65c/ktrain/text/translation/core.py
and have some accuracy measurement output to summarize the conclusion will more interesting.
Can multilingual sentence embedding can do some help ?

Discussion and a few observations

First -- thank you for the docker container. I could not get the Dockerfile to build. I am impressed with my results so far.

Second -- during startup, I get the following :


Booted as backend: True
Load model: opus-mt
[2021-03-30 00:17:10 +0000] [2255] [INFO] Started server process [2255]
[2021-03-30 00:17:10 +0000] [2255] [INFO] Waiting for application startup.
[2021-03-30 00:17:10 +0000] [2255] [INFO] Application startup complete.
/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3216: FutureWarning: `prepare_seq2seq_batch` is deprecated and will be removed in version 5 of 🤗 Transformers. Use the regular `__call__` method to prepare your inputs and the tokenizer under the `with_target_tokenizer` context manager to prepare your targets. See the documentation of your specific tokenizer for more details
  warnings.warn(

Third -- some translations are humerous (using opus-mt). consider sv to en (Swedish to English) of sitt, which translates to:

This medicine is not recommended for use in patients with severe hepatic impairment.

or stackers, which translates to

poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor.

(You can confirm these by using the $10 computer translation window and swedish to english).

Translation of än goes on for quite a few lines. Or you can do vågar

Fourth -- I'm running on some pretty good iron and the performance seems a little slow. Are there some tuning issues that I can address? I have enough cpu and memory.

Fifth -- A sincere thanks for putting this together. I was using googletrans and it was giving me fits.

And you might care about what I'm using easyNMT for....I'm doing prep work for a student taking classes in a foreign language. I am pulling the English closed captions out of the recording of a broadcast TV program, translating them to Swedish, and then putting them back into the program before we watch it. I also build a new word list from the Swedish translations, eliminate words the student already knows, and create a flat file of the Swedish word and English meaning that can be input to a flash card program. This is working much better since I have started using easyNMT. Many, many thanks.

auto detect source_lang not correct

in my case, some ‘zh’ strings are detected to 'wuu', 'yue', 'eo'....actually, they are all 'zh' strings.

Can EasyNMT be used to train your own model using own dataset?

First, thanks for this. It looks promising.

Can it be used to train your own model using your own training dataset?

Thanks,

[docker] Is it possible to cancel the translation process if the client request is cancelled or times out?

Thanks again for this wonderful project.

We have a use case as follows. Is it possible to achieve this with EasyNMT? We haven't found any way to cancel ongoing translation processes yet.

HTTP Client makes an HTTP request to the docker container
The translation process may take a very long time, depending on the size of the input.
HTTP Client times out or cancels the request and retries again.

For Case 3, is it possible to cancel the ongoing translation process + HTTP request in order to reclaim processor resources for the next request immediately?

Permission Denied fairseq \ examples

Hi,
Thank you for the great library. Very useful. I tried to install in my machine and it landed up in error. Can you please let me know if i am missing anything,?
ERROR: Command errored out with exit status 1:
command: 'C:\Program Files\Anaconda\python.exe' 'C:\Program Files\Anaconda\lib\site-packages\pip_vendor\pep517_in_process.py' get_requires_for_build_wheel 'C:\Users\Public\Documents\Wondershare\CreatorTemp\tmp90q37bss'
cwd: C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-install-zxfcubdl\fairseq
Complete output (31 lines):
Traceback (most recent call last):
File "setup.py", line 214, in
do_setup(package_data)
File "setup.py", line 136, in do_setup
setup(
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools_init_.py", line 152, in setup
install_setup_requires(attrs)
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools_init.py", line 147, in _install_setup_requires
dist.fetch_build_eggs(dist.setup_requires)
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools\build_meta.py", line 60, in fetch_build_eggs
raise SetupRequirementsError(specifier_list)
setuptools.build_meta.SetupRequirementsError: ['cython', 'numpy', 'setuptools>=18.0']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Program Files\Anaconda\lib\site-packages\pip_vendor\pep517_in_process.py", line 280, in
main()
File "C:\Program Files\Anaconda\lib\site-packages\pip_vendor\pep517_in_process.py", line 263, in main
json_out['return_val'] = hook(hook_input['kwargs'])
File "C:\Program Files\Anaconda\lib\site-packages\pip_vendor\pep517_in_process.py", line 114, in get_requires_for_build_wheel
return hook(config_settings)
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools\build_meta.py", line 149, in get_requires_for_build_wheel
return self._get_build_requires(
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools\build_meta.py", line 130, in _get_build_requires
self.run_setup()
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools\build_meta.py", line 145, in run_setup
exec(compile(code, file**, 'exec'), locals())
File "setup.py", line 217, in
os.unlink(fairseq_examples)
PermissionError: [WinError 5] Access is denied: 'fairseq\examples'

ERROR: Command errored out with exit status 1: 'C:\Program Files\Anaconda\python.exe' 'C:\Program Files\Anaconda\lib\site-packages\pip_vendor\pep517_in_process.py' get_requires_for_build_wheel 'C:\Users\Public\Documents\Wondershare\CreatorTemp\tmp90q37bss' Check the logs for full command output.

Incorrect model name when loading multiple models

I ran into a bug where a model keeps getting loaded, even after being used right before.

I set max_loaded_models=1
I then did a call to translate a sentence from english to dutch, loading the EN-NL model
Then I did a translate call from german to dutch, unloading the EN-NL model and loading DE-NL.

However, when I repeat the german to dutch call, the model gets loaded again.

Printing the self.models of the translator show the following:
{'Helsinki-NLP/opus-mt-EN-NL': {'tokenizer': PreTrainedTokenizer(name_or_path='Helsinki-NLP/opus-mt-DE-NL'

In other words, It loads the german model under the english name.

I found the problem in the source code:

In load_model the line for model_name in self.models: overwrites the value of model_name, which should indicate the model you want to load. Therefore, it puts the new model under the name of the last evaluated model in self.models.

Translation / rephrase in same language

I want to ask if is there any way a text can get rephrased using this library keeping same language.

How many characters is 512 word pieces?

language_detection function may have some fault with upper case English.

language_detection default use fasttext lid.176.ftz
which will have some confusion between upper case English and Japanese.
following is a example:

model = EasyNMT("opus-mt")
text_list = ["WHAT SCHOOL DOES HA'SEAN CLINTON-DIX BELONG TO?",
'WHAT IS THE LOWEST ROUND FOR CB POSITION?',
'IN THE ISSUE WHERE NICOLE NARAIN WAS THE CENTERFOLD MODEL, WHO WAS THE INTERVIEW SUBJECT?',
"WHAT WAS ESSENDON'S HIGHEST CROWD NUMBER WHEN PLAYING AS THE AWAY TEAM?",
'WHAT ARE THE HIGHEST POINTS WITH GOALS LARGER THAN 48, WINS LESS THAN 19, GOALS AGAINST SMALLER THAN 44, DRAWS LARGER THAN 5?',
'WHAT IS THE STROKE COUNT WITH RADICAL OF 生, FRQUENCY SMALLER THAN 22?',
'HOW MANY TEMPERATURE INTERVALS ARE POSSIBLE TO USE WITH ACRYL? ',
'WHAT ARE THE GOALS WITH DRAWS SMALLER THAN 6, AND LOSSES SMALLER THAN 7?',
'Name the least number for 君のそばで～ヒカリのテーマ～(popup.version)',
'WHAT WAS THE SURFACE MADE OF WHEN THE OPPONENT WAS GEORGE KHRIKADZE?',
'Name the english name for 福鼎市',
'WHAT IS THE LOWEST MONEY WITH 74-74-76-74=298 SCORE?',
'WHAT IS THE HIGHEST PICK WITH A TE POSITION, AND ROUND SMALLER THAN 2?',
'WHAT IS THE LOWEST MONEY WITH TO PAR LARGER THAN 26?']

pd.Series(text_list).map(model.language_detection).value_counts(),
pd.Series(text_list).map(lambda x: x.lower()).map(model.language_detection).value_counts()

first series detect all to ja, second distinguish en and ja.
May have some pre-process of input , or change the default detector.