ukplab / easynmt Goto Github PK
View Code? Open in Web Editor NEWEasy to use, state-of-the-art Neural Machine Translation for 100+ languages
License: Apache License 2.0
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages
License: Apache License 2.0
Hi there,
thanks for providing this great package!
I have encountered an issue with the translate_stream
method of the EasyNMT class: If the number of input texts ('n_texts') is not divisible by the chunk_size
(i.e. without remainder), translate_stream
only yields the first (n_texts // chunk_size) * chunk_size
translations.
A minimal working example can be found here: https://colab.research.google.com/drive/1coVyCXc8jnPdHVcFHaVEsTfv1KfhGkvm?usp=sharing
Hi, I want to load the 300MB translation model in a different directory, let's say in my program folder instead of in the python/huggingface/pytorch folder. Is there any way I can do this? Thanks
Hi and thanks for the cool library!
I want to include the translation function in one of my data pipelines that loops over thousands of text snippets. Without the GPU support and on Windows I was following the instructions in the other issue and successfully added the function.
from easynmt import EasyNMT
model = EasyNMT('opus-mt')
and I translate with:
language = detect_langs(text)
for each_lang in language:
if (each_lang.lang != "en"):
translated_text = model.translate(text, target_lang='en')
whereas text is a string.
However, after a few translations (2-3) I always run into this error:
OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-ia-en'. Make sure that:
- 'Helsinki-NLP/opus-mt-ia-en' is a correct model identifier listed on 'https://huggingface.co/models'
Any idea what the problem could be?
Hi,
I was trying to translate 19203 sentence data from german to English using the translate_stream method explained in the following link.
https://github.com/UKPLab/EasyNMT/blob/main/examples/translation_streaming.py
I set the chunk size to 32. After successfully translating 3 chunks and writing output on file it gave an error of the model. Can you guide me with this issue? I am pasting error wording down here.
0%|▌ | 96/19203 [00:54<3:00:03, 1.77it/s]
Exception: Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en'. Make sure that:
- 'Helsinki-NLP/opus-mt-nds-en' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'Helsinki-NLP/opus-mt-nds-en' is the correct path to a directory containing relevant tokenizer files
1%|▋ | 127/19203 [01:06<2:46:24, 1.91it/s]
Traceback (most recent call last):
File "translate.py", line 12, in <module>
for translation in model.translate_stream(sentences, chunk_size=32, target_lang='en'):
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 297, in translate_stream
translated = self.translate(batch, show_progress_bar=False, **kwargs)
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 124, in translate
translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 210, in translate_sentences
raise e
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 205, in translate_sentences
translated = self.translate_sentences(grouped_sentences, source_lang=lng, target_lang=target_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 222, in translate_sentences
output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/models/OpusMT.py", line 46, in translate_sentences
tokenizer, model = self.load_model(model_name)
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/easynmt/models/OpusMT.py", line 28, in load_model
tokenizer = MarianTokenizer.from_pretrained(model_name)
File "/home/ulhaqi12/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-nds-en'. Make sure that:
- 'Helsinki-NLP/opus-mt-nds-en' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'Helsinki-NLP/opus-mt-nds-en' is the correct path to a directory containing relevant tokenizer files
Hey team,
Thank you again for the great library.
Today we translated 'id' (Indonesian) sentences and quite a few of them came out as variants of "I'm sorry I'm sorry I'm sorry I'm sorry I'm sorry I'm sorry" even though they did not mention 'sorry' in the text.
Any idea why this could be please? Could it be because I'm not performing sentence splitting prior to translation whilst using the 'translate_sentences' function?
Thanks!
I want to translate from English to Yoruba, but I am getting the above OS error
Exception: Can't load tokenizer for 'Helsinki-NLP/opus-mt-en-yo'. Make sure that:
- 'Helsinki-NLP/opus-mt-en-yo' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'Helsinki-NLP/opus-mt-en-yo' is the correct path to a directory containing relevant tokenizer files
Traceback (most recent call last):
File "translatefile.py", line 19, in <module>
translations = model.translate(sentences, target_lang='yo')
File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 122, in translate
translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 208, in translate_sentences
raise e
File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 203, in translate_sentences
translated = self.translate_sentences(grouped_sentences, source_lang=lng, target_lang=target_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 220, in translate_sentences
output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/models/OpusMT.py", line 46, in translate_sentences
tokenizer, model = self.load_model(model_name)
File "/home/alabi/anaconda3/lib/python3.7/site-packages/easynmt/models/OpusMT.py", line 28, in load_model
tokenizer = MarianTokenizer.from_pretrained(model_name)
File "/home/alabi/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'Helsinki-NLP/opus-mt-en-yo'. Make sure that:
- 'Helsinki-NLP/opus-mt-en-yo' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'Helsinki-NLP/opus-mt-en-yo' is the correct path to a directory containing relevant tokenizer files
Translating from Yoruba to English worked well. What am I doing wrong?
Hi, I like your package!
Wanted to propose you do reverse sorting by lengths before translation, to avoid sudden memory error at the end of the translation)
Just got a new CentOS 7 server with a GeForce RTX 2080 GPU...
Using the API the translation was very slow so I tried the following:
Correct me if I am wrong but it seems like my easynmt instance doesn't use the GPU? Did I do something wrong?
Thank you for your help...
Running this on a 2017 Macbook. Docker image easynmt/api:2.0-cpu
fails to start with exceptions, while easynmt/api:1.1-cpu
was running fine with the same docker run
command previously.
docker run -p 24081:80 -v /Users/agrim/Downloads/easynmt-models:/cache easynmt/api:2.0-cpu
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 92, in __init__
self.translator = module_class(easynmt_path=model_path, **self.config['model_args'])
KeyError: 'model_args'
Checking for script in /app/prestart.sh
There is no script /app/prestart.sh
[2021-04-27 14:38:22 +0000] [13] [INFO] Starting gunicorn 20.1.0
[2021-04-27 14:38:22 +0000] [12] [INFO] Starting gunicorn 20.1.0
[2021-04-27 14:38:22 +0000] [12] [INFO] Listening at: http://0.0.0.0:8080 (12)
[2021-04-27 14:38:22 +0000] [13] [INFO] Listening at: http://0.0.0.0:80 (13)
[2021-04-27 14:38:22 +0000] [13] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-04-27 14:38:22 +0000] [12] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-04-27 14:38:22 +0000] [17] [INFO] Booting worker with pid: 17
[2021-04-27 14:38:22 +0000] [18] [INFO] Booting worker with pid: 18
[2021-04-27 14:38:22 +0000] [19] [INFO] Booting worker with pid: 19
[2021-04-27 14:38:24 +0000] [19] [INFO] Started server process [19]
[2021-04-27 14:38:24 +0000] [17] [INFO] Started server process [17]
[2021-04-27 14:38:24 +0000] [17] [INFO] Waiting for application startup.
[2021-04-27 14:38:24 +0000] [19] [INFO] Waiting for application startup.
[2021-04-27 14:38:24 +0000] [17] [INFO] Application startup complete.
[2021-04-27 14:38:24 +0000] [19] [INFO] Application startup complete.
{"loglevel": "info", "workers": "1", "bind": "0.0.0.0:8080", "graceful_timeout": 120, "timeout": 120, "keepalive": 5, "errorlog": "-", "accesslog": "-", "host": "0.0.0.0", "port": "8080"}
Booted as backend: True
Load model: opus-mt
[2021-04-27 14:38:25 +0000] [18] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
worker.init_process()
File "/usr/local/lib/python3.8/site-packages/uvicorn/workers.py", line 63, in init_process
super(UvicornWorker, self).init_process()
File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/base.py", line 134, in init_process
self.load_wsgi()
File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
self.wsgi = self.app.wsgi()
File "/usr/local/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
return self.load_wsgiapp()
File "/usr/local/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
return util.import_app(self.app_uri)
File "/usr/local/lib/python3.8/site-packages/gunicorn/util.py", line 359, in import_app
mod = importlib.import_module(module)
File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 783, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/app/main.py", line 36, in <module>
model = EasyNMT(model_name, load_translator=IS_BACKEND, **model_args)
File "/usr/local/lib/python3.8/site-packages/easynmt/EasyNMT.py", line 92, in __init__
self.translator = module_class(easynmt_path=model_path, **self.config['model_args'])
KeyError: 'model_args'
[2021-04-27 14:38:25 +0000] [18] [INFO] Worker exiting (pid: 18)
[2021-04-27 14:38:25 +0000] [12] [INFO] Shutting down: Master
[2021-04-27 14:38:25 +0000] [12] [INFO] Reason: Worker failed to boot.
{"loglevel": "info", "workers": "1", "bind": "0.0.0.0:8080", "graceful_timeout": 120, "timeout": 120, "keepalive": 5, "errorlog": "-", "accesslog": "-", "host": "0.0.0.0", "port": "8080"}
One of the processes has already exited.
Hello. Where are the downloaded models stored on a Mac?(I am talking about the large, 300MB models.
Hello, congrats for the initiative!
I've been using Helsinking-NLP models previously and most commonly used models are 'opus-mt-en-ROMANCE' and 'opus-mt-ROMANCE-en' for Portuguese. So, if I use model.translate(sample, source_lang='pt', target_lang='en')
, it won't work, but as I've tested, model.translate(sample, source_lang='ROMANCE', target_lang='en')
works.
So, It would be nice to have some alias in the code for ROMANCE. :)
This is the error coming as the model is not available in the repo.
First, I need to congratulate the team for your work, especially Nils is imho one of the best devs in the NLP community. Sentence-transformers and this NMT translator repo have been very helpful to us in Contents.com.
I use the Opus-mt right now it is great and I noticed that it even keeps hmtl tags. But sometimes it makes a mistake of generating ">/strong>", ×/strong> or --/strong> instead of (strong as example, same for other tags like h3 or li (as far as I know)). I solved it by simply replacing: text = text.replace('×/', '</').replace('>/', '</').replace('--/', '</'). It is a very simple thing, Just wanted to make you know.
I wonder if you are aware of a translator model that is very good at keeping the proper nouns like people, cities, company nouns ect unchanged, even when made by 2 or more words? I could use NER but it could decrease speed and tbh I don't find the free NER libraries enough reliable.
Thank you!
Hello guys, and thank you for your awesome library!
I'm currently struggling with making test_multi_process_translation.py script working. When it comes to the multi-process part, the following error occures:
2021-03-04 19:00:46 | INFO | easynmt.EasyNMT | Start multi-process pool on devices: cuda:0 Traceback (most recent call last): File "test_multi_process_translation.py", line 80, in <module> process_pool = model.start_multi_process_pool() File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/site-packages/easynmt/EasyNMT.py", line 258, in start_multi_process_pool p.start() File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/data/home/k.kirillova/anaconda/envs/fairseq/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'BaseFairseqModel.make_generation_fast_.<locals>.train'
Could you please give me any ideas on how to fix this? Thanks in advance!
Hello
I am running the following code
from easynmt import EasyNMT
model = EasyNMT('opus-mt')
print(model.translate("停", target_lang='en'))
The result of the code is just "停", which is the exact same thing as the input. How can i fix this?
Hi, Is there any way to add custom rules for certain words so that they won't get translated in another meaning?
This is not an issue per se, but rather, a query. Could you please elaborate on how the models provided in EasyNMT have been sized down to their current sizes (300MB for Opus-MT, 1.2 GB for mBART and 2.4 GB for the 1.2bn M2M-100)?
If there is some model pruning or other downsizing techniques involved to reduce the size of the original models to their EasyNMT versions, how does it affect the performance of these models as compared to their original counterparts?
Thank you for your response in advance.
Hi, I found that M2M_100 support translate directly between any pair in 100 languages (9900 pairs). But when I use EasyNMT with M2M_100 model, it doesn't support all of these pairs.
Example: EasyNMT can't translate directly from 'th' (Thai) to 'en' (English) while M2M_100 model does support this pair.
And when I tried to use HuggingFace to translate directly between Thai and English, it work perfectly.
Can you please solve the problem? By the way, thank you for creating EasyNMT.
Hi Team,
I have a question. I am trying to translate a column which has blanks in between. I am using EasyMT and its giving an error. won't it work if there is a blanks or missing in between the rows of a column?
Thanks
Srinivas
EasyNMT uses fasttext to identify language. Some chinese phrases can be misidentified into chinese variants, like 'yue' or 'wuu'. This will cause easyNMT to fail. Can you map Chinese language variants to Chinese. 'yue', 'wuu', 'min' to 'zh'?
Hi, I have found this tool very much helpful and excellent effort and help.
I want to know that after translation, is there any way that if I want to replace the word by its synonym, can I call any function?
Please if you can guide.
OSError: Unable to load weights from pytorch checkpoint file for 'facebook/m2m100_1.2B' at '/cache/transformers/68002fb1a7773d8d8373f1a230588141964ef9f249db6987681f295dbe85356c.ee70663869b89be4f68eed03a21d5c3400b223cb544883f411e469aaea0a25f9'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
I am using Ubuntu 20.04 with 3 GPUs of RTX 2080. This is my docker-compose:
version: '2.3'
services:
easynmt:
image: easynmt/api:2.0-cuda11.1
restart: always
runtime: nvidia
volumes:
- ./cache:/cache
ports:
- "24080:80"
env_file:
- ./envfile
The envfile:
EASYNMT_MODEL=m2m_100_1.2B
The first problem is that the service always use only the first GPU in my machine.
The second, one backend worker will take around 7GB of GPU's VRAM, so after a few requests of translate CUDA run out of memory and the code do not work any more (I think the spawned woker need another 7GB VRAM of GPU).
I have my own solution for both: set MAX_WORKERS_BACKEND=1.
start 3 services of easynmt with specificed GPU then use nginx as load balancer to group of 3 easynmt services.
Did I do something wrong? or Any better solution in this case?
Running Thai verbatim through the pipeline gives an error "ModuleNotFoundError: No module named 'thai_segmenter'.
Does this need to be specified explicitly in setup.py?
Thanks!
With the switch away from Fairseq to huggingface transformers in EasyNMT 2.0, there seem to have been substantial changes, e.g. the size of the m2m_100_1.2B
model has increased from 2.3 GB to 5 GB in size.
Do we need new benchmark tests for the m2m/mbart models after this change? In general, will this make translation inference slower or faster compared to before (by subjective opinion)?
I am trying to install easyNMT on a semi-offline system. Python libraries are permitted to be installed but accessing other URLs is not permitted from this system. Therefore, can you advise on a way of how to manually install the Facebook (m2m_100_418M and m2m_100_1.2B) models so that easyNMT can see them? I can see they can be manually downloaded from Huggingface.co. If I were to download these on a second system, where should I then save them on the semi-offline Windows system on which I intend to test eastNMT? Thanks
@nreimers
can an OPUS-MT model be inferenced on an arm based device, such as a raspberry pi?
Hi,
I'm sorry for this noobish question/issue and maybe it is easy to resolve (I'm not experienced with docker). I've built a web app which uses easyNMT in the back via the docker images and REST. When translating from romanian to german I noticed that the docker image is only using the opus model which does not provide this language direction. But when executing the "/model_name" request it shows me only "opus" as part of the docker image.
So how can I get the other models? I have 3 docker images of easynmt (one with 7.7gb, one with 6.02 and one with 3.8 gb size) but it seems none of them contains the other models. Am I doing something wrong here?
And also when they are part of the image, is there some kind of auto selection if a language is not available in one of the packages?
I installed the docker images via the "build-docker-hub.sh" file.
Best regards,
André
Hi,
I'm using EasyNMT for translating customer reviews. During translation, I got this error
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/Helsinki-NLP/opus-mt-ro-en
`HTTPError Traceback (most recent call last)
in
1 for index, row in df_review['AnswerValue'].iteritems():
----> 2 translated_row = model.translate(row, target_lang='en')#translating each row
3 df_review.loc[index, 'Translate'] = translated_row
~/opt/anaconda3/lib/python3.8/site-packages/easynmt/EasyNMT.py in translate(self, documents, target_lang, source_lang, show_progress_bar, beam_size, batch_size, perform_sentence_splitting, paragraph_split, sentence_splitter, document_language_detection, **kwargs)
152 except Exception as e:
153 logger.warning("Exception: "+str(e))
--> 154 raise e
155
156 if is_single_doc and len(output) == 1:
~/opt/anaconda3/lib/python3.8/site-packages/easynmt/EasyNMT.py in translate(self, documents, target_lang, source_lang, show_progress_bar, beam_size, batch_size, perform_sentence_splitting, paragraph_split, sentence_splitter, document_language_detection, **kwargs)
147 method_args['documents'] = [documents[idx] for idx in ids]
148 method_args['source_lang'] = lng
--> 149 translated = self.translate(**method_args)
150 for idx, translated_sentences in zip(ids, translated):
151 output[idx] = translated_sentences
~/opt/anaconda3/lib/python3.8/site-packages/easynmt/EasyNMT.py in translate(self, documents, target_lang, source_lang, show_progress_bar, beam_size, batch_size, perform_sentence_splitting, paragraph_split, sentence_splitter, document_language_detection, **kwargs)
179 #logger.info("Translate {} sentences".format(len(splitted_sentences)))
180
--> 181 translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
182
183 # Merge sentences back to documents
~/opt/anaconda3/lib/python3.8/site-packages/easynmt/EasyNMT.py in translate_sentences(self, sentences, target_lang, source_lang, show_progress_bar, beam_size, batch_size, **kwargs)
276
277 for start_idx in iterator:
--> 278 output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
279
280 #Restore original sorting of sentences
~/opt/anaconda3/lib/python3.8/site-packages/easynmt/models/OpusMT.py in translate_sentences(self, sentences, source_lang, target_lang, device, beam_size, **kwargs)
38 def translate_sentences(self, sentences: List[str], source_lang: str, target_lang: str, device: str, beam_size: int = 5, **kwargs):
39 model_name = 'Helsinki-NLP/opus-mt-{}-{}'.format(source_lang, target_lang)
---> 40 tokenizer, model = self.load_model(model_name)
41 model.to(device)
42
~/opt/anaconda3/lib/python3.8/site-packages/easynmt/models/OpusMT.py in load_model(self, model_name)
20 else:
21 logger.info("Load model: "+model_name)
---> 22 tokenizer = MarianTokenizer.from_pretrained(model_name)
23 model = MarianMTModel.from_pretrained(model_name)
24 model.eval()
~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1645 else:
1646 # At this point pretrained_model_name_or_path is either a directory or a model identifier name
-> 1647 fast_tokenizer_file = get_fast_tokenizer_file(
1648 pretrained_model_name_or_path, revision=revision, use_auth_token=use_auth_token
1649 )
~/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in get_fast_tokenizer_file(path_or_repo, revision, use_auth_token)
3406 """
3407 # Inspect all files from the repo/folder.
-> 3408 all_files = get_list_of_files(path_or_repo, revision=revision, use_auth_token=use_auth_token)
3409 tokenizer_files_map = {}
3410 for file_name in all_files:
~/opt/anaconda3/lib/python3.8/site-packages/transformers/file_utils.py in get_list_of_files(path_or_repo, revision, use_auth_token)
1691 else:
1692 token = None
-> 1693 model_info = HfApi(endpoint=HUGGINGFACE_CO_RESOLVE_ENDPOINT).model_info(
1694 path_or_repo, revision=revision, token=token
1695 )
~/opt/anaconda3/lib/python3.8/site-packages/huggingface_hub/hf_api.py in model_info(self, repo_id, revision, token)
246 )
247 r = requests.get(path, headers=headers)
--> 248 r.raise_for_status()
249 d = r.json()
250 return ModelInfo(**d)
~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
941
942 if http_error_msg:
--> 943 raise HTTPError(http_error_msg, response=self)
944
945 def close(self):
HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/Helsinki-NLP/opus-mt-ro-en`
Could you please review and fix the issue?
Thank you.
Is there a way how to set "temperature" or "repetition_penalty" for Facebook models?
Hello, I'm running this simple code in pycharm
`from easynmt import EasyNMT
model = EasyNMT("opus-mt")
print(model.translate("Hi", target_lang="fr"))`
and it gives me this error
Traceback (most recent call last):
File "H:/Documents/Python/Random.py", line 2, in
model = EasyNMT("opus-mt")
File "C:\Users\Jerz King\AppData\Local\Programs\Python\Python37\lib\site-packages\easynmt\EasyNMT.py", line 69, in init
module_class = import_from_string(self.config['model_class'])
File "C:\Users\Jerz King\AppData\Local\Programs\Python\Python37\lib\site-packages\easynmt\util.py", line 56, in import_from_string
module = importlib.import_module(module_path)
File "C:\Users\Jerz King\AppData\Local\Programs\Python\Python37\lib\importlib_init_.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1006, in _gcd_import
File "", line 983, in _find_and_load
File "", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'easynmt.models.OpusMT'
I had the same error addressed here when installing easynmt, and followed the steps, nothing happened... how do I fix this?
Hi, i review the code, and want to give some suggestions.
As the code logic describe, if user not set source_lang in translate method, the object will
auto infer the possible source_lang in translate_sentences.
This behaviour will have good conclusion when the input sentences are short,
when it comes to long sentence, because the perform_sentence_splitting usage, will split the
totally long input sentence into small fragments and do source_lang infer on evey fragments
and choose a good model to translate it (in grouped_sentences group by detected source_lang)
it will suit some mix language input sentences, translate different language fragments
by different model and join them back.
but when the sentence_splitter unfortunate split the long input in bad manner,
consider following example:
import pandas as pd
input_ = 'How many times does the rebuilt data contain cannot handle non-empty timestamp argument! 1929 and scrapped data contain cannot handle non-empty timestamp argument! 1954?'
#### this output will be ['en', 'en', 'eo'] because the last fragment is "1954?"
#### and language_detection map it to "eo"
pd.Series(sentence_splitter(input_)).map(model.language_detection)
And when use it in opus-mt model, to translate this sentence from "eo" into "zh" this will yield a error that not have this model to load.
I understand that i can avoid this error by set source_lang to "en" in translate method.
But i think also need deal with this problem.
I think if the language_detection and sentence_splitter can run rapidly, can try to valid all possible translate-mapping in
lang_pairs in easynmt.json in the opus-mt folder in the models dir before run translate method.
Or becuse the last fragments is too short to give a good suggestion on language_detection, set a evidence filter on different
length fragments.
Or if you can use some regex (regular expression) to fllter out some symbols (in this example '?' in "1954?") or other bad tokens before input language_detection is more useful. And i think because the different format of our input, someone may input a html
document into the translate method. It is useful to provide a interface let user set a token filter (filter "?" "<\br>" ) before language_detection.
The above example is one sample i use to translate from a dataset, so when this error occured , i lost all previous translated conclusion because one Exception yield . Because the truly translate is running in batch manner, People may want to maintain some success batches by set a small batch_size and a collection to collect success batches. I hope the future version will support collect success batches and not let all lost in the final output of translate method a long (in list measure or string measure) input of documents.
I have installed easynmt docker on Windows 7 using Docker Toolbox. I also mapped the /cache folder to my local drive so cached data remain persistent after reboot. I tested and it works fine, I can see hundreds of megabytes of data in the cache folder when I try to translate in a new language.
The problem is that after a reboot, using an already downloaded language set, there is always a pretty long delay after the first translation call. I disabled my network card to test and indeed something is still downloaded from the internet even with data/models already in cached folder.
What is downloaded?
Is it possible to run in 100% offline mode?
With large documents (several thousand characters), I see a crash in the URL parser. Tried with different source and target languages. Also, for the same source text, the translation may sometimes succeed, but then fail again. For example, for the sample request URL given below, the URL parser may sometimes succeed and sometimes fail with exception.
I'm using the last 1.x commit of EasyNMT (specifically commit 61fcf7154f01f56c02be6d30b1c5d0921b91aa2e
) as it has better benchmarks than 2.x for fairseq models, but I believe the same issue should be there for the latest version too as I don't think the URL parser would have changed. I'm using the m2m_100_418M
model with a T4 GPU if that matters at all.
EasyNMT error logs:
[2021-05-25 14:58:29 +0000] [16] [WARNING] Invalid HTTP request received.
Traceback (most recent call last):
File "httptools/parser/parser.pyx", line 245, in httptools.parser.parser.cb_on_url
File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 216, in on_url
parsed_url = httptools.parse_url(url)
File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'ividual+level+we+need+to+take+up+responsibility+to+curb+the+spread+of+this+virus%2C%22+said+Dr+Rao%0D%0A%0D%0APromotedListen+to+the+latest+songs%2C+only+on+JioSaavn.com%0D%0A%0D%0AWhen+it+comes+to+Bengaluru%2C+Dr+Rao+said+the+lockdown+had+reduced+the+number+of+emergency+oxygen+requirements+and+the+panic.+%22That+is+because+the+virus+has+stopped+moving+because+we+have+stopped+moving%2C%22+he+said.+%22Generally%2C+as+a+rule%2C+a+health+care+system+will+not+be+able+to+cope+with+a+sudden+rise+in+numbers%2C+emergency+oxygen+requirements+or+health+care.+The+other+big+concern+is+trained+manpower.%22%0D%0A%0D%0AComments%0D%0AMucormycosis%2C+commonly+known+as+Black+Fungus%2C+is+also+on+the+rise+in+the+state.+Dr+Rao+said%3A+%22At+HCG+we+are+treating+30+cases+and+the+number+is+on+the+rise.+In+Karnataka%2C+currently%2C+it+must+be+about+700+cases.+It+looks+like+an+epidemic+within+a+pandemic+at+this+juncture.+We+need+to+understand+the+source+of+this+infection%2C+have+early+detection+and+treatment.+A+committee+will+give+a+clear+strategy+for+the+state.+We+don%27t+need+to+scare+people+about+black+fungus%2C+we+need+to+create+awareness.+What+we+have+seen+in+the+patients+-+they+have+all+been+Covid+positive%2C+most+have+been+given+steroids%2C+majority+had+high+sugar.+30+to+40+per+cent+had+been+given+oxygen+and+most+important+-+none+of+them+had+been+vaccinated.%22%0D%0A%0D%0A'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 167, in data_received
self.parser.feed_data(data)
File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserCallbackError: the on_url callback failed
[2021-05-25 14:58:29 +0000] [18] [WARNING] Invalid HTTP request received.
Traceback (most recent call last):
File "httptools/parser/parser.pyx", line 245, in httptools.parser.parser.cb_on_url
File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 216, in on_url
parsed_url = httptools.parse_url(url)
File "httptools/parser/parser.pyx", line 468, in httptools.parser.parser.parse_url
httptools.parser.errors.HttpParserInvalidURLError: invalid url b'irus+is+now+being+reported+more+from+rural+Karnataka+with+often+a+weak+health+infrastructure.%0D%0A%0D%0ADr+Vishal+Rao+of+the+HCG+hospitals+and+a+member+of+the+Karnataka+Covid+task+force+said%2C+%22It+is+going+to+be+an+uphill+task+as+we+move+towards+the+districts+as+the+health+care+systems+get+overburdened+there.+Even+the+oxygen+management.+In+cities%2C+we+have+the+privilege+that+oxygen+comes+to+the+doorstep+of+the+hospital.+Whereas+in+villages+and+districts%2C+hospitals+have+to+carry+their+cylinders+to+refill+them.+Public+health+experts+and+virologists+are+repeatedly+trying+to+enhance+the+surveillance+in+villages+to+ensure+we+are+better+prepared+in+villages.+This+is+the+time+to+ramp+up+the+preparation+for+villages.%22%0D%0A%0D%0AHe+also+said+that+the+lockdown+%22definitely+had+a+very+significant+impact%22+on+the+daily+infections.+%22From+50%2C000+cases+everyday%2C+today+we+are+at+around+20%2C000+odd+cases.It+is+not+a+reassurance+that+once+the+lockdown+is+lifted%2C+we+will+continue+to+have+these+low+numbers.+But+what+is+of+concern+is+that+the+positivity+rate+still+sticks+at+around+20+per+cent+and+the+mortality+has+jumped+to+about+2+per+cent.+We+need+to+understand+that+when+the+waves+flatten%2C+it+is+not+that+the+virus+is+taking+rest.+It+is+a+socio-economic+virus+and+the+more+we+improve+interactions+without+safety%2C+we+are+going+to+explode+and+expand+the+spread+of+this+virus.+At+an+ind'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 167, in data_received
self.parser.feed_data(data)
File "httptools/parser/parser.pyx", line 193, in httptools.parser.parser.HttpParser.feed_data
httptools.parser.errors.HttpParserCallbackError: the on_url callback failed
Full request URL path + query params:
/translate?beam_size=2&source_lang=en&target_lang=de&text=Bengaluru%3A+Karnataka+-+one+of+the+worst+hit+states+in+the+country+in+the+second+wave+of+COVID-19%2C+has+been+witnessing+a+slump+in+the+new+case+numbers+over+the+last+few+weeks.+The+authorities%2C+however%2C+are+of+the+view+that+It+is+far+too+early+to+relax.%0D%0AThe+number+of+COVID-19+outnumbered+fresh+infections+in+Karnataka+yet+again+on+Tuesday%2C+as+the+state+reported+38%2C224+discharges+and+22%2C758+new+cases.+Of+the+new+cases+reported+today%2C+6%2C243+were+from+Bengaluru.%0D%0A%0D%0A%22If+you+look+at+the+numbers%2C+it+has+been+reducing+very+drastically.+Except+for+a+few+districts+where+the+numbers+are+not+coming+down.+In+most+of+the+districts+and+Bengaluru%2C+the+numbers+have+come+down.+The+number+should+come+down+drastically+so+that+we+can+unlock+from+the+lockdown%2C%22+Deputy+Chief+Minister+Dr+Ashwath+Narayan+told+NDTV.%0D%0A%0D%0AThe+state+is+in+the+middle+of+a+strict+shutdown.+But+that+doesn%27t+mean+any+major+reduction+in+the+demand+for+oxygen+in+Bengaluru+as+ICU+beds+still+remain+full.%0D%0A%0D%0A%22Since+the+number+has+come+down+very+drastically+-+now+it+is+5%2C000+odd+cases+in+Bengaluru+%28daily+infections%29+-+from+when+it+had+almost+reached+25%2C000%2C+it+is+a+great+relief.+When+it+comes+to+ICU+or+ventilator%2C+however%2C+there+is+still+a+lot+of+demand%2C%22+he+said.%0D%0A%0D%0AAn+extra+concern+is+that+the+virus+is+now+being+reported+more+from+rural+Karnataka+with+often+a+weak+health+infrastructure.%0D%0A%0D%0ADr+Vishal+Rao+of+the+HCG+hospitals+and+a+member+of+the+Karnataka+Covid+task+force+said%2C+%22It+is+going+to+be+an+uphill+task+as+we+move+towards+the+districts+as+the+health+care+systems+get+overburdened+there.+Even+the+oxygen+management.+In+cities%2C+we+have+the+privilege+that+oxygen+comes+to+the+doorstep+of+the+hospital.+Whereas+in+villages+and+districts%2C+hospitals+have+to+carry+their+cylinders+to+refill+them.+Public+health+experts+and+virologists+are+repeatedly+trying+to+enhance+the+surveillance+in+villages+to+ensure+we+are+better+prepared+in+villages.+This+is+the+time+to+ramp+up+the+preparation+for+villages.%22%0D%0A%0D%0AHe+also+said+that+the+lockdown+%22definitely+had+a+very+significant+impact%22+on+the+daily+infections.+%22From+50%2C000+cases+everyday%2C+today+we+are+at+around+20%2C000+odd+cases.It+is+not+a+reassurance+that+once+the+lockdown+is+lifted%2C+we+will+continue+to+have+these+low+numbers.+But+what+is+of+concern+is+that+the+positivity+rate+still+sticks+at+around+20+per+cent+and+the+mortality+has+jumped+to+about+2+per+cent.+We+need+to+understand+that+when+the+waves+flatten%2C+it+is+not+that+the+virus+is+taking+rest.+It+is+a+socio-economic+virus+and+the+more+we+improve+interactions+without+safety%2C+we+are+going+to+explode+and+expand+the+spread+of+this+virus.+At+an+individual+level+we+need+to+take+up+responsibility+to+curb+the+spread+of+this+virus%2C%22+said+Dr+Rao%0D%0A%0D%0APromotedListen+to+the+latest+songs%2C+only+on+JioSaavn.com%0D%0A%0D%0AWhen+it+comes+to+Bengaluru%2C+Dr+Rao+said+the+lockdown+had+reduced+the+number+of+emergency+oxygen+requirements+and+the+panic.+%22That+is+because+the+virus+has+stopped+moving+because+we+have+stopped+moving%2C%22+he+said.+%22Generally%2C+as+a+rule%2C+a+health+care+system+will+not+be+able+to+cope+with+a+sudden+rise+in+numbers%2C+emergency+oxygen+requirements+or+health+care.+The+other+big+concern+is+trained+manpower.%22%0D%0A%0D%0AComments%0D%0AMucormycosis%2C+commonly+known+as+Black+Fungus%2C+is+also+on+the+rise+in+the+state.+Dr+Rao+said%3A+%22At+HCG+we+are+treating+30+cases+and+the+number+is+on+the+rise.+In+Karnataka%2C+currently%2C+it+must+be+about+700+cases.+It+looks+like+an+epidemic+within+a+pandemic+at+this+juncture.+We+need+to+understand+the+source+of+this+infection%2C+have+early+detection+and+treatment.+A+committee+will+give+a+clear+strategy+for+the+state.+We+don%27t+need+to+scare+people+about+black+fungus%2C+we+need+to+create+awareness.+What+we+have+seen+in+the+patients+-+they+have+all+been+Covid+positive%2C+most+have+been+given+steroids%2C+majority+had+high+sugar.+30+to+40+per+cent+had+been+given+oxygen+and+most+important+-+none+of+them+had+been+vaccinated.%22%0D%0A%0D%0ADelhi+received+144.8+mm+rainfall+in+May+this+year%2C+the+highest+for+the+month+in+13+years%2C+according+to+the+India+Meteorological+Department+%28IMD%29.%0D%0A%22No+rain+is+predicted+in+the+next+four+to+five+days.+So%2C+this+is+the+highest+rainfall+in+May+since+2008%2C%22+Kuldeep+Srivastava%2C+the+head+of+the+IMD%27s+regional+forecasting+centre%2C+said+today.%0D%0A%0D%0AThe+Safdarjung+Observatory%2C+considered+the+official+marker+for+the+city%2C+had+recorded+21.1+mm+rainfall+last+year%2C+26.9+mm+in+2019+and+24.2+mm+in+2018.%0D%0A%0D%0AIt+had+gauged+40.5+mm+precipitation+in+2017%3B+24.3+mm+in+2016%3B+3.1+mm+in+2015+and+100.2+mm+in+2014%2C+according+to+IMD+data.%0D%0A%0D%0A
Hi @nreimers
Will the method Translate split or tokenize the input text in sentences?
model.translate(text, source_lang='es', target_lang='en')
I'm working with large sentences or documents, but without commas or special characters, because I cleaned the text in previous steps, So, I am sending sentences like the example below, but very large:
"amor edificio casa perro" --> expected output ---> "love building house dog"
As you can notice, there is no presence of commas or dots, will your model need this characters for translate in chunks?
Hello I got the following error when using pytorch with GPU (the code works fine with using only CPU)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling
cublasCreate(handle)``
when I run a simple basic example such as
from easynmt import EasyNMT
model = EasyNMT('m2m_100_1.2B')
print(model.translate("Bonjour", target_lang='en'))
I have cuda 11.0 installed and a GPU 6Gb GeForce GTX 1660.
Does anyone have encountered the same issue ?
I have been unabale to download opus-mt.
Everytime i run: model = EasyNMT('opus-mt')
after downloading 20%, the downloading speed reduces from 2Mbps to 1 kbps.
I find similar project called ktrain support this.
located in
https://github.com/amaiya/ktrain/blob/5c9c6b333115be44433639c4bc4c091bd79ab65c/ktrain/text/translation/core.py
and have some accuracy measurement output to summarize the conclusion will more interesting.
Can multilingual sentence embedding can do some help ?
First -- thank you for the docker container. I could not get the Dockerfile to build. I am impressed with my results so far.
Second -- during startup, I get the following :
Booted as backend: True
Load model: opus-mt
[2021-03-30 00:17:10 +0000] [2255] [INFO] Started server process [2255]
[2021-03-30 00:17:10 +0000] [2255] [INFO] Waiting for application startup.
[2021-03-30 00:17:10 +0000] [2255] [INFO] Application startup complete.
/usr/local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:3216: FutureWarning: `prepare_seq2seq_batch` is deprecated and will be removed in version 5 of 🤗 Transformers. Use the regular `__call__` method to prepare your inputs and the tokenizer under the `with_target_tokenizer` context manager to prepare your targets. See the documentation of your specific tokenizer for more details
warnings.warn(
Third -- some translations are humerous (using opus-mt). consider sv to en (Swedish to English) of sitt, which translates to:
This medicine is not recommended for use in patients with severe hepatic impairment.
or stackers, which translates to
poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor, poor.
(You can confirm these by using the $10 computer translation window and swedish to english).
Translation of än goes on for quite a few lines. Or you can do vågar
Fourth -- I'm running on some pretty good iron and the performance seems a little slow. Are there some tuning issues that I can address? I have enough cpu and memory.
Fifth -- A sincere thanks for putting this together. I was using googletrans and it was giving me fits.
And you might care about what I'm using easyNMT for....I'm doing prep work for a student taking classes in a foreign language. I am pulling the English closed captions out of the recording of a broadcast TV program, translating them to Swedish, and then putting them back into the program before we watch it. I also build a new word list from the Swedish translations, eliminate words the student already knows, and create a flat file of the Swedish word and English meaning that can be input to a flash card program. This is working much better since I have started using easyNMT. Many, many thanks.
in my case, some ‘zh’ strings are detected to 'wuu', 'yue', 'eo'....actually, they are all 'zh' strings.
First, thanks for this. It looks promising.
Can it be used to train your own model using your own training dataset?
Thanks,
Thanks again for this wonderful project.
We have a use case as follows. Is it possible to achieve this with EasyNMT? We haven't found any way to cancel ongoing translation processes yet.
For Case 3, is it possible to cancel the ongoing translation process + HTTP request in order to reclaim processor resources for the next request immediately?
Hi,
Thank you for the great library. Very useful. I tried to install in my machine and it landed up in error. Can you please let me know if i am missing anything,?
ERROR: Command errored out with exit status 1:
command: 'C:\Program Files\Anaconda\python.exe' 'C:\Program Files\Anaconda\lib\site-packages\pip_vendor\pep517_in_process.py' get_requires_for_build_wheel 'C:\Users\Public\Documents\Wondershare\CreatorTemp\tmp90q37bss'
cwd: C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-install-zxfcubdl\fairseq
Complete output (31 lines):
Traceback (most recent call last):
File "setup.py", line 214, in
do_setup(package_data)
File "setup.py", line 136, in do_setup
setup(
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools_init_.py", line 152, in setup
install_setup_requires(attrs)
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools_init.py", line 147, in _install_setup_requires
dist.fetch_build_eggs(dist.setup_requires)
File "C:\Users\Public\Documents\Wondershare\CreatorTemp\pip-build-env-hxrh3fi7\overlay\Lib\site-packages\setuptools\build_meta.py", line 60, in fetch_build_eggs
raise SetupRequirementsError(specifier_list)
setuptools.build_meta.SetupRequirementsError: ['cython', 'numpy', 'setuptools>=18.0']
During handling of the above exception, another exception occurred:
ERROR: Command errored out with exit status 1: 'C:\Program Files\Anaconda\python.exe' 'C:\Program Files\Anaconda\lib\site-packages\pip_vendor\pep517_in_process.py' get_requires_for_build_wheel 'C:\Users\Public\Documents\Wondershare\CreatorTemp\tmp90q37bss' Check the logs for full command output.
I ran into a bug where a model keeps getting loaded, even after being used right before.
I set max_loaded_models=1
I then did a call to translate a sentence from english to dutch, loading the EN-NL model
Then I did a translate call from german to dutch, unloading the EN-NL model and loading DE-NL.
However, when I repeat the german to dutch call, the model gets loaded again.
Printing the self.models of the translator show the following:
{'Helsinki-NLP/opus-mt-EN-NL': {'tokenizer': PreTrainedTokenizer(name_or_path='Helsinki-NLP/opus-mt-DE-NL'
In other words, It loads the german model under the english name.
I found the problem in the source code:
In load_model the line for model_name in self.models:
overwrites the value of model_name, which should indicate the model you want to load. Therefore, it puts the new model under the name of the last evaluated model in self.models.
I want to ask if is there any way a text can get rephrased using this library keeping same language.
How many characters is 512 word pieces?
language_detection default use fasttext lid.176.ftz
which will have some confusion between upper case English and Japanese.
following is a example:
model = EasyNMT("opus-mt")
text_list = ["WHAT SCHOOL DOES HA'SEAN CLINTON-DIX BELONG TO?",
'WHAT IS THE LOWEST ROUND FOR CB POSITION?',
'IN THE ISSUE WHERE NICOLE NARAIN WAS THE CENTERFOLD MODEL, WHO WAS THE INTERVIEW SUBJECT?',
"WHAT WAS ESSENDON'S HIGHEST CROWD NUMBER WHEN PLAYING AS THE AWAY TEAM?",
'WHAT ARE THE HIGHEST POINTS WITH GOALS LARGER THAN 48, WINS LESS THAN 19, GOALS AGAINST SMALLER THAN 44, DRAWS LARGER THAN 5?',
'WHAT IS THE STROKE COUNT WITH RADICAL OF 生, FRQUENCY SMALLER THAN 22?',
'HOW MANY TEMPERATURE INTERVALS ARE POSSIBLE TO USE WITH ACRYL? ',
'WHAT ARE THE GOALS WITH DRAWS SMALLER THAN 6, AND LOSSES SMALLER THAN 7?',
'Name the least number for 君のそばで~ヒカリのテーマ~(popup.version)',
'WHAT WAS THE SURFACE MADE OF WHEN THE OPPONENT WAS GEORGE KHRIKADZE?',
'Name the english name for 福鼎市',
'WHAT IS THE LOWEST MONEY WITH 74-74-76-74=298 SCORE?',
'WHAT IS THE HIGHEST PICK WITH A TE POSITION, AND ROUND SMALLER THAN 2?',
'WHAT IS THE LOWEST MONEY WITH TO PAR LARGER THAN 26?']
pd.Series(text_list).map(model.language_detection).value_counts(),
pd.Series(text_list).map(lambda x: x.lower()).map(model.language_detection).value_counts()
first series detect all to ja, second distinguish en and ja.
May have some pre-process of input , or change the default detector.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.