Giter Club home page Giter Club logo

laser's Introduction

LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

NEWS

  • 2023/11/30 Released P-xSIM, a dual approach extension to multilingual similarity search (xSIM)
  • 2023/11/16 Released laser_encoders, a pip-installable package supporting LASER-2 and LASER-3 models
  • 2023/06/26 xSIM++ evaluation pipeline and data released
  • 2022/07/06 Updated LASER models with support for over 200 languages are now available
  • 2022/07/06 Multilingual similarity search (xSIM) evaluation pipeline released
  • 2022/05/03 Librivox S2S is available: Speech-to-Speech translations automatically mined in Librivox [9]
  • 2019/11/08 CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
  • 2019/07/31 Gilles Bodard and Jérémy Rapin provided a Docker environment to use LASER
  • 2019/07/11 WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
  • 2019/03/18 switch to BSD license
  • 2019/02/13 The code to perform bitext mining is now available

CURRENT VERSION:

  • We now provide updated LASER models which support over 200 languages. Please see here for more details including how to download the models and perform inference.

According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalize to other languages which have not been seen during training, but which are in a language family which is covered by other languages.

A detailed description of how the multilingual sentence embeddings are trained can be found here, together with an experimental evaluation.

The core sentence embedding package: laser_encoders

We provide a package laser_encoders with minimal dependencies. It supports LASER-2 (a single encoder for the languages listed below) and LASER-3 (147 language-specific encoders described here).

The package can be installed simply with pip install laser_encoders and used as below:

from laser_encoders import LaserEncoderPipeline
encoder = LaserEncoderPipeline(lang="eng_Latn")
embeddings = encoder.encode_sentences(["Hi!", "This is a sentence encoder."])
print(embeddings.shape)  # (2, 1024)

The laser_encoders readme file provides more examples of its installation and usage.

The full LASER kit

Apart from the laser_encoders, we provide support for LASER-1 (the original multilingual encoder) and for various LASER applications listed below.

Dependencies

  • Python >= 3.7
  • PyTorch 1.0
  • NumPy, tested with 1.15.4
  • Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
  • Faiss, for fast similarity search and bitext mining
  • transliterate 1.10.2 (pip install transliterate)
  • jieba 0.39, Chinese segmenter (pip install jieba)
  • mecab 0.996, Japanese segmenter
  • tokenization from the Moses encoder (installed automatically)
  • FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)
  • Fairseq, sequence modeling toolkit (pip install fairseq==0.12.1)
  • tabulate, pretty-print tabular data (pip install tabulate)
  • pandas, data analysis toolkit (pip install pandas)
  • Sentencepiece, subword tokenization (installed automatically)

Installation

  • install the laser_encoders package by e.g. pip install -e . for installing it in the editable mode
  • set the environment variable 'LASER' to the root of the installation, e.g. export LASER="${HOME}/projects/laser"
  • download encoders from Amazon s3 by e.g. bash ./nllb/download_models.sh
  • download third party software by bash ./install_external_tools.sh
  • download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Supported languages

The original LASER model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

LASER3

Updated LASER models referred to as LASER3 supplement the above list with support for 147 languages. The full list of supported languages can be seen here.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

[9] Paul-Ambroise Duquenne, Hongyu Gong, Holger Schwenk, Multimodal and Multilingual Embeddings for Large-Scale Speech Mining,, NeurIPS 2021, pages 15748-15761.

[10] Kevin Heffernan, Onur Celebi, and Holger Schwenk, Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

laser's People

Contributors

avidale avatar cclauss avatar celebio avatar gillesbodart avatar gwenzek avatar harrry538 avatar heffernankevin avatar hoschwenk avatar ivanvergiliev avatar jrapin avatar julianpollmann avatar kalyangvs avatar mayhewsw avatar mpuig avatar nixblack11 avatar paulooh007 avatar ritwik12 avatar stefan-it avatar thehappylemon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

laser's Issues

models as onnx files

Hello
Looks like today pretrained models are in pythorch pt formats :
"bilstm.eparl21.2018-11-19.pt"
"bilstm.93langs.2018-12-26.pt"
Would it be possible to also offer it in onnx format :
https://onnx.ai
?
Kind

Trying to get Japanese tokenization to work

I got the mecab setup in the right location as mentioned in the docs. But I am not able to get the japanese tokenization working. Anyone seen this before ?

!echo "雪の風景" | python3 ./LASER/source/embed.py \
    --encoder ./LASER/models/bilstm.93langs.2018-12-26.pt \
    --token-lang ja \
    --bpe-codes ./LASER/models/93langs.fcodes \
    --output /data/LASER/LASER-embeddings/jp-titles.vec \
    --verbose \- Encoder: loading ./LASER/models/bilstm.93langs.2018-12-26.pt
- Encoder: loading ./LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language ja  
WARNING: No known abbreviations for language 'ja', attempting fall-back to English version...
Traceback (most recent call last):
  File "/home/ubuntu/projects/LASER/source/lib/romanize_lc.py", line 46, in <module>
    for line in args.input:
  File "/home/ubuntu/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
 - fast BPE: processing tok
 - Encoder: bpe to jp-titles.vec
 - Encoder: 0 sentences in 0s
CPU times: user 68 ms, sys: 116 ms, total: 184 ms
Wall time: 3.44 s

'ascii' codec can't decode byte

Hi!
I've tried to calculate sentence embeddings for a file in Russian language and failed with this error: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

The problem is that the data is UTF-8 encoded and default python open function doesn't support non-ASCII characters.

can not get the "embed.sh" example running

I follow the instuction README, but still can't get it running.

Installation
set the environment variable 'LASER' to the root of the installation, e.g. export LASER="${HOME}/projects/laser"
download encoders from Amazon s3 by bash ./install_models.sh
download third party software by bash ./install_external_tools.sh
./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.fra-eng.fra fr my_embeddings.raw
 - Encoder: loading /bigdata/facebookresearch-LASER/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language fr
 - fast BPE: processing tok
 - Encoder: bpe to my_embeddings.raw
Traceback (most recent call last):
  File "/bigdata/facebookresearch-LASER/LASER/source/embed.py", line 364, in <module>
    buffer_size=args.buffer_size)
  File "/bigdata/facebookresearch-LASER/LASER/source/embed.py", line 295, in EncodeFile
    fin = open(inp_fname, 'r') if len(inp_fname) > 0 else sys.stdin
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpn0xl23ug/bpe'

Thanks for such a great work, There are very little resource in cantonese (yue chinese). This model should be a great help.

Language abbreviation error

I tried to get the embeddings for hindi language using ./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.hin-eng.hin hi check.raw, but I got the following output on my console:

 - Encoder: loading ${LASER}/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language hi
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...
 - fast BPE: processing tok
 - Encoder: bpe to check.raw
 - Encoder: 1000 sentences in 17s

I also tried different abbreviations like hn, hin, hindi etc but none of them worked.

Also, if can you provide the mappings for all the languages to their respective abbreviations, that would be helpful.

[QUESTION] LASER and fastText

I'm using fastText for both unsupervised language model and as supervised text classifier. Currently for cross-language models fastText has several limitations, due to the approach in sub-words creations, that is limited in terms of text tokenization and it does not handle properly languages without word boundary etc. I was wondering how to combine LASER multilingual text embedding with a fastText model: fastText supports pre-trained vectors, so would it be possible to load the LASER bi-LSTM embeddings for training? Or maybe a different approach could be followed.
Cross-side note. In this recent paper we have exploited a fastText+biLSTM+Attention model and we have found pretty interesting results (mono-lingual), and I would like to extend these results including LASER at some point!

issue with mine_bitexts.py

Hi,
I am trying to mine some parallel sentences from two large monolingual corpora (over 40M sentences each). In the first step I encoded the two sides and then called mine_bitexts.py to do the magic and extract the most probable sentence pairs. However, I faced a memory issue so I decided to just load the embeddings of the target side and to keep the memory footage minimal at each time I just encode one source and try to mine the candidates of that single sentence. But, still I get the following error:

Faiss assertion 'err__ == cudaSuccess' failed in virtual void faiss::gpu::StandardGpuResources::initializeForDevice(int) at StandardGpuResources.cpp:168; details: CUDA error 2 go-align.sh: line 56: 23772 Aborted (core dumped) python ${LASER}/source/mine_bitexts.py

To reduce even further the memory usage I decrease the batch size so that at each time it just reads a small batch of target embeddings and compares the source embedding with them. But still no success.
This issue seems to be related to FAISS and I found the following thread in the FAISS issues:
https://github.com/facebookresearch/faiss/issues/231
But, couldn't find a solution which works for me. Any ideas about this?
I am running my experiments on 4 Tesla k80 gpus and the corpora contain about 50-60M sentences each
The only other solution that I could think of is to split to target corpus into smaller batches of say 10M sentences and for each source sentence get its most probable candidate in each batch. Then I need to go through the list of all the extracted candidates for each source sentence and return the best as the most similar candidate.
May I ask you if you ever faced this issue and if you have a better solution for it?

Thank you,
Amin

Compute Document Embedding from Sentence Embeddings

Hi,

Thank you for amazing work!

We have a cross lingual document classification (Binary) use-case where each documents is more than 500 words and we have labelled data for 2 foreign languages along with English. Few Questions:

  1. How is document embeddings being computed currently? Does entire sequence get passed through biLSTM in one go? If that is true, can we compute it in a better way specially for documents since semantic representation of document might not be good enough while passing entire sequence?
  2. Will training on combined multilingual data is any better than training on Only English data and then doing inference over other languages?
  3. Do we have to pass any explicit encoding while calling embed.sh in case of languages like French and Spanish? I see default encoding is utf-8, Is there something I should be worried about as pointed out in #39

TIA! :)

Regards,
Ashvini

Pre-trained embeddings for En-Zh

Hello,
Thanks for open-sourcing & maintaining this useful package. I've been using it well for gathering parallel sentences from Wikipedia.

I was wondering, would it be possible to make the pre-trained multilingual encoder model for Chinese and English (as used in the paper, "Learning joint multilingual sentence representations with neural machine translation") publicly available?

Training on unknown languages

I would like to test LASER with a set of very low resource languages: Sanskrit, Classical Tibetan and Classical Chinese. Is it possible to train it with my own data? It is already tokenized and I was able to calculate fasttext embedding for it.
With best wishes,

Sebastian

MLDoc dataset unavailable

TREC website has been down for several days. So we cannot access the MLdoc dataset, which was used in the paper for zero shot text classification. Any other way to download the data?

[Installation] Files downloaded are not always master.zip

During installing the external library (bash install_external_tools.sh) I found that wget somehow the gave me 302 responses and redirected me to https://.../master so I ended up downloaded "master" instead of "master.zip". Therefore, the installation failed because the script was trying to find a file named "master.zip" which doesn't exist. So it's better to add an -O option for wget command I think.

get sentence embeddings in python

First of all, definitely a great framework, I would like to congratulate you and thank you also for sharing it, the idea of multilingual embeddings really is great.

My doubt is, I was able to run the sample to obtain the embeddings for sentences in CLI, is there a way to get these embeddings in Python without the need to enter the sentences from one file and read the result as a numpy array, from another.

That is to call some function like:

encoder.embed ("sentence") or something similar.

Thank you, kind regards
Alejandro.

LASER: calculation of sentence embeddings: ( Temporary file error. )

Hello LASER Team!
I try to run the "Sentence Embedding for text files" for fra language but it gives me the following error. may you please guide me....

command :::::> ./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.fra-eng.fra fr my_embeddings.raw

  • Encoder: loading /home/farhat/LASER/models/bilstm.93langs.2018-12-26.pt
  • Tokenizer: in language fr
  • fast BPE: processing tok
  • Encoder: bpe to my_embeddings.raw
    Traceback (most recent call last):
    File "/home/farhat/LASER/source/embed.py", line 364, in
    buffer_size=args.buffer_size)
    File "/home/farhat/LASER/source/embed.py", line 295, in EncodeFile
    fin = open(inp_fname, 'r') if len(inp_fname) > 0 else sys.stdin
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp5mjbrkfz/bpe'

Sentence Similarity Disambiguation

For this sentence Que le dé, que le dé, que le dé (that is in ES, Spanish) and a closer translation to EN English that sounds like To give it to her, to give it to her, to give it to her, the dot product of the embedding (see #44) gives out a score of 0.44. Typically for similar sentences I get a score >0.5 (likely similar), >0.7 (very likely to be similar), while everything <0.5 seems to be unlikely similar.

Now in this specific case, it seems that at first glance Que le dé, que le dé, que le dé is considered like FR, French from most of the langId neural network I run (FastText, Google Cloud, etc.). If so, this means that in the vector space the EN translation would be not so closer as it would be the ES one.

The second consideration is that, another possibile translation in Spanish is "to be given,to be given,to be given", and in this case we get a 0.79 score, that means that source and targer are actually similar.

Last, if I add to the sentence more ES context like Que le dé, que le dé, que le dé, come estas, now the similarity gets closer as well to 0.68.

So, which would be a reasonable approach to disambiguates sentences like this, that could have very close vectors in different languages?

Running Laser line by line with TokenLine, BPEfastApplyLine

We are using LASER line by line so using

line = lines[index]
        # romanize some languages
        romanize = True if lang in ROMANIZED_LANGUAGES else False
        
        # tokenize
        token = TokenLine(line,
                    lang=lang,
                    romanize=romanize,
                    lower_case=True)

        # bpe
        bpe = BPEfastApplyLine(token,bpe_codes)

        # encode
        encoded = EncodeLine(model,bpe)

where the new EncodeLine looks line

# Encode sentences string
def EncodeLine(encoder, line):
    verbose=False
   
    print ( len(line), line )
    embedding = encoder.encode_sentences( line )

    print ( embedding.shape, embedding )
    return embedding

Now everything works ok: TokenLine, BPEfastApplyLine, but we are not sure about the EncodeLine nd.array shape:

For a given sentence like like donde podremos estar solos, solos, solos we get as BPE donde pod@@ remos estar sol@@ os , sol@@ os , sol@@ os (size:55), then the encoded shape was (55, 1024) after EncodeLine. Is that correct?

Full logging on two sentences:

line donde podremos estar solos, solos, solos
55 donde pod@@ remos estar sol@@ os , sol@@ os , sol@@ os

(55, 1024) [[-1.1423406e-03 -5.5885896e-05  2.2527664e-03 ...  3.7325646e-03
  -3.6964880e-03  1.1122560e-02]
 [ 2.3968022e-03 -2.3012199e-03  2.5090985e-03 ...  1.0885910e-02
  -1.7804311e-03  7.0607006e-03]
 [-5.9819245e-04  8.6139684e-04  1.0634706e-03 ...  2.4023859e-02
  -6.1983038e-03  2.8500563e-02]
 ...
 [ 2.3968022e-03 -2.3012199e-03  2.5090985e-03 ...  1.0885910e-02
  -1.7804311e-03  7.0607006e-03]
 [ 1.9327810e-02 -5.5067514e-05  1.7086942e-03 ...  2.1211503e-02
  -4.9151536e-03  3.5833105e-02]
 [ 6.4000086e-04  1.8474855e-04  9.5328852e-04 ... -1.3147421e-04
  -1.7696818e-03 -2.5204693e-03]]
line où nous pourrons être seuls,seuls,seuls
65 où nous pourr@@ ons être se@@ ul@@ s , se@@ ul@@ s , se@@ ul@@ s

(65, 1024) [[ 2.3968022e-03 -2.3012199e-03  2.5090985e-03 ...  1.0885910e-02
  -1.7804311e-03  7.0607006e-03]
 [ 4.0953397e-03 -1.6550833e-04 -2.3423209e-04 ...  1.4184720e-03
  -6.8216762e-03 -5.1830243e-03]
 [ 6.4000080e-04  1.8474844e-04  9.5328980e-04 ... -1.3147425e-04
  -1.7696820e-03 -2.5204695e-03]
 ...
 [ 6.4000080e-04  1.8474844e-04  9.5328980e-04 ... -1.3147425e-04
  -1.7696820e-03 -2.5204695e-03]
 [ 1.9327810e-02 -5.5067514e-05  1.7086942e-03 ...  2.1211503e-02
  -4.9151536e-03  3.5833105e-02]
 [ 6.4000045e-04  1.8474869e-04  9.5328881e-04 ... -1.3147409e-04
  -1.7696813e-03 -2.5204699e-03]]

Second question, we still need to call IndexCreate before calculating the similarity between embeddings like:

   d, idx = IndexCreate(path +'data_' + str(i) + '.enc.' + l ,'FlatL2', verbose=verbose, save_index=False)

Is this step actually needed? Assumed we have the embeddings from the EncodeLine we have to actually run IndexCreate?

How to train with sent_classif.py

Hi,
I am trying to understand how to organize my dataset for training using the sent_classif.py script.
I have 2 classes (spam/not-spam), and I have created the following files:

train_set_corpus.txt
val_set_corpus.txt
test_set_corpus.txt

train_set_labels.txt
val_set_labels.txt
test_set_labels.txt

The XX_corpus.txt files contain only the text, while the XX_labels.txt are the corresponding labels. First question: Is this correct? Should the files be splitted like that?

I have also encoded my labels as numerical since I encountered the error ValueError: could not convert string to float: 'SPAM'. So now I have binary labels.

I am launching the script as follows:

python3.6 sent_classif.py -s data/spam_dataset/laser_output/ -t train_set_corpus.txt -T train_set_labels.txt -d val_set_corpus.txt -D val_set_labels.txt -e test_set_corpus.txt -E test_set_labels.txt -g -1 -b /src/data/spam_dataset

But I have this error:

 - base directory: /src/data/spam_dataset
 - read 93957x1024 elements in train_set_corpus.txt
 - read 278471 labels [0,1] in train_set_labels.txt
Traceback (most recent call last):
  File "sent_classif.py", line 188, in <module>
    dim=args.dim, bsize=args.bsize, shuffle=True)
  File "sent_classif.py", line 41, in LoadData
    D = data_utils.TensorDataset(torch.from_numpy(x), torch.from_numpy(lbl))
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataset.py", line 36, in __init__
    assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
AssertionError

Source code for system training

Hi! Could you please open source the code that was used to train full encoder-decoder system?

It would be extremely helpful for those who work on improving embeddings itself and want to use your work as a starting point.

Thank you

Document similarity task using LASER

Hello,

Task: Given a document (each document has ~ 8-30 sentences), find k most similar documents from the dataset.

Language: English

Approach: Each document is converted to a embedding vector using LASER. Given N query vectors, I am using FAISS library for finding k nearest neighbours. Then the experiment is repeated by using TFIDF (nltk TfidfVectorizer) for embedding documents instead of LASER.

Observation: TFIDF approach perform better than the LASER. I am using recall@k where k = 1..5 as evaluation metric. However, I was expecting LASER to perform better.

Am I missing something? Can someone share some insight on how this might be the case?
Any revenant reference is also welcome.

Thanks.

Similarity between Embeddings

Hi,
I am trying to use LASER embedding in order to calculate similarity between sentences. I tried with inner product and results are not bad, but maybe others are better.
What similarity would you recommend?
Thanks

unseen words

First of all thanks for providing LASER.
I am trying to get the embedding of some sentences in hindi.
The question is that - how the trained biLSTM model is handling an unseen word ?
if all the words in the sentence is new how the sentence embedding is calculated?

I understood following:-
BPE - encoder takes care of unknown words. for every word it tries to break the word into subwords. e.g. "loved" will be broken into "lov" and "ed" and "loving" will be broken into "lov" and "ing". So if a new word comes it breaks it into known sub words and gives the output regarding those sub-words

bpe_example

correct me if i am wrong

Cross-lingual Spam Detection

Hi, I am working on a cross-lingual spam detection task. I have a dataset containing spam/non-spam sentences in different languages. My goal is to detect the spam no matter in which language it is.
I was wondering how to approach this problem using LASER. I saw a number of examples but each of them require the language for training/testing the model. Is there a way I can generalize the usage of LASER for my purpose?

Computing sentence embeddings for other sentences

This is such a cool project, thanks for putting it out! I was wondering what the process is to compute our own embeddings on text. For example, if I have a list of sentences in Portuguese, what format do they need to be in and what do I need to do to compute their embeddings?

About Greek romanization

In the Token I see that the transliteration package is used

# Romanization (Greek only)
ROMAN_LC = 'python3.6 ' + os.path.join(CURRENT_PATH,'romanize_lc.py') + ' -l '

and piped to the output like:

tok = check_output(
            REM_NON_PRINT_CHAR
            + '|' + NORM_PUNC + lang
            + '|' + DESCAPE
            + '|' + MOSES_TOKENIZER + lang
            + ('| python3.6 -m jieba -d ' if lang == 'zh' else '')
            + ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '')
            + '|' + ROMAN_LC + roman,
            input=line,
            encoding='UTF-8',
            shell=True)

While for zh and ja it is optional. Could it be improved checking the input language to be romanized here as well? Also the transliteration handles not only Greek but Russian/Armenian and other languages, so it will be applied in this case as well.

About Sentence Encoders

The install_models.sh file downloads 3 files, one is the blstm.ep7.9langs-v1.bpej20k.model.py file and the other two are ep7.9langs-v1.bpej20k.bin.9xx & ep7.9langs-v1.bpej20k.codes.9xx. mlenc.py file says that bpe_codes is "File with BPE codes (created by learn_bpe.py)." and on the research paper it is mentioned as "20k joint vocabulary for all the nine languages" I created this using learn_bpe.py as mentioned with my own data but I don't quite understand how to create the other two, hash_table "File with hash table for binarization." and model "File with trained model used for encoding."
Any idea on how I can create hash_table and model? I couldn't find any documentation about them or code sample to train them. Thanks in advance.

Decoder Source Code

It seems that the multilingual decoder mentioned in the blog post is not currently in this repository. Do you have plans to open source it?

Embedding Killed, empty raw file

While trying to execute the example for sentence embeddings I get this console log:

./embed.sh ${LASER}/data/tatoeba/v1/tatoeba.fra-eng.fra fr 01.raw
 - Encoder: loading /home/ubuntu/LASER/models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language fr  
 - fast BPE: processing tok
 - Encoder: bpe to 01.raw
./embed.sh: line 43:  7999 Done                    cat $ifile
      8000 Killed                  | python3 ${LASER}/source/embed.py --encoder ${encoder} --token-lang ${lang} --bpe-codes ${bpe_codes} --output ${ofile} --verbose

The resulting raw file is empty (0 Bytes).

I also tried to invoke the python script manually to no avail. Unfortunately there seems no further explanation being logged.

How can I fix this?

Btw: Many thanks for sharing your work and code!

multi-gpu in embed.py

Does embed.py support multi-gpu? I need to get the sentence embeddings of a rather large corpus (+40M sentences) and was looking into embed.py to see if it supports multi-gpu to speedup the process. But, it seems that currently it can run only on a single gpu. right? May I ask you if there is any script for this purpose?

Thank you,
Amin

embed task: number of embedding more than given sentences

I used ./embed.sh ./es_tass1.txt es es_tass1.raw to run the embedding example.
It executed successfully with the following info:

 - Encoder: loading /app//models/bilstm.93langs.2018-12-26.pt
 - Tokenizer:  in language es  
 - fast BPE: processing tok
 - Encoder: bpe to es_tass1.raw
 - Encoder: 7217 sentences in 6s

The encoder encoded the given 7217 sentences .

But while reading the raw embedding as per the script give here, I'm getting 7219 embeddings.

What could be the possible reason for getting +2 number of embeddings?

assertion error

while running bash ./xnli.sh i am getting this error
Traceback (most recent call last):
File "/home/appuser/sheshank.k/projects/laser/source/nli.py", line 228, in
dim=args.dim, bsize=args.bsize, shuffle=True)
File "/home/appuser/sheshank.k/projects/laser/source/nli.py", line 57, in LoadDataNLI
D = data_utils.TensorDataset(torch.from_numpy(nli), torch.from_numpy(lbl))
File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/dataset.py", line 36, in init
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
AssertionError

Problems with replicating MLDoc results

Thank you for open sourcing your work. We are trying to test MLDoc zeroshort performance and simply running mldoc.sh does not reproduce the results. I've run them twice and got the same results that are far from expected:

Train en de es fr it ru zh
en: 90.88 86.48 67.62 61.98 69.95 22.95 11.65
de: 73.23 92.90 77.23 74.05 72.30 24.80 9.93
es: 65.62 80.58 92.03 73.28 69.03 34.10 12.58
fr: 78.35 85.45 78.20 89.68 69.85 33.88 9.68
it: 73.93 84.58 79.23 76.73 84.03 34.48 11.83
ru: 57.33 63.78 45.80 52.78 51.15 66.08 36.28
zh: 26.15 28.13 21.88 29.33 30.58 34.38 75.62

Do you happen to have the original mldoc models stored somewhere? If so could you share them It would help us to debug the issue.

Code for BUCC task

Thank you for releasing the code.

Any idea when the code for bitext mining is getting released?

separate initial loading and per-line operations

In order to support interactive mode and/or run-time querying from other languages, it would be ideal if the code under if __name__ == '__main__': in a task like embed had an initial load and then processed each line from stdin as soon as it arrived.

Error: Get BPE code in Chinese

Hi,
I can't get bpe codes in Chinese:

./fast applybpe data/embd/lcqmc.dev.prem.bpe.zh data/embd/lcqmc.dev.prem.tok.zh LaserModel/models/93langs.fcodes LaserModel/models/93langs.fvocab

will result error:
Read 2369862597 words (73636 unique) from vocabulary file.
Loading codes from LaserModel/models/93langs.fcodes ...
Read 50000 codes from the codes file.
Loading vocabulary from data/embd/lcqmc.dev.prem.tok.zh ...
Read 328980 words (235 unique) from text file.
fast: fast.cc:486: void decompose(std::__cxx11::string, std::vector<std::__cxx11::basic_string >&, const std::unordered_map<std::__cxx11::basic_string, std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string > >&, const std::unordered_map<std::__cxx11::basic_string, unsigned int>&, bool): Assertion count == 1' failed. fast: fast.cc:486: void decompose(std::__cxx11::string, std::vector<std::__cxx11::basic_string<char> >&, const std::unordered_map<std::__cxx11::basic_string<char>, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> > >&, const std::unordered_map<std::__cxx11::basic_string<char>, unsigned int>&, bool): Assertion count == 1' failed.
Aborted (core dumped)

But in english, it work well:
./fast applybpe data/embd/mrpc.dev.prem.bpe.en data/embd/mrpc.dev.prem.tok.en LaserModel/models/93langs.fcodes LaserModel/models/93langs.fvocab
Read 2369862597 words (73636 unique) from vocabulary file.
Loading codes from LaserModel/models/93langs.fcodes ...
Read 50000 codes from the codes file.
Loading vocabulary from /data/embd/mrpc.dev.prem.tok.en ...
Read 38776 words (7710 unique) from text file.
Applying BPE to data/embd/mrpc.dev.prem.tok.en ...
Modified 38776 words from text file.

Thanks.

Missing files for xnli.sh

I tried running xnli.sh to replicate the XNLI results and got the following errors:

LASER: training and evaluation for XNLI

  • loading encoder /group/project/s1782911/LASER/models/bilstm.93langs.2018-12-26.pt

Processing train:

  • Tokenizer: xnli.train.prem.tok.en exists already
  • fast BPE: processing xnli.train.prem.tok.en
  • Encoder: xnli.train.prem.bpe.en to xnli.train.prem.enc.en
    Traceback (most recent call last):
    File "xnli.py", line 91, in
    buffer_size=args.buffer_size)
    File "/group/project/s1782911/LASER/source/embed.py", line 295, in EncodeFile
    fin = open(inp_fname, 'r') if len(inp_fname) > 0 else sys.stdin
    FileNotFoundError: [Errno 2] No such file or directory: 'embed/xnli.train.prem.bpe.en'

Training the classifier (see embed/xnli.log)
Traceback (most recent call last):
File "/group/project/s1782911/LASER/source/nli.py", line 228, in
dim=args.dim, bsize=args.bsize, shuffle=True)
File "/group/project/s1782911/LASER/source/nli.py", line 36, in LoadDataNLI
x = np.fromfile(fn1, dtype=np.float32, count=-1)
FileNotFoundError: [Errno 2] No such file or directory: 'embed/xnli.train.prem.enc.en'

Do you have an idea what might cause these errors and what I can do about them?
Thanks in advance!

Sentence embeddings – code typo

In the README.md «LASER: calculation of sentence embeddings» seems to be a small code typo:

Shouldn't this:

bash ./embed.sh INPUT-FILE LANGUGAE OUTPUT-FILE

be this?

bash ./embed.sh INPUT-FILE LANGUAGE OUTPUT-FILE

By the way: Thanks very much for sharing LASER! Much appreciated.

appned wrong path in tasks/xnli

Hi.

I found to refer wrong path in LASER/tasks/xnli/xnli.py line 29.

now
sys.path.append(LASER + '/source/tools')

maybe better
sys.path.append(LASER + '/source/lib')

Python subprocess.Popen “OSError: [Errno 12] Cannot allocate memory”

This problems happens when running multiple concurrent calls to the process of calculating embeddings in TokenLine

def TokenLine(line, lang='en', lower_case=True, romanize=False):
    assert lower_case, 'lower case is needed by all the models'
    roman = lang if romanize else 'none'
    tok = check_output(
            REM_NON_PRINT_CHAR
            + '|' + NORM_PUNC + lang
            + '|' + DESCAPE
            + '|' + MOSES_TOKENIZER + lang
            + ('| python3.6 -m jieba -d ' if lang == 'zh' else '')
            + ('|' + MECAB + '/bin/mecab -O wakati -b 50000 ' if lang == 'ja' else '')
            + '|' + ROMAN_LC + roman,
            input=line,
            encoding='UTF-8',
            shell=True)
    return tok.strip()

Here is the detailed strack trace:

  File "/usr/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
INFO:default:called LaserEmbeddingHandler...
ERROR:default:[Errno 12] Cannot allocate memory
Traceback (most recent call last):
  File "/tornado_api/handlers/embeddingLaserHandler.py", line 195, in post
    embeddings = vector_embedding.embedding_line(model=model_laser,lang=lang,bpe_codes=QUALITY_MODEL_PATH + "/93langs.fcodes",input_text=text)
  File "/tornado_api/deeplearning/vector_embedding.py", line 50, in embedding_line
    embeddings.append(t.result()[0].tolist() )
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/tornado_api/deeplearning/similarity_search.py", line 99, in pipeline
    lower_case=lower_case)
  File "/tornado_api/deeplearning/lib/text_processing.py", line 62, in TokenLine
    shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 403, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
INFO:default:called LaserEmbeddingHandler...
ERROR:default:[Errno 12] Cannot allocate memory
Traceback (most recent call last):
  File "/tornado_api/handlers/embeddingLaserHandler.py", line 195, in post
    embeddings = vector_embedding.embedding_line(model=model_laser,lang=lang,bpe_codes=QUALITY_MODEL_PATH + "/93langs.fcodes",input_text=text)
  File "/tornado_api/deeplearning/vector_embedding.py", line 50, in embedding_line
    embeddings.append(t.result()[0].tolist() )
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/tornado_api/deeplearning/similarity_search.py", line 99, in pipeline
    lower_case=lower_case)
  File "/tornado_api/deeplearning/lib/text_processing.py", line 62, in TokenLine
    shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 403, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

similarity.py is missing

I'm trying to install and run the similarity task but I got this error:

bash ./wmt.sh
python3: can't open file '/LASER//source/similarity_search.py': [Errno 2] No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.