Giter Club home page Giter Club logo

sent2vec's Introduction

Updates

Code and pre-trained models related to the Bi-Sent2vec, cross-lingual extension of Sent2Vec can be found here.

Sent2vec

TLDR: This library provides numerical representations (features) for words, short texts, or sentences, which can be used as input to any machine learning task.

Table of Contents

Setup and Requirements

Our code builds upon Facebook's FastText library, see also their nice documentation and python interfaces.

To compile the library, simply run a make command.

A Cython module allows you to keep the model in memory while inferring sentence embeddings. In order to compile and install the module, run the following from the project root folder:

pip install .

Note -

if you install sent2vec using

$ pip install sent2vec

then you'll get the wrong package. Please follow the instructions in the README.md to install it correctly.

Sentence Embeddings

For the purpose of generating sentence representations, we introduce our sent2vec method and provide code and models. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences.

The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the paper for more details.

Generating Features from Pre-Trained Models

Directly from Python

If you've installed the Cython module, you can infer sentence embeddings while keeping the model in memory:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .") 
embs = model.embed_sentences(["first sentence .", "another sentence"])

Text preprocessing (tokenization and lowercasing) is not handled by the module, check wikiTokenize.py for tokenization using NLTK and Stanford NLP.

An alternative to the Cython module is using the python code provided in the get_sentence_embeddings_from_pre-trained_models notebook. It handles tokenization and can be given raw sentences, but does not keep the model in memory.

Running Inference with Multiple Processes

There is an 'inference' mode for loading the model in the Cython module, which loads the model's input matrix into a shared memory segment and doesn't load the output matrix, which is not needed for inference. This is an optimization for the usecase of running inference with multiple independent processes, which would otherwise each need to load a copy of the model into their address space. To use it:

model.load_model('model.bin', inference_mode=True)

The model is loaded into a shared memory segment named after the model name. The model will stay in memory until you explicitely remove the shared memory segment. To do it from Python:

model.release_shared_mem('model.bin')

Using the Command-line Interface

Given a pre-trained model model.bin (download links see below), here is how to generate the sentence features for an input text. To generate the features, use the print-sentence-vectors command and the input text file needs to be provided as one sentence per line:

./fasttext print-sentence-vectors model.bin < text.txt

This will output sentence vectors (the features for each input sentence) to the standard output, one vector per line. This can also be used with pipes:

cat text.txt | ./fasttext print-sentence-vectors model.bin

Downloading Sent2vec Pre-Trained Models

(as used in the NAACL2018 paper)

Note: users who downloaded models prior to this release will encounter compatibility issues when trying to use the old models with the latest commit. Those users can still use the code in the release to keep using old models.

Tokenizing

Both feature generation as above and also training as below do require that the input texts (sentences) are already tokenized. To tokenize and preprocess text for the above models, you can use

python3 tweetTokenize.py <tweets_folder> <dest_folder> <num_process>

for tweets, or then the following for wikipedia:

python3 wikiTokenize.py corpora > destinationFile

Note: For wikiTokenize.py, set the SNLP_TAGGER_JAR parameter to be the path of stanford-postagger.jar which you can download here

Train a New Sent2vec Model

To train a new sent2vec model, you first need some large training text file. This file should contain one sentence per line. The provided code does not perform tokenization and lowercasing, you have to preprocess your input data yourself, see above.

You can then train a new model. Here is one example of command:

./fasttext sent2vec -input wiki_sentences.txt -output my_model -minCount 8 -dim 700 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000 -maxVocabSize 750000 -numCheckPoints 10

Here is a description of all available arguments:

sent2vec -input train.txt -output model

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -lr                 learning rate [0.2]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                dimension of word and sentence vectors [100]
  -epoch              number of epochs [5]
  -minCount           minimal number of word occurences [5]
  -minCountLabel      minimal number of label occurences [0]
  -neg                number of negatives sampled [10]
  -wordNgrams         max length of word ngram [2]
  -loss               loss function {ns, hs, softmax} [ns]
  -bucket             number of hash buckets for vocabulary [2000000]
  -thread             number of threads [2]
  -t                  sampling threshold [0.0001]
  -dropoutK           number of ngrams dropped when training a sent2vec model [2]
  -verbose            verbosity level [2]
  -maxVocabSize       vocabulary exceeding this size will be truncated [None]
  -numCheckPoints     number of intermediary checkpoints to save when training [1]

Nearest Neighbour Search and Analogies

Given a pre-trained model model.bin , here is how to use these features. For the nearest neighbouring sentence feature, you need the model as well as a corpora in which you can search for the nearest neighbouring sentence to your input sentence. We use cosine distance as our distance metric. To do so, we use the command nnSent and the input should be 1 sentence per line:

./fasttext nnSent model.bin corpora [k] 

k is optional and is the number of nearest sentences that you want to output.

For the analogiesSent, the user inputs 3 sentences A,B and C and finds a sentence from the corpora which is the closest to D in the A:B::C:D analogy pattern.

./fasttext analogiesSent model.bin corpora [k]

k is optional and is the number of nearest sentences that you want to output.

Unigram Embeddings

For the purpose of generating word representations, we compared word embeddings obtained training sent2vec models with other word embedding models, including a novel method we refer to as CBOW char + word ngrams (cbow-c+w-ngrams). This method augments fasttext char augmented CBOW with word n-grams. You can see the full comparison of results in this paper.

Extracting Word Embeddings from Pre-Trained Models

If you have the Cython wrapper installed, some functionalities allow you to play with word embeddings obtained from sent2vec or cbow-c+w-ngrams:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin') # The model can be sent2vec or cbow-c+w-ngrams
vocab = model.get_vocabulary() # Return a dictionary with words and their frequency in the corpus
uni_embs, vocab = model.get_unigram_embeddings() # Return the full unigram embedding matrix
uni_embs = model.embed_unigrams(['dog', 'cat']) # Return unigram embeddings given a list of unigrams

Asking for a unigram embedding not present in the vocabulary will return a zero vector in case of sent2vec. The cbow-c+w-ngrams method will be able to use the sub-character ngrams to infer some representation.

Downloading Pre-Trained Models

Coming soon.

Train a CBOW Character and Word Ngrams Model

Very similar to the sent2vec instructions. A plausible command would be:

./fasttext cbow-c+w-ngrams -input wiki_sentences.txt -output my_model -lr 0.05 -dim 300 -ws 10 -epoch 9 -maxVocabSize 750000 -thread 20 -numCheckPoints 20 -t 0.0001 -neg 5 -bucket 4000000 -bucketChar 2000000 -wordNgrams 3 -minn 3 -maxn 6

References

When using this code or some of our pre-trained models for your application, please cite the following paper for sentence embeddings:

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018

@inproceedings{pgj2017unsup,
  title = {{Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features}},
  author = {Pagliardini, Matteo and Gupta, Prakhar and Jaggi, Martin},
  booktitle={NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics},
  year={2018}
}

For word embeddings:

Prakhar Gupta, Matteo Pagliardini, Martin Jaggi, Better Word Embeddings by Disentangling Contextual n-Gram Information NAACL 2019

@inproceedings{DBLP:conf/naacl/GuptaPJ19,
  author    = {Prakhar Gupta and
               Matteo Pagliardini and
               Martin Jaggi},
  title     = {Better Word Embeddings by Disentangling Contextual n-Gram Information},
  booktitle = {{NAACL-HLT} {(1)}},
  pages     = {933--939},
  publisher = {Association for Computational Linguistics},
  year      = {2019}
}

sent2vec's People

Contributors

abloch avatar fethbita avatar guptaprkhr avatar helmuthj avatar ivonindza avatar jbcdnr avatar jjoller avatar jllan avatar martinjaggi avatar mpagli avatar pavan-turlapati avatar pidugusundeep avatar vsolovyov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sent2vec's Issues

Does nnSent take word n-grams into account?

I trained the model with wordNgrams set to 2. I tried the same input sentence with permuted word order. I would expect the results to be at least slightly different, since the word n-grams are different, but they are precisely the same. Are the word n-gams not taken into account here?

Query sentence? 
the capital of austria is vienna .
0.721763 964960 the capital of anzoátegui is barcelona .  
0.695541 965465 herisau is the capital of the swiss canton of appenzell ausserrhoden .  
0.682082 964936 it is the capital of the swiss canton of basel - landschaft .  
0.675558 963181 altmann was born in vienna , austria .  
0.624039 965487 its capital is granada .  
0.71819 944282 he is from vienna , austria .  
0.643779 949365 it was the capital of the historic catalan comarque of conflent .  
0.64149 949344 it was the capital of the historic catalan comarque of vallespir .  
0.576923 925080 she was the wife of archduke franz ferdinand of austria .  
0.567474 922188 schell was born on 8 december 1930 in vienna , austria .  
0.562906 898471 it is the capital of the kapilvastu district , in the lumbini zone .  
0.820954 760517 the capital of tyrol is innsbruck .  

Query sentence? 
vienna is the capital of austria .
0.721763 964960 the capital of anzoátegui is barcelona .  
0.695541 965465 herisau is the capital of the swiss canton of appenzell ausserrhoden .  
0.682082 964936 it is the capital of the swiss canton of basel - landschaft .  
0.675558 963181 altmann was born in vienna , austria .  
0.624039 965487 its capital is granada .  
0.71819 944282 he is from vienna , austria .  
0.643779 949365 it was the capital of the historic catalan comarque of conflent .  
0.64149 949344 it was the capital of the historic catalan comarque of vallespir .  
0.576923 925080 she was the wife of archduke franz ferdinand of austria .  
0.567474 922188 schell was born on 8 december 1930 in vienna , austria .  
0.562906 898471 it is the capital of the kapilvastu district , in the lumbini zone .  
0.820954 760517 the capital of tyrol is innsbruck .  

Query sentence? 
capital austria vienna is of the .
0.721763 964960 the capital of anzoátegui is barcelona .  
0.695541 965465 herisau is the capital of the swiss canton of appenzell ausserrhoden .  
0.682082 964936 it is the capital of the swiss canton of basel - landschaft .  
0.675558 963181 altmann was born in vienna , austria .  
0.624039 965487 its capital is granada .  
0.71819 944282 he is from vienna , austria .  
0.643779 949365 it was the capital of the historic catalan comarque of conflent .  
0.64149 949344 it was the capital of the historic catalan comarque of vallespir .  
0.576923 925080 she was the wife of archduke franz ferdinand of austria .  
0.567474 922188 schell was born on 8 december 1930 in vienna , austria .  
0.562906 898471 it is the capital of the kapilvastu district , in the lumbini zone .  
0.820954 760517 the capital of tyrol is innsbruck .  

(I will reply regarding the sorting issue soon)

Comparison of Sent2Vec performance with FastText

@martinjaggi @mpagli @guptaprkhr @menshikh-iv I've been working on a native implementation of Sent2Vec in Gensim. During benchmarking, I came across some unexpected results. It's evident that while Sent2Vec outperforms Doc2Vec, the average of FastText word vectors for a sentence gives better results for various supervised and unsupervised tasks as compared to Sent2Vec sentence vector for the same sentence. So, is there a problem with the hyperparameters' values or are these results to be expected?

module 'sent2vec' has no attribute 'Sent2vecModel'

Hi, I trained a sent2vec model last week and was able to use it for sentence similarity. Today I reran the code in jupyter notebook and the line "model = sent2vec.Sent2vecModel()" produces the error "module 'sent2vec' has no attribute 'Sent2vecModel'". Do you know what the problem is?

Install instructions for python

Hi,
I want to install the python version of sent2vec. Here are the provided instructions from your site:

python setup.py build_ext
sudo pip install .

But there is no setup.py file in the repository. Should I be copying the fastText's setup.py in order to install sent2vec's python version?

How to calculate amount of RAM and amount of space on hard drive for successfully training and saving model?

Hello! How to calculate what amount of RAM and how much free space on a hard drive I need to have during training?

For example:
I have .txt file of 100Gb (docs + tweets), nearly 355 000 000 docs, does your model stores in RAM one vector for each doc + dictionary for all words from these docs?

  1. Can you, please, explain, what your model stores in RAM during training?

  2. And how final model size depends on the size of the input file or the number of docs in the input file?

  3. And how to receive an approximate estimate of RAM/Hard?

  4. Can you, please, provide and clear explanation in terms like in the answer from the post about doc2vec:
    https://stackoverflow.com/questions/45943832/gensim-doc2vec-finalize-vocab-memory-error

SegFault while training on Toronto Corpus

I'm encountering a segfault while training on the Toronto Book Corpus.
For training I'm using:
./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 700 -epoch 12 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 4 -t 0.000005 -dropoutK 7 -bucket 2000000

It runs for a few minutes and then this shows up:
Progress: 0.0% words/sec/thread: 5160 lr: 0.199927 loss: 3.388311 eta: 170h56m Segmentation fault: 11

I'm working on a Macbook Air mid 2012 model having 4 cores.

Training process fault

../sent2vec/fasttext sent2vec -input data/preprocessed_docs/result_0.txt -output my_model_s2v -minCount 10 -dim 700 -epoch 7 -lr 0.3 -wordNgrams
2 -loss ns -neg 10 -thread 48 -t 0.00005 -dropoutK 4 -bucket 10000
Read 227M words
Number of words: 374566
Number of labels: 0
Progress: 19.0% words/sec/thread: 59 lr: 0.242889 loss: 20.702034 eta: 125h56m Segmentation fault

Trained with these parameters, but the training process ended with the message:

"Segmentation fault"

Can you, please, explain, why?

Machine parameters:
48 cores, 126Gb RAM. 1.5Tb of free disk space.

Size of the input file: 1.5Gb

Memory Size

When I use twitter_bigrams.bin, I got an error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

It seems like it is out of memory, so does it mean I need 23 GB memory when I use twitter_bigrams.bin?

Thanks.

embed_sentence returns float64

The Python module returns float64 for embed_sentence function and float32 for embed_sentences. I don't think this should be expected behavior as it makes certain operations harder.
this
v = sent2vec_model.embed_sentence(text).reshape(1, -1).astype(np.float32)
instead of this
v = sent2vec_model.embed_sentence(text).reshape(1, -1)
I think it would be better if the two functions (embed_sentence and embed_sentences) were consistent.

How each sentence form should be?

Hello, I am confused for how each sentence form should be. For example, a sentence is "It is important for machine learning." After tokenizing and lowercasing, it is a form of list [ it, is, important, for, machine learning ], then I want to write this result into a file, one line one sentence, should it be "it, is, important, for, machine learning", which I mean should each word separate by comma?

How to save model on each epoch?

Is it possible to save model during training (like using callbacks on each epoch finish)?

  1. And how to enable logging in your model?
    With verbosity not a lot of information available.
  2. And is it possible to train model with 2 or more documents as input at once?

Also:
3. Is it possible to call building a new model from python?
Or just with command line?

Thanks in advance.

Training fails without any error message for most hyperparameter combinations on Windows

I've built the package on Windows using GnuWin.

My first issue was with the pre-trained models that were on the GitHub page, when I tried to use them to encode sentences, I got an assertion failed error : Assertion failed: (counts.size() == osz_). I never got around this.

Later, I was trying to train a model on a subset of Wikipedia restricted to a certain domain, so that the embedding is a domain specific one. The training fails without any error shown with most of the hyperparameter combinations, and no bin file is outputted. Only with very select hyperparameter combinations does the training complete - and then also sometimes it gets stuck after getting to 100% and nothing is outputted. The attached image below shows a training cycle which reached a 100% and then hasnt outputted anything for 4 days.

image

My input dataset is about 221 MB in size, with about 120k words having a word count > 5.

How 'input.txt' should look like?

Hello! I found your answer about how each sentence should look like:

it is important for machine_learning

But if I want to build my own model, what should I feed to sent2vec?

../sent2vec/fasttext sent2vec -input data/documents_mixed.txt -output my_model -minCount 8 -dim 700 -epoch 10 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 36 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000

How does each sentence need to be separated from each other?

For example:

1 sentence: "i love my dog"
2 sentence: "my dog loves me"

input.txt: "i love my dog", "my dog loves me"
or: "i love my dog" "my dog loves me"
or "i love my dog" (new line)
"my dog loves me"

Thank you in advance!

And one more question:
For a very large corpus (84 Gb - tweets and web docs) what are your recommendations for parameters t, dropoutK and bucket?
And is it possible to change linearly learning rate to particular min learning rate value?

No similar sentences returned when using fasttext to search for nearest neighbours

When using fasttext search for the nearest neighbours, there are cases when no similar sentence is retrieved even though an identical sentence with the one I am feeding as input already exists in the collection of sentences I am searching for. What is the explanation for this, in which scenarios does this occur? Is this a limitation of the model or a bug?

nnSent segfault?

I'm still getting this segfault. My machine seems to have lots of free memory and I have no idea what's happening. Any help?

$ ./fasttext nnSent torontobooks_unigrams.bin 1 < sentence
Pre-computing sentence vectors... done.
Segmentation fault (core dumped)

No module named Cython.Build

When I run sudo pip install . in src folder after running python setup.py build_ext, I got the following error:

Running setup.py (path:/tmp/pip-IRGssa-build/setup.py) egg_info for package from file:///home/hao/Workplace/HaoXu/Library/sent2vec/src
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip-IRGssa-build/setup.py", line 2, in <module>
        from Cython.Build import cythonize
    ImportError: No module named Cython.Build
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip-IRGssa-build/setup.py", line 2, in <module>

    from Cython.Build import cythonize

ImportError: No module named Cython.Build

When I run python setup.py build_ext, I got the following warning:

Compiling sent2vec.pyx because it changed.
[1/1] Cythonizing sent2vec.pyx
running build_ext
building 'sent2vec' extension
creating build
creating build/temp.linux-x86_64-3.6
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c sent2vec.cpp -o build/temp.linux-x86_64-3.6/sent2vec.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c fasttext.cc -o build/temp.linux-x86_64-3.6/fasttext.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
fasttext.cc: In member function ‘void fasttext::FastText::findNNSent(const fasttext::Matrix&, const fasttext::Vector&, int32_t, const std::set<std::__cxx11::basic_string<char> >&, int64_t, const std::vector<std::__cxx11::basic_string<char> >&)’:
fasttext.cc:587:10: warning: variable ‘it’ set but not used [-Wunused-but-set-variable]
     auto it = banSet.find(heap.top().second);
          ^
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c args.cc -o build/temp.linux-x86_64-3.6/args.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c dictionary.cc -o build/temp.linux-x86_64-3.6/dictionary.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c matrix.cc -o build/temp.linux-x86_64-3.6/matrix.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c qmatrix.cc -o build/temp.linux-x86_64-3.6/qmatrix.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c model.cc -o build/temp.linux-x86_64-3.6/model.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c real.cc -o build/temp.linux-x86_64-3.6/real.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c utils.cc -o build/temp.linux-x86_64-3.6/utils.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c vector.cc -o build/temp.linux-x86_64-3.6/vector.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c real.cc -o build/temp.linux-x86_64-3.6/real.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -B /home/sunday/anaconda3/envs/py36/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/sunday/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/core/include -I/home/sunday/anaconda3/envs/py36/include/python3.6m -c productquantizer.cc -o build/temp.linux-x86_64-3.6/productquantizer.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
creating build/lib.linux-x86_64-3.6
g++ -pthread -shared -B /home/sunday/anaconda3/envs/py36/compiler_compat -L/home/sunday/anaconda3/envs/py36/lib -Wl,-rpath=/home/sunday/anaconda3/envs/py36/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/sent2vec.o build/temp.linux-x86_64-3.6/fasttext.o build/temp.linux-x86_64-3.6/args.o build/temp.linux-x86_64-3.6/dictionary.o build/temp.linux-x86_64-3.6/matrix.o build/temp.linux-x86_64-3.6/qmatrix.o build/temp.linux-x86_64-3.6/model.o build/temp.linux-x86_64-3.6/real.o build/temp.linux-x86_64-3.6/utils.o build/temp.linux-x86_64-3.6/vector.o build/temp.linux-x86_64-3.6/real.o build/temp.linux-x86_64-3.6/productquantizer.o -o build/lib.linux-x86_64-3.6/sent2vec.cpython-36m-x86_64-linux-gnu.so

Python version: Python 3.6.6
GCC: 7.2.0

call is not working even with fasttext provided by this repository

the call is giving ouput as 2, so its not running properly. when I checked the output of call using check_output method in subprocess it says the command returned non-zero exit. I am attaching the trace of the error.

CalledProcessError Traceback (most recent call last)
in ()
1 sentences = ['Once upon a time.', 'And now for something completely different.']
2
----> 3 my_embeddings = get_sentence_embeddings(sentences, 'unigrams', 'wiki')
4 print(my_embeddings.shape)

in get_sentence_embeddings(sentences, ngram, model)
67
68 if ngram == 'unigrams':
---> 69 wiki_embeddings = get_embeddings_for_preprocessed_sentences(tokenized_sentences_SNLP, MODEL_WIKI_UNIGRAMS, FASTTEXT_EXEC_PATH)
70 print "came to if"
71 else:

in get_embeddings_for_preprocessed_sentences(sentences, model_path, fasttext_exec_path)
13 model_path + ' < '+
14 test_path + ' > ' +
---> 15 embeddings_path, shell=True)
16 embeddings = read_embeddings(embeddings_path)
17 os.remove(test_path)

/home/vaijenath/anaconda2/lib/python2.7/subprocess.pyc in check_output(*popenargs, **kwargs)
571 if cmd is None:
572 cmd = popenargs[0]
--> 573 raise CalledProcessError(retcode, cmd, output=output)
574 return output
575

CalledProcessError: Command '/home/vaijenath/Vaiju/Project/all code and data/My Code/sent2vec-master/src/fasttext print-sentence-vectors /home/vaijenath/Vaiju/Project/all code and data/My Code/TrainedModels/wiki_unigrams.bin < /home/vaijenath/Vaiju/Project/all code and data/My Code/sent2vec-master/1511975965.08_fasttext.test.txt > /home/vaijenath/Vaiju/Project/all code and data/My Code/sent2vec-master/1511975965.08_fasttext.embeddings.txt' returned non-zero exit status 2

where, /home/vaijenath/Vaiju/Project/all code and data/My Code/sent2vec-master/ is my path where i am running this file.

Need help on fasttext.h atomic error

Hi,

I know this is silly problem but I just can't find how to solve this problem.

In file included from sent2vec.cpp:315:

./fasttext.h:18:10: fatal error: 'atomic' file not found
#include <atomic>
         ^~~~~~~~

1 error generated.
error: command '/usr/bin/clang' failed with exit status 1

How could I search better parameters? (or add getLoss() to python interface)

I'm trying to train my unsupervised sent2vec model with Japanese text. Though the paper shows good parameters for English wikipedia, tweets, etc, they might not be good for my Japanese corpus. To search for good parameters, I'd like to know the loss of a trained model.

I guess getLoss() method might be for this. But I could not find the python equivalent. Is it possible to add this method to python interface?

Thank you.

Error when running "make"

Hello!
When I try to compile the library as Setup & Requirements suggests by running make command, errors pop up as below:

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
In file included from src/dictionary.cc:10:
src/dictionary.h:55: error: ISO C++ forbids initialization of member ‘pruneidx_size_’
src/dictionary.h:55: error: making ‘pruneidx_size_’ static
src/dictionary.h:55: error: ISO C++ forbids in-class initialization of non-const static member ‘pruneidx_size_’
src/dictionary.cc: In member function ‘void fasttext::Dictionary::threshold(int64_t, int64_t)’:
src/dictionary.cc:261: error: expected primary-expression before ‘[’ token
src/dictionary.cc:261: error: expected primary-expression before ‘]’ token
src/dictionary.cc:261: error: expected primary-expression before ‘const’
src/dictionary.cc:261: error: expected primary-expression before ‘const’
src/dictionary.cc:265: error: expected primary-expression before ‘[’ token
src/dictionary.cc:265: error: expected primary-expression before ‘]’ token
src/dictionary.cc:265: error: expected primary-expression before ‘const’
src/dictionary.cc:269: error: ‘class std::vector<fasttext::entry, std::allocatorfasttext::entry >’ has no member named ‘shrink_to_fit’
src/dictionary.cc: In member function ‘std::vector<long int, std::allocator > fasttext::Dictionary::getCounts(fasttext::entry_type) const’:
src/dictionary.cc:292: error: expected initializer before ‘:’ token
src/dictionary.cc:295: error: expected primary-expression before ‘return’
src/dictionary.cc:295: error: expected ‘;’ before ‘return’
src/dictionary.cc:295: error: expected primary-expression before ‘return’
src/dictionary.cc:295: error: expected ‘)’ before ‘return’
src/dictionary.cc: In member function ‘void fasttext::Dictionary::addNgrams(std::vector<int, std::allocator >&, int32_t, int32_t, std::minstd_rand&) const’:
src/dictionary.cc:322: error: ‘uniform_int_distribution’ is not a member of ‘std’
src/dictionary.cc:322: error: expected primary-expression before ‘>’ token
src/dictionary.cc:322: error: ‘uniform’ was not declared in this scope
src/dictionary.cc: In member function ‘int32_t fasttext::Dictionary::getLine(std::istream&, std::vector<int, std::allocator >&, std::vector<int, std::allocator >&, std::vector<int, std::allocator >&, std::minstd_rand&) const’:
src/dictionary.cc:357: error: ‘uniform_real_distribution’ is not a member of ‘std’
src/dictionary.cc:357: error: expected primary-expression before ‘>’ token
src/dictionary.cc:357: error: ‘uniform’ was not declared in this scope
src/dictionary.cc: In member function ‘void fasttext::Dictionary::save(std::ostream&) const’:
src/dictionary.cc:426: error: expected initializer before ‘:’ token
src/dictionary.cc:430: error: expected primary-expression before ‘}’ token
src/dictionary.cc:430: error: expected ‘;’ before ‘}’ token
src/dictionary.cc:430: error: expected primary-expression before ‘}’ token
src/dictionary.cc:430: error: expected ‘)’ before ‘}’ token
src/dictionary.cc:430: error: expected primary-expression before ‘}’ token
src/dictionary.cc:430: error: expected ‘;’ before ‘}’ token
src/dictionary.cc: In member function ‘void fasttext::Dictionary::prune(std::vector<int, std::allocator >&)’:
src/dictionary.cc:474: error: expected initializer before ‘:’ token
src/dictionary.cc:478: error: could not convert ‘((std::vector<int, std::allocator >)idx)->std::vector<_Tp, _Alloc>::insert [with _InputIterator = __gnu_cxx::__normal_iterator<int, std::vector<int, std::allocator > >, _Tp = int, _Alloc = std::allocator](((std::vector<int, std::allocator >*)idx)->std::vector<_Tp, _Alloc>::end with _Tp = int, _Alloc = std::allocator, ngrams.std::vector<_Tp, _Alloc>::begin with _Tp = int, _Alloc = std::allocator, ngrams.std::vector<_Tp, _Alloc>::end with _Tp = int, _Alloc = std::allocator)’ to ‘bool’
src/dictionary.cc:479: error: expected primary-expression before ‘}’ token
src/dictionary.cc:479: error: expected ‘)’ before ‘}’ token
src/dictionary.cc:479: error: expected primary-expression before ‘}’ token
src/dictionary.cc:479: error: expected ‘;’ before ‘}’ token
make: *** [dictionary.o] Error 1

I assume it's about c++, but I'm not familiar with that, can someone help?

Also I tried running python setup.py build_ext from ./sent2vec/src, but again errors occur:

running build_ext
building 'sent2vec' extension
gcc -pthread -B /home/python/anaconda3/envs/termExtraction/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/python/anaconda3/envs/termExtraction/lib/python3.6/site-packages/numpy/core/include -I/home/python/anaconda3/envs/termExtraction/include/python3.6m -c sent2vec.cpp -o build/temp.linux-x86_64-3.6/sent2vec.o -std=c++0x -Wno-cpp -pthread -Wno-sign-compare
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for Ada/C/ObjC but not for C++
In file included from /home/python/anaconda3/envs/termExtraction/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h:1816,
from /home/python/anaconda3/envs/termExtraction/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /home/python/anaconda3/envs/termExtraction/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from sent2vec.cpp:663:
/home/python/anaconda3/envs/termExtraction/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION"
In file included from sent2vec.cpp:671:
fasttext.h:18:18: error: atomic: No such file or directory
In file included from fasttext.h:23,
from sent2vec.cpp:671:
dictionary.h:55: error: ISO C++ forbids initialization of member ‘pruneidx_size_’
dictionary.h:55: error: making ‘pruneidx_size_’ static
dictionary.h:55: error: ISO C++ forbids in-class initialization of non-const static member ‘pruneidx_size_’
In file included from qmatrix.h:25,
from fasttext.h:25,
from sent2vec.cpp:671:
productquantizer.h:26: error: ISO C++ forbids initialization of member ‘nbits_’
productquantizer.h:26: error: making ‘nbits_’ static
productquantizer.h:27: error: ISO C++ forbids initialization of member ‘ksub_’
productquantizer.h:27: error: making ‘ksub_’ static
productquantizer.h:28: error: ISO C++ forbids initialization of member ‘max_points_per_cluster_’
productquantizer.h:28: error: making ‘max_points_per_cluster_’ static
productquantizer.h:29: error: ISO C++ forbids initialization of member ‘max_points_’
productquantizer.h:29: error: making ‘max_points_’ static
productquantizer.h:30: error: ISO C++ forbids initialization of member ‘seed_’
productquantizer.h:30: error: making ‘seed_’ static
productquantizer.h:31: error: ISO C++ forbids initialization of member ‘niter_’
productquantizer.h:31: error: making ‘niter_’ static
productquantizer.h:32: error: ISO C++ forbids initialization of member ‘eps_’
productquantizer.h:32: error: making ‘eps_’ static
In file included from sent2vec.cpp:671:
fasttext.h:46: error: ISO C++ forbids declaration of ‘atomic’ with no type
fasttext.h:46: error: invalid use of ‘::’
fasttext.h:46: error: expected ‘;’ before ‘<’ token
cc1plus: warning: unrecognized command line option "-Wno-cpp"
error: command 'gcc' failed with exit status 1

Information against the OS is

Linux slaver01 2.6.32-696.el6.x86_64 #1 SMP Tue Mar 21 19:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux``

Thanks in advance!

Merge FastText to master

Hello, thank your for this project 👍
It seems that the version of fastText in your codebase is pretty old, since then there were different fixes and new features from the master repo. Any plan to merge the changes?
Thank you.

module 'sent2vec' has no attribute 'Sent2vecModel'

Hi, I used a new server to run the code in jupyter notebook and the line "model = sent2vec.Sent2vecModel()" produces the error "module 'sent2vec' has no attribute 'Sent2vecModel'" once again. This time recloning the repository doesn't work anymore. And yes I reinstalled Cython but it doesn't help.

Cannot import name "StanfordTokenizer"

Hello,
There is a line in wikiTokenize.py: "from nltk.tokenize import StanfordTokenizer"
But when I try to run the command as : python wikiTokenize.py filename.txt , it pops up a prompt that Cannot import name "StanfordTokenizer".
I install nltk by this command: pip install nltk. In the nltk installation path, /anaconda3/envs/py3/lib/python3.6/site-packages/nltk/tokenize, there is stanford_segmenter.py and stanford.py, but no StanfordTokenizer.py.
So can I ask how to fix this problem?

the predicted vectors are all 0.

I successfully install the sent2vec python module globally. However, while I tried to predict the vector using the wiki_unigram model, the output is a 600 dim vector with all element 0. Does anyone know what's wrong with it ? Thanks a lot! Meanwhile, when I tried the cml,
./fasttext print-sentence-vectors wiki_unigrams.bin < test.txt,
There was an assertion error:
Assertion failed: (counts.size() == osz_), function setTargetCounts, file src/model.cc, line 226. Abort trap: 6

Code below:
import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('wiki_unigrams.bin'
print(model)
emb = model.embed_sentence(" I do not know why you said so . ")
print(emb)

output:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...

Training for other languages

I'm looking forward to train for Urdu language.

https://github.com/epfml/sent2vec#training-new-models
./fasttext sent2vec -input wiki_sentences.txt -output my_model -minCount 8 -dim 700 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000

Is the command is enough or we need to do some preprocessing like stemming for each of the word in the file.

Using fastText to compute the sentence vectors

Hi,

Am I correct in thinking there is no difference between using sent2vec's python interface, and loading the sent2vec model directly into fastText to access the word vectors? E.g. using your Wikipedia unigram model I get:

>>> sent2vec_model.embed_sentence("hello world")

array([-8.04620683e-01,  8.93464625e-01, -3.09190452e-02, -1.37415811e-01,
       -3.63196850e-01,  2.66210616e-01, -7.63243794e-01,  8.93934608e-01,

and with fastText:

>>> fasttext_model = fastText.load_model("wiki_unigrams.bin")
>>> (fasttext_model.get_word_vector("hello") + fasttext_model.get_word_vector("world"))/2

array([-8.04620683e-01,  8.93464625e-01, -3.09190452e-02, -1.37415811e-01,
       -3.63196850e-01,  2.66210616e-01, -7.63243794e-01,  8.93934608e-01,

Thanks for the info.

Learned Embeddings

Hi,

How can I extract the embedding vectors after using ./fasttext sent2vec... on a training set of sentences?

Issues with code and wikipedia model files

I want to bring to your attention that 'print-vectors' is not a valid command in fasttext anymore (there is either 'print-word-vectors' or 'print-sentence-vectors'. Additionally, please notice that the wikipedia model files sent2vec_wiki_bigrams and sent2vec_wiki_unigrams get 'Model file has wrong file format!' messages from fasttext (I haven't tested the twitter files).

About the running speed

Hi, I try to train a model on 1421M words, and found that the word/sec/thread are keep decreasing, eg from 100k at 0.0% reduce to 8k at 11% , and now the eta is more than 60 hours.

Here is my script for training:

./fasttext sent2vec -input my_sentences.txt -output my_model -minCount 8 -dim 300 -epoch 9 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 6 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000

[Improvement]: Add french and german pre-trained models

Hi,

First of all thanks for the great work!

I have trained unigrams models on the wikipedia corpus in german and french and would like to share them. German model is 7.3GB, french model is 4.4GB.

Both models have been trained on the latest (preprocessed) wiki dumps with the parameters found in the paper for training "Wiki Sent2Vec unigrams" models (dim:600, minCount:8, minCountLabel:20, lr:0.2, epoch:9, t:0.00001, dropoutK:0, neg:10). Let me know if you're interested and and if that's the case where I can upload them

Weird results with nnSent

I try to apply the pretrained wikipedia bi-gram model on an existing corpus i have (Simple English wikipedia sentences, tokenized and lowercased with another pipeline). I tried the nnSent command to see how well we can retrieve similar sentences, and I get weird/underwhelming results.

With most queries I try, the results are not (or only roughly) ordered by vector distance:

Query sentence?
neuropathy can happen with or without the presence of diabetes mellitus .
0.709152 994968 mody is often called monogenic diabetes .
0.659386 994966 maturity onset diabetes of the young , or mody is any of several hereditary forms of diabetes mellitus caused by gene mutations .
0.608413 994574 he died on may 18 , 2009 of complications of diabetes .
0.606619 994651 furst has had type 2 diabetes for many years .
0.679047 949868 it is an important part of blood glucose checking for people with diabetes mellitus or hypoglycemia .
1 842587 neuropathy can happen with or without the presence of diabetes mellitus .
0.710052 860388 neonatal diabetes mellitus ( ndm ) is a type of diabetes .
0.744731 80109 people with diabetes " mellitus " are called " diabetics " .
0.722052 80107 there are other kinds of diabetes , like diabetes insipidus .
0.719025 80132 gestational diabetes mellitus is like type 2 diabetes .

Query sentence?
there are other kinds of diabetes , like diabetes insipidus .
0.794162 994968 mody is often called monogenic diabetes .
0.793908 949871 the tests are used for type 1 diabetes , latent autoimmune diabetes of adults and type 2 diabetes .
0.792295 950147 he has type 1 diabetes .
0.652549 908504 this happened after many years of fighting with diabetes .
0.722051 842587 neuropathy can happen with or without the presence of diabetes mellitus .
1 80107 there are other kinds of diabetes , like diabetes insipidus .
0.874403 80132 gestational diabetes mellitus is like type 2 diabetes .
0.812321 80109 people with diabetes " mellitus " are called " diabetics " .
0.809479 647689 type 2 diabetes makes up around 90% of cases of diabetes , while type 1 diabetes and other types of diabetes make up the other 10% .
0.798721 80135 in the case of diabetes , there are two kinds of complications .

For some queries, results seem almost ordered (but still not perfectly so):

Query sentence?
slavery was eventually abolished in the united states .
0.853386 983766 slavery was abolished in 1851 .
0.819932 94713 the institution of slavery was abolished in 1925 .
0.872842 614734 sharecropping became more common in the southern united states when slavery was abolished .
0.765797 67662 following the war , slavery was outlawed everywhere in the united states .
0.76175 11557 slavery in united states was legally abolished by thirteenth amendment to the united states constitution in 1865 .
0.76154 82923 some founders wanted to abolish slavery everywhere in the united states .
0.751806 90399 the society would get slavery abolished in new york .
0.726364 712056 this gave slaves their freedom , even though slavery had been officially abolished in 1807 .
0.702602 85751 the compromise was to delay the slavery issue in the united states .
0.686827 1002030 meanwhile , some states still had slavery .
0.68088 837048 it was abolished in 1968 .
0.680816 842166 it was abolished in 1969 .
0.674035 897590 in 1784 she wrote a play that supported slavery being abolished ( gotten rid of ) .
0.67191 912262 it was abolished in 1974 .
0.667849 1001457 it was abolished in 1949 .

minCountLabel?

does the parameter minCountLabel actually do anything? This library looks to be unsupervised only.

Finding most similar words

Is there a possibility to find the most similar words?
model.most_similar("word")
for example?

Or you can add this feauture in the future.

Thank you.

Getting a segfault for my data.

Hi! I wanted to train a new model on a dataset of new articles from various sources that I made. I cleaned and tokenized the dataset into one .txt file. I keep getting a segfault with the following output:

Siddharths-MacBook-Pro-2:sent2vec siddharth$ ./fasttext sent2vec -input all_sentences.txt -output my_model -epoch 9 -thread 4
Read 9M words
Number of words: 45770
Number of labels: 0
Progress: 42.2% words/sec/thread: 205950 lr: 0.115647 loss: 2.235114 eta: 0h1m Segmentation fault: 11

I'm not sure if this is a memory issue that requires me to use a machine with more RAM or an issue in my input data. As I understood, the input .txt file should contain lowercase sentences separated by new lines with no non-alphabetic chars. If this is wrong, what should the exact format of my data input be?

Possibility to keep the model loaded in memory?

Hello,

I am working on a QA system and I would like to use your embeddings to work with the sentences the user gives me. One problem is that when I get a new sentence and I want to transform it, sent2vec loads the whole model into memory and creates the embedding. When a new sentence comes, it repeats this process. Since the model is cached, the repeated loading of the vocabulary is fast after the first load, but I believe that it is not the best approach for a system that should work for a longer period of time. Is it somehow possible to load and keep the model in RAM, while just sending individual sentences to transform them? (I am using the python wrapper you provide as an example in the repo).

One other question (not worth creating a new post for that I believe). Can you please tell me what is the vocabulary size of the pre-trained models you provide? Unless I misunderstood the numbers in the paper, there are only the numbers of words in the training corpus and since I am using the python wrapper, it is kinda hard to work out the vocabulary sizes from the binary files.

Thank you :)

Getting embeddings in real time

I followed the instructions in get_sentence_embeddings_from_pre-trained_models.ipynb to get embeddings of sentences, and I found that every time I call the method get_sentence_embeddings(), it reloads the parameters of the model once, making loading sentence embedding several times a time-consuming stuff.

Is there any solutions to make it more efficient?

Training does not complete but program exits

Hello, I'm using the library on windows and am trying to generate embeddings on my own data. Upon running:

fasttext.exe sent2vec -input C:\....\fasttext_formatted_data.txt -output sent2vec_model

the following is shown:

Read 2M words
Number of words:  10117
Number of labels: 0
Progress: 0.0%  words/sec/thread: 2016  lr: 0.199933  loss: 7.489615  eta: 0h13m

After a moment, the program exits and the progress jumps to 1%. No model is saved after the program exits. How can I get sentence embeddings on my data?

fatal error while installing

Hello !

while installing, I ran
python setup.py build_ext
I got this answer

image

installation of fastest did not make any trouble ...

indeed the complete answer is

image

any idea ?
thank you for your answers ;)

Can't load .bin file in gensim, is there a way to genrate .vec instead?

I am trying to load the ".bin" model file in gensim (v3.3.0) from sent2vec, but I get this error:

/usr/local/lib/python2.7/dist-packages/gensim/models/utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
    167     with utils.smart_open(fname) as fin:
    168         print "TEST=", fin.readline()
--> 169         header = utils.to_unicode(fin.readline(), encoding=encoding)
    170         vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
    171         if limit:

/usr/local/lib/python2.7/dist-packages/gensim/utils.pyc in any2unicode(text, encoding, errors)
    327     if isinstance(text, unicode):
    328         return text
--> 329     return unicode(text, encoding, errors=errors)
    330 
    331 

/usr/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 31: invalid continuation byte

I looked for the plain text format (.vec) of the model but I can't find it, I presume sent2vec doesn't generate it.

I also tried "./fasttext print-word-vectors model.bin" but it just hangs.

How can I use the vectors in gensim?

How does sent2vec differ from fasttext-supervised

I'm trying to understand the exact difference between sent2vec and running fasttest-supervised on lines like this:

some sentence with words __label__some __label__sentence __label__with __label__words

Reading your paper and code, I think you're holding out the word on the lhs, that you are trying to predict on the rhs. E.g. you run

train(sentence with words,  some)
train(some with words,  sentence)
train(some sentence words,  with)
train(some sentence with,  words)

where as fasttext-supervised would include the rhs word in each of those four calls.
Is this correct, or am I missing other differences between the two systems?

Extracting parameters from model

How can we extract the parameters with whom our model was trained?

For example, I have a model file and I want to know, what's vector size here, min count, how many epochs were performed and something like that?

Thank you!

ImportError: No module named Cython.Build

I have cython already in my system but when I execute sudo pip install . command. Following error pops up

`The directory '/home/lokesh/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/lokesh/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Processing /home/lokesh/Desktop/Unsupervised_KeYEx/sent2vec/src
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-8ApShB-build/setup.py", line 2, in <module>
        from Cython.Build import cythonize
    ImportError: No module named Cython.Build

----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-8ApShB-build/

`

My python version is 2.7

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.