oborchers / fast_sentence_embeddings Goto Github PK

Compute Sentence Embeddings Fast!

License: GNU General Public License v3.0

Python 44.08% Jupyter Notebook 51.98% Cython 3.51% C 0.08% Shell 0.26% Dockerfile 0.08%

cython document-similarity embeddings fasttext fse gensim gensim-model maxpooling sentence-embeddings sentence-representation sentence-similarity sif swem usif word2vec-model wordembedding

fast_sentence_embeddings's People

Contributors

Stargazers

Watchers

Forkers

dunovank stjordanis zhouyonglong awesome-archive hamedmx rajicon angelo337 manolaz gritsenko-konstantin hiyorimi nimesh0505 gclen darrellbest yyht paul0m pankajmehar hjuhel-cdpq 0x01h vikrammitra vedraiyani zhaojunzuozjzfr ayush488 avi197 averma12 yywang0415 zhang-du jimmyhu4 hzitoun alonisser lancezeng947 haoyunhong kyoungrok0517 chaosdonkey06 jjjamie elliotthwang ncgamit ingexue dineshjs aydv namitha-sharma ank-22 rollend slbinilkumar ai-ml-cv manikant92 midnight93 girmak jbdatascience datakalp charleoy tonylv mindis abulhasanat fyjgreatlion abcp4 greysun grantmwilliams momor666 daquarti zsiciarz gabefair rafaeldelrey llling339 jdadong jeffersonlplima yueyedeai alemuzzi python-repository-hub knutjaegersberg valeman aucan julianlopezb sampath215 techthiyanes htlszlh mihirpurwar nashid hyjin-asc philisterd juanhuguet hbcbh1999 brunoarine beiluomi

fast_sentence_embeddings's Issues

SVD ram subsampling for SIF / uSIF

Customizable, standard is 1 GB of RAM

error: (lwork>=n||lwork==-1) failed for 1st keyword lwork: sorgqr:lwork=-980116480

2019-10-04 12:19:33,452 : MainThread : INFO : worker thread finished; awaiting finish of 1 more threads
2019-10-04 12:19:33,452 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-11-7bb656653df0> in <module>
----> 1 model.train(doc)

~/anaconda3/lib/python3.6/site-packages/fse/models/base_s2v.py in train(self, sentences, update, queue_factor, report_delay)
    640 
    641         # Preform post-tain calls (i.e principal component removal)
--> 642         self._post_train_calls()
    643 
    644         self._log_train_end(eff_sentences=eff_sentences, eff_words=eff_words, overall_time=overall_time)

~/anaconda3/lib/python3.6/site-packages/fse/models/usif.py in _post_train_calls(self)
     79         """ Function calls to perform after training, such as computing eigenvectors """
     80         if self.components > 0:
---> 81             self.svd_res = compute_principal_components(self.sv.vectors, components=self.components)
     82             self.svd_weights = (self.svd_res[0] ** 2) / (self.svd_res[0] ** 2).sum().astype(REAL)
     83             remove_principal_components(self.sv.vectors, svd_res=self.svd_res, weights=self.svd_weights, inplace=True)

~/anaconda3/lib/python3.6/site-packages/fse/models/utils.py in compute_principal_components(vectors, components)
     32     start = time()
     33     svd = TruncatedSVD(n_components=components, n_iter=7, random_state=42, algorithm="randomized")
---> 34     svd.fit(vectors)
     35     elapsed = time()
     36     logger.info(f"computing {components} principal components took {int(elapsed-start)}s")

~/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/truncated_svd.py in fit(self, X, y)
    139             Returns the transformer object.
    140         """
--> 141         self.fit_transform(X)
    142         return self
    143 

~/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/truncated_svd.py in fit_transform(self, X, y)
    176             U, Sigma, VT = randomized_svd(X, self.n_components,
    177                                           n_iter=self.n_iter,
--> 178                                           random_state=random_state)
    179         else:
    180             raise ValueError("unknown algorithm %r" % self.algorithm)

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py in randomized_svd(M, n_components, n_oversamples, n_iter, power_iteration_normalizer, transpose, flip_sign, random_state)
    332 
    333     Q = randomized_range_finder(M, n_random, n_iter,
--> 334                                 power_iteration_normalizer, random_state)
    335 
    336     # project M to the (k + p) dimensional space using the basis vectors

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py in randomized_range_finder(A, size, n_iter, power_iteration_normalizer, random_state)
    224     # Sample the range of A using by linear projection of Q
    225     # Extract an orthonormal basis
--> 226     Q, _ = linalg.qr(safe_sparse_dot(A, Q), mode='economic')
    227     return Q
    228 

~/anaconda3/lib/python3.6/site-packages/scipy/linalg/decomp_qr.py in qr(a, overwrite_a, lwork, mode, pivoting, check_finite)
    163     elif mode == 'economic':
    164         Q, = safecall(gor_un_gqr, "gorgqr/gungqr", qr, tau, lwork=lwork,
--> 165                       overwrite_a=1)
    166     else:
    167         t = qr.dtype.char

~/anaconda3/lib/python3.6/site-packages/scipy/linalg/decomp_qr.py in safecall(f, name, *args, **kwargs)
     19         ret = f(*args, **kwargs)
     20         kwargs['lwork'] = ret[-2][0].real.astype(numpy.int)
---> 21     ret = f(*args, **kwargs)
     22     if ret[-1] < 0:
     23         raise ValueError("illegal value in %d-th argument of internal %s"

error: (lwork>=n||lwork==-1) failed for 1st keyword lwork: sorgqr:lwork=-980116480

When executing

from gensim.models.keyedvectors import FastTextKeyedVectors as kv
from fse.models import uSIF
from fse import IndexedLineDocument
ft = kv.load("<path to pretrained fasttext>") 

from fse.models.average import FAST_VERSION, MAX_WORDS_IN_BATCH 
print(MAX_WORDS_IN_BATCH) 
print(FAST_VERSION)


doc = IndexedLineDocument("<very large list of sentences.txt>") 
model = uSIF(ft, workers=28, sv_mapfile_path="../tmp/sv_map", wv_mapfile_path="../tmp/wv_map") 

model.train(doc)

Note that it also happens with regular SIF, the list of sentences is approx 30GB with the embedding dimension being 128. Not entirely sure how to debug this further, any thoughts?

List of dependencies from Anaconda https://gist.github.com/joelkuiper/b43a9d1f0b422eadf3dee0a29476ad90

Hierarchical pooling

Could you say something more about hierarchical pooling?
I am interested in this feature, but I'm not sure what you mean.
I can try to implement this if given some guidance.

Does FSE guarantee ordering of vectors to be that of the input sentences?

For an example like:

import pandas as pd

from fse.models import uSIF
from fse import SplitIndexedList
from gensim.models.keyedvectors import FastTextKeyedVectors

fasttext_model_path = "models/fasttext-wiki-news-subwords-300.model"
ft = FastTextKeyedVectors.load(fasttext_model_path)

sent_fp = "data/sentences/sentences.csv.gz"
df = pd.read_csv(sent_fp)

sentences = df.sentence.values

indexed_sentences = SplitIndexedList(sentences)

model = uSIF(ft, workers=2, lang_freq="en")

sentence_count, word_count = model.train(indexed_sentences)

embeddings = model.sv.vectors

Where I read in an ordered list of sentences and then process them through a pre-trained model, does FSE guarantee the order of the model vectors to be the same order that the sentences were fed in?

I didn't see anything in the documentation or source code to suggest they wouldn't be, but I also haven't seen in the documentation any claims for guaranteed ordering either.

Thanks!

Encounter "Divided 0 Error"

Hi Oliver,

Thanks to this great repo!
But, I found some issues when I use it with SentEval.

Generally speaking, the problem is "divide by zero error" when I use uSIF(glove, length=12, ncomponents=1). The error was raised at calculating a = (1 - alpha)/(alpha * 2) at the fse/models/usif.py:line126.

If you need a minimum running code to reproduce the error, please let me know.

Regarding comparison

Hi, Really great work!!

I just have one question.

Which one (FastText, word2vec, glove) is good in getting better sentence embedding from respective word vectors by averaging them?

Which according to you will give better results on search results for sentences if I embed them?

Gensim version ImportError: cannot import name 'BaseKeyedVectors'

Dear fse creator,

Below import gives ImportError: cannot import name 'BaseKeyedVectors'.
from fse import SplitIndexedList

We think it's from the compatibility of gensim, so were wondering what is the gensim version we should use.
(we are using the latest gensim 4.0.0)

This github issue suggests it should be from gensim.models.keyedvectors import KeyedVector

Best,
-- Luke

Usif does not work with small data?

I'm trying to test the usif but I'm getting an error in the SVD part about nan values in the vector.
I took the example of the Average and changed to usif

`
from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
ft = FastText(sentences, min_count=1, size=10)

from fse.models import uSIF
from fse import IndexedList
model = uSIF(ft, components=1)
model.train(IndexedList(sentences))
`

The error is the following ocurring during the fit() of the TruncatedSVD

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I'm using Python 3.6.5 :: Anaconda, Inc.

Optimize for Fasttext supervised mode

Hi @oborchers ,
first, kudos for the superb work! 👍
I am exploring SIF embeddings with your library. I have noticed, looking at the code, that there are some specific optimizations for FastText models. Now, since Gensim does not seem to support supervised FT models in any way, I currently "force" it to load my model through the .vec files (instead of the .bin), like this:

ft = KeyedVectors.load_word2vec_format("model_ft.vec")

which means that the model that I load is not an instance of FastTextKeyedVectors, and thus I will not exploit any of these optimizations when I train the SIF model.
Indeed, the resulting model is quite heavy on the RAM usage, so I am wondering if there is any better way to do this, also considering that the only function that I will call after training will be infer.

Also I am thinking, can one change the memmap settings in a second moment? I am thinking of something like training on RAM, write the whole SIF model to disk and then loading in a second moment but keeping the word vectors on disk (or also change the path of wv_mapfile_path to another location, like for example if we change machine).

Any kind of hint would be highly appreciated! And thanks again 😃

Docs: extend example with lookup

It would be great if you could show how to use the sentence embeddings, e.g. to find the nearest sentence, or nearest N sentences, in the training corpus. I.e. how to write find_nearest() in the below:

from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)

from fse.models import Sentence2Vec
se = Sentence2Vec(model)
sentences_emb = se.train(sentences)

test_sentence = [["dog", "say", "moo"]]
find_nearest(sentences_emb, test_sentence)

Refactor and benchmark IndexedSentence

Refactoring the horribly complicated Input class

use normalized word vectors or not?

Hi, I am using word2vec model to calculate sentence level embeddings and I was wondering if I should use normalized word vectors to train the SIF model?

Speed

Hi, I am currently trying out your algorithm and I was wondering what speeds you achieve. On my machine (MacBook Pro), training on 200 sentences takes roughly 3 seconds. Is this normal or do you think there is something wrong? Your help would be much appreciated!

Paranmt Model

Is there any information on getting the paranmt model or setting it up? The benchmarks show it as a great model to use with FSE and I was hoping to try it out, but I haven't been able to find it anywhere (just training data). I was just curious if there was somewhere we could access this model/keyed vectors.

Reduce Model Size for Inference

Is it possible to reduce model size by discarding previously calculated sentence vectors? This would be for downstream inference-only uses.

Best,

Update setup.py to python<3.7 to fix "C extension not loaded, training/inferring will be slow. Install a C compiler and reinstall fse" warning

This is related to #18, which was closed but is not solved. Myself and another user (@lucas-ubm) are experiencing this problem on macOS systems, so it is not limited to Windows. I have tried installing gensim through conda to no avail. Any tips would be greatly appreciated.

Error message:

/opt/anaconda3/envs/sbir_covid/lib/python3.8/site-packages/fse/models/base_s2v.py:114: UserWarning: C extension not loaded, training/inferring will be slow. Install a C compiler and reinstall fse.
  warnings.warn(

Here is my machine setup:
macOS: MacBook Pro (15-inch, 2019), Version 11.2.3 (20D91)
Processor: 2.3 GHz 8-Core Intel Core i9
Memory: 32 GB 2400 MHz DDR4

Here is my conda env setup:

# packages in environment at /opt/anaconda3/envs/sbir_covid:
#
# Name                    Version                   Build  Channel
affinegap                 1.11                     pypi_0    pypi
aiohttp                   3.7.4            py38h96a0964_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
appnope                   0.1.2            py38h50d1736_1    conda-forge
argon2-cffi               20.1.0           py38h5406a74_2    conda-forge
async-timeout             3.0.1                   py_1000    conda-forge
async_generator           1.10                       py_0    conda-forge
attrs                     20.3.0             pyhd3deb0d_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
black                     20.8b1                     py_1    conda-forge
bleach                    3.3.0              pyh44b312d_0    conda-forge
boto                      2.49.0                     py_0    conda-forge
boto3                     1.17.55            pyhd8ed1ab_0    conda-forge
botocore                  1.20.55            pyhd8ed1ab_0    conda-forge
brotlipy                  0.7.0           py38h5406a74_1001    conda-forge
btrees                    4.8.0                    pypi_0    pypi
bz2file                   0.98                       py_0    conda-forge
c-ares                    1.17.1               h0d85af4_1    conda-forge
ca-certificates           2020.12.5            h033912b_0    conda-forge
cachetools                4.2.1              pyhd8ed1ab_0    conda-forge
catalogue                 2.0.3            py38h50d1736_0    conda-forge
categorical-distance      1.9                      pypi_0    pypi
certifi                   2020.12.5        py38h50d1736_1    conda-forge
cffi                      1.14.5           py38ha97d567_0    conda-forge
chardet                   4.0.0            py38h50d1736_1    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
colorama                  0.4.4              pyh9f0ad1d_0    conda-forge
coverage                  5.5              py38h96a0964_0    conda-forge
cryptography              3.4.7            py38h1fa4640_0    conda-forge
cycler                    0.10.0                     py_2    conda-forge
cymem                     2.0.5            py38h91a8764_1    conda-forge
cython-blis               0.7.4            py38ha1b04c9_0    conda-forge
dataclasses               0.8                pyhc8e2a94_1    conda-forge
datetime-distance         0.1.3                    pypi_0    pypi
decorator                 5.0.7              pyhd8ed1ab_0    conda-forge
dedupe                    2.0.8                    pypi_0    pypi
dedupe-hcluster           0.3.8                    pypi_0    pypi
dedupe-variable-datetime  0.1.5                    pypi_0    pypi
defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
doublemetaphone           0.1                      pypi_0    pypi
entrypoints               0.3             pyhd8ed1ab_1003    conda-forge
fastcluster               1.1.28                   pypi_0    pypi
filelock                  3.0.12             pyh9f0ad1d_0    conda-forge
flake8                    3.9.1              pyhd8ed1ab_0    conda-forge
freetype                  2.10.4               h4cff582_1    conda-forge
fse                       0.1.15                   pypi_0    pypi
future                    0.18.2           py38h50d1736_3    conda-forge
gensim                    3.8.3            py38ha048514_4    conda-forge
google-api-core           1.26.2             pyhd8ed1ab_0    conda-forge
google-auth               1.28.0             pyh44b312d_0    conda-forge
google-cloud-core         1.5.0              pyhd3deb0d_0    conda-forge
google-cloud-storage      1.35.1             pyh44b312d_0    conda-forge
google-crc32c             1.1.2            py38haea9d43_0    conda-forge
google-resumable-media    1.2.0              pyhd3deb0d_0    conda-forge
googleapis-common-protos  1.53.0           py38h50d1736_0    conda-forge
grpcio                    1.37.0           py38ha263829_0    conda-forge
haversine                 2.3.0                    pypi_0    pypi
highered                  0.2.1                    pypi_0    pypi
idna                      2.10               pyh9f0ad1d_0    conda-forge
importlib-metadata        4.0.1            py38h50d1736_0    conda-forge
iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
ipykernel                 5.5.3            py38h6c79ece_0    conda-forge
ipython                   7.22.0           py38h6c79ece_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.18.0           py38h50d1736_2    conda-forge
jinja2                    2.11.3             pyh44b312d_0    conda-forge
jmespath                  0.10.0             pyh9f0ad1d_0    conda-forge
joblib                    1.0.1              pyhd8ed1ab_0    conda-forge
jpeg                      9d                   hbcb3906_0    conda-forge
jsonschema                3.2.0              pyhd8ed1ab_3    conda-forge
jupyter_client            6.1.12             pyhd8ed1ab_0    conda-forge
jupyter_core              4.7.1            py38h50d1736_0    conda-forge
jupyterlab_pygments       0.1.2              pyh9f0ad1d_0    conda-forge
kiwisolver                1.3.1            py38hd9c93a9_1    conda-forge
krb5                      1.17.2               h60d9502_0    conda-forge
langcodes                 3.1.0                    pypi_0    pypi
lcms2                     2.12                 h577c468_0    conda-forge
levenshtein-search        1.4.5                    pypi_0    pypi
libblas                   3.9.0                     8_mkl    conda-forge
libcblas                  3.9.0                     8_mkl    conda-forge
libcrc32c                 1.1.1                h1c7c35f_2    conda-forge
libcxx                    11.1.0               habf9029_0    conda-forge
libedit                   3.1.20191231         h0678c8f_2    conda-forge
libffi                    3.3                  h046ec9c_2    conda-forge
libgfortran               5.0.0           9_3_0_h6c81a4c_22    conda-forge
libgfortran5              9.3.0               h6c81a4c_22    conda-forge
liblapack                 3.9.0                     8_mkl    conda-forge
libpng                    1.6.37               h7cec526_2    conda-forge
libpq                     13.1                 h052a64a_2    conda-forge
libprotobuf               3.15.8               hcf210ce_0    conda-forge
libsodium                 1.0.18               hbcb3906_1    conda-forge
libtiff                   4.2.0                h355d032_0    conda-forge
libwebp-base              1.2.0                h0d85af4_2    conda-forge
llvm-openmp               11.1.0               hda6cdc1_1    conda-forge
lz4-c                     1.9.3                h046ec9c_0    conda-forge
markupsafe                1.1.1            py38h5406a74_3    conda-forge
matplotlib                3.4.1            py38h50d1736_0    conda-forge
matplotlib-base           3.4.1            py38h6152e83_0    conda-forge
mccabe                    0.6.1                      py_1    conda-forge
mistune                   0.8.4           py38h5406a74_1003    conda-forge
mkl                       2020.4             h08c4f10_301    conda-forge
more-itertools            8.7.0              pyhd8ed1ab_0    conda-forge
msgpack                   1.0.2                    pypi_0    pypi
multidict                 5.1.0            py38h5406a74_1    conda-forge
murmurhash                1.0.5            py38h91a8764_0    conda-forge
mypy                      0.812            py38h96a0964_2    conda-forge
mypy_extensions           0.4.3            py38h50d1736_3    conda-forge
nbclient                  0.5.3              pyhd8ed1ab_0    conda-forge
nbconvert                 6.0.7            py38h50d1736_3    conda-forge
nbformat                  5.1.3              pyhd8ed1ab_0    conda-forge
ncurses                   6.2                  h2e338ed_4    conda-forge
nest-asyncio              1.5.1              pyhd8ed1ab_0    conda-forge
ninja                     1.10.2               h9a9d8cb_0    conda-forge
nltk                      3.6.2              pyhd8ed1ab_0    conda-forge
notebook                  6.3.0              pyha770c72_1    conda-forge
numpy                     1.20.2           py38had91d27_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openjpeg                  2.4.0                h6cbf5cd_0    conda-forge
openssl                   1.1.1k               h0d85af4_0    conda-forge
packaging                 20.9               pyh44b312d_0    conda-forge
pandas                    1.2.4            py38h1f261ad_0    conda-forge
pandoc                    2.13                 h0d85af4_0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parso                     0.8.2              pyhd8ed1ab_0    conda-forge
pathspec                  0.8.1              pyhd3deb0d_0    conda-forge
pathy                     0.4.0              pyhd8ed1ab_0    conda-forge
persistent                4.7.0                    pypi_0    pypi
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    8.1.2            py38h83525de_1    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
pluggy                    0.13.1           py38h50d1736_4    conda-forge
preshed                   3.0.5            py38h91a8764_0    conda-forge
prometheus_client         0.10.1             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.18             pyha770c72_0    conda-forge
protobuf                  3.15.8           py38ha048514_0    conda-forge
psutil                    5.8.0            py38h96a0964_1    conda-forge
psycopg2                  2.8.6            py38hc775865_2    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
py                        1.10.0             pyhd3deb0d_0    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pycodestyle               2.7.0              pyhd8ed1ab_0    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pydantic                  1.7.3            py38h96a0964_1    conda-forge
pyflakes                  2.3.1              pyhd8ed1ab_0    conda-forge
pygments                  2.8.1              pyhd8ed1ab_0    conda-forge
pyhacrf-datamade          0.2.5                    pypi_0    pypi
pylbfgs                   0.2.0.13                 pypi_0    pypi
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pyrsistent                0.17.3           py38h5406a74_2    conda-forge
pysocks                   1.7.1            py38h50d1736_3    conda-forge
pytest                    6.2.3            py38h50d1736_0    conda-forge
pytest-cov                2.11.1             pyh44b312d_0    conda-forge
python                    3.8.8           h4e93d89_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytorch                   1.8.0           cpu_py38h561cec5_1    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
pyyaml                    5.4.1            py38h5406a74_0    conda-forge
pyzmq                     22.0.3           py38hd3b92b6_1    conda-forge
readline                  8.1                  h05e3726_0    conda-forge
regex                     2021.4.4         py38h96a0964_0    conda-forge
requests                  2.25.1             pyhd3deb0d_0    conda-forge
reverse_geocoder          1.5.1              pyhd8ed1ab_0    conda-forge
rlr                       2.4.5                    pypi_0    pypi
rsa                       4.7.2              pyh44b312d_0    conda-forge
s3transfer                0.4.1              pyhd8ed1ab_0    conda-forge
sacremoses                0.0.43             pyh9f0ad1d_0    conda-forge
scikit-learn              0.24.1           py38hfd19401_0    conda-forge
scipy                     1.6.2            py38h431c0a8_0    conda-forge
send2trash                1.5.0                      py_0    conda-forge
sentence-transformers     0.4.1              pyhd3deb0d_0    conda-forge
setuptools                49.6.0           py38h50d1736_3    conda-forge
shellingham               1.4.0              pyh44b312d_0    conda-forge
simplecosine              1.2                      pypi_0    pypi
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sleef                     3.5.1                h35c211d_1    conda-forge
smart_open                2.2.1              pyh9f0ad1d_0    conda-forge
spacy                     3.0.5            py38hc7464f7_0    conda-forge
spacy-alignments          0.8.3                    pypi_0    pypi
spacy-legacy              3.0.3              pyhd8ed1ab_0    conda-forge
spacy-lookups-data        1.0.0                    pypi_0    pypi
spacy-transformers        1.0.2                    pypi_0    pypi
sqlite                    3.35.4               h44b9ce1_0    conda-forge
srsly                     2.4.1            py38ha048514_0    conda-forge
terminado                 0.9.4            py38h50d1736_0    conda-forge
testpath                  0.4.4                      py_0    conda-forge
thinc                     8.0.3            py38he35c9cc_0    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               h0419947_1    conda-forge
tokenizers                0.10.1           py38heb1ef26_0    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tornado                   6.1              py38h5406a74_1    conda-forge
tqdm                      4.60.0             pyhd8ed1ab_0    conda-forge
traitlets                 5.0.5                      py_0    conda-forge
transformers              4.5.1              pyhd8ed1ab_0    conda-forge
typed-ast                 1.4.3            py38h96a0964_0    conda-forge
typer                     0.3.2              pyhd8ed1ab_0    conda-forge
typing-extensions         3.7.4.3                       0    conda-forge
typing_extensions         3.7.4.3                    py_0    conda-forge
urllib3                   1.26.4             pyhd8ed1ab_0    conda-forge
wasabi                    0.8.2              pyh44b312d_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
wordfreq                  2.5.0                    pypi_0    pypi
xz                        5.2.5                haf1e3a3_1    conda-forge
yaml                      0.2.5                haf1e3a3_0    conda-forge
yarl                      1.5.1            py38h4d0b108_0    conda-forge
zeromq                    4.3.4                h1c7c35f_0    conda-forge
zipp                      3.4.1              pyhd8ed1ab_0    conda-forge
zlib                      1.2.11            h7795811_1010    conda-forge
zope-index                5.0.0                    pypi_0    pypi
zope-interface            5.4.0                    pypi_0    pypi
zstd                      1.4.9                h582d3a0_0    conda-forge

slow speed for SIF model for large corpus

Hi,
I have been experimenting with fse. For small dataset 200-300k sentences, embedding generation was very fast. But now i am training with large data corpus of 50 million sentences. I am using 12 workers and still the training for embeddings is very slow. From logs it is somewhat 700 sentences/sec. I am using gensim.models.FastText
Also got a user warning of "C extension not loaded, training/inferring will be slow. " on Ubuntu 16.04. Any way to increase the speed?
Thank you

RuntimeError: You must first train the model to obtain SVD components

Hello

Model training can done easily. But after save my model i can load it.

from fse.models import SIF
model = SIF(w2v, workers=8)
model.train(doc)
model.save("my_sıf_model")
model.load("my_sıf_model")

But when I try to
similars = model.sv.similar_by_sentence("my sentence".split(), model=model, indexable=doc.items, topn=100)
I got error

Handling out of vocabulary

Hello!

I am using this package to compile reasonable word vectors, but for some short compilations of words, all my words are OOV. I tried using FastText, but I get:

*** RuntimeError: Model must be child of BaseWordEmbeddingsModel or BaseKeyedVectors. Received FastText(vocab=2519370, size=300, alpha=0.025)

Is it possible to use FastText and handle Out of vocabulary words?

Thank you!

cannot import name 'Sentence2Vec' from 'fse.models'

Hi @oborchers ,

Post installing fse on windows by pip install fse when I try to import from fse.models import Sentence2Vec

It throws the following error,
cannot import name 'Sentence2Vec' from 'fse.models' (d:\amandalmia\venv\lib\site-packages\fse\models\__init__.py)

Can you please help me in this issue.
Source: https://towardsdatascience.com/fse-2b1ffa791cf9

Thanks.

MaxPooling Model

cannot import name 'BaseKeyedVectors' from 'gensim.models.keyedvectors'

I have installed gensim 3.8 and have python 3.7.

Traceback:

from fse.inputs import IndexedList, IndexedLineDocument
11
---> 12 from gensim.models.keyedvectors import BaseKeyedVectors
13
14 from numpy import dot, float32 as REAL, memmap as np_memmap, \

ImportError: cannot import name 'BaseKeyedVectors' from 'gensim.models.keyedvectors' (/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py)

I don't find any class called "BaseKeyedVectors" in gensim. Looks like its been changed to just "KeyedVectors" ?

Add gensim 4.0.0 support

Add Features to Sentencevectors

[ ] Sentencevectors:
Global:
[ ] Remove normalized vector files and replace with NN
ANN: --> (Annoy, with Option for Google ScANN?)
[ ] Only construct index when when calling most_similar method
[ ] Logging of index speed
[ ] Save and load of index
[ ] Assert that index and vectors are of equal size
[ ] Paramters must be tunable afterwards
[ ] Method to reconstruct index
[ ] How does the index saving comply with SaveLoad?
[ ] Write unittests?
Brute:
[ ] Keep access to default method
[ ] Make ANN Search the default?! --> Results?
[ ] Throw warning for large datasets for vector norm init
[ ] Maybe throw warning if exceeds RAM size of the embedding + normalization
Other:
[ ] L2 Distance
[ ] L1 Distance
[ ] Correlation (Power Score Correlation?)
[ ] Lookup-Functionality (via defaultdict)
[ ] Get vector: Not really memory friendly
[ ] Show which words are in vocabulary
[ ] Asses empty vectors (via EPS sum)
[ ] Z-Score Transformation from Power-Means Embedding? --> Benefit?

Bug in remove_principal_components

Fast_Sentence_Embeddings/fse/models/utils.py

Line 135 in 6f9febf

output = vectors.dot(w_comp.transpose()) * w_comp

For inplace = False function returns projection on principal components instead of resulting vectors.

Ordering of sentences trained on matters for the inferred vectors.

Hello,

First of all, thank you for a nice repository. I am however a bit troubled about one thing, which I hope to get answered here.

The order in which the data is inputted seem to matter for the outcome of the vectors; at least for the uSIF embedding function.

Consider the example below.

from fse.models import uSIF
from fse import IndexedList
import gensim.downloader as api

def load_w2vec(vecs: str = "word2vec-google-news-300"):
    model = api.load(vecs)
    return model

glove = load_w2vec("glove-wiki-gigaword-100")
data = [["Hello", "there", "John"], ["Hi","everyone", "good", "day"]]
input_1 = IndexedList(data)
model = uSIF(glove, lang_freq="en")
model.train(input_1)
vecs = model.infer(input_1)

model.train(input_1)
vecs2 = model.infer(input_)


print(f"All vectors are the same: {np.all(vecs == vecs2)}")

# Feed the model the same data for training but in another order. 
input_2 = IndexedList(data[::-1])



model = uSIF(glove, lang_freq="en")
model.train(input_2)
vecs2 = model.infer(input_1) # Take the same vectors but in the original order and infer these. 
print(f"All vectors are the same: {np.all(vecs == vecs2)}")

Gives me the output

All vectors are the same: True
All vectors are the same: False

Should this really be the case? Thank you in advance!

save sif_model.sv.vectors.npy file is very large?

I found my sif_model.sv.vectors.npy file is just (758194, 100) matrix, but that file is 15G, while I save a (800000, 100) matrix to npy file, it just 600mb, so is it normal? I train the sif model on 30 million sentences

-rw-r--r-- 1 ke ke  43M oct 11 19:09 sif_model
-rw-r--r-- 1 ke ke  15G oct 11 19:09 sif_model.sv.vectors.npy  <<----- this file very large
-rw-r--r-- 1 ke ke 290M oct 11 19:07 sif_model.wv.vectors.npy

uSIF model

I got nan values error with the uSIF model

Don't absorb KeyedVectors into BaseS2V class

Untangling the bad design decision to actually store the BaseKeyedVector from Gensim internally. If users want mmap, they can just load that and pass it. At least we shouldn't store it with the model.

Move Away from Travis.CI

Different CI (Travis free mode not longer available)

Rework Threading Input class

Reworking the threading (at least from my last experience the input thread is the bottleneck, not the actual computation)

Question: Document embedding is the sum over the sentence vectors?

When we have several sentences with the same index (custom index was provided), how is the document embedding computed? Is it just a simple sum over the sentence embeddings?

issue with fasttext model

The following code throws an error (TypeError: Cannot convert numpy.float32 to numpy.ndarray):

fb = load_facebook_model(path_to_model)
model = SIF(fb, alpha=1e-7, components=1)
model.train([IndexedSentence(s, i) for i, s in enumerate(sentences)])
this line >> model.sv.similar_by_sentence(['документы', 'бухгалтерия'], model=model, indexable=sentences)

However, if we replace the model with vectors, everything seems alright.

ft = KeyedVectors.load_word2vec_format(path_to_vectors)
model = SIF(ft, alpha=1e-7, components=1)
model.train([IndexedSentence(s, i) for i, s in enumerate(sentences)])
model.sv.similar_by_sentence(['документы', 'бухгалтерия'], model=model, indexable=sentences)

This problem is really important since word counts (ft.wv.vocab) from vectors look like they were automatically recovered from vectors using cosine similarity (not sure about that) and they are not the same as from the model.

Infer only returns embedding of one sentence

Given a list of input Tuples in the form of Tuple[List[str], int] I initially expected to get a numpy matrix returned of size (n, vector_size).
I suspect this is due to the following line:

Fast_Sentence_Embeddings/fse/models/base_s2v.py

Line 670 in 39ada0e

output = zeros((statistics["max_index"], self.sv.vector_size), dtype=REAL)

Should it be something like this?

output = zeros((statistics["total_sentences"], self.sv.vector_size), dtype=REAL)

Reproducible example from the tutorial:

import gensim.downloader as api
data = api.load("quora-duplicate-questions")
glove = api.load("glove-wiki-gigaword-100")

sentences = []
for d in data:
    # Let's blow up the data a bit by replicating each sentence.
    for i in range(8):
        sentences.append(d["question1"].split())
        sentences.append(d["question2"].split())
s = IndexedList(sentences)

model = SIF(glove, workers=2)
model.train(s)
tmp = ("Hello my friends".split(), 0)
model.infer([tmp, tmp])

ImportError: cannot import name '_l2_norm' from 'gensim.models.keyedvectors

ImportError Traceback (most recent call last)

in ()
----> 1 from fse import Vectors, Average, IndexedList
2 vecs = Vectors.from_pretrained("fasttext-wiki-news-subwords-300")
3 model = Average(vecs)

3 frames

/usr/local/lib/python3.7/dist-packages/fse/models/base_s2v.py in ()
40
41 from gensim.models.base_any2vec import BaseWordEmbeddingsModel
---> 42 from gensim.models.keyedvectors import BaseKeyedVectors, FastTextKeyedVectors, _l2_norm
43 from gensim.utils import SaveLoad
44 from gensim.matutils import zeros_aligned

ImportError: cannot import name '_l2_norm' from 'gensim.models.keyedvectors' (/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py)

GENSIM KeyedVectors and downloadable Models

It appears that when I download any model from the downloader api in gensim or saved a Word2Vec and re-load it using a KeyedVectors format, the vocab object is storing a reverse index in the "count" variable. So for example, if I have 10 words in the model, the first word has a count of 10 and an index of 0.

Using the following code:

word_vectors = api.load('glove-wiki-gigaword-100')
sif_model = uSIF(model=word_vectors)

The word_vectors.wv.vocab shows the first word to be:
"the" and the count = 400000 and the index = 0
For each succeeding word in the model the count goes down by one, and the index goes up by 1.

Clearly this is not the frequency information.

I took this example from your jupyter workbook so I am assuming that something has changed with the models themselves? Any guidance on this would be helpful. I CAN create my on word2vec models and it has the frequency values as expected and the precalculation works as expected.

Thanks for any thoughts or guidance on this. Perhaps this is normal that none of these models retain the word frequencies.

Thanks,

Michael Wade

dummy example doesn't work for me

Thanks for that package. Very much needed work :-)

se.train(sentences)
gives: TypeError: s2v_train() takes 3 positional arguments but 4 were given

I was trying it on Google Colab

Hierarchical (Convolutional) Embeddings

maintenance

Hi Oliver,
You have merged few PR 's but still have not issued new version. Is there chance for that in a near future? Or maybe if you are busy with other projects could you add maintainers to your repo. Me or the guy that did recent PR would be more than happy to contribute and help this package to survive @oborchers

from the Results, CBOW is best, therefore why use SIF?

As far as know, SIF(smooth inverse frequency) just modify the vectors trained by Word2Vec、Glove or other word vector methods.
Therefore why CBOW is best in the Results? If CBOW is best, why need SIF?

loading word2vec and fasttext both don'

I tried to load the model several ways but keep getting
https://drive.google.com/file/d/0B0ZXk88koS2KNER5UHNDY19pbzQ/view

RuntimeError: Model must be child of BaseWordEmbeddingsModel or BaseKeyedVectors.

Error in Colab Code

Hi Dear Oliver.

In the Colab https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi In the last chunk must be.

from fse.models import Average
from fse import IndexedList
model = Average(w2v)
model.train(IndexedList(sentences))
model.sv.similarity(0,1)

A

n error currently appears

question for output

Dear @oborchers

I'm investigate sentence vector your gensim example (data and glove ).

when I check the similarity of sentence s[0],
I got result well

[(10, 1.0),
 (2, 1.0),
 (4, 1.0),
 (14, 1.0),
 (6, 1.0),
 (8, 1.0),
 (12, 1.0),
 (15, 0.9294594526290894),
 (13, 0.9294594526290894),
 (1, 0.9294594526290894)]

0 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 10)
1 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 2)
2 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 4)
3 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 14)
4 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 6)
5 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 8)
6 (['What', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in', 'share', 'market', 'in', 'india?'], 12)

and check the similarity works well

0 10 similarity :  1.0
0 2 similarity :  1.0
0 4 similarity :  1.0
0 14 similarity :  1.0
0 6 similarity :  1.0
0 8 similarity :  1.0
0 12 similarity :  1.0

however, when I check s[100] , the result is wrong. when I check s[100] and s[102]` is same sentence

100 (['Should', 'I', 'buy', 'tiago?'], 100)
102 (['Should', 'I', 'buy', 'tiago?'], 102)

and result is different.

model.sv.most_similar(100)
[(3949083, 1.0),
 (897678, 1.0),
 (4229890, 1.0),
 (3949079, 1.0),
 (3949081, 1.0),
 (2934317, 1.0),
 (4093542, 1.0),
 (3949075, 1.0),
 (4229889, 1.0),
 (2934319, 1.0)]

3949083 (['Should', 'I', 'buy', 'Asus', 'Zenfone', '5?'], 3949083)
897678 (['Why', "doesn't", 'Google', 'buy', 'Quora?'], 897678)

do you have any idea?

Returning vectors with similarity above threshold for most_similar()

In sentencevectors.py most_similar() can return the topn most similar words. However it would be useful to be able to specify a similarity threshold above which the sentences are returned. For this topn could take a fractional value and therefore if topn is strictly smaller than 1 then it's considered a threshold and otherwise it works in the same way as it does now.

train on data and predict on new data

Hi I see that the train method of an fse object returns the sentence embedding.
Is threre a predict method to apply the trained modeled on new data?
Or train stays for predict?

Best

gensim model of Paranmt

Hi, could I have the gensim model of Paranmt you've used in STS demo? I've found the original vector pickle from the authors homepage (http://www.cs.cmu.edu/~jwieting/) but I don't know how to convert it to gensim model. Thanks for the great work!

C extension not loaded, training/inferring will be slow

Hi, I've installed fse in Windows I get the following warning
C:\Users\CARLOS\Anaconda3\lib\site-packages\fse\models\base_s2v.py:115: UserWarning: C extension not loaded, training/inferring will be slow. Install a C compiler and reinstall fse.
"C extension not loaded, training/inferring will be slow. "

I know that this type of warning also appears in gensim in some cases, however, in gensim I don't have this problem. Does anyone have any idea what should I do?

My results on gensim

from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

My training speed is :

training on 571201 effective sentences with 4590776 effective words took 28s with 19703 sentences/s

Is it possible to use customised word frequency other than the one comes with model and lang_freq option?

Hi, I have a word frequency dictionary independent from the model I trained, is it possible to have the option to feed sif with a word frequency dictionary separately? Thanks!

method save works( somehow) but load does not

Hi when I call .save method on sif model it works - although as I understand the only way to save/serialize model on disc is by using pickle?

model_sif.save("model_sif2")

trying to using save should return error , the same as when i try to load saved model

model_sif2= FT_gensim.load("model_sif2")

AttributeError: Can't get attribute 'FastTextKeyedVectors' on <module 'gensim.models.deprecated.keyedvectors' from '/j/miniconda3/envs/clean_unsup/lib/python3.7/site-packages/gensim/models/deprecated/keyedvectors.py'>

AttributeError: 'Word2Vec' object has no attribute 'infer'

Hi, when I run the below line of code I am getting below error. could you please suggest.
I am using a word2vec model, which is trained with Gensim package.

code
model.sv.similar_by_sentence("Is this really easy to learn".split(), model=wmodel, indexable=s.items)

Error:
File "C:\ProgramData\Anaconda3\lib\site-packages\fse\models\sentencevectors.py", line 347, in similar_by_sentence
vector = model.infer([(sentence, 0)])