Giter Club home page Giter Club logo

webvectors's Introduction

webvectors

Webvectors is a toolkit to serve vector semantic models (particularly, prediction-based word embeddings, as in word2vec or ELMo) over the web, making it easy to demonstrate their abilities to general public. It requires Python >= 3.6, and uses Flask, Gensim and simple_elmo under the hood.

Working demos:

The service can be either integrated into Apache web server as a WSGI application or run as a standalone server using Gunicorn (we recommend the latter option).

Logo

Brief installation instructions

  1. Clone WebVectors git repository (git clone https://github.com/akutuzov/webvectors.git) into a directory acessible by your web server.
  2. Install Apache for Apache integration or Gunicorn for standalone server.
  3. Install all the Python requirements (pip3 install -r requirements.txt)
  4. If you want to use PoS tagging for user queries, install UDPipe, Stanford CoreNLP, Freeling or other PoS-tagger of your choice.
  5. Configure the files:

For Apache installation variant

Add the following line to Apache configuration file:

WSGIScriptAlias /WEBNAME "PATH/syn.wsgi", where WEBNAME is the alias for your service relative to the server root (webvectors for http://example.com/webvectors), and PATH is your filesystem path to the WebVectors directory.

For all installation variants

In all *.wsgi and *.py files in your WebVectors directory, replace webvectors.cfg in the string config.read('webvectors.cfg') with the absolute path to the webvectors.cfg file.

Set up your service using the configuration file webvectors.cfg. Most important settings are:

  • `root` - absolute path to your _WebVectors_ directory (**NB: end it with a slash!**)
  • `temp` - absolute path to your temporary files directory
  • `font` - absolute path to a TTF font you want to use for plots (otherwise, the default system font will be used)
  • `detect_tag` - whether to use automatic PoS tagging
  • `default_search` - URL of search engine to use on individual word pages (for example, https://duckduckgo.com/?q=)

Tags

Models can use arbitrary tags assigned to words (for example, part-of-speech tags, as in boot_NOUN). If your models are trained on words with tags, you should switch this on in webvectors.cfg (use_tags variable). Then, WebVectors will allow users to filter their queries by tags. You also should specify the list of allowed tags (tags_list variable in webvectors.cfg) and the list of tags which will be shown to the user (tags.tsv file).

Models daemon

WebVectors uses a daemon, which runs in the background and actually processes all embedding-related tasks. It can also run on a different machine, if you want. Thus, in webvectors.cfg you should specify host and port that this daemon will listen at. After that, start the actual daemon script word2vec_server.py. It will load the models and open a listening socket. This daemon must be active permanently, so you may want to launch it using screen or something like this.

Models

The list of models you want to use is defined in the file models.tsv. It consists of tab-separated fields:

  • model identifier
  • model description
  • path to model
  • identifier of localized model name
  • is the model default or not
  • does the model contain PoS tags
  • training algorithm of the model (word2vec/fastText/etc)
  • size of the training corpus in words
  • language of the model

Model identifier will be used as the name for checkboxes in the web pages, and it is also important that in the strings.csv file the same identifier is used when denoting model names. Language of the model is used as an argument passed to the lemmatizer function, it is a simple string with the name of the language (e.g. "english", "russian", "french").

Models can currently be in 4 formats:

  • plain text _word2vec_ models (ends with `.vec`);
  • binary _word2vec_ models (ends with `.bin`);
  • Gensim format _word2vec_ models (ends with `.model`);
  • Gensim format _fastText_ models (ends with `.model`).

WebVectors will automatically detect models format and load all of them into memory. The users will be able to choose among loaded models.

Localization

WebVectors uses the strings.csv file as the source of localized strings. It is a comma-separated file with 3 fields:

  • identifier
  • string in language 1
  • string in language 2

By default, language 1 is English and language 2 is Russian. This can be changed in webvectors.cfg.

Templates

Actual web pages shown to user are defined in the files templates/*.html. Tune them as you wish. The main menu is defined at base.html.

Statis files

If your application does not find the static files (bootstrap and js scripts), edit the variable static_url_path in run_syn.py. You should put there the absolute path to the data folder.

Query hints

If you want query hints to work, do not forget to compile your own list of hints (JSON format). Example of such a list is given in data/example_vocab.json. Real URL of this list should be stated in data/hint.js.

Running WebVectors

Once you have modified all the settings according to your workflow, made sure the templates are OK for you, and launched the models daemon, you are ready to actually start the service. If you use Apache integration, simply restart/reload Apache. If you prefer the standalone option, execute the following command in the root directory of the project:

gunicorn run_syn:app_syn -b address:port

where address is the address on which the service should be active (can be localhost), and port is, well, port to listen (for example, 9999).

Support for contextualized embeddings You can turn on support for contextualized embedding models (currently ELMo is supported). In order to do that:

  1. Install simple_elmo package

  2. Download an ELMo model of your choice (for example, here).

  3. Create a type-based projection in the word2vec format for a limited set of words (for example 10 000), given the ELMo model and a reference corpus. For this, use the extract_elmo.py script we provide:

python3 extract_elmo.py --input CORPUS --elmo PATH_TO_ELMO --outfile TYPE_EMBEDDING_FILE --vocab WORD_SET_FILE

It will run the ELMo model over the provided corpus and generate static averaged type embeddings for each word in the word set. They will be used as lexical substitutes.

  1. Prepare a frequency dictionary to use with the contextualized visualizations, as a plain-text tab-separated file, where the first column contains words and the second column contains their frequencies in the reference dictionary of your choice. The first line of this file should contain one integer matching the size of the corpus in word tokens.

  2. In the [Token] section of the webvectors.cfg configuration file, switch use_contextualized to True and state the paths to your token_model (pre-trained ELMo), type_model (the type-based projection you created with our script) and freq_file which is your frequency dictionary.

  3. In the ref_static_model field, specify any of your static word embedding models (just its name), which you want to use as the target of hyperlinks from words in the contextualized visualization pages.

  4. The page with ELMo lexical substitutes will be available at http://YOUR_ROOT_URL/contextual/

Contacts

In case of any problems, please feel free to contact us:

References

  1. http://www.aclweb.org/anthology/E17-3025

  2. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

  3. http://flask.pocoo.org/

  4. http://radimrehurek.com/gensim/

  5. http://gunicorn.org/

webvectors's People

Contributors

akutuzov avatar ananastii avatar danilsko avatar karahans avatar konstantinbarkalov avatar kunilovskaya avatar leann-fraoigh avatar lizaku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webvectors's Issues

Not able to calculate the opposite

In the section to the right at http://vectors.nlpl.eu/explore/embeddings/en/calculator/ where you can do algebraic operations on vectors, it is not possible to calculate the "opposite" of a word by just entering that word in the negative field. I was curious to see e.g. what the opposite of dark_ADJ could be.

One way to get around this was adding a dummy word, e.g. boy_NOUN, in both the positive and negative field. The opposite of dark_ADJ, however, was not quite what I had expected. ;-)

Thanks a lot for this WebVectors site! I will take a closer look at the data and the coding when I find some time.

Show last 10 or so calculated similarities

"When users use the tool to calculate the similarity between two words, it would be better if the computing history could remain, so users will have a better idea to compare the index of similarity, instead of a single decimal."

Language config does not work

Setting interface_languages = en" doesn't have to seem any effect on the frontend, it is still ru/en
@ElizavetaKuzmenko

Не находит похожие слова

Здравствуйте, пытаюсь использовать ваши модели в связке gensim,

from gensim.models import KeyedVectors

w2v_fpath = "news_upos_cbow_600_2_2018.vec.gz"

w2v = KeyedVectors.load_word2vec_format(w2v_fpath, binary=False, unicode_errors='ignore')
w2v.init_sims(replace=True)

for word, score in w2v.most_similar(positive=["депутат"], topn=5):
    print(word, score)`

все достаточно просто, на на выходе я получаю ошибку

Traceback (most recent call last): File "/Users/avorontsov/projects/python/gensim/main.py", line 8, in <module> for word, score in w2v.most_similar(positive=["депутат"], topn=5): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 531, in most_similar mean.append(weight * self.word_vec(word, use_norm=True)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gensim/models/keyedvectors.py", line 452, in word_vec raise KeyError("word '%s' not in vocabulary" % word) KeyError: "word 'депутат' not in vocabulary"

но если я меняю словарь на один из этих http://panchenko.me/data/dsl-backup/w2v-ru/

то начинает находить. Что я делаю неправильно?

Inline comments in templates are repeating

Inline HTML comments describing what's happening in the templates are great.
Unfortunately, some of them are located within loops (example).
As a result, in the final web page, these comments are repeated, sometimes dozens of times. This increases page size and makes its code less readable.

I think, there should be no inline comments within loops. Instead, they can be moved to the beginning of the page, etc.

Question about udpipe tokenization

Hi! Thank you for the great tool!

I have found some strange PROPN usage in preprocessing.
In https://github.com/akutuzov/webvectors/blob/master/preprocessing/rus_preprocessing_udpipe.py#L182 and below it has additional space character in the end.
Maybe it shouldn't have additional space and look like + '_PROPN', not + '_PROPN '?

Also udpipe returns strange result for word "Спасибо":

>>> up.tokenize_upos("Спасибо")
['спасибо_PROPN ']
>>> up.tokenize_upos("спасибо")
['спасибо_NOUN']
>>> up.tokenize_upos("Газпромбанк")
['газпромбанк_PROPN ']

Redirection to "Similar Words" After Clicking on the Words

Greetings,

I'm a computer engineering student from Turkey and currently working on creating a Turkish version of Webvectors. However, I encounter a problem. When I click on a word, it attempts to redirect me to a link where similar words are found for that specific word. For example, when I click on the word "adam", it attempts to open an URL like this: http://127.0.0.1:5000/en/word2vec_skipgram_300/adam/. The name of my model is correct. The problem is that there is no such page in the current version of the repository. So I couldn't find an easy way to redirect users to a similar words page for the word "adam". Am I missing anything here?

Let me provide screenshots as well:

  1. Here I hover over the word "araba": image

  2. This is the address it redirects me to upon clicking on it:
    image

How run project with model?

I downloaded ruwikiruscorpora_upos_skipgram_300_2_2019, extracted it and try add to webvectors.
image
image
image
image
image

whats incorrect in my configs?

Неточность в комменте файле webvectors/preprocessing/rus_preprocessing_udpipe.py

Коллеги, добрый день. Спасибо за проект. Увидел неточность, возможно стоит исправить.
В файле webvectors/preprocessing/rus_preprocessing_udpipe.py написано "зеленый_NOUN трамвай_NOUN", правильно "зеленый_ADJ трамвай_NOUN" (ваша модель тоже так считает :) ).

Compatibility with Python 3

Some scripts are not python3-compatible.
For instance, tsne.py with outdated prints and word2vec_server with outdated excepts.

Persistent model selection

If a user selects a model (or a set of models), this choice should persist after the results are shown.
For now, it is reset to the default.

win textfile encoding

Понимаю что ерунда, но все же напишу.
Если выполнять туториал на windows машине(наверное. это странно, да) то в 3 ячейке

text = open(textfile).read()

будут проблемы с открытием файла из-за использования кодировки cp1251 по умолчанию

Может так будет приятнее?

text = open(textfile, encoding='utf-8').read()

word2vec_server.py not starting on tagger port

I try run word2vec_server.py. In logs its show that is started. I try connect to port 12666 with telnet - all works. But when i try connect to tagged_port 66666 - its no connection

root@vmi342368:/var/www/webvectors# python3.7 word2vec_server.py
2021-03-19 04:05:00,333 : INFO : loading projection weights from /var/www/model/model.bin
2021-03-19 04:05:10,794 : INFO : loaded (248978, 300) matrix from /var/www/model/model.bin
2021-03-19 04:05:10,794 : INFO : precomputing L2-norms of word weight vectors
Model ruwikiruscorpora_upos_skipgram_300_2_2019 from file /var/www/model/model.bin loaded successfully.
Socket created
Socket bind complete
Socket now listening on port 12666

[Languages]
interface_languages = ru,en

[Files and directories]
temp = tmp
l10n = strings.csv
models = models.tsv
image_cache = images_cache.csv
font = /var/www/webvectors/fonts/OpenSans-Bold.ttf
root = /var/www/webvectors/

[Sockets]
host = 0.0.0.0
port = 12666
tagger_port = 66666

[Tags]
use_tags = True
detect_tag = True
tags_list = ADJ ADV INTJ NOUN PROPN VERB ADP AUX CCONJ DET NUM PART PRON SCONJ PUNCT SYM X
exposed_tags_list = tags.tsv

[Other]
url = /
dbpedia_images = true
default_search = https://duckduckgo.com/?q=
tensorflow_projector = false
git_username = YOUR_GITHUB_USERNAME
git_token = YOUR_GITHUB_API_TOKEN

Data preprocessing

Prepositions like po, v are excluded from consideration, then some tokens make up mwe. That means that prepositional mwe such as по принципу, в принципе are tagged принцип_NOUN in texts. What is worse, в течение is probably the same as течение_NOUN.
The same issue concerns mwes for conjunctions and particles, and, to a lesser extent, adverbial mwes (or it's another issue).
(If I am wrong and they are filtered out, where can I find their list?)

dependencies

В зависимостях написать, какие версии генсима, фласка и матплотлиба нужны для работы.

Can't download geowac_lemmas_none_fasttextskipgram_300_5_2020 following the tutorial

I unzipped the archive and realized that geowac_lemmas_none_fasttextskipgram_300_5_2020 isn't a binary model, that is only 'model.model' was in it and no 'model.bin' like in a ruscorpora_upos_skipgram_600_10_2017 for instance. So, I tried this:

import zipfile
model_url = 'http://vectors.nlpl.eu/repository/20/213.zip'
m = wget.download(model_url)
model_file = model_url.split('/')[-1]
with zipfile.ZipFile(model_file, 'r') as archive:
stream = archive.open('model.model')
model = FastText.load_fasttext_format(datapath(stream))


TypeError Traceback (most recent call last)
in <cell line: 7>()
7 with zipfile.ZipFile(model_file, 'r') as archive:
8 stream = archive.open('model.model')
----> 9 model = FastText.load_fasttext_format(datapath(stream))

2 frames
/usr/lib/python3.10/genericpath.py in _check_arg_types(funcname, *args)
150 hasbytes = True
151 else:
--> 152 raise TypeError(f'{funcname}() argument must be str, bytes, or '
153 f'os.PathLike object, not {s.class.name!r}') from None
154 if hasstr and hasbytes:

TypeError: join() argument must be str, bytes, or os.PathLike object, not 'ZipExtFile'

Then I unzipped the archive, downloaded 'model.model' and 'model.model.vectors.npy' into a google.colab and tried to open each of them direclty eithe via KeyedVectors.load() or via Fast Text.load_fasttext_format()or via gensim.models.fast text.load_facebook_model() or via gensim.models.fast text.load_facebook_vectors().

model = KeyedVectors.load('model.model')

FileNotFoundError Traceback (most recent call last)
in <cell line: 1>()
----> 1 model = KeyedVectors.load('model.model')

4 frames/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding, max_header_size)
425 own_fid = False
426 else:
--> 427 fid = stack.enter_context(open(os_fspath(file), "rb"))
428 own_fid = True
429

FileNotFoundError: [Errno 2] No such file or directory: 'model.model.vectors_vocab.npy'

model = KeyedVectors.load('model.model.vectors.npy')

UnpicklingError Traceback (most recent call last)
in <cell line: 1>()
----> 1 model = KeyedVectors.load('model.model.vectors.npy')

1 frames/usr/local/lib/python3.10/dist-packages/gensim/utils.py in unpickle(fname)
1459 """
1460 with open(fname, 'rb') as f:
-> 1461 return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
1462
1463

UnpicklingError: STACK_GLOBAL requires str

model = FastText.load_fasttext_format('model.model')

NotImplementedError Traceback (most recent call last)
in <cell line: 1>()
----> 1 model = FastText.load_fasttext_format('model.model')

5 frames/usr/local/lib/python3.10/dist-packages/gensim/models/_fasttext_bin.py in _load_vocab(fin, new_format, encoding)
196 # Vocab stored by Dictionary::save
197 if nlabels > 0:
--> 198 raise NotImplementedError("Supervised fastText models are not supported")
199 logger.info("loading %s words for fastText model from %s", vocab_size, fin.name)
200

NotImplementedError: Supervised fastText models are not supported

model = FastText.load_fasttext_format('model.model.vectors.npy')

NotImplementedError Traceback (most recent call last)
in <cell line: 1>()
----> 1 model = FastText.load_fasttext_format('model.model.vectors.npy')

5 frames/usr/local/lib/python3.10/dist-packages/gensim/models/_fasttext_bin.py in _load_vocab(fin, new_format, encoding)
196 # Vocab stored by Dictionary::save
197 if nlabels > 0:
--> 198 raise NotImplementedError("Supervised fastText models are not supported")
199 logger.info("loading %s words for fastText model from %s", vocab_size, fin.name)
200

NotImplementedError: Supervised fastText models are not supported

model = FastText.load_facebook_model('model.model')

AttributeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 model = FastText.load_facebook_model('model.model.vectors.npy')

AttributeError: type object 'FastText' has no attribute 'load_facebook_model'

Should I try datapath from gensim.test.utils or api.load from gensim.downloader or pip install fasttext instead of gensim's FastText?
How to download any pretrained fasttext model from rusvectores?

Add PCA visualizations

In addition to the tSNE projections, we need PCA projections as well. They might be not so visually attractive, but they are deterministic and stable.

Output word frequencies

Show word frequencies (and its frequency range), taken from the training corpus.
For that, we would need to load models in Gensim format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.