wikipedia2vec / wikipedia2vec Goto Github PK

View Code? Open in Web Editor NEW

920.0 35.0 101.0 2.47 MB

A tool for learning vector representations of words and entities from Wikipedia

Home Page: http://wikipedia2vec.github.io/

License: Other

Python 99.13% Shell 0.25% Cython 0.63%

wikipedia python embeddings natural-language-processing nlp text-classification

wikipedia2vec's Introduction

Wikipedia2Vec

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.

Use Cases

Wikipedia2Vec has been applied to the following tasks:

Entity linking: Yamada et al., 2016, Eshel et al., 2017, Chen et al., 2019, Poerner et al., 2020, van Hulst et al., 2020.
Named entity recognition: Sato et al., 2017, Lara-Clares and Garcia-Serrano, 2019.
Question answering: Yamada et al., 2017, Poerner et al., 2020.
Entity typing: Yamada et al., 2018.
Text classification: Yamada et al., 2018, Yamada and Shindo, 2019, Alam et al., 2020.
Relation classification: Poerner et al., 2020.
Paraphrase detection: Duong et al., 2018.
Knowledge graph completion: Shah et al., 2019, Shah et al., 2020.
Fake news detection: Singh et al., 2019, Ghosal et al., 2020.
Plot analysis of movies: Papalampidi et al., 2019.
Novel entity discovery: Zhang et al., 2020.
Entity retrieval: Gerritse et al., 2020.
Deepfake detection: Zhong et al., 2020.
Conversational information seeking: Rodriguez et al., 2020.
Query expansion: Rosin et al., 2020.

References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.

@inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}

The text classification model implemented in this example was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.

@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}

License

Apache License 2.0

wikipedia2vec's People

Contributors

Stargazers

Watchers

Forkers

shannonyu zhujiahui zhangyijia1979 zisding shuyanzhou sangohan nilportugues zhouyonglong xuerenlv ml-lab fabiofumarola coodest cxz batermj niarfe jabb0 dav009 samikshm dengziheng fishredleaf aymansalama yw2903 miyainnyc ab212 rsohlot jason-lee-lxx bonehead l00mi yifding aaronpcz pranciskus bhsu22 intuitionmachine kztakuro esmab lizh0019 phucntnii nirmal-scrapper mattolson93 chunlinx shiraeisenberg sainiudit rogervaas guoruijiao yest ecodev-cloud tushaargvs vasyllyashkevych jiaruipeng1994 6gsn rajajrds muzaluisa yuan776 kyubyong aylien skfzyy rita1223-0727 alonarad3 waldenn patelrajnath joaopauloradd kabisor devinxzhou dylansppy alexkamil hnjm lorr1 zomun yotofu standardgalactic moqingxinai wajahat-mirza cia05rf bread-hu agnedil windexplore averypai zch0205 python-repository-hub rickyhong tubular phucty techthiyanes victorkontist beyang phymucs michael-lee-wbd wannaphong apollohuang1 joerenner siddhantsingh-25 singletongue bleidens harshitssj4 gongchuanyang ruiatelsevier

wikipedia2vec's Issues

training the same model multiple times?

Thanks for releasing this. I have a question that I am hoping you can answer. If I were to train this model using one wikipedia dump, and then after training, consecutively train the same model on a new wikipedia dump, do you have any idea if good performance would be maintained? In other words, could this model be continually be trained on new wikipedia dumps indefinitely without losing quality of the embeddings?

https://en.wikipedia.org/wiki/Online_machine_learning

Entity detector model on custom dataset

I want to use the entity detector model on a dataset of tweets. The functions in data.py appear to load only either of two preloaded datasets. Is it currently possible to use custom datasets or no?

ModuleNotFoundError: No module named 'wikipedia2vec.dictionary'

Can't run 20 newsgroups test: 'python main.py train-classifier enwiki_20180420_lg1_300d.pkl enwiki_20180420_entity_linker.pkl --dataset=20ng'

Code for Named Entity Disambiguation

Hi, is the code for the paper "Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation" which also learns the Wikipedia embedding using the anchor context model available? Or do you plan to release it sometime? I am particularly interested in NED model experiments.

Thanks

how to load model with gensim?

Hi,

I have downloaded the pretrained embedding model, and I loaded the file using Gensim's load_word2vec_format(). However, it failed and given the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

So how can I correctly load the pretrained model using gensim? Can you help me？

Thanks

ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216

Hi,

I'm trying to run wikipedia2vec on Python 3.7, on Windows 10, but I get problems with "numpy.ufunc has the wrong size".

Googling this points to errors with multiple versions of numpy installed, but I have verified that this is not the case. I'm executing this inside a separate virtualenv, and pip freeze shows only the latest version of numpy installed. I've also tried first installing numpy, and then wikipedia2vec.

The reason I think this might have something to do with wikipedia2vec, is the strange path at the end of the stacktrace: ".venv/lib/python3.6/site-packages/Cython/Includes/numpy/init.pxd". This is NOT the version of python I have installed, or a path that exists on my filesystem at all (I don't even have a .venv directory!). The only reference to this directory is inside a wikipedia2vec.cpp file, which somehow is created while wikipedia2vec is creates.

I do not get these errors on Python3.7 in MacOSX when I tried it there.

(wikipedia-embeddings) C:\Projects\wikipedia-embeddings>wikipedia2vec build-dump-db svwiki-latest-pages-articles.xml.bz2 svwiki-dump-db.bin
Traceback (most recent call last):
  File "c:\program files\python37\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\program files\python37\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Admin\Envs\wikipedia-embeddings\Scripts\wikipedia2vec.exe\__main__.py", line 5, in <module>
  File "c:\users\admin\envs\wikipedia-embeddings\lib\site-packages\wikipedia2vec\__init__.py", line 4, in <module>
    from .dictionary import Dictionary
  File ".venv/lib/python3.6/site-packages/Cython/Includes/numpy/__init__.pxd", line 872, in init wikipedia2vec.dictionary
ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216

Any ideas of what I should try next?

(wikipedia-embeddings) C:\Projects\wikipedia-embeddings>pip freeze
Click==7.0
jieba==0.39
joblib==0.13.0
lmdb==0.94
marisa-trie==0.7.5
mwparserfromhell==0.5.2
numpy==1.15.4
scipy==1.2.0
six==1.12.0
tqdm==4.28.1
wikipedia2vec==1.0.1

ImportError: No module named 'dictionary'

from wikipedia2vec import Wikipedia2Vec
Traceback (most recent call last):
File "", line 1, in
File "C:\root\Python35\lib\site-packages\wikipedia2vec_init_.py", line 7, in
from dictionary import Dictionary
ImportError: No module named 'dictionary'

Where are the anchor context model words?

Do you know where the anchor context model words can be found?

Put otherwise, is there any way of extracting:

<(anchor link/entity), context text>

How to get most similar items to added/subtracted vectors?

I want to do something like this:

finalvec = wiki2vec.get_entity_vector("Scarlett Johansson") - wiki2vec.get_word_vector("american") + wiki2vec.get_word_vector("japanese")

Or if that doesn't quite work, then:

finalvec = wiki2vec.get_entity_vector("Scarlett Johansson") + wiki2vec.get_entity_vector("George Clooney")

And then:

mostsim = wiki2vec.most_similar(finalvec, 50)
I want to be able to add 2 vectors together or do mathematical operations in general then get the nearest entities to that vector. Is that possible?

When I try it, I get this error:

Traceback (most recent call last):
  File "wikipedia-finder2-withoutwhile.py", line 97, in <module>
    mostsim = wiki2vec.most_similar(finalvec, 50000)
TypeError: Argument 'item' has incorrect type (expected wikipedia2vec.dictionary.Item, got numpy.ndarray)

Delete the token '\n' (and '\r' if any) in the txt format.

I think the embedding of token '\n' is almost useless, and it will cause some IO errors when we load embedding in the specific operate system, which use the '\n' as the line break symbol.

This problem exists when I use torchtext and gensim.

If it's my fault, how could I solve it elegantly? I solve it by deleting token '\n' and its embedding.

How to add another tokenizer?

Hi, thanks for this great repo!

I was wondering if I can add another tokenizer to Wikipedia2Vec. Currently, it takes some specific tokenizers such as mecab for Japanese and jieba for Chinese. But I'd like to test other tokenizers for other languages.

I tried to tweak the source code with pip install -e ., but it gave a complaint like from dictionary import Dictionary ImportError: No module named 'dictionary.
So instead I installed using pip install wikipedia2vec, and added a new_tokenizer.pyx file in the utils/tokenizer folder. And I added it to the __init__ file. But, the system failed to recognize it.

I'm not familar with cython. Would you help me with this?

Out-of-vocabulary words

Hello,

is it possible to train and provide a model that is trained on subwords like fasttext?
we need to help with oov words issue

How to extract entities from text using Wikipedia2Vec

Is there any way I can know the vocabulary of the pretrained model? Like in case of Word2Vec.

ValueError: mmap length is greater than file size

attempting basic usage:

/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/usr/lib/python3.5/contextlib.py:59: UserWarning: mmap_mode "c" is not compatible with compressed file /mnt/big_data/enwiki_20180420_100d.pkl.bz2. "c" flag will be ignored.
  return next(self.gen)
Traceback (most recent call last):
  File "words.py", line 3, in <module>
    wiki2vec = Wikipedia2Vec.load('/mnt/big_data/enwiki_20180420_100d.pkl.bz2')
  File "wikipedia2vec/wikipedia2vec.pyx", line 157, in wikipedia2vec.wikipedia2vec.Wikipedia2Vec.load
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 596, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 524, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.5/pickle.py", line 1039, in load
    dispatch[key[0]](self)
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 352, in load_build
    self.stack.append(array_wrapper.read(self))
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 193, in read
    array = self.read_mmap(unpickler)
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 171, in read_mmap
    offset=offset)
  File "/home/.local/lib/python3.5/site-packages/joblib/backports.py", line 23, in make_memmap
    shape=shape, order=order)
  File "/home/.local/lib/python3.5/site-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: mmap length is greater than file size

how can the result file saved by command save_text --format word2vec loaded by word2vec?

Embedding for <unknown> token.

Does this model provide embedding for unknown token ? (which is the token that represent the least frequent words that not included in the vocabulary). Thanks in advance.

wikipedia id other than title

Hi team, great work! I have a quick question and hope you could help me out.

I'm trying to match wikipedia2vec entity embedding to each annotation in TAC and CONLL dataset. To my understanding, the only way to lookup entity in wikipedia2vec is through Wikipedia entity page title. However, datasets are annotated through other ids such as:

wikipedia url: https://en.wikipedia.org/wiki/Scarlett_Johansson
wikipedia page id: 20913246

or some other ids.

How do you convert between title and ids that these datasets use in the paper?
For example, for id 1, do you use a util function that normalize url to the title in order to lookup wikipedia2vec entity embedding?

suspicious leading and trailing space in title

Some wikipedia2vec entity titles has extra suspicious leading and trailing space. This issue includes some common entities such as May and India

wikipedia2vec.get_entity("India") # with index 2131751
wikipedia2vec.get_entity(" India") # with index 2011310

wikipedia2vec.get_entity("May") # with index 1938987
wikipedia2vec.get_entity(" May") # with index 2011219

Loading pretrained model failed

Hi, when I tried to load the pretrained model using the following command:

wiki2vec = Wikipedia2Vec.load("enwiki_20180420_win10_500d.txt.bz2")
/

I got the following error:

users/anaconda3/lib/python3.6/contextlib.py:81: UserWarning: mmap_mode "c" is not compatible with compressed file /users/grad/ting/tr.ting/OIE/KBE/enwiki_20180420_win10_500d.txt.bz2. "c" flag will be ignored.
  return next(self.gen)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "wikipedia2vec/wikipedia2vec.pyx", line 172, in wikipedia2vec.wikipedia2vec.Wikipedia2Vec.load
  File "/user/anaconda3/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 598, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/users/grad/ting/tr.ting/anaconda3/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/users/grad/ting/tr.ting/anaconda3/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
KeyError: 52

Do you know the reason?

Thanks.

IndexError when training

Hi, we are attempting to run wikipedia2vec on the 20190101 wikidata dump with the following options:

wikipedia2vec train --min-link-prob=0.0 --min-prior-prob=0.0 --min-entity-count=0 --dim-size=300 --iteration=10 --negative=15 dumps/enwiki-20190101-pages-articles.xml.bz2 trained/enwiki_20190101_300d.pkl

We get through 16.1 million pages processed. Then there is no log output for several hours until the process dies with the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/wikipedia2vec", line 11, in <module>
    sys.exit(cli())
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 52, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 97, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 34, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 130, in train
    invoke(build_dump_db, out_file=dump_db_file)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 126, in invoke
    ctx.invoke(cmd, **cmd_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 34, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 168, in build_dump_db
    DumpDB.build(dump_reader, out_file, **kwargs)
  File "wikipedia2vec/dump_db.pyx", line 156, in wikipedia2vec.dump_db.DumpDB.build
  File "wikipedia2vec/dump_db.pyx", line 182, in wikipedia2vec.dump_db.DumpDB.build
  File "wikipedia2vec/dump_db.pyx", line 186, in wikipedia2vec.dump_db.DumpDB.build
  File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/pool.py", line 354, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
IndexError: string index out of range

This is running on ubuntu 18.04.

ImportError: No module named 'click_graph'

from wikipedia2vec import Wikipedia2Vec
Traceback (most recent call last):
File "", line 1, in
File "C:\root\Python35\lib\site-packages\wikipedia2vec_init_.py", line 8, in
from click_graph import ClickGraph
ImportError: No module named 'click_graph'

KeyError when trying get_entity_vector on some Wikipedia titles

I came across a number of Wikipedia titles, which I obtained from the last part of Wikipedia URLs, that I couldn't get embeddings for using "get_entity_vector". For example, I try model.get_entity_vector('New business development') or model.get_entity_vector(''Personal effectiveness") and I get KeyError. But the corresponding pages exist - https://en.wikipedia.org/wiki/New_business_development, https://en.wikipedia.org/wiki/Personal_effectiveness.

Do you know why this could be happening? Thank you.

Number of pre-trained embeddings questions

I downloaded the English pre-trained embeddings enwiki__20180420 from your website and found there to be 2,592,608 entities.

For roughly the same period (2018-01-01), the size of wikipedia page quotes 5,541,900 articles.

Do you know why the discrepancy?

Thanks!

how to run?

Debiased embeddings?

Seems it’s rather easy to remove (a part of) gender biases from embeddings. Is this something you would consider?

Man is to Computer Programmer as Woman is to Homemaker: https://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf

suspicious / in title containing '

Hey team,

This could be a silly question, but I found some wikipedia2vec titles with weird extra \ in front of '

for example entity: South Bird's Head languages has titleSouth Bird\'s Head languages

in a similar fassion:

Ch\'olan languages
Central Alaskan Yup\'ik language
Pa\'o language
...

I'm using python 3.6.8, it could be a encoding problem I think.

Embeddings exist for entities that do not have pages in Wikipedia

While using Wikipedia2Vec (which is an amazing tool by the way, thanks so much for making it available!) I occasionally find that model.get_entity_vector gives embeddings for the titles that do not have corresponding pages in Wikipedia. For example, I can retrieve embedding vectors for "Cultural awareness" or "Business communications", but https://en.wikipedia.org/wiki/Cultural_awareness page does not exist. I found a wikidata item for "Cultural awareness", but nothing for "Business communications". Does this mean that the Wikipedia dump I used for training the model contained information on these pages, but since then they were removed from Wikipedia? Thank you very much!

Is it possible to do something like 'infer vector' given a document not in the data?

If I have my own document that is not in the data and I want to get the most similar Wiki pages / entities to that document, is that possible?

I know that with Doc2Vec you can do something like 'infer vector' and then compare that vector to other entities. I'm looking for something similar perhaps, if it's possible.

Parsing disambiguation page

The dump db has wrongly classified some pages regarding whether it's a disambiguation page.

For example:
http://en.wikipedia.org/wiki/Ashta is an entity in AIDA dataset

db.is_disambiguation("Ashta") => False

The is_disambiguation says it's not disambiguation page,

list(map(lambda p : p.text, db.get_paragraphs("Ashta"))) =>
<class 'list'>: ['Ashta,Madhya Pradesh may refer to:', 'Ashta, Bangladesh', 'Ashta, Madhya Pradesh, a municipality in Sehore district in the state of Madhya Pradesh, India', 'Ashta, Maharashtra, a city in Sangli district in the state of Maharashtra, India']

but the content is actually a disambiguation page.

I think it's due to how the page is parsed, somewhere here

txt to wiki dump

do you know how to change txt to wiki dump

Installation in the readme?

It might be helpful for some users to have the pip installation in the readme.

Thanks for working on this! It's a really cool piece of software

What Japanese text pre-processing method is used?

Hi,

Thanks for your great repository and providing pre-trained models.

I want to use the JA pre-trained embeddings. I want to know what method of pre-processing the Wikipedia text (language-agnostic preprocessing & Japanese preprocessing) is used?

Knowing the word segmentation method is very important for me. Was MeCab used? If yes, which version of Mecab? Also, the dictionary and its version which is used in the word-segmenter program?

Thanks

Availability Wiki dump 20-04-2018 (dd-mm-yyyy)

Hi,

For my project, I would like to use the pre-trained embeddings that are available on your Github [0]. However, I would then also need to have access to the original dump for the remainder of my project. The history of dumps for Wikimedia does not go back more than a few months, so I am curious if you still have this dump available and if you could share it?

Thanks in advance and for publically sharing the pre-trained embeddings :)!

[0] https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

Pretrained embeddings for more languages?

Hi 👋
This library really does look fantastic! I really appreciate the list of pre-trained embeddings that can be downloaded.

Before doing the tutorial for creating embeddings for the language I’m looking for... Would you consider adding more pre-trainer languages? I’m specifically looking for Swedish, which seems not part of your top 10 list.

Cant not download the Chinese pretrained model for 300-dim text model.

Can anyone download the Chinese pretrainrd model for 300-dim text model? http://wikipedia2vec.s3.amazonaws.com/models/zh/2018-04-20/zhwiki_20180420_300d.txt.bz2

can not build dump database

when I use the commond "wikipedia2vec build-dump-db DUMP_FILE OUT_FILE", I got some errors.

New dump for 2019/2020

Hi,

I noticed the latest pre-trained vectors are from 2018, is it possible to release vectors from the latest dump?

wiki id for wiki entities

Is there a method to get wiki id for any given entity?

I am looking for id("Scarlett Johansson").

"crosslingual-map" branch, seems to be missing "langlink.txt" file

Hello, I want to do some cross-lingual operate, so I run your test.py script in "crosslingual-map" branch, but seems to be missing "langlink.txt" file. I would like to ask what the file is, and how can I obtain them.

Command not found after installation

I installed the project via pip install --user wikipedia2vec and it works well on the python side. (for instance, I tried typing import wikipedia2vec on python and bingo)
However, when I tried typing "wikipedia2vec train enwiki.xml.bz2 model_file" on my bash, it just kept prompting "command not found". I wonder how to run those "train" "build_dictionary" and "train_embedding" commands after installation via pip. Thanks.

progressively updating model with new text?

Hi! Would it be possible to progressively update a model (learned via skipgram or cbow) by passing new text to wikipedia2vec? If so, how? This is useful for doing some kind of online learning usecases we are doing.

Chinese words segment

Which python packages you use to complete the Chinese words segmentation?

Entity Extraction using Wikipedia2Vec

Is there any way we can extract all the wikipedia entities from the text using Wikipedia2Vec? Or is there any other way to do the same. Kindly have a look on the example given below.

Example:
Text : "Scarlett Johansson is an American actress."
Entities : [ 'Scarlett Johansson' , 'American' ]

NOTE : I want to do it in Python

Thanks

gradient update during model training process

Hi,
I think the "if" statement in the 182 line of the "entity_vector_worker.pyx" file should be the "elif" statement. But I am not sure. If not, can you explain it.

category flag fails to filter all category pages

Hey team,

I use the latest enwiki and train a model with the following cmd:

wikipedia2vec train enwiki-20190701-pages-articles.xml.bz2 enwiki-20190701-300d \
--dim-size=300 \
--no-lowercase \
--min-word-count=30 \
--min-entity-count=10

To my understanding, by default, the category flag is False. Therefore, it should filter all wiki category pages.

However, by examine the titles in the wikipedia2vec model, I found the following titles:

:Category:American actors
:Category:American architects
:Category:American film actors
...

categories are not fully filtered.

We could change the code here by adding more filters and check if the title starts with :Category:.

In a similar fashion, you might also want to filter non-entity with title that starts with:

:wikt: 
:Category: 
:Image: 
:category: 
:commons: 
:Template: 
:File:
...

There are quite some suspicious non-entity title that starts with :

dump_db problem

Hi all,

I have tried to use this project to train a new embedding for entities and words. However, after running wikipedia2vec build_dump_db DUMP_FILE OUT_FILE
There occur some problems.

The OUT_FILE is about 93.1GB, not 15GB as described.
If I run wikipedia2vec build_phrase_dictionary DUMP_DB_FILE OUT_FILE on Windows Server 2012, there will pop out a lot of windows warning "python has stopped working".

My python version is 3.6 and I haven't install any BLAS library

I am wondering if there are any problems?

Thank you very much!

wikipedia2vec train mrwiki-latest-pages-articles.xml.bz2 mrwiki.model

The model works only if the words are from "entity". Other words are not indexed correctly. In other words model has learned only from headings and not from article text. How do I fix this?

Broken Link

The Visualization tab on the website is broken.

wikipedia2vec / wikipedia2vec Goto Github PK

wikipedia2vec's Introduction

Wikipedia2Vec

Basic Usage

Pretrained Embeddings

Use Cases

References

License

wikipedia2vec's People

Contributors

Stargazers

Watchers

Forkers

wikipedia2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org