Giter Club home page Giter Club logo

wikipedia2vec's Introduction

Wikipedia2Vec

tests pypi Version

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.

Use Cases

Wikipedia2Vec has been applied to the following tasks:

References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.

@inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}

The text classification model implemented in this example was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.

@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}

License

Apache License 2.0

wikipedia2vec's People

Contributors

akariasai avatar ikuyamada avatar jyori112 avatar l00mi avatar naoyat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikipedia2vec's Issues

training the same model multiple times?

Thanks for releasing this. I have a question that I am hoping you can answer. If I were to train this model using one wikipedia dump, and then after training, consecutively train the same model on a new wikipedia dump, do you have any idea if good performance would be maintained? In other words, could this model be continually be trained on new wikipedia dumps indefinitely without losing quality of the embeddings?

https://en.wikipedia.org/wiki/Online_machine_learning

Entity detector model on custom dataset

I want to use the entity detector model on a dataset of tweets. The functions in data.py appear to load only either of two preloaded datasets. Is it currently possible to use custom datasets or no?

Code for Named Entity Disambiguation

Hi, is the code for the paper "Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation" which also learns the Wikipedia embedding using the anchor context model available? Or do you plan to release it sometime? I am particularly interested in NED model experiments.

Thanks

how to load model with gensim?

Hi,

I have downloaded the pretrained embedding model, and I loaded the file using Gensim's load_word2vec_format(). However, it failed and given the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

So how can I correctly load the pretrained model using gensim? Can you help me?

Thanks

ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216

Hi,

I'm trying to run wikipedia2vec on Python 3.7, on Windows 10, but I get problems with "numpy.ufunc has the wrong size".

Googling this points to errors with multiple versions of numpy installed, but I have verified that this is not the case. I'm executing this inside a separate virtualenv, and pip freeze shows only the latest version of numpy installed. I've also tried first installing numpy, and then wikipedia2vec.

The reason I think this might have something to do with wikipedia2vec, is the strange path at the end of the stacktrace: ".venv/lib/python3.6/site-packages/Cython/Includes/numpy/init.pxd". This is NOT the version of python I have installed, or a path that exists on my filesystem at all (I don't even have a .venv directory!). The only reference to this directory is inside a wikipedia2vec.cpp file, which somehow is created while wikipedia2vec is creates.

I do not get these errors on Python3.7 in MacOSX when I tried it there.

(wikipedia-embeddings) C:\Projects\wikipedia-embeddings>wikipedia2vec build-dump-db svwiki-latest-pages-articles.xml.bz2 svwiki-dump-db.bin
Traceback (most recent call last):
  File "c:\program files\python37\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\program files\python37\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Admin\Envs\wikipedia-embeddings\Scripts\wikipedia2vec.exe\__main__.py", line 5, in <module>
  File "c:\users\admin\envs\wikipedia-embeddings\lib\site-packages\wikipedia2vec\__init__.py", line 4, in <module>
    from .dictionary import Dictionary
  File ".venv/lib/python3.6/site-packages/Cython/Includes/numpy/__init__.pxd", line 872, in init wikipedia2vec.dictionary
ValueError: numpy.ufunc has the wrong size, try recompiling. Expected 192, got 216

Any ideas of what I should try next?

(wikipedia-embeddings) C:\Projects\wikipedia-embeddings>pip freeze
Click==7.0
jieba==0.39
joblib==0.13.0
lmdb==0.94
marisa-trie==0.7.5
mwparserfromhell==0.5.2
numpy==1.15.4
scipy==1.2.0
six==1.12.0
tqdm==4.28.1
wikipedia2vec==1.0.1

ImportError: No module named 'dictionary'

from wikipedia2vec import Wikipedia2Vec
Traceback (most recent call last):
File "", line 1, in
File "C:\root\Python35\lib\site-packages\wikipedia2vec_init_.py", line 7, in
from dictionary import Dictionary
ImportError: No module named 'dictionary'

How to get most similar items to added/subtracted vectors?

I want to do something like this:

finalvec = wiki2vec.get_entity_vector("Scarlett Johansson") - wiki2vec.get_word_vector("american") + wiki2vec.get_word_vector("japanese")

Or if that doesn't quite work, then:

finalvec = wiki2vec.get_entity_vector("Scarlett Johansson") + wiki2vec.get_entity_vector("George Clooney")

And then:

mostsim = wiki2vec.most_similar(finalvec, 50)
I want to be able to add 2 vectors together or do mathematical operations in general then get the nearest entities to that vector. Is that possible?

When I try it, I get this error:

Traceback (most recent call last):
  File "wikipedia-finder2-withoutwhile.py", line 97, in <module>
    mostsim = wiki2vec.most_similar(finalvec, 50000)
TypeError: Argument 'item' has incorrect type (expected wikipedia2vec.dictionary.Item, got numpy.ndarray)

Delete the token '\n' (and '\r' if any) in the txt format.

I think the embedding of token '\n' is almost useless, and it will cause some IO errors when we load embedding in the specific operate system, which use the '\n' as the line break symbol.

This problem exists when I use torchtext and gensim.

If it's my fault, how could I solve it elegantly? I solve it by deleting token '\n' and its embedding.

How to add another tokenizer?

Hi, thanks for this great repo!

I was wondering if I can add another tokenizer to Wikipedia2Vec. Currently, it takes some specific tokenizers such as mecab for Japanese and jieba for Chinese. But I'd like to test other tokenizers for other languages.

  1. I tried to tweak the source code with pip install -e ., but it gave a complaint like from dictionary import Dictionary ImportError: No module named 'dictionary.
  2. So instead I installed using pip install wikipedia2vec, and added a new_tokenizer.pyx file in the utils/tokenizer folder. And I added it to the __init__ file. But, the system failed to recognize it.

I'm not familar with cython. Would you help me with this?

Out-of-vocabulary words

Hello,

is it possible to train and provide a model that is trained on subwords like fasttext?
we need to help with oov words issue

ValueError: mmap length is greater than file size

attempting basic usage:

/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/usr/lib/python3.5/contextlib.py:59: UserWarning: mmap_mode "c" is not compatible with compressed file /mnt/big_data/enwiki_20180420_100d.pkl.bz2. "c" flag will be ignored.
  return next(self.gen)
Traceback (most recent call last):
  File "words.py", line 3, in <module>
    wiki2vec = Wikipedia2Vec.load('/mnt/big_data/enwiki_20180420_100d.pkl.bz2')
  File "wikipedia2vec/wikipedia2vec.pyx", line 157, in wikipedia2vec.wikipedia2vec.Wikipedia2Vec.load
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 596, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 524, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.5/pickle.py", line 1039, in load
    dispatch[key[0]](self)
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 352, in load_build
    self.stack.append(array_wrapper.read(self))
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 193, in read
    array = self.read_mmap(unpickler)
  File "/home/.local/lib/python3.5/site-packages/joblib/numpy_pickle.py", line 171, in read_mmap
    offset=offset)
  File "/home/.local/lib/python3.5/site-packages/joblib/backports.py", line 23, in make_memmap
    shape=shape, order=order)
  File "/home/.local/lib/python3.5/site-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
ValueError: mmap length is greater than file size

Embedding for <unknown> token.

Does this model provide embedding for unknown token ? (which is the token that represent the least frequent words that not included in the vocabulary). Thanks in advance.

wikipedia id other than title

Hi team, great work! I have a quick question and hope you could help me out.

I'm trying to match wikipedia2vec entity embedding to each annotation in TAC and CONLL dataset. To my understanding, the only way to lookup entity in wikipedia2vec is through Wikipedia entity page title. However, datasets are annotated through other ids such as:

  1. wikipedia url: https://en.wikipedia.org/wiki/Scarlett_Johansson
  2. wikipedia page id: 20913246

or some other ids.

How do you convert between title and ids that these datasets use in the paper?
For example, for id 1, do you use a util function that normalize url to the title in order to lookup wikipedia2vec entity embedding?

suspicious leading and trailing space in title

Some wikipedia2vec entity titles has extra suspicious leading and trailing space. This issue includes some common entities such as May and India

wikipedia2vec.get_entity("India") # with index 2131751
wikipedia2vec.get_entity(" India") # with index 2011310
wikipedia2vec.get_entity("May") # with index 1938987
wikipedia2vec.get_entity(" May") # with index 2011219

Loading pretrained model failed

Hi, when I tried to load the pretrained model using the following command:

wiki2vec = Wikipedia2Vec.load("enwiki_20180420_win10_500d.txt.bz2")
/

I got the following error:

users/anaconda3/lib/python3.6/contextlib.py:81: UserWarning: mmap_mode "c" is not compatible with compressed file /users/grad/ting/tr.ting/OIE/KBE/enwiki_20180420_win10_500d.txt.bz2. "c" flag will be ignored.
  return next(self.gen)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "wikipedia2vec/wikipedia2vec.pyx", line 172, in wikipedia2vec.wikipedia2vec.Wikipedia2Vec.load
  File "/user/anaconda3/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 598, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/users/grad/ting/tr.ting/anaconda3/lib/python3.6/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/users/grad/ting/tr.ting/anaconda3/lib/python3.6/pickle.py", line 1050, in load
    dispatch[key[0]](self)
KeyError: 52

Do you know the reason?

Thanks.

IndexError when training

Hi, we are attempting to run wikipedia2vec on the 20190101 wikidata dump with the following options:

wikipedia2vec train --min-link-prob=0.0 --min-prior-prob=0.0 --min-entity-count=0 --dim-size=300 --iteration=10 --negative=15 dumps/enwiki-20190101-pages-articles.xml.bz2 trained/enwiki_20190101_300d.pkl

We get through 16.1 million pages processed. Then there is no log output for several hours until the process dies with the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/wikipedia2vec", line 11, in <module>
    sys.exit(cli())
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 52, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 97, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 34, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 130, in train
    invoke(build_dump_db, out_file=dump_db_file)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 126, in invoke
    ctx.invoke(cmd, **cmd_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 34, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/wikipedia2vec/cli.py", line 168, in build_dump_db
    DumpDB.build(dump_reader, out_file, **kwargs)
  File "wikipedia2vec/dump_db.pyx", line 156, in wikipedia2vec.dump_db.DumpDB.build
  File "wikipedia2vec/dump_db.pyx", line 182, in wikipedia2vec.dump_db.DumpDB.build
  File "wikipedia2vec/dump_db.pyx", line 186, in wikipedia2vec.dump_db.DumpDB.build
  File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/pool.py", line 354, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/home/ubuntu/anaconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
IndexError: string index out of range

This is running on ubuntu 18.04.

ImportError: No module named 'click_graph'

from wikipedia2vec import Wikipedia2Vec
Traceback (most recent call last):
File "", line 1, in
File "C:\root\Python35\lib\site-packages\wikipedia2vec_init_.py", line 8, in
from click_graph import ClickGraph
ImportError: No module named 'click_graph'

KeyError when trying get_entity_vector on some Wikipedia titles

I came across a number of Wikipedia titles, which I obtained from the last part of Wikipedia URLs, that I couldn't get embeddings for using "get_entity_vector". For example, I try model.get_entity_vector('New business development') or model.get_entity_vector(''Personal effectiveness") and I get KeyError. But the corresponding pages exist - https://en.wikipedia.org/wiki/New_business_development, https://en.wikipedia.org/wiki/Personal_effectiveness.

Do you know why this could be happening? Thank you.

suspicious / in title containing '

Hey team,

This could be a silly question, but I found some wikipedia2vec titles with weird extra \ in front of '

for example entity: South Bird's Head languages has titleSouth Bird\'s Head languages

in a similar fassion:

Ch\'olan languages
Central Alaskan Yup\'ik language
Pa\'o language
...

I'm using python 3.6.8, it could be a encoding problem I think.

Embeddings exist for entities that do not have pages in Wikipedia

While using Wikipedia2Vec (which is an amazing tool by the way, thanks so much for making it available!) I occasionally find that model.get_entity_vector gives embeddings for the titles that do not have corresponding pages in Wikipedia. For example, I can retrieve embedding vectors for "Cultural awareness" or "Business communications", but https://en.wikipedia.org/wiki/Cultural_awareness page does not exist. I found a wikidata item for "Cultural awareness", but nothing for "Business communications". Does this mean that the Wikipedia dump I used for training the model contained information on these pages, but since then they were removed from Wikipedia? Thank you very much!

Parsing disambiguation page

The dump db has wrongly classified some pages regarding whether it's a disambiguation page.

For example:
http://en.wikipedia.org/wiki/Ashta is an entity in AIDA dataset

db.is_disambiguation("Ashta") => False

The is_disambiguation says it's not disambiguation page,

list(map(lambda p : p.text, db.get_paragraphs("Ashta"))) =>
<class 'list'>: ['Ashta,Madhya Pradesh may refer to:', 'Ashta, Bangladesh', 'Ashta, Madhya Pradesh, a municipality in Sehore district in the state of Madhya Pradesh, India', 'Ashta, Maharashtra, a city in Sangli district in the state of Maharashtra, India']

but the content is actually a disambiguation page.

I think it's due to how the page is parsed, somewhere here

Installation in the readme?

It might be helpful for some users to have the pip installation in the readme.

Thanks for working on this! It's a really cool piece of software

What Japanese text pre-processing method is used?

Hi,

Thanks for your great repository and providing pre-trained models.

I want to use the JA pre-trained embeddings. I want to know what method of pre-processing the Wikipedia text (language-agnostic preprocessing & Japanese preprocessing) is used?

Knowing the word segmentation method is very important for me. Was MeCab used? If yes, which version of Mecab? Also, the dictionary and its version which is used in the word-segmenter program?

Thanks

Availability Wiki dump 20-04-2018 (dd-mm-yyyy)

Hi,

For my project, I would like to use the pre-trained embeddings that are available on your Github [0]. However, I would then also need to have access to the original dump for the remainder of my project. The history of dumps for Wikimedia does not go back more than a few months, so I am curious if you still have this dump available and if you could share it?

Thanks in advance and for publically sharing the pre-trained embeddings :)!

[0] https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

Pretrained embeddings for more languages?

Hi 👋
This library really does look fantastic! I really appreciate the list of pre-trained embeddings that can be downloaded.

Before doing the tutorial for creating embeddings for the language I’m looking for... Would you consider adding more pre-trainer languages? I’m specifically looking for Swedish, which seems not part of your top 10 list.

New dump for 2019/2020

Hi,

I noticed the latest pre-trained vectors are from 2018, is it possible to release vectors from the latest dump?

wiki id for wiki entities

Is there a method to get wiki id for any given entity?

I am looking for id("Scarlett Johansson").

Command not found after installation

I installed the project via pip install --user wikipedia2vec and it works well on the python side. (for instance, I tried typing import wikipedia2vec on python and bingo)
However, when I tried typing "wikipedia2vec train enwiki.xml.bz2 model_file" on my bash, it just kept prompting "command not found". I wonder how to run those "train" "build_dictionary" and "train_embedding" commands after installation via pip. Thanks.

progressively updating model with new text?

Hi! Would it be possible to progressively update a model (learned via skipgram or cbow) by passing new text to wikipedia2vec? If so, how? This is useful for doing some kind of online learning usecases we are doing.

Entity Extraction using Wikipedia2Vec

Is there any way we can extract all the wikipedia entities from the text using Wikipedia2Vec? Or is there any other way to do the same. Kindly have a look on the example given below.

Example:
Text : "Scarlett Johansson is an American actress."
Entities : [ 'Scarlett Johansson' , 'American' ]

NOTE : I want to do it in Python

Thanks

category flag fails to filter all category pages

Hey team,

I use the latest enwiki and train a model with the following cmd:

wikipedia2vec train enwiki-20190701-pages-articles.xml.bz2 enwiki-20190701-300d \
--dim-size=300 \
--no-lowercase \
--min-word-count=30 \
--min-entity-count=10

To my understanding, by default, the category flag is False. Therefore, it should filter all wiki category pages.

However, by examine the titles in the wikipedia2vec model, I found the following titles:

:Category:American actors
:Category:American architects
:Category:American film actors
...

categories are not fully filtered.

We could change the code here by adding more filters and check if the title starts with :Category:.

In a similar fashion, you might also want to filter non-entity with title that starts with:

:wikt: 
:Category: 
:Image: 
:category: 
:commons: 
:Template: 
:File:
...

There are quite some suspicious non-entity title that starts with :

dump_db problem

Hi all,

I have tried to use this project to train a new embedding for entities and words. However, after running wikipedia2vec build_dump_db DUMP_FILE OUT_FILE
There occur some problems.

  1. The OUT_FILE is about 93.1GB, not 15GB as described.
  2. If I run wikipedia2vec build_phrase_dictionary DUMP_DB_FILE OUT_FILE on Windows Server 2012, there will pop out a lot of windows warning "python has stopped working".

My python version is 3.6 and I haven't install any BLAS library

I am wondering if there are any problems?

Thank you very much!

corpus

If I use .txt as the input ,how can I train the model

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.