Giter Club home page Giter Club logo

japanese-words-to-vectors's Introduction

Japanese Word2Vec

こんにちは!

About

Word2vec (word to vectors) approach for Japanese language using Gensim (Deep Learning skip-gram and CBOW models). The model is trained on the Japanese version of Wikipedia available at jawiki-latest-pages-articles.xml.bz2.

Definition: Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a high-dimensional space (typically of several hundred dimensions), with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Further reading about word2vec: http://nlp.stanford.edu/projects/glove/

Usage

Generating the vectors from a wikipedia dump takes about 2~3 hours on a Core i5, with the default parameters.

git clone https://github.com/philipperemy/japanese-word-to-vectors.git
cd japanese-word-to-vectors
pip3 install -r requirements.txt # you can create a virtual env before.
wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 # 2.4GB. It can take some time depending of your internet speed!

# will use TinySegmenter3 for the tokenization (easy to install but less accurate)
python3 generate_vectors.py 

# recommended. will use the MeCab tokenizer. Installation is available at http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
# next section of the README called "Tokenize the text" provides the details to install it as well.
python3 generate_vectors.py --mecab

If generate_vectors.py does not detect the file jawiki-latest-pages-articles.xml.bz2, it will download it automatically before running the long generation of the vectors.

Convert Wiki dump to text

The first step is to extract the text and the sentences of the dump. It is done in this function:

INPUT_FILENAME = 'jawiki-latest-pages-articles.xml.bz2' # This is the only input filename
JA_WIKI_TEXT_FILENAME = 'jawiki-latest-text.txt' # first output file of the function
JA_WIKI_SENTENCES_FILENAME = 'jawiki-latest-text-sentences.txt' # second output file of the function
process_wiki_to_text(INPUT_FILENAME, JA_WIKI_TEXT_FILENAME, JA_WIKI_SENTENCES_FILENAME)

The output consists of two files:

  • JA_WIKI_TEXT_FILENAME whose content looks like: trebuchet msフォントアンパサンドとはを意味する記号である where each line corresponds to an article.

  • JA_WIKI_SENTENCES_FILENAME where each line corresponds to a sentence or chunk of words in the text. This file will not be used in the word2vec algorithm but can be useful to train a sentence to vec (named skip-thoughts, available here https://github.com/ryankiros/skip-thoughts/).

Tokenize the text

Tokenizing means separating the full text into words by using spaces as delimiters. Two approaches are available here:

TinySegmenter3 (easy but less accurate in the tokenization phase)

For this, we use a library called TinySegmenter3 which is able to tokenize japanese corpus with more than 95 percent accuracy (source: http://lilyx.net/tinysegmenter-in-python/).

The output is JA_WIKI_TEXT_TOKENS_FILENAME. It looks like this: trebuchet ms フォント アンパサンド と は を 意味 する

MeCab (advanced but very accurate)

I strongly advise you to read this tutorial first: How to install MeCab.

The installation depends on your OS:

MacOS
brew install mecab
brew install mecab-ipadic
brew install git curl xz
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n
pip3 install mecab-python3
Ubuntu
sudo apt-get install mecab mecab-ipadic libmecab-dev
sudo apt-get install mecab-ipadic-utf8
sudo apt-get install git curl
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
sudo ./bin/install-mecab-ipadic-neologd -n
pip3 install mecab-python3

Infer the vectors

Finally, the Gensim library is used to perform the word2vec algorithm with the parameters:

  • size of 50 (dimensionality of the feature vectors)
  • window of 5 (maximum distance between the current and predicted word within a sentence)
  • min count of 5 (ignore all words with total frequency lower than this)
  • iter of 5 (number of iterations or epochs over the corpus)
  • number of workers equal to number of cores

While training, the console output looks like:

2016-09-04 02:54:38,354 : INFO : PROGRESS: at 99.74% examples, 482630 words/s, in_qsize 5, out_qsize 4
2016-09-04 02:54:39,346 : INFO : PROGRESS: at 99.82% examples, 482644 words/s, in_qsize 7, out_qsize 0
2016-09-04 02:54:40,356 : INFO : PROGRESS: at 99.90% examples, 482643 words/s, in_qsize 7, out_qsize 1
2016-09-04 02:54:41,390 : INFO : PROGRESS: at 99.98% examples, 482630 words/s, in_qsize 8, out_qsize 0

Once it's finished, 4 new files are generated:

  • ja-gensim.50d.data.model. This file contains the model in the binary format. Use model = Word2Vec.load(fname) to get back your word2vec model.
  • ja-gensim.50d.data.txt. This file contains the model vectors in the text format. Can be used in any other script without the Gensim library!
  • ja-gensim.50d.data.model.syn1neg.npy and ja-gensim.50d.data.model.wv.syn0.npy. Files generated automatically. Contains some numpy arrays (weights and other parameters). It must be in the same directory as the model.

Finally, let's inspect ja-gensim.50d.data.txt

の 0.128774 3.631298 -3.058414 -0.434418 -0.300449 -1.211774 0.608027 -5.561740 -1.186208 -0.035129 1.709353 1.252130 -3.849393 0.390795 4.260262 0.209959 2.316592 -2.880473 -0.427741 -1.335913 4.500565 0.556813 0.585122 -0.739895 1.034633 3.786435 -1.032835 -5.697092 1.436553 -1.689847 -4.953261 -3.883135 1.730590 -3.211419 -2.154781 -1.915586 -0.283341 0.332927 -2.281737 0.440092 1.535507 0.925073 -4.101060 0.634421 -4.230011 -0.313288 -3.955676 0.009256 2.931253 -0.500217
に -2.019490 4.359702 -1.845176 -2.663986 1.774256 0.147722 1.484422 -2.984465 2.262582 -0.861214 0.804603 1.007627 -4.322638 -0.173283 2.905254 0.803300 2.850667 -3.859382 -0.214240 -1.914028 5.640825 -0.139551 0.243700 -3.234274 1.844652 6.613075 -2.586612 -7.520448 4.413483 -3.270162 -2.952101 -2.278936 7.161888 -6.830038 -2.042799 -0.559094 -2.270651 2.744259 -2.250800 0.269468 -0.153715 3.831476 -2.068467 1.833452 -4.605278 3.756418 -4.275790 1.822912 1.606565 -2.918230
は 0.296134 4.136690 -3.184480 -0.817397 0.555022 -1.181827 0.933714 -4.486689 -0.429983 0.427427 0.089208 1.415648 -2.763912 1.310283 5.143843 1.778646 2.280496 -4.852800 -1.581973 -1.364721 3.240205 1.227000 0.931791 -2.009395 1.856946 3.401864 -1.741597 -6.626904 -0.016503 -3.313225 -2.302027 -3.208004 4.541845 -4.704424 -2.073442 -1.192726 0.880771 -1.584695 0.450757 1.645549 1.212130 1.006536 -3.576060 0.142494 -4.799853 0.906162 -3.141263 1.762820 2.482034 -1.188599

Here we can see the vectors for , and . If we go deeper, we can see longer words such as 文献. The size of the vocabulary is the number of lines of this file (one line equals one word and its vector representation).

wc -l ja-gensim.50d.data.txt yields 1200627 words.

References

  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

japanese-words-to-vectors's People

Contributors

martialblog avatar mcps5601 avatar philipperemy avatar vortexkd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

japanese-words-to-vectors's Issues

writing bytes

Using python 3.5, errors such as AttributeError: 'str' object has no attribute 'decode' are thrown.
changing file read and write modes from r to rb, and w to wb respectively fixes the issue.
Thanks for the great work!

Input file to generate_vectors

Hi @philipperemy,
Thanks for your great work.
I have a question about the input for function generate_vector.
I suppose the input for this function should be tokenized, but I see in the script you are using JA_WIKI_SENTENCES_FILENAME, which is not tokenized.
Can you clear my confusion?
Thank you so much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.