Giter Club home page Giter Club logo

mecab-ipadic-neologd's Introduction

mecab-ipadic-NEologd : Neologism dictionary for MeCab

Build Status

For Japanese

README.ja.md is written in Japanese.

Documentation

You can find more detailed documentation and examples in the following wiki.

Overview

mecab-ipadic-NEologd is customized system dictionary for MeCab.

This dictionary includes many neologisms (new word), which are extracted from many language resources on the Web.

When you analyze the Web documents, it's better to use this system dictionary and default one (ipadic) together.

Pros and Cons

Pros

  • Recorded about 3.22 million pairs(including duplicate entries) of surface/furigana(kana indicating the pronunciation of kanji) of the words such as the named entity that can not be tokenized correctly using default system dictionary of MeCab.
  • Update process of this dictionary will automatically run on development server.
    • I'm planning to renew this dictionary at least updating twice weekly
      • Every Monday and Thursday
  • When renewing by utilizing the language resources on Web, a new named entity can be recorded.
    • The resources are being utilized at present are as follows.
      • Dump data of hatena keyword
      • Japanese postal code number data download (ken_all.lzh)
      • The name-of-the-station list of whole country of Japan
      • The entry data of the person names (last name / first name)
      • The entry data of emojis from Unicode 10.0 and Emoji 5.0
      • The entry data of Kaomoji strings
      • The entry data of adverbs
      • The entry data of adjectives
      • The entry data of adjective verbs
      • The entry data of interjections
      • The entry data of orthographic variant of general nouns
      • The entry data of orthographic variant of nouns used as verb form
  • A lot of documents, which crawled from Web
    • I'm planning to record the words such as the named entity, which will be extracted from other new language resource.

Cons

  • Classification of the named entity is insufficient
    • Ex. Some person names and a product name are classified into the same named entity category.
  • Not named entity word is recorded as named entity too.
  • Since the manual checking to all the named entities can't conduct, it may have made a mistake in matching of surface of the named entity and furigana of the named entity.
  • Corresponding to only UTF-8
    • You should install the UTF-8 version of MeCab

Getting started

Memory requirements

  • Required: 2GB of RAM
  • Recommend: 6GB of RAM
    • Current maximum binary size is 1.2GB

Dependencies

We build mecab-ipadic-NEologd using the source code of mecab-ipadic at installing phase.

You should install following libraries using apt or yum, homebrew, source-code.

  • C++ Compiler

    • Operation on GCC-4.4.7 and Apple LLVM version 6.0 are confirmed
  • iconv (libiconv)

    • Use to convert the character code of IPADIC
  • mecab

    • We use bin/mecab and bin/mecab-config
  • mecab-ipadic

    • One of a dictionary of MeCab
      • Use to test at the time of installation
      • If you install it using source code, you should select the UTF-8 as a character code

    ./configure --with-charset=utf8; make; sudo make install

  • xz

    • Use to decompress a xz package of a seed of mecab-ipadic-NEologd

Please install at any time other lack library.

Examples

  • On CentOS

    $ sudo rpm -ivh http://packages.groonga.org/centos/groonga-release-1.1.0-1.noarch.rpm

    $ sudo yum install mecab mecab-devel mecab-ipadic git make curl xz patch

  • On Fedora

    $ sudo yum install mecab mecab-devel mecab-ipadic git make curl xz

  • On Ubuntu

    $ sudo aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file

  • On MacOSX

    $ brew install mecab mecab-ipadic git curl xz

Preparation of installing

A seed data of the dictionary will distribute via GitHub repository.

In first time, you should execute the following command to 'git clone'.

$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git

OR

$ git clone --depth 1 [email protected]:neologd/mecab-ipadic-neologd.git

If you need all log of mecab-ipadic-neologd.git, you should clone the repository without '--depth 1'

How to install/update mecab-ipadic-NEologd

Step.1

Move to a directory of the repository which was cloned in the above preparation.

$ cd mecab-ipadic-neologd

Step.2

You can install or can update(overwritten) the recent mecab-ipadic-NEologd by following command.

$ ./bin/install-mecab-ipadic-neologd -n

You can check the directory path to install it.

$ echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

If you installed some mecab-config, you should set the path of the mecab-config which you want to use.

If you use following command, you can check useful command line option.

$ ./bin/install-mecab-ipadic-neologd -h

How to use mecab-ipadic-NEologd

When you want to use mecab-ipadic-NEologd, you should set the path of custom system dictionary(*/lib/mecab/dic/mecab-ipadic-neologd/) as -d option of MeCab.

Example (on CentOS)

$ mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/

Example of output of MeCab

When you use mecab-ipadic-NEologd

$ echo "8月3日に放送された「中居正広の金曜日のスマイルたちへ」(TBS系)で、1日たった5分でぽっこりおなかを解消するというダイエット方法を紹介。キンタロー。のダイエットにも密着。" | mecab -d /usr/local/lib/mecab/dic/mecab
8月3日	名詞,固有名詞,一般,*,*,*,8月3日,ハチガツミッカ,ハチガツミッカ
に	助詞,格助詞,一般,*,*,*,に,ニ,ニ
放送	名詞,サ変接続,*,*,*,*,放送,ホウソウ,ホーソー
さ	動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ
れ	動詞,接尾,*,*,一段,連用形,れる,レ,レ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
「	記号,括弧開,*,*,*,*,「,「,「
中居正広の金曜日のスマイルたちへ	名詞,固有名詞,一般,*,*,*,中居正広の金曜日のスマイルたちへ,ナカイマサヒロノキンヨウビノスマイルタチヘ,ナカイマサヒロノキンヨービノスマイルタチヘ
」(	記号,一般,*,*,*,*,*
TBS	名詞,固有名詞,一般,*,*,*,TBS,ティービーエス,ティービーエス
系	名詞,接尾,一般,*,*,*,系,ケイ,ケイ
)	記号,一般,*,*,*,*,*
で	助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
、	記号,読点,*,*,*,*,、,、,、
1日	名詞,固有名詞,一般,*,*,*,1日,ツイタチ,ツイタチ
たった	副詞,助詞類接続,*,*,*,*,たった,タッタ,タッタ
5分	名詞,固有名詞,一般,*,*,*,5分,ゴフン,ゴフン
で	助詞,格助詞,一般,*,*,*,で,デ,デ
ぽっこり	副詞,一般,*,*,*,*,ぽっこり,ポッコリ,ポッコリ
おなか	名詞,一般,*,*,*,*,おなか,オナカ,オナカ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
解消	名詞,サ変接続,*,*,*,*,解消,カイショウ,カイショー
する	動詞,自立,*,*,サ変・スル,基本形,する,スル,スル
という	助詞,格助詞,連語,*,*,*,という,トイウ,トユウ
ダイエット方法	名詞,固有名詞,一般,*,*,*,ダイエット方法,ダイエットホウホウ,ダイエットホウホー
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
紹介	名詞,サ変接続,*,*,*,*,紹介,ショウカイ,ショーカイ
。	記号,句点,*,*,*,*,。,。,。
キンタロー。	名詞,固有名詞,一般,*,*,*,キンタロー。,キンタロー,キンタロー
に	助詞,格助詞,一般,*,*,*,に,ニ,ニ
も	助詞,係助詞,*,*,*,*,も,モ,モ
密着	名詞,サ変接続,*,*,*,*,密着,ミッチャク,ミッチャク
。	記号,句点,*,*,*,*,。,。,。
EOS

What's the point of the above result

  • MeCab can parse the words which are recorded in mecab-ipadic-NEologd correctly
    • "中居正広の金曜日のスマイルたちへ(To Friday's smile by Masahiro Nakai)" is a new word
      • This word could parse correctly because of updating of the language resources on Web
  • Almost all of the entry of mecab-ipadic-NEologd has the value of furigana field
  • Recording frequent time expressions and quantitative expressions in advance
  • Expressions frequently appearing in News and SNS will be made into one word
  • Positively make the name of famous people Named Entity
  • Famous "。 is the end of sentence" problem solved

When you use default system dictionary

$ echo "8月3日に放送された「中居正広の金曜日のスマイルたちへ」(TBS系)で、1日たった5分でぽっこりおなかを解消するというダイエット方法を紹介。キンタロー。にも密着。" | mecab
8	名詞,数,*,*,*,*,*
月	名詞,一般,*,*,*,*,月,ツキ,ツキ
3	名詞,数,*,*,*,*,*
日	名詞,接尾,助数詞,*,*,*,日,ニチ,ニチ
に	助詞,格助詞,一般,*,*,*,に,ニ,ニ
放送	名詞,サ変接続,*,*,*,*,放送,ホウソウ,ホーソー
さ	動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ
れ	動詞,接尾,*,*,一段,連用形,れる,レ,レ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
「	記号,括弧開,*,*,*,*,「,「,「
中居	名詞,固有名詞,人名,姓,*,*,中居,ナカイ,ナカイ
正広	名詞,固有名詞,人名,名,*,*,正広,マサヒロ,マサヒロ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
金曜日	名詞,副詞可能,*,*,*,*,金曜日,キンヨウビ,キンヨービ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
スマイル	名詞,一般,*,*,*,*,スマイル,スマイル,スマイル
たち	名詞,接尾,一般,*,*,*,たち,タチ,タチ
へ	助詞,格助詞,一般,*,*,*,へ,ヘ,エ
」(	名詞,サ変接続,*,*,*,*,*
TBS	名詞,一般,*,*,*,*,*
系	名詞,接尾,一般,*,*,*,系,ケイ,ケイ
)	名詞,サ変接続,*,*,*,*,*
で	助詞,格助詞,一般,*,*,*,で,デ,デ
、	記号,読点,*,*,*,*,、,、,、
1	名詞,数,*,*,*,*,*
日	名詞,接尾,助数詞,*,*,*,日,ニチ,ニチ
たっ	動詞,自立,*,*,五段・タ行,連用タ接続,たつ,タッ,タッ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
5	名詞,数,*,*,*,*,*
分	名詞,接尾,助数詞,*,*,*,分,フン,フン
で	助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ぽ	形容詞,接尾,*,*,形容詞・アウオ段,ガル接続,ぽい,ポ,ポ
っ	動詞,非自立,*,*,五段・カ行促音便,連用タ接続,く,ッ,ッ
こり	動詞,自立,*,*,一段,連用形,こりる,コリ,コリ
おなか	名詞,一般,*,*,*,*,おなか,オナカ,オナカ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
解消	名詞,サ変接続,*,*,*,*,解消,カイショウ,カイショー
する	動詞,自立,*,*,サ変・スル,基本形,する,スル,スル
という	助詞,格助詞,連語,*,*,*,という,トイウ,トユウ
ダイエット	名詞,サ変接続,*,*,*,*,ダイエット,ダイエット,ダイエット
方法	名詞,一般,*,*,*,*,方法,ホウホウ,ホーホー
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
紹介	名詞,サ変接続,*,*,*,*,紹介,ショウカイ,ショーカイ
。	記号,句点,*,*,*,*,。,。,。
キンタロー	名詞,一般,*,*,*,*,*
。	記号,句点,*,*,*,*,。,。,。
に	助詞,格助詞,一般,*,*,*,に,ニ,ニ
も	助詞,係助詞,*,*,*,*,も,モ,モ
密着	名詞,サ変接続,*,*,*,*,密着,ミッチャク,ミッチャク
。	記号,句点,*,*,*,*,。,。,。
EOS

To evaluate or to reproduce a research result

We release the tag for mainly research purpose.

It is very useful for the following applications.

  • Experiments for evaluation of a research result
  • Reproducibility of a experimental result of others
  • Creation of the processing results of morphological analysis that doesn't update forever

If you are the beginner of NLP, I recommend that you use the latest version of master branch.

Bibtex

Please use the following bibtex, when you refer mecab-ipadic-NEologd from your papers.

@INPROCEEDINGS{sato2017mecabipadicneologdnlp2017,
    author    = {Toshinori Sato, Taiichi Hashimoto and Manabu Okumura},
    title     = {Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval (in Japanese)},
    booktitle = "Proceedings of the Twenty-three
    Annual Meeting of the Association for Natural Language Processing",
    year      = "2017",
    pages     = "NLP2017-B6-1",
    publisher = "The Association for Natural Language Processing",
}
@INPROCEEDINGS{sato2016neologdipsjnl229,
    author    = {Toshinori Sato, Taiichi Hashimoto and Manabu Okumura},
    title     = {Operation of a word segmentation dictionary generation system called NEologd (in Japanese)},
    booktitle = "Information Processing Society of Japan, Special Interest Group on Natural Language Processing (IPSJ-SIGNL)",
    year      = "2016",
    pages     = "NL-229-15",
    publisher = "Information Processing Society of Japan",
}
@misc{sato2015mecabipadicneologd,
    title  = {Neologism dictionary based on the language resources on the Web for Mecab},
    author = {Toshinori, Sato},
    url    = {https://github.com/neologd/mecab-ipadic-neologd},
    year   = {2015}
}

Star please !!

Please star this github repository if mecab-ipadic-NEologd is very useful to your project ;)

Copyrights

Copyright (c) 2015-2019 Toshinori Sato (@overlast) All rights reserved.

We select the 'Apache License, Version 2.0'. Please check following link.

mecab-ipadic-neologd's People

Contributors

felixonmars avatar neologd avatar overlast avatar pecorarista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mecab-ipadic-neologd's Issues

mecab-ipadic-NEologd won't be updated when running the installer with the full path

I ran a cron job with the full path installer and -n option, like,

00 03 * * 2 /opt/mecab/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y

Then the following errors occurred.

fatal: Not a git repository (or any of the parent directories): .git
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

This occurred in the following code because the current directory was not a git repository.

if [ `git log refs/heads/master --pretty=%H | head -1` = `git ls-remote origin -h refs/heads/master |cut -f1` ]; then
    echo "$ECHO_PREFIX mecab-ipadic-NEologd is already up-to-date"

In this case, the condition is always true because both of the results are empty.
Therefore, the message "mecab-ipadic-NEologd is already up-to-date" is always displayed.

A small script to find wrong yomigana entries

Hello,

First of all, your mecab-ipadic-neologd is amazing.
Thank you so much!

I wrote a small script and found some wrong yomigana entries.
find-neologd-error-entries.rb.txt
neologd-error-entries.txt

$ ruby find-neologd-error-entries.rb mecab-user-dict-seed.20160111.csv
It generates "neologd-error-entries.txt".

e.g.

  • 京都市上京区西町,1293,1293,-1319,名詞,固有名詞,地域,一般,,,京都市上京区西町,キョウトシカミギョウクニシマチニシマチニシマチニシマチニシマチニシマチニシマチ,キョートシカミギョークニシマチニシマチニシマチニシマチニシマチニシマチニシマチ
  • 神津小学校,1288,1288,352,名詞,固有名詞,一般,,,*,神津小学校,カミツショウガッコウコウヅショウガッコウ,カミツショーガッコーコーズショウガッコー

I know we can't get yomigana perfectly, but neologd may have some errors in zip code data splitting.

Wrong entry for ササキ

佐々木貞清,1289,1289,2337,名詞,固有名詞,人名,一般,*,*,佐々木貞清,ササキ,ササキ

Nothing happened after "Download original mecab-ipadic file"

Try to install ipadic-neologd on Mac and followed steps, but after the step of Download original mecab-ipadic file, nothing happened and the program seems break. Can you help? Thanks

./bin/install-mecab-ipadic-neologd -n
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] : find => ok
[install-mecab-ipadic-NEologd] : sort => ok
[install-mecab-ipadic-NEologd] : head => ok
[install-mecab-ipadic-NEologd] : cut => ok
[install-mecab-ipadic-NEologd] : egrep => ok
[install-mecab-ipadic-NEologd] : mecab => ok
[install-mecab-ipadic-NEologd] : mecab-config => ok
[install-mecab-ipadic-NEologd] : make => ok
[install-mecab-ipadic-NEologd] : curl => ok
[install-mecab-ipadic-NEologd] : sed => ok
[install-mecab-ipadic-NEologd] : cat => ok
[install-mecab-ipadic-NEologd] : diff => ok
[install-mecab-ipadic-NEologd] : tar => ok
[install-mecab-ipadic-NEologd] : unxz => ok
[install-mecab-ipadic-NEologd] : xargs => ok
[install-mecab-ipadic-NEologd] : grep => ok
[install-mecab-ipadic-NEologd] : iconv => ok
[install-mecab-ipadic-NEologd] : patch => ok
[install-mecab-ipadic-NEologd] : which => ok
[install-mecab-ipadic-NEologd] : file => ok
[install-mecab-ipadic-NEologd] : openssl => ok
[install-mecab-ipadic-NEologd] : awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/local/lib/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
MacBook-Pro:mecab-ipadic-neologd User$

インストールに失敗

リポジトリをクローン後に以下のコマンドでインストールを実行すると mecab-ipadic-2.7.0-20070801.tar.gz のハッシュ値が違うという原因でエラーが発生します

$ ./bin/install-mecab-ipadic-neologd -n

・エラー発生時のログ

[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
[install-mecab-ipadic-NEologd] :     patch => ok
[install-mecab-ipadic-NEologd] :     which => ok
[install-mecab-ipadic-NEologd] :     file => ok
[install-mecab-ipadic-NEologd] :     openssl => ok
[install-mecab-ipadic-NEologd] :     awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : create /mecab-ipadic-neologd/libexec/../build
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3435    0  3435    0     0   7797      0 --:--:-- --:--:-- --:--:--  7896
[make-mecab-ipadic-NEologd] : Fail to download /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz
[make-mecab-ipadic-NEologd] : You should remove /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz before retrying to install mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] :        rm -rf /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801
[make-mecab-ipadic-NEologd] :        rm /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz

該当のtarファイルが置かれている以下の URL へアクセスすると Google ドライブのエラーが表示されておりこれが影響しているかもしれません

https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM

google_drive_

`Android標準ブラウザ` related entries

Motivation

Fix incorrect entries

Goal

  • write the goal

write the description

$ grep 'Android.*ブラウザ' mecab-user-dict-seed.20200709.csv
Android標準ブラウザ,1288,1288,4545,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ
Android標準ブラウザー,1288,1288,5229,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザー,ブラウザー
android標準ブラウザ,1288,1288,4545,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ
ブラウザ,1288,1288,6395,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ

$ mecab -d /usr/lib/mecab/dic/mecab-ipadic-neologd
ブラウザ
ブラウザ        名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ

ブラウザ is a generic word but neologd seems it as a Android標準ブラウザ :(

installer reports error

installer reports fatal: ambiguous argument '...refs/heads/master^': unknown revision or path not in the working tree.

system env:

~% $SHELL '--version'
zsh 5.0.2 (x86_64-pc-linux-gnu)

~% git --version
git version 2.5.0

detail logs:

....

fatal: ambiguous argument '...refs/heads/master^': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
[install-mecab-ipadic-neologd] : Get the newest updated information using git
./bin/install-mecab-ipadic-neologd: line 199: [: =: unary operator expected
HEAD is now at f987dff Fix typo

[install-mecab-ipadic-neologd] : mecab-ipadic-neologd will be install to /usr/lib64/mecab/dic/mecab-ipadic-neologd

....

normalize_neologd.pyの間違い?

WikiのRegexp.jaのページに記載されているnormalize_neologd.pyですが,

s = unicode_normalize('0−9A-Za-z。-゚', s)

の部分の0と9の間がHYPHEN-MINUSではなくMINUS SIGNになっています.

「10日」を正規化すると「10日」のようになると思うのですが,現在のソースコードでは「10日」のようになってしまいます (Python 2.7.9で確認).

以下の変更をマージしていただけませんか?

arosh/mecab-ipadic-neologd-wiki@0e5534d

(Wikiに対するPull Requestの方法が分からなかったので,Issueで質問させていただきました)

Download failed in China

Hi. Thank you for sharing a great dictionary!
Currently, we are using your dict for Japanese text-to-speech system in our project.

The users from China reported the failure of downloading due to the block of google drive service.
espnet/espnet#606
Is there any plan to provide another download source for the installation?

数値系が固有名詞になっている

$100,1288,1288,7806,名詞,固有名詞,一般,,,,$100,ヒャクドル,ヒャクドル
昭和10年,1288,1288,6518,名詞,固有名詞,一般,
,,,昭和10年,ショウワジュウネン,ショーワジュウネン
10 years,1288,1288,4569,名詞,固有名詞,一般,,,*,10 years,テンイヤーズ,テンイヤーズ

などの数値系の辞書の品詞が、固有名詞になっているが、固有名詞ではないのではないでしょうか?
一般などの品詞に変えられないでしょうか?

How to produce mecab-user-dict-seed.YYYYMMDD.csv.xz?

Hi, I love and appreciate this helpful dictionary!

A quick question: how do you produce seed file mecab-user-dict-seed.YYYYMMDD.csv.xz?
I suppose you use some scripts to it, but if so, the scripts are also uploaded to the git repo?

I'm looking for the way to build a bit customized-version of the dic.

Thanks in advance!

「世界の秘密」

Hello,
This dictionary is very good!
I use it every day.
Thank you so much.

By the way, a phrase "世界の秘密" is analyzed a token of this dictionary.
The phrase is a quiz TV program name.
But, that TV program ended only five months.
I think that the phrase should be analyzed "世界" + "の" + "秘密".

"株式会社" should be splitted.

I think these characters should be splitted:
(株), (株), 株式会社

neologd has 5 "あおい電子工業" variants.

  • あおい電子工業 株式会社,1292,1292,-14635,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業 (株),1292,1292,-10826,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業(株),1292,1292,6301,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業株式会社,1292,1292,-9787,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業(株),1292,1292,-6382,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ

I don't think we need 5 variants for "あおい電子工業",
and, more importantly, neologd doesn't have basic "あおい電子工業 アオイデンシコウギョウ".

I think these entries are enough and we can reduce the dictionary size.
あおい電子工業 アオイデンシコウギョウ
株式会社 カブシキガイシャ
(株) カブシキガイシャ
(株) カブシキガイシャ

Regards.

Cannot Install mecab-python3 (unable to execute swig: no such file in directory)

Motivation

Hello,
I've successfully installed MeCab, mecab-ipadic, and the neological dictionary. However, I cannot install mecab-python3 to get MeCab talking with Python. Each time I've tried, I receive the following error:

unable to execute 'swig': No such file or directory
error: command 'swig' failed with exit status 1

From what I've gathered looking into the issue on Google, it seems to be an issue that resulted from the most recent update. Was wondering if there was a temporary fix until the issue gets resolved?

I ran across this online:

https://qiita.com/siraasagi/items/e07e0b271cb7cd679a70

but as I'm using a Mac, I cannot run apt in the command line. Brew also does not recognize the formulae when I substitute it with apt-get. Any help would be much appreciated!

Thanks for your time!

Goal

Goal is to use MeCab with Python to tokenize some Japanese text for NLP purposes.

build時に「line 525: 6288 Killed ${MECAB_LIBEXEC_DIR}/mecab-dict-index -f UTF8 -t UTF8」のエラーが出る

エラーの内容

最近のレポジトリからgit clone後、エラーが表示されてインストールに失敗します。
参照しようとしているディレクトリが違うように見えますが、ご助言いただけましたら幸いです。

状況

・DockerFileを利用しています。
・DockerFile内でgit clone 後にbuildしています。

コード

# Dockerfile

FROM python:3.6
WORKDIR /code
ENV PYTHONUNBUFFERED 1
COPY requirements.txt /code/
RUN apt-get update -y&&\
    apt-get upgrade -y&&\
    apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8 sudo -y&&\
    apt-get install git make curl xz-utils file&&\
    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git&&\
    /code/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y &&\
    mkdir /code/media && \
    mkdir /code/static &&\
    python -m pip install --upgrade pip &&\
    pip install -r requirements.txt
COPY . /code/

エラーの全文

↑は関係のない項目なので省きます。
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
Setting up mecab-ipadic-utf8 (2.7.0-20070801+main-2.1) ...
Compiling IPA dictionary for Mecab.  This takes long time...
reading /usr/share/mecab/dic/ipadic/unk.def ... 40
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
update-alternatives: using /var/lib/mecab/dic/ipadic-utf8 to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode
Processing triggers for libc-bin (2.28-10) ...
Reading package lists...
Building dependency tree...
Reading state information...
curl is already the newest version (7.64.0-4+deb10u1).
file is already the newest version (1:5.35-4+deb10u1).
git is already the newest version (1:2.20.1-2+deb10u3).
make is already the newest version (4.2.1-1.2).
xz-utils is already the newest version (5.2.4-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Cloning into 'mecab-ipadic-neologd'...
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
[install-mecab-ipadic-NEologd] :     patch => ok
[install-mecab-ipadic-NEologd] :     which => ok
[install-mecab-ipadic-NEologd] :     file => ok
[install-mecab-ipadic-NEologd] :     openssl => ok
[install-mecab-ipadic-NEologd] :     awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : create /code/mecab-ipadic-neologd/libexec/../build
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
[make-mecab-ipadic-NEologd] : Try to access to https://ja.osdn.net
[make-mecab-ipadic-NEologd] : Try to download from https://ja.osdn.net/frs/g_redir.php?m=kent&f=mecab%2Fmecab-ipadic%2F2.7.0-20070801%2Fmecab-ipadic-2.7.0-20070801.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 11.6M  100 11.6M    0     0  7350k      0  0:00:01  0:00:01 --:--:-- 7731k
Hash value of /code/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz matched
[make-mecab-ipadic-NEologd] : Decompress original mecab-ipadic file
mecab-ipadic-2.7.0-20070801/
mecab-ipadic-2.7.0-20070801/README
mecab-ipadic-2.7.0-20070801/AUTHORS
mecab-ipadic-2.7.0-20070801/COPYING
mecab-ipadic-2.7.0-20070801/ChangeLog
mecab-ipadic-2.7.0-20070801/INSTALL
mecab-ipadic-2.7.0-20070801/Makefile.am
mecab-ipadic-2.7.0-20070801/Makefile.in
mecab-ipadic-2.7.0-20070801/NEWS
mecab-ipadic-2.7.0-20070801/aclocal.m4
mecab-ipadic-2.7.0-20070801/config.guess
mecab-ipadic-2.7.0-20070801/config.sub
mecab-ipadic-2.7.0-20070801/configure
mecab-ipadic-2.7.0-20070801/configure.in
mecab-ipadic-2.7.0-20070801/install-sh
mecab-ipadic-2.7.0-20070801/missing
mecab-ipadic-2.7.0-20070801/mkinstalldirs
mecab-ipadic-2.7.0-20070801/Adj.csv
mecab-ipadic-2.7.0-20070801/Adnominal.csv
mecab-ipadic-2.7.0-20070801/Adverb.csv
mecab-ipadic-2.7.0-20070801/Auxil.csv
mecab-ipadic-2.7.0-20070801/Conjunction.csv
mecab-ipadic-2.7.0-20070801/Filler.csv
mecab-ipadic-2.7.0-20070801/Interjection.csv
mecab-ipadic-2.7.0-20070801/Noun.adjv.csv
mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv
mecab-ipadic-2.7.0-20070801/Noun.csv
mecab-ipadic-2.7.0-20070801/Noun.demonst.csv
mecab-ipadic-2.7.0-20070801/Noun.nai.csv
mecab-ipadic-2.7.0-20070801/Noun.name.csv
mecab-ipadic-2.7.0-20070801/Noun.number.csv
mecab-ipadic-2.7.0-20070801/Noun.org.csv
mecab-ipadic-2.7.0-20070801/Noun.others.csv
mecab-ipadic-2.7.0-20070801/Noun.place.csv
mecab-ipadic-2.7.0-20070801/Noun.proper.csv
mecab-ipadic-2.7.0-20070801/Noun.verbal.csv
mecab-ipadic-2.7.0-20070801/Others.csv
mecab-ipadic-2.7.0-20070801/Postp-col.csv
mecab-ipadic-2.7.0-20070801/Postp.csv
mecab-ipadic-2.7.0-20070801/Prefix.csv
mecab-ipadic-2.7.0-20070801/Suffix.csv
mecab-ipadic-2.7.0-20070801/Symbol.csv
mecab-ipadic-2.7.0-20070801/Verb.csv
mecab-ipadic-2.7.0-20070801/char.def
mecab-ipadic-2.7.0-20070801/feature.def
mecab-ipadic-2.7.0-20070801/left-id.def
mecab-ipadic-2.7.0-20070801/matrix.def
mecab-ipadic-2.7.0-20070801/pos-id.def
mecab-ipadic-2.7.0-20070801/rewrite.def
mecab-ipadic-2.7.0-20070801/right-id.def
mecab-ipadic-2.7.0-20070801/unk.def
mecab-ipadic-2.7.0-20070801/dicrc
mecab-ipadic-2.7.0-20070801/RESULT
[make-mecab-ipadic-NEologd] : Configure custom system dictionary on /code/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801-neologd-20200813
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking whether make sets $(MAKE)... yes
checking for working aclocal-1.4... missing
checking for working autoconf... found
checking for working automake-1.4... missing
checking for working autoheader... found
checking for working makeinfo... missing
checking for a BSD-compatible install... /usr/bin/install -c
checking for mecab-config... /usr/bin/mecab-config
configure: creating ./config.status
config.status: creating Makefile
[make-mecab-ipadic-NEologd] : Encode the character encoding of system dictionary resources from EUC_JP 
to UTF-8
./../../libexec/iconv_euc_to_utf8.sh ./Adnominal.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp-col.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Filler.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.nai.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.verbal.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.proper.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Conjunction.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adj.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.number.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.name.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.place.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Interjection.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Auxil.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.demonst.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adverb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adverbal.csv
./../../libexec/iconv_euc_to_utf8.sh ./Verb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Prefix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Suffix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Symbol.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adjv.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.org.csv 
rm ./Adnominal.csv 
rm ./Postp-col.csv 
rm ./Filler.csv 
rm ./Others.csv
rm ./Noun.nai.csv 
rm ./Noun.others.csv
rm ./Noun.verbal.csv
rm ./Noun.proper.csv
rm ./Conjunction.csv
rm ./Adj.csv
rm ./Postp.csv
rm ./Noun.number.csv
rm ./Noun.name.csv 
rm ./Noun.place.csv
rm ./Noun.csv
rm ./Interjection.csv
rm ./Auxil.csv
rm ./Noun.demonst.csv
rm ./Adverb.csv
rm ./Noun.adverbal.csv
rm ./Verb.csv 
rm ./Prefix.csv
rm ./Suffix.csv 
rm ./Symbol.csv
rm ./Noun.adjv.csv
rm ./Noun.org.csv
./../../libexec/iconv_euc_to_utf8.sh ./right-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./left-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./feature.def 
./../../libexec/iconv_euc_to_utf8.sh ./unk.def 
./../../libexec/iconv_euc_to_utf8.sh ./rewrite.def 
./../../libexec/iconv_euc_to_utf8.sh ./pos-id.def
./../../libexec/iconv_euc_to_utf8.sh ./matrix.def 
./../../libexec/iconv_euc_to_utf8.sh ./char.def 
rm ./right-id.def 
rm ./left-id.def 
rm ./feature.def
rm ./unk.def
rm ./rewrite.def
rm ./pos-id.def 
rm ./matrix.def
rm ./char.def
mv ./Postp.csv.utf8 ./Postp.csv 
mv ./Noun.org.csv.utf8 ./Noun.org.csv 
mv ./Prefix.csv.utf8 ./Prefix.csv
mv ./Noun.demonst.csv.utf8 ./Noun.demonst.csv
mv ./rewrite.def.utf8 ./rewrite.def
mv ./Others.csv.utf8 ./Others.csv 
mv ./matrix.def.utf8 ./matrix.def
mv ./pos-id.def.utf8 ./pos-id.def
mv ./Noun.others.csv.utf8 ./Noun.others.csv
mv ./Noun.adjv.csv.utf8 ./Noun.adjv.csv 
mv ./Interjection.csv.utf8 ./Interjection.csv
mv ./Adj.csv.utf8 ./Adj.csv
mv ./unk.def.utf8 ./unk.def
mv ./Auxil.csv.utf8 ./Auxil.csv
mv ./Noun.number.csv.utf8 ./Noun.number.csv 
mv ./char.def.utf8 ./char.def
mv ./Conjunction.csv.utf8 ./Conjunction.csv
mv ./feature.def.utf8 ./feature.def
mv ./Filler.csv.utf8 ./Filler.csv
mv ./Symbol.csv.utf8 ./Symbol.csv 
mv ./Postp-col.csv.utf8 ./Postp-col.csv
mv ./Noun.csv.utf8 ./Noun.csv
mv ./Adnominal.csv.utf8 ./Adnominal.csv 
mv ./Adverb.csv.utf8 ./Adverb.csv
mv ./Noun.nai.csv.utf8 ./Noun.nai.csv
mv ./Noun.name.csv.utf8 ./Noun.name.csv
mv ./Noun.adverbal.csv.utf8 ./Noun.adverbal.csv
mv ./Noun.proper.csv.utf8 ./Noun.proper.csv 
mv ./Noun.place.csv.utf8 ./Noun.place.csv
mv ./Suffix.csv.utf8 ./Suffix.csv
mv ./left-id.def.utf8 ./left-id.def
mv ./right-id.def.utf8 ./right-id.def
mv ./Noun.verbal.csv.utf8 ./Noun.verbal.csv
mv ./Verb.csv.utf8 ./Verb.csv 
[make-mecab-ipadic-NEologd] : Fix yomigana field of IPA dictionary
patching file Noun.csv
patching file Noun.place.csv
patching file Verb.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.adverbal.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.others.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Prefix.csv
patching file Suffix.csv
patching file Noun.proper.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Suffix.csv
patching file Noun.demonst.csv
patching file Noun.csv
patching file Noun.name.csv
[make-mecab-ipadic-NEologd] : Copy user dictionary resource
[make-mecab-ipadic-NEologd] : Install adverb entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adverb-dict-seed.20150623.csv.xz
[make-mecab-ipadic-NEologd] : Install interjection entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-interjection-dict-seed.20170216.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-common-noun-ortho-variant-dict-seed.20170228.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-proper-noun-ortho-variant-dict-seed.20161110.csv.xz
[make-mecab-ipadic-NEologd] : Install entries of orthographic variant of a noun used as verb form using /code/mecab-ipadic-neologd/libexec/../seed/neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv.xz
[make-mecab-ipadic-NEologd] : Install frequent adjective orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-std-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-exp-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-adjective-exp-dict-seed.20151126.csv.xz, please set --install_adjective_exp option

[make-mecab-ipadic-NEologd] : Install adjective verb orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-verb-dict-seed.20160324.csv.xz
[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-date-time-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-date-time-infreq-dict-seed.20190415.csv.xz, 
please set --install_infreq_datetime option

[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-quantity-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-quantity-infreq-dict-seed.20190415.csv.xz, please set --install_infreq_quantity option

[make-mecab-ipadic-NEologd] : Install entries of ill formed words using /code/mecab-ipadic-neologd/libexec/../seed/neologd-ill-formed-words-dict-seed.20170127.csv.xz
[make-mecab-ipadic-NEologd] : Re-Index system dictionary
reading ./unk.def ... 40
emitting double-array: 100% |###########################################|
./model.def is not found. skipped.
reading ./Adnominal.csv ... 135
reading ./Postp-col.csv ... 91
reading ./Filler.csv ... 19
reading ./Others.csv ... 2
reading ./Noun.nai.csv ... 42
reading ./neologd-ill-formed-words-dict-seed.20170127.csv ... 60616
reading ./neologd-proper-noun-ortho-variant-dict-seed.20161110.csv ... 138379
reading ./Noun.others.csv ... 153
reading ./Noun.verbal.csv ... 12150
reading ./Noun.proper.csv ... 27493
reading ./Conjunction.csv ... 171
reading ./Adj.csv ... 27210
reading ./neologd-common-noun-ortho-variant-dict-seed.20170228.csv ... 152869
reading ./Postp.csv ... 146
reading ./Noun.number.csv ... 42
reading ./Noun.name.csv ... 34215
reading ./Noun.place.csv ... 73194
reading ./neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv ... 26058
reading ./Symbol.csv ... 208
reading ./neologd-adjective-verb-dict-seed.20160324.csv ... 20268
reading ./Noun.adjv.csv ... 3328                                             058
reading ./Noun.org.csv ... 17149
/code/mecab-ipadic-neologd/bin/../libexec/make-mecab-ipadic-neologd.sh: line 525:  6288 Killed
         ${MECAB_LIBEXEC_DIR}/mecab-dict-index -f UTF8 -t UTF8
ERROR: Service 'python-django' failed to build: The command '/bin/sh -c apt-get update -y&&    apt-get 
upgrade -y&&    apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8 sudo -y&&    apt-get 
install git make curl xz-utils file&&    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git&&    /code/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y &&    mkdir /code/media &&     mkdir /code/static &&    python -m pip install --upgrade pip &&    pip install -r requirements.txt' returned a non-zero code: 137

Improper proper nouns

I found some clauses suffixed with "。" are registered as 固有名詞 (proper noun) incorrectly.

$ echo '好きだ。' | mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd
好きだ。	名詞,固有名詞,一般,*,*,*,好きだ。,スキダ,スキダ

The examples are the below:

  • 好きだ。
  • 元気です。
  • おはよう。
  • あなた。
  • またね。
  • 娘。

Negative cost

Thanks first for the great database.

Motivation

I find some words in the data are assigned negative costs.

$ cat mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20191111/mecab-user-dict-seed.20191111.csv | grep "ファニチャーロウ"
ファニチャーロウレーシング,1288,1288,-5111,名詞,固有名詞,一般,*,*,*,ファニチャー・ロウ・レーシング,ファニチャーロウレーシング,ファニチャーロウレーシング
ファニチャー・ロウ・レーシング,1288,1288,-9029,名詞,固有名詞,一般,*,*,*,ファニチャー・ロウ・レーシング,ファニチャーロウレーシング,ファニチャーロウレーシング

Costs are lower for more frequent words. But the examples above do not seem to be so frequent as assigned a very low cost. I suspect this could possibly be a result of integer overflow or sort.

Goal

I would like to know:
(1) if this is a correct/intended result or a bug
(2) if correct/intended, how negative costs should be interpreted.

Can someone help me with this?

Release new version

Motivation

I hope people can install latest updated package with fresh data.

I has packaged version 0.0.5 of mecab-ipadic-neologd which released on 2016-05-02 for Debian and derived distribution like Ubuntu, also release the packaging file on both Launchpad PPA and Bintray, So people can easily install by command apt-get install mecab-ipadic-neologd.

Goal

  • Release new version for updated data

How to use on Windows10 and Python?

I am a non-japanese speaker. Firstly I installed mecab from that website:
https://pypi.org/project/mecab-python3/

even it didn't create a mecab folder on my pc.

in the python file, I wrote wakati = MeCab.Tagger("-Owakati") and it worked well! but they say mecab-ipadic-neologd is better and I need to use it. But all guides are based on Linux and MacOS. Please help

Pronunciations for 1日間 ~ 10日間 are wrong

「1日間 ~ 10日間」の読み方が間違ってる。
The 'カン' from '間' are missing.
And the furigana of "1日間" should be "イチニチカン" not "ツイタチカン"。

1日間   名詞,固有名詞,一般,*,*,*,1日間,ツイタチ,ツイタチ
2日間   名詞,固有名詞,一般,*,*,*,2日間,フツカ,フツカ
3日間
4日間
...
10日間  名詞,固有名詞,一般,*,*,*,10日間,トオカ,トオカ

11日間 is correct.

11日間  名詞,固有名詞,一般,*,*,*,11日間,ジュウイチニチカン,ジュウイチニチカン

Some wrong yomigana/hyouki entries

// hyouki
mecab-user-dict-seed.20160222.csv:387971: ウグイスタケ,1288,1288,-1686,名詞,固有名詞,一般,,,,鶯〓,ウグイスタケ,ウグイスタケ
mecab-user-dict-seed.20160222.csv:388991: ウチダヒャッケン,1288,1288,-5999,名詞,固有名詞,一般,
,,,内田百〓@6BE1@,ウチダヒャッケン,ウチダヒャッケン
(+87 "〓" entries)

// yomigana
mecab-user-dict-seed.20160222.csv:268129: けけ,1289,1289,7587,名詞,固有名詞,人名,一般,,,けけ,ケケヶ,ケケヶ
mecab-user-dict-seed.20160222.csv:274205: ずヾや株式会社,1288,1288,4587,名詞,固有名詞,一般,,,*,ずヾや株式会社,ズヾヤカブシキガイシャ,ズヾヤカブシキガイシャ

"ヶ" and "ヾ" are not good for Japanese yomigana.

Some entries have wrong yomi and pronunciation

Some entries have wrong yomi and pronunciation.
For example, after building dictionary,

$ cd mecab_ipadic_neologd
$ grep '高橋みなみ,' ./**/*.csv
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:5代目高橋みなみ,1289,1289,4078,名詞,固有名詞,人名,一般,*,*,5代目高橋みなみ,ゴダイメタカハシミナミ,ゴダイメタカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:ゴダイメタカハシミナミ,1289,1289,-951,名詞,固有名詞,人名,一般,*,*,5代目高橋みなみ,ゴダイメタカハシミナミ,ゴダイメタカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:高橋みなみ,1289,1289,273,名詞,固有名詞,人名,一般,*,*,高橋みなみ,タカハシミナミ,タカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:高橋みなみ,1289,1289,273,名詞,固有名詞,人名,一般,*,*,高橋みなみ,タカハシミナミエーケービーフォーティエイト,タカハシミナミエーケービーフォーティエイト
$ grep '日本料理,' ./**/*.csv
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:日本料理,1288,1288,3024,名詞,固有名詞,一般,*,*,*,日本料理,ニホンリョウリ,ニホンリョーリ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:日本料理,1288,1288,3024,名詞,固有名詞,一般,*,*,*,日本料理,ニホンリョウリニッポンリョウリ,ニホンリョーリニッポンリョウリ

Why?
It looks to me that 日本料理 has concatenated yomi and pronunciation.
Why does 高橋みなみ have エーケービーフォーティエイト?

My version is 20170228-01, but more old version have same issues.

Thanks.

e-mail and URL tokenization

Motivation and Goal

Instead of breaking down an email address and/or an URL, it could be a desirable option to be able to identify email addresses and URLs as a single token. See example below to compare current behavior to the suggested one.

Sample code

import MeCab
mecab = MeCab.Tagger("-Ochasen -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")

text = "中川さんのメールは[email protected]です"
print(mecab.parse(text))

Output

中川    ナカガワ中川    名詞-固有名詞-人名-
さん    サン    さん    名詞-接尾-人名
                  助詞-連体化
メール   メール   メール   名詞-サ変接続
                  助詞-係助詞
nakagawa        nakagawa        nakagawa        名詞-固有名詞-組織
@       @       @       記号-一般
xxxx    イエナイXXXX    名詞-固有名詞-一般
.       .       .       記号-一般
co.jp   シーオージェイピー co.jp   名詞-固有名詞-一般
です    デス    です    助動詞  特殊 基本形
EOS

Desirable output

中川    ナカガワ中川    名詞-固有名詞-人名-
さん    サン    さん    名詞-接尾-人名
                  助詞-連体化
メール   メール   メール   名詞-サ変接続
                  助詞-係助詞
nakagawa@xxxx.co.jp        [...]
です    デス    です    助動詞  特殊 基本形
EOS

'三重県' and '群馬県' are parsed as name of person

Both 三重県 and 群馬県 are name of prefecture. Other prefectures are analyzed as 名詞-固有名詞-地域-一般 correctly.

But these prefectures are analyzed as 名詞-固有名詞-人名-一般 and I find these are in seed file. There are no famous persons named 三重県 nor 群馬県 as I searched.

I think both of words should be analyzed as 名詞-固有名詞-地域-一般.

Result of analysis

茨城県  名詞,固有名詞,地域,一般,*,*,茨城県,イバラキケン,イバラキケン
栃木県  名詞,固有名詞,地域,一般,*,*,栃木県,トチギケン,トチギケン
群馬県  名詞,固有名詞,人名,一般,*,*,群馬県,グンマケン,グンマケン
愛知県  名詞,固有名詞,地域,一般,*,*,愛知県,アイチケン,アイチケン
岐阜県  名詞,固有名詞,地域,一般,*,*,岐阜県,ギフケン,ギフケン
三重県  名詞,固有名詞,人名,一般,*,*,三重県,ミエケン,ミエケン

Seed file

./build/mecab-ipadic-2.7.0-20070801-neologd-20190812/mecab-user-dict-seed.20190812.csv:三重県,1289,1289,-2894,名詞,固有名詞,人名,一般,*,*,三重県,ミエケン,ミエケン
./build/mecab-ipadic-2.7.0-20070801-neologd-20190812/mecab-user-dict-seed.20190812.csv:群馬県,1289,1289,1138,名詞,固有名詞,人名,一般,*,*,群馬県,グンマケン,グンマケン

What is the correct way to customize the pos-id.def file in mecab-ipadic-neologd?

Hi,
I'm trying to modify the pos-id.def coming with the neologd dictionary. But after changing that file, whether I execute
sudo ./mecab-dict-index -f UTF8 -t UTF8 -d /usr/lib/mecab/dic/mecab-ipadic-neologd
or execute
sudo ./mecab-dict-index -f UTF8 -t UTF8 -d .../build/mecab-ipadic-2.7.0-20070801-neologd-20170710>,
I would get the error "

dictionary_compiler.cpp(133) [dic.size()] no dictionaries are specified
or
char_property.cpp(236) [unk.find(it->first) != unk.end()] category [ALPHA] is undefined in ...../mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20170710/unk.def

respectively.

So could anyone tell me the correct way to compile the new pos-id.def for the neglogd dictionary? Any hint is appreciated. Thanks.

Unnecessary variants for single address

grep -a "愛知県名古屋市南区豊田町" mecab-user-dict-seed.20160225.csv

名古屋市豊田町,1293,1293,-5820,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,ナゴヤシトヨダチョウ,ナゴヤシトヨダチョー
愛知県南区豊田町,1293,1293,-1981,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,アイチケンミナミクトヨダチョウ,アイチケンミナミクトヨダチョー
愛知県名古屋市南区豊田町,1293,1293,-19354,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,アイチケンナゴヤシミナミクトヨダチョウ,アイチケンナゴヤシミナミクトヨダチョー
愛知県名古屋市豊田町,1293,1293,-18608,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,アイチケンナゴヤシトヨダチョウ,アイチケンナゴヤシトヨダチョー

I think we don't need "名古屋市豊田町" "愛知県南区豊田町" "愛知県名古屋市豊田町".
https://www.google.co.jp/search?q="名古屋市豊田町"
4 results
https://www.google.co.jp/search?q="愛知県南区豊田町"
0 results
https://www.google.co.jp/search?q="愛知県名古屋市豊田町"
0 results

Download common-nouns.csv of specific date

Motivation

  • Extract newly added nouns to the dictionary using the current common-nouns.csv and the last year's common-nouns.csv

Goal

  • Download latest common-nouns.csv and last year'S common-nouns.csv.
  • Is there any way we could download common-nouns.csv of specific date?
    I have looked into the /seeds/ directory but it seems that there is only 2017/02's common-nouns.csv.

With best regards

It cannot parse である correctly

When I use mecab with default dictionary , it can correctly parse this sentence.

対象者はゼロであるが、実施する。
対象    名詞,一般,*,*,*,*,対象,タイショウ,タイショー
者      名詞,接尾,一般,*,*,*,者,シャ,シャ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
ゼロ    名詞,数,*,*,*,*,ゼロ,ゼロ,ゼロ
で      助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある    助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
が      助詞,接続助詞,*,*,*,*,が,ガ,ガ
、      記号,読点,*,*,*,*,、,、,、
実施    名詞,サ変接続,*,*,*,*,実施,ジッシ,ジッシ
する    動詞,自立,*,*,サ変・スル,基本形,する,スル,スル
。      記号,句点,*,*,*,*,。,。,。

but, When I use mecab with neologd dictionary (commit 0700f47) , 「である」 is treated as 固有名詞.

対象者はゼロであるが、実施する。
対象者  名詞,固有名詞,一般,*,*,*,対象者,タイショウシャ,タイショーシャ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
ゼロ    名詞,数,*,*,*,*,ゼロ,ゼロ,ゼロ
である  名詞,固有名詞,一般,*,*,*,である,デアル,デアル
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
、      記号,読点,*,*,*,*,、,、,、
実施    名詞,サ変接続,*,*,*,*,実施,ジッシ,ジッシ
する    動詞,自立,*,*,サ変・スル,基本形,する,スル,スル
。      記号,句点,*,*,*,*,。,。,。

Is this a bug, or the sentence is grammatically wrong ?
Thanks.

Most "}" entries are unnecessary

I think most "}" entries are unnecessary.

ag } mecab-user-dict-seed.20200130.csv 
47452:2015年{{lang|zh|深圳}}土砂崩事故,1288,1288,3753,名詞,固有名詞,一般,*,*,*,2015年{{lang|zh|深圳}}土砂崩事故,ニセンジュウゴネンシンセンドシャクズレジコ,ニセンジュウゴネンシンセンドシャクズレジコ
388354:}★,1288,1288,8142,名詞,固有名詞,一般,*,*,*,}★,ワルグチトワルコメボクメツダンタ,ワルグチトワルコメボクメツダンタ
655423:カジタタカアキ,1289,1289,4374,名詞,固有名詞,人名,一般,*,*,梶田隆章{{R|nichigai}},カジタタカアキ,カジタタカーキ
842080:ザウィドウ{{}}真実を求めて{{}},1288,1288,4068,名詞,固有名詞,一般,*,*,*,ザ・ウィドウ{{~}}真実を求めて{{~}},ザウィドウシンジツヲモトメテ,ザウィドウシンジツオモトメテ

Thank you for providing and keeping a good dictionary.

Wide "," is included in 原形

$ ag アースウィンドアンドファイアー mecab-user-dict-seed.20200123.csv 
151101:Earth Wind & Fire,1288,1288,4131,名詞,固有名詞,一般,*,*,*,Earth,Wind&Fire,アースウィンドアンドファイアー,アースウィンドアンドファイアー

I think this is better.

- Earth,Wind&Fire
+ Earth, Wind&Fire

Add Left double quotation mark to Regexp.ja

Motivation

This issue is about Regexp.ja in a wiki.

以下の全角記号は半角記号に置換
/!”#$%&’()*+,−./:;<>?@[¥]^_`{|}

It recommends replacing Right double quotation mark(U+201D) to Quotation mark(U+0022) and not replacing Left double quotation mark(U+201C) to Quotation mark. I prefer both Right and Left double quotation mark to be replaced to Quotation mark in sentences like below.

ダブルクォテーションは日本語では“強調”のために使われる。
→ ダブルクォテーションは日本語では"強調"のために使われる。

Sorry if there is a specific reason why Left double quotation mark is not included in the rule.

Goal

My suggestion might look like this.

以下の全角記号は半角記号に置換
!“”#$%&’()*+,−./:;<>?@[¥]^_`{|}

In addition to adding Left double quotation mark(U+201C), I omitted Slash(U+002F), which is a half-width character, at the head of the line. I guess this is a mistake.

README.md のBibtexについて

2017年度の言語処理学会と,2016年度の情報処理学会の論文の author についてですが,
橋本泰一さんの名前が Taiichi Hashimoro とタイポしているかと思います.
Taiichi Hashimoto が正しいかと.

hatena keyword doesn't have 16 or higher yomigana characters.

hatena keyword doesn't have 16 or higher yomigana characters.

e.g.

  • うごめもしゅうへんのはてなでのも うごメモ周辺のはてなでの問題
  • おしえてはてなだいありーでんごん 教えてはてなダイアリー伝言板
  • しんはてなだいあらーえいがひゃく 真・はてなダイアラー映画百選

hatena keyword has proper yomigana when the yomigana has 15 or lower characters.

すもももももももものうち

辞書を自分で鍛えるのが面倒なので、新し目の辞書を探していてmecab-ipadic-neologdに行き当たりました。なるほど今まで細切れになっていたものが一語として認識され調子良さそうです。しかしながら、ひとつこまったことが。「すもももももももものうち」を解析すると、一般名詞「すもももももももものうち」と解析されてしまいます。

これは辞書をmakeする過程でなにか足りなかったからなのでしょうか?それとも、こういう仕様なのでしょうか?

同じようにmecab-unidic-neologdの方も一般名詞となってしまうことを確認しております。

Failed to build lucene-kuromoji because mecab-user-dict-seed.20190930.csv contain invalid format.

How have you been in a year?
mecab-user-dict-seed.20190930.csv contains invalid CSV format as follows.

line 1378761:
マスストランディング,1288,1288,-141,名詞,固有名詞,一般,*,**,,マス・ストランディング,マスストランディング,マスストランディング
マスストランディング,1288,1288,-141,名詞,固有名詞,一般,*,*,*,マス・ストランディング,マスストランディング,マスストランディング

Missing Japanese names

These names are missing in mecab-user-dict-seed.20181112.csv and mecab-ipadic-2.7.0-20070801.
I think they are famous/common names.

サンペイ 三瓶
ソウシゲル 宗茂
タケユタカ 武豊
ユウト 勇人
リンカ 梨花

needs ` yum install patch` with CentOS7

It must be needed patch command before install with CentOS7 as Minimal

./bin/install-mecab-ipadic-neologd -n
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
which: no patch in ($HOME/perl5/perlbrew/bin:/usr/local/bin:/usr/bin:$HOME/bin:/usr/local/sbin:/usr/sbin)
[install-mecab-ipadic-NEologd] :     patch is not found.

so, we have to do rewrite the description like below:

$ sudo yum install mecab mecab-devel mecab-ipadic git make curl xz patch

組織名「日生協」/「日本生活協同組合連合会」

単語の追加に関する要望があります。
いわゆる「生協」(COOP)の略称「日生協」と正式名称の「日本生活協同組合連合会」を追加してほしいです。

現状、人名の姓として「日生」のみが辞書に存在するため、「日生協」を処理すると
「日生」と「協」に分割されてしまいます。

パッチの形にする良い方法が浮かばなかったので、Issueとして報告します。

neologdを使ってみて思ったのですが

こんにちわ。
使わせて頂いてありがとうございます。
さて、数字やローマ字、記号の混じったものは名詞・固有名詞となっています。
ipadicでは数要素であることがfeatureで分かります。
同じように数字などの混じった名詞・固有名詞に、例えば度量衡などの要素を加えて頂けませんか?
良い案があれば、その他の方法でも良いです。
単体で¥は記号、その他の%、kg、cm、Ⅼ(リットルと読めない)は名詞です。
例えば、4カ月  名詞、固有名詞、一般、度量衡
    5つ   名詞、一般、    、度量衡
    A型   名詞、固有名詞、一般、度量衡
    35℃  名詞、固有名詞、一般、度量衡
    70%  名詞、固有名詞、一般、度量衡
    65kg  名詞、固有名詞、一般、度量衡
    180cm  名詞、固有名詞、一般、度量衡
    500m3  ※これは分割されてしまいます。
    8個(ケ) ※これは分割されてしまいます。
5l※リットルと読めるが、分割されてしまいます。

Morphological analysis result of "夫婦" is wrong

Motivation

I think the morphological analysis result of "夫婦" is wrong.
(build version: mecab-ipadic-2.7.0-20070801-neologd-20190919)

echo "夫婦" | mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
夫婦    名詞,固有名詞,一般,*,*,*,夫婦。,フウフ,フーフ
  1. Original (原形) of "夫婦" is 夫婦 instead of 夫婦。
  2. "夫婦" (Type of noun/品詞細分類1) is 一般 instead of 固有名詞

Goal

  1. Fix original to 夫婦 from 夫婦。
  2. Fix type of noun to 一般 from 固有名詞

Could you deal with this issue for me?

出力エンコーディングの指定

Windows環境(C#, NMeCaB)で使用しているのですが、出力エンコーディングがUTF8なので少し手を加えないと使用できません。

コンパイル環境はUnixで当面良いので、出力エンコーディングをインストーラのオプションで指定できるようにしてもらえると助かります。

参考(自著ブログ): mecab-ipadic-neologdをNMeCab用にshift-jisでコンパイルした - 雲行きそらゆきココロイキ

Make capable to install mecab-ipadic-NEologd to an user directory without sudo privileges

Currently, I should set "--asuser option" to install the mecab-ipadic-NEologd to an user directory without sudo privileges.
But I would like mecab-ipadic-NEologd to detect whether sudo privileges are required.

So I will implement following features

  • A process to compare an uid of a current user and an uid of target directory
  • assudo option
    • It's required when I want to install using sudoer privileges

Issueではありませんが。。。

大変申し訳ないですが、本辞書とMECABの既存辞書を一緒に使うのがおすすめと言うことなんですが、両方を使うにはどうすればいいか教えていただけますか。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.