Giter Club home page Giter Club logo

fugashi's Introduction

Open in Streamlit Current PyPI packages Test Status PyPI - Downloads Supported Platforms

fugashi

fugashi by Irasutoya

fugashi is a Cython wrapper for MeCab, a Japanese tokenizer and morphological analysis tool. Wheels are provided for Linux, OSX (Intel), and Win64, and UniDic is easy to install.

issueを英語で書く必要はありません。

Check out the interactive demo, see the blog post for background on why fugashi exists and some of the design decisions, or see this guide for a basic introduction to Japanese tokenization.

If you are on a platform for which wheels are not provided, you'll need to install MeCab first. It's recommended you install from source. If you need to build from source on Windows, @chezou's fork is recommended; see issue #44 for an explanation of the problems with the official repo.

Known platforms without wheels:

  • musl-based distros like alpine #77
  • PowerPC
  • Windows 32bit

Usage

from fugashi import Tagger

tagger = Tagger('-Owakati')
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')
    # "feature" is the Unidic feature data as a named tuple

Installing a Dictionary

fugashi requires a dictionary. UniDic is recommended, and two easy-to-install versions are provided.

  • unidic-lite, a slightly modified version 2.1.2 of Unidic (from 2013) that's relatively small
  • unidic, the latest UniDic 3.1.0, which is 770MB on disk and requires a separate download step

If you just want to make sure things work you can start with unidic-lite, but for more serious processing unidic is recommended. For production use you'll generally want to generate your own dictionary too; for details see the MeCab documentation.

To get either of these dictionaries, you can install them directly using pip or do the below:

pip install 'fugashi[unidic-lite]'

# The full version of UniDic requires a separate download step
pip install 'fugashi[unidic]'
python -m unidic download

For more information on the different MeCab dictionaries available, see this article.

Dictionary Use

fugashi is written with the assumption you'll use Unidic to process Japanese, but it supports arbitrary dictionaries.

If you're using a dictionary besides Unidic you can use the GenericTagger like this:

from fugashi import GenericTagger
tagger = GenericTagger()

# parse can be used as normal
tagger.parse('something')
# features from the dictionary can be accessed by field numbers
for word in tagger(text):
    print(word.surface, word.feature[0])

You can also create a dictionary wrapper to get feature information as a named tuple.

from fugashi import GenericTagger, create_feature_wrapper
CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma')
tagger = GenericTagger(wrapper=CustomFeatures)
for word in tagger.parseToNodeList(text):
    print(word.surface, word.feature.alpha)

Citation

If you use fugashi in research, it would be appreciated if you cite this paper. You can read it at the ACL Anthology or on Arxiv.

@inproceedings{mccann-2020-fugashi,
    title = "fugashi, a Tool for Tokenizing {J}apanese in Python",
    author = "McCann, Paul",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.7",
    pages = "44--51",
    abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.",
}

Alternatives

If you have a problem with fugashi feel free to open an issue. However, there are some cases where it might be better to use a different library.

  • If you don't want to deal with installing MeCab at all, try SudachiPy.
  • If you need to work with Korean, try pymecab-ko or KoNLPy.

License and Copyright Notice

fugashi is released under the terms of the MIT license. Please copy it far and wide.

fugashi is a wrapper for MeCab, and fugashi wheels include MeCab binaries. MeCab is copyrighted free software by Taku Kudo <[email protected]> and Nippon Telegraph and Telephone Corporation, and is redistributed under the BSD License.

fugashi's People

Contributors

chezou avatar koichiyasuoka avatar lambdadog avatar nikitalita avatar odidev avatar polm avatar ronnypfannschmidt avatar tamuhey avatar teowenshen avatar yihong0618 avatar zdyh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fugashi's Issues

Error when running with a lot of long sentences

Hi @polm,

Thanks for creating the efficient library for Japanese Tokenize. 👏👏👏

I have occurred error when using the fugashi with the list of 1000 long sentences (avg. 50-100 words each sentence). ⚠⚠⚠

------------------- ERROR DETAILS ------------------------
arguments: [b'fugashi', b'-C', b'-r', b'/opt/conda/lib/python3.7/site-packages/unidic/dicdir/mecabrc', b'-d', b'/opt/conda/lib/python3.7/site-packages/unidic/dicdir', b'-Owakati']
error message: viterbi.cpp(54) [connector_->open(param)] connector.cpp(24) [cmmap_->open(filename, mode)] cannot open: /opt/conda/lib/python3.7/site-packages/unidic/dicdir/matrix.bin 
Exception in thread Thread-58:
Traceback (most recent call last):
    tagger = Tagger('-Owakati')
  File "fugashi/fugashi.pyx", line 313, in fugashi.fugashi.Tagger.__init__
  File "fugashi/fugashi.pyx", line 220, in fugashi.fugashi.GenericTagger.__init__
RuntimeError: Failed initializing MeCab
------------------- ERROR DETAILS ------------------------
arguments: [b'fugashi', b'-C', b'-r', b'/opt/conda/lib/python3.7/site-packages/unidic/dicdir/mecabrc', b'-d', b'/opt/conda/lib/python3.7/site-packages/unidic/dicdir', b'-Owakati']
error message: viterbi.cpp(50) [tokenizer_->open(param)] tokenizer.cpp(105) [property_.open(param)] char_property.cpp(82) [cmmap_->open(filename, "r")]  

Could you advise me how to solve this error? Thank you a lot. 🙏🙏🙏

Packages Version:

fugashi==1.1.0
mecab-python3==1.0.3
unidic==1.0.3
unidic-lite==1.0.8

Offer bundled MeCab

Some users have difficulty installing and configuring MeCab, so it would be good to provide wheels with it bundled. The way mecab-python3 did this may be a useful reference.

On the other hand, this shouldn't be done by default since it requires including a dictionary, which makes the install very large.

method for preserving half-width spaces?

Not sure if this is a MeCab thing or a Unidic thing, but full-width spaces are properly output while half-width spaces are simply swallowed:

>>> from fugashi import Tagger
>>> TAGGER = Tagger("-Owakati")
>>> TAGGER("ハロー ジャパン")
[ハロー,  , ジャパン]
>>> TAGGER("ハロー ジャパン")
[ハロー, ジャパン]

Do you know of any way to prevent this? Losing characters in the output means having to do extra processing to match input text spans against output text tokens.

Wheel package not available

When I try to install Fugoshi, fogoshi-lite (or even the other pip mecab wrapper) I get an error saying the wheel package is not available. I've installed microsoft visual 19, updated my wheel with pip, even downloaded the dictionary first.

 ERROR: Command errored out with exit status 1:
  command: 'c:\users\X\appdata\local\programs\python\python38-32\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\X\\AppData\\Local\\Temp\\pip-install-vzjexnbf\\fugashi_005a8225e52444168e8f5bbcb78d0b6a\\setup.py'"'"'; __file__='"'"'C:\\Users\\X\\AppData\\Local\\Temp\\pip-install-vzjexnbf\\fugashi_005a8225e52444168e8f5bbcb78d0b6a\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\X\AppData\Local\Temp\pip-record-gw5a9tiu\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\X\appdata\local\programs\python\python38-32\Include\fugashi'
      cwd: C:\Users\X\AppData\Local\Temp\pip-install-vzjexnbf\fugashi_005a8225e52444168e8f5bbcb78d0b6a\
 Complete output (20 lines):
 WARNING: The wheel package is not available.
 WARNING: The wheel package is not available.
 WARNING: The wheel package is not available.
 running install
 running build
 running build_py
 creating build\lib.win32-3.8
 creating build\lib.win32-3.8\fugashi
 copying fugashi\cli.py -> build\lib.win32-3.8\fugashi
 copying fugashi\__init__.py -> build\lib.win32-3.8\fugashi
 running build_ext
 cythoning fugashi/fugashi.pyx to fugashi\fugashi.c
 building 'fugashi.fugashi' extension
 creating build\temp.win32-3.8
 creating build\temp.win32-3.8\Release
 creating build\temp.win32-3.8\Release\fugashi
 C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.28.29910\bin\HostX86\x86\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\mecab -Ic:\users\X\appdata\local\programs\python\python38-32\include -Ic:\users\X\appdata\local\programs\python\python38-32\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.28.29910\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /Tcfugashi\fugashi.c /Fobuild\temp.win32-3.8\Release\fugashi\fugashi.obj
 fugashi.c
 fugashi\fugashi.c(610): fatal error C1083: Cannot open include file: 'mecab.h': No such file or directory
 error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.28.29910\\bin\\HostX86\\x86\\cl.exe' failed with exit status 2```

Error: Failed initializing MeCab

RuntimeError:
Failed initializing MeCab. Please see the README for possible solutions:

https://github.com/polm/fugashi

If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:

https://github.com/polm/fugashi/issues

issueを英語で書く必要はありません。

------------------- ERROR DETAILS ------------------------
arguments: [b'fugashi', b'-C', b'-r', b'/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/unidic_lite/dicdir/mecabrc', b'-d', b'/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/unidic_lite/dicdir']
viterbi.cpp(50) [tokenizer_->open(param)] tokenizer.cpp(109) [sysdic->open (create_filename(prefix, SYS_DIC_FILE).c_str())] dictionary.cpp(79) [dmmap_->open(file, mode)] no such file or directory: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-p

Update Installation Instructions for Apple Silicon MacOS?

I am installing Fugashi on a M1 Macbook Air, and I had to install Mecab manually through homebrew first before I was able to install Fugashi. The error was "fugashi/fugashi.c:618:10: fatal error: 'mecab.h' file not found".

My suggestion would be to update the instructions that M1 Macbooks should manually install Mecab first. I am happy to work on a small PR to update the instructions, but wanted to check with you if that is necessary first or if you have other approaches in mind.

Failed initializing MeCab.

I want to use "fugashi" to tokenize Japanese text. I have already installed "fugashi", "unidic-lite" and "unidic" successfully. This is what I have written:

import fugashi
tagger = fugashi.Tagger('-Owakati')
words = [word.surface for word in tagger(text)]

I have also tried this one:

import fugashi
tagger = fugashi.Tagger()
words = [word.surface for word in tagger(text)]

This is the error I got:

Failed initializing MeCab. Please see the README for possible solutions:

    https://github.com/polm/fugashi

If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:

    https://github.com/polm/fugashi/issues

issueを英語で書く必要はありません。

------------------- ERROR DETAILS ------------------------
arguments: [b'fugashi', b'-C', b'-r', b'/Users/PycharmProjects/pythonProject/venv/lib/python3.9/site-packages/unidic/dicdir/mecabrc', b'-d', b'/Users/PycharmProjects/pythonProject/venv/lib/python3.9/site-packages/unidic/dicdir', b'-Owakati']
error message: param.cpp(69) [ifs] no such file or directory: /Usersi/PycharmProjects/pythonProject/venv/lib/python3.9/site-packages/unidic/dicdir/mecabrc
Traceback (most recent call last):
  File "/Users/PycharmProjects/pythonProject/Python file.py", line 4, in <module>
    tagger = fugashi.Tagger('-Owakati')
  File "fugashi/fugashi.pyx", line 313, in fugashi.fugashi.Tagger.__init__
  File "fugashi/fugashi.pyx", line 220, in fugashi.fugashi.GenericTagger.__init__
RuntimeError: Failed initializing MeCab

Supporting N-best Paths

Hi,

Due to my use case which parses a lot of slangy, colloquial Japanese, the resulting best path often isn't the most sensible path.

Hence, I need to extract N-best paths from the decoding lattice in order to have other paths to back off to in case my algorithm deems that the 1best isn't making sense.

From what I noticed, currently fugashi's only supports for nbest returns a raw string. So, I have edited mecab.pxd and fugashi.py to return nbest results using the same namedtuples as 1best. Although I am using "deprecated" interfaces from MeCab, so far my add-on seems to be working.

My questions are as follows:-

  1. Will there be (or, is there) official nbest support for fugashi in the future? If yes, then I will keep tabs on the project.
  2. If you don't mind me using the deprecated methods of MeCab, I am interested to contribute to the project too. What kind of interface/design do you have in mind for the method?

Korean Support

It'd be nice to support Korean. A simple way to do this would be to subclass the tagger with a KoreanTagger and overwrite the field names, or allow fields to be passed in at creation time.

The tagspec for mecab-ko-dict is here. 2.0 seems to be the most recent one so I guess it makes sense to support that.

Field names and meaning based on Google translate:

Original English
품사 태그 part of speech tag
의미 부류 meaning type
종성 유무 patchim presence (T or F)
읽기 reading (pronunciation, for hanja?)
타입 type (*/Inflected/Compound/Preanalysis)
첫번째 품사 first pos (for compounds?)
마지막 품사 last pos
표현 notes(?) (seems to specify composition of compounds, uses / as delimiter)

In Korean a fork of MeCab is used, it looks like one difference is how whitespace is handled. Not sure if fugashi will just work with it, but since natto-py seems to work there should be a way to support it.

Wheel for Windows?

It's hard to build fugashi in Windows because

  • long_description in setup.py doesn't have encoding so that parsing setup.py fails
  • lacking include_dirs for Extension prevents finding mecab.h
  • Unable to find mecab.dll without having library_dirs for Extension
    etc etc...

When I built mecab python binding wheel for Windows I included everything (dll, header) in a wheel, not sure it is a good way though. https://github.com/chezou/mecab/blob/master/mecab/python/setup.py#L23-L34

It'd be appreciated if you could provide Windows wheel, we can reduce compilation issues.

DLL load failed: 指定されたモジュールが見つかりません。[windows]

この記事のgoogle colabのコードをダウンロードしてjupyter notebookで動かそうとしています。
https://qiita.com/sonoisa/items/1df94d0a98cd4f209051
(google colabでは問題なく動くことを確認しました。)

下記でインストールをしました。
pip install fugashi[unidic]
python -m unidic download

しかし、下記のエラーが起きてfugashiが使えないようです。

使用している環境は下記です。
windows10 pro
python 3.7.8
fugashi 1.1.1

どうすればいいか教えていただけないでしょうか?

エラー内容

ImportError                               Traceback (most recent call last)
<ipython-input-2-5dae7eb78200> in <module>
----> 1 model = SentenceBertJapanese("sonoisa/sentence-bert-base-ja-mean-tokens")

<ipython-input-1-1cb8424fffdf> in __init__(self, model_name_or_path, device)
      4 class SentenceBertJapanese:
      5     def __init__(self, model_name_or_path, device=None):
----> 6         self.tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path)
      7         self.model = BertModel.from_pretrained(model_name_or_path)
      8         self.model.eval()

~\AppData\Roaming\Python\Python37\site-packages\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1718 
   1719         return cls._from_pretrained(
-> 1720             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1721         )
   1722 

~\AppData\Roaming\Python\Python37\site-packages\transformers\tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1789         # Instantiate tokenizer.
   1790         try:
-> 1791             tokenizer = cls(*init_inputs, **init_kwargs)
   1792         except OSError:
   1793             raise OSError(

~\AppData\Roaming\Python\Python37\site-packages\transformers\models\bert_japanese\tokenization_bert_japanese.py in __init__(self, vocab_file, do_lower_case, do_word_tokenize, do_subword_tokenize, word_tokenizer_type, subword_tokenizer_type, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, mecab_kwargs, **kwargs)
    150             elif word_tokenizer_type == "mecab":
    151                 self.word_tokenizer = MecabTokenizer(
--> 152                     do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
    153                 )
    154             else:

~\AppData\Roaming\Python\Python37\site-packages\transformers\models\bert_japanese\tokenization_bert_japanese.py in __init__(self, do_lower_case, never_split, normalize_text, mecab_dic, mecab_option)
    229 
    230         try:
--> 231             import fugashi
    232         except ModuleNotFoundError as error:
    233             raise error.__class__(

~\AppData\Roaming\Python\Python37\site-packages\fugashi\__init__.py in <module>
----> 1 from .fugashi import *
      2 

ImportError: DLL load failed: 指定されたモジュールが見つかりません。

ユーザー定義辞書を使って文章をparseしようとするとカーネルがDeadになる。

今、fugashi-build-dictを使ってユーザー定義辞書を作りました。その辞書を使ってtagger を以下の通り作り、

from fugashi import Tagger
import unidic_lite

tagger = Tagger(r"-u /path to usrdic /user.dic")

以下の通りコードを実行するとカーネルがDeadになってしまいます。ここで、「薬価改定」を仮にユーザー辞書に登録しています。

text = "薬価改定は、麩を主材料とした日本の菓子。"
tagger.parse(text)
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')
    # "feature" is the Unidic feature data as a named tuple

ちなみにユーザー辞書ではなくunidic_liteを使ったときに、こちらのコードを実行するとうまくいきます。

text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')

上記は会社のRstudio SeverからJupyterNotebookを使って実施している作業になります。
もし、解決策があれば教えて頂ければと思います。よろしくお願いします。

Couldn't create Tagger error

/home/martin/nlp/my-env/spacy/bin/python /home/martin/nlp/spaCy/examples/ja_seg.py
Traceback (most recent call last):
  File "/home/martin/nlp/spaCy/examples/ja_seg.py", line 2, in <module>
    nlp = Japanese()  # use directly
  File "/home/martin/nlp/spaCy/spacy/language.py", line 173, in __init__
    make_doc = factory(self, **meta.get("tokenizer", {}))
  File "/home/martin/nlp/spaCy/spacy/lang/ja/__init__.py", line 111, in create_tokenizer
    return JapaneseTokenizer(cls, nlp)
  File "/home/martin/nlp/spaCy/spacy/lang/ja/__init__.py", line 84, in __init__
    self.tokenizer = try_fugashi_import().Tagger()
  File "fugashi/fugashi.pyx", line 229, in fugashi.Tagger.__init__
  File "fugashi/fugashi.pyx", line 176, in fugashi.GenericTagger.__init__
RuntimeError: Couldn't create Tagger. Maybe your arguments are invalid?

Hi, i installed fugashi from the source code and Python is 3.7. I am running it on Ubuntu 18, but got the above error message. Testing code:

from spacy.lang.ja import Japanese
nlp = Japanese()  # use directly
doc = nlp("りんごが大好きです。")
for token in doc:
    print(token.text, token.tag_)

tokenization corner case

Hello,

I have encountered a corner case with MeCab and was hoping you could spend one or two sentences advising how to proceed -- whether to say "this is a known issue and it's hopeless" or "you are using the wrong dictionary" etc.

The following examples are with the full version of unidic-2.3.0

>>> t = fugashi.Tagger('-Owakati')
>>> t.parse('スムースストレッチコットンクルーネックT').split()
['スムースストレッチコットンクルーネック T']
>>> t.parse('スムースコットンフレンチスリーブロングワンピース').split()
['スムースコットンフレンチスリーブロングワンピース']
>>> katsu.romaji('スムースストレッチコットンクルーネックT')
'Sumuususutoretchikottonkuruunekku T'

And yet, sub-phrases seem to parse OK:

>>> t.parse('スムースストレッチ').split()
['スムース', 'ストレッチ']
>>> t.parse('コットンクルーネック').split()
['コットン', 'クルー', 'ネック']

Thank you!

BERT tokenizerの逆斜線

Windows10、AnacondaでBERTを使用していたら、fugashiに代わっているようで下記のエラーとなった。

from sentence_transformers import SentenceTransformer
from sentence_transformers import models
transformer = models.BERT('cl-tohoku/bert-base-japanese-whole-word-masking')

------------------- ERROR DETAILS ------------------------
arguments: [b'fugashi', b'-C', b'-d', b'C:UsersnwAnaconda3envsPyTorchCUDA10_1libsite-packagesipadicdicdir', b'-r', b'C:UsersnwAnaconda3envsPyTorchCUDA10_1libsite-packagesipadicdicdirmecabrc']
error message: param.cpp(69) [ifs] no such file or directory: C:UsersnwAnaconda3envsPyTorchCUDA10_1libsite-packagesipadicdicdirmecabrc
----------------------------------------------------------
RuntimeError: Failed initializing MeCab

となった。

そこで、transformersパッケージのtokenization_bert_japanese.pyの252行目に
mecabrc = os.path.join(dic_dir, "mecabrc")
mecab_option = "-d {} -r {} ".format(dic_dir, mecabrc) + mecab_option
mecab_option = mecab_option.replace('\','/')
replaceを追加し、事無きを得たように見える。
修正はこれでよいのだろうか。

posid is 1 for all tokens with Unidic

A Tagger instantiated with the default parameters (using Unidic) gives tokens a posid of 1, no matter what the part of speech actually is.

import fugashi

tagger = fugashi.Tagger()
test_string = "これから会議があります。"

print(tagger.dictionary_info[0]['filename'])
for token in tagger(test_string):
    print(token.surface + ": " + str(token.posid))

gives the output:

C:\Users\[omitted]\AppData\Local\Programs\Python\Python38\lib\site-packages\unidic\dicdir\sys.dic
これ: 1
から: 1
会議: 1
が: 1
あり: 1
ます: 1
。: 1

I tried instantiating a GenericTagger with Unidic, and it had the same problem.

import unidic
uni_generic_tagger = fugashi.GenericTagger(f'-d "{unidic.DICDIR}"')

However, a GenericTagger instantiated in the same way with IPAdic (from the ipadic PyPI package) does provide proper posids, so maybe this is an issue with Unidic.

Cygwin64 support

Hi, I've just tried to pip3.7 install fugashi on Cygwin64 with my mecab-cygwin64, but failed with the error message below:

  gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-3.0.7-x86_64-3.7/fugashi/fugashi.o -L/usr/lib/python3.7/config -L/usr/lib -lmecab -lpython3.7m -o build/lib.cygwin-3.0.7-x86_64-3.7/fugashi.cpython-37m-x86_64-cygwin.dll
  /usr/lib/gcc/x86_64-pc-cygwin/7.4.0/../../../../x86_64-pc-cygwin/bin/ld: cannot find -lmecab

In my Cygwin64, libmecab.a and libmecab.la is at /usr/local/lib and mecab-config --libs-only-L returns /usr/local/lib but fugashi installer does not recognize it. How do I add /usr/local/lib for the library-path?

failed to import in python 3.6

When I import fugashi, it will raise an error like this:

from fugashi import GenericTagger
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "fugashi/fugashi.pyx", line 9, in init fugashi
TypeError: namedtuple() got an unexpected keyword argument 'defaults'

My python version is 3.6.7, and fugashi version is 0.1.8.

The error is caused by nametuple doesn't have an 'defaults' argument in python 3.6. The 'defaults' argument is added in python 3.7.

I have searched in stackoverflow, and a fix can be like this:

from collections import namedtuple
Node = namedtuple('Node', 'val left right')
Node.__new__.__defaults__ = (None,) * len(Node._fields)

Or you can make fugashi requires python version >=3.7.

Add M1 / OSX arm64 wheels

It's not clear how complicated this is.

It might be as simple as using cibuildwheel to cross-compile the wheel, which means it could be done right away.

However, it might be the case that only handles things the wheel builds directly, and won't take care of the build artifacts of MeCab itself. In that case it might require tweaking the MeCab build for cross compilation. Worst case it would require an OSX arm64 env to build MeCab directly.

I have tried the cibuildwheel solution and test wheels are available via pip install fugashi==1.1.2a6. If someone confirms they work I can do a release.

Cache in Windows Wheel build has issues

Windows wheel builds are failing because curl finds the file it's downloading already exists and prompts for user input on whether to overwrite it. There's not a clear flag to force overwriting or not overwriting the existing file.

I tried various methods to get around the prompt for input, such as deleting the file before downloading it or using a redirect, but they failed. Some of that was because it's bash on Windows and I'm not very familiar with how that works.

@chezou Sorry, but could you take a look at this when you get a chance? I will take another look at it later but I'm kind of stumped right now.

One issue is it looks like the cache is restored successfully, but the download isn't skipped. Not sure why that would be the case...

Redistribution of Mecab requires some conditions and they are not met

According to https://github.com/taku910/mecab/blob/master/mecab/COPYING,

MeCab is copyrighted free software by Taku Kudo [email protected] and
Nippon Telegraph and Telephone Corporation, and is released under
any of the GPL (see the file GPL), the LGPL (see the file LGPL), or the
BSD License (see the file BSD).

In the case of BSD, https://github.com/taku910/mecab/blob/master/mecab/BSD says

  • Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the
    following disclaimer in the documentation and/or other
    materials provided with the distribution.

I think this applies to fugashi, which redistributes libmecab (a binary file).

Disclaimer: I'm not a lawyer and this is not a legal advice.

DLL Load Failed on import

I am getting an error after uninstalling and re-installing fugashi.

The error is as follows:
image

I tried to uninstall and re-install (to upgrade) as follows:

pip uninstall fugashi
pip install fugashi[unidic]
pip -m unidic download

I did this both inside an Anaconda environment and the global Python environment on my Windows 10 machine. Any idea what might be causing this or how to fix it?

[Question] About alternatives to tokenizations

Thanks for your post about how to tokenize Japanese.
Currently my solution is to use icu tokenizer with word break iterator and customized locale as showed here:

code
Repl.it
My question is if this approach has in any cases the same results as fugashi for what concerns the japanese language.

Thank you!

Unable to install (Windows x64, Python 3.10)

In general, this should be an issue of not having pre-built wheels for Python 3.10.
mecab.h and libmecab.lib cannot be found, causing building errors.

Eventually I noticed that I have to download mecab-msvc-x64.zip 1 and extract to the very specific location recorded in fugashi_utils.py2, in order to build the wheel and install successfully.
(ofc before this, have Build Tools and Windows SDK installed)

What I don't understand is that I have to use this specific (forked) version of mecab, instead of the installer from Official Website (generally because installed ver have subfolders bin and sdk to store dll, h and lib separately). And the hardcoded path does look strange.

fugashi-1.1.1-cp310-cp310-win_amd64.zip

idk if this should be an issue, sorry for any incovenience

Footnotes

  1. https://github.com/polm/fugashi/blob/9ba7b3013680e359aadfe57c2213c7df040d13ca/.github/workflows/windows.yml#L49

  2. https://github.com/polm/fugashi/blob/9ba7b3013680e359aadfe57c2213c7df040d13ca/fugashi_util.py#L18

Invalid Tagger args are silently ignored

If you do something like fugashi.Tagger("d /asdf") (no hyphen), the invalid arguments will be ignored and you'll get a working tagger. That's weird and unhelpful. It seems to be the way the MeCab API works, but the actual MeCab command line behaves reasonably (gives an error), so there should be a way to modify the behavior.

Wheel for macOS requires Mecab to run

https://github.com/polm/fugashi/releases/tag/v0.2.0 said

it's possible to install fugashi without MeCab

I confirmed that it is possible to install fugashi and run without Mecab on Linux (Ubuntu 18.04).
However, I was not able to run fugashi without Mecab on macOS, though it is possible to install fugashi without it.
Is this by design or an unexpected behaviour?

Without Mecab

(venv) fugashi $ mecab --version
-bash: mecab: command not found
(venv) fugashi $ pip list | grep fugashi
fugashi    0.2.2
(venv) fugashi $ python
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fugashi
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/hiromu/workspace/fugashi/venv/lib/python3.7/site-packages/fugashi.cpython-37m-darwin.so, 2): Library not loaded: /usr/local/lib/libmecab.2.dylib
  Referenced from: /Users/hiromu/workspace/fugashi/venv/lib/python3.7/site-packages/fugashi.cpython-37m-darwin.so
  Reason: image not found

With Mecab

(venv) fugashi $ brew install mecab
==> Downloading https://homebrew.bintray.com/bottles/mecab-0.996.catalina.bottle.3.tar.gz
Already downloaded: /Users/hiromu/Library/Caches/Homebrew/downloads/152cd5889822b8a9cf8247aab5a4b333f1206e1bf56aa3217c9e5b6928318258--mecab-0.996.catalina.bottle.3.tar.gz
==> Pouring mecab-0.996.catalina.bottle.3.tar.gz
🍺  /usr/local/Cellar/mecab/0.996: 20 files, 4.2MB
(venv) fugashi $ mecab --version
mecab of 0.996

(venv) fugashi $ pip list | grep fugashi
fugashi    0.2.2
(venv) fugashi $ python
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fugashi
>>> fugashi.__file__
'/Users/hiromu/workspace/fugashi/venv/lib/python3.7/site-packages/fugashi.cpython-37m-darwin.so'

OS: macOS (10.15.4)
Python: 3.7.7
fugashi: 0.2.2

Clean up setup.py

Hi,

this might be the same as #42 and #44, but the platform is different. I'm unable to load the .so file through fugashi when running from the base Docker image for Python 3.10. Minimal breaking example below:

Dockerfile:

FROM python:3.10.1-bullseye
RUN pip install -vvv fugashi
RUN echo "import fugashi" | python

Output of running docker build 2>&1 ..

If you know a quick workaround for this, I'd be happy to use that for now. I'd be happy to help if you need a hand with this, as well.

fugashiからのユーザー定義辞書の利用

今fugashiを使って形態素解析解析をしていますが、Mecabのようにユーザー定義辞書を使う必要が出てきました。ざっと検索はかけてみたのですが、以下の事項についての答えが見つかりません。
①ユーザ定義辞書がそもそも使えるのか
②使えるとしたら、辞書の作り方とコード上でどのように指定すればよいか。

ご回答頂ければ幸いです。

Can't be built on macOS

Fugashi cannot be built hence cannot be installed on macOS.

$ pip install fugashi
Collecting fugashi
  Using cached https://files.pythonhosted.org/packages/2e/88/156c51c78ee4ccfd54000e720f0c9814d073993b4e1f5d400d01416ddb6d/fugashi-0.1.10.tar.gz
Requirement already satisfied: Cython in /Users/hiromu/miniconda3/envs/fonduer-dev/lib/python3.7/site-packages (from fugashi) (0.29.16)
Building wheels for collected packages: fugashi
  Building wheel for fugashi (setup.py) ... error
  ERROR: Complete output from command /Users/hiromu/miniconda3/envs/fonduer-dev/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-install-adei_m45/fugashi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-wheel-7supwlz0 --python-tag cp37:
  ERROR: running bdist_wheel
  running build
  running build_ext
  cythoning fugashi/fugashi.pyx to fugashi/fugashi.c
  /Users/hiromu/miniconda3/envs/fonduer-dev/lib/python3.7/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-install-adei_m45/fugashi/fugashi/fugashi.pyx
    tree = Parsing.p_module(s, pxd, full_module_name)
  building 'fugashi' extension
  creating build/temp.macosx-10.7-x86_64-3.7
  creating build/temp.macosx-10.7-x86_64-3.7/fugashi
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/hiromu/miniconda3/envs/fonduer-dev/include -arch x86_64 -I/Users/hiromu/miniconda3/envs/fonduer-dev/include -arch x86_64 -I/usr/local/Cellar/mecab/0.996/include -I/Users/hiromu/miniconda3/envs/fonduer-dev/include/python3.7m -c fugashi/fugashi.c -o build/temp.macosx-10.7-x86_64-3.7/fugashi/fugashi.o
  In file included from fugashi/fugashi.c:598:
  /usr/local/Cellar/mecab/0.996/include/mecab.h:380:47: warning: this function declaration is not a prototype [-Wstrict-prototypes]
    MECAB_DLL_EXTERN const char*   mecab_version();
                                                ^
                                                 void
  /usr/local/Cellar/mecab/0.996/include/mecab.h:520:54: warning: this function declaration is not a prototype [-Wstrict-prototypes]
    MECAB_DLL_EXTERN mecab_lattice_t *mecab_lattice_new();
                                                       ^
                                                        void
  fugashi/fugashi.c:4828:13: warning: assigning to 'char *' from 'const char *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
    __pyx_t_5 = mecab_nbest_sparse_tostr(__pyx_v_self->c_tagger, __pyx_t_3, __pyx_t_4);
              ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  fugashi/fugashi.c:9121:26: warning: code will never be executed [-Wunreachable-code]
                  module = PyImport_ImportModuleLevelObject(
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  4 warnings generated.
  creating build/lib.macosx-10.7-x86_64-3.7
  gcc -bundle -undefined dynamic_lookup -L/Users/hiromu/miniconda3/envs/fonduer-dev/lib -arch x86_64 -L/Users/hiromu/miniconda3/envs/fonduer-dev/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.7/fugashi/fugashi.o -L/usr/local/Cellar/mecab/0.996/lib -lmecab -lstdc++ -o build/lib.macosx-10.7-x86_64-3.7/fugashi.cpython-37m-darwin.so
  clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
  ld: library not found for -lstdc++
  clang: error: linker command failed with exit code 1 (use -v to see invocation)
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for fugashi
  Running setup.py clean for fugashi
Failed to build fugashi
Installing collected packages: fugashi
  Running setup.py install for fugashi ... error
    ERROR: Complete output from command /Users/hiromu/miniconda3/envs/fonduer-dev/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-install-adei_m45/fugashi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-record-y_d6t5au/install-record.txt --single-version-externally-managed --compile:
    ERROR: running install
    running build
    running build_ext
    skipping 'fugashi/fugashi.c' Cython extension (up-to-date)
    building 'fugashi' extension
    creating build/temp.macosx-10.7-x86_64-3.7
    creating build/temp.macosx-10.7-x86_64-3.7/fugashi
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/hiromu/miniconda3/envs/fonduer-dev/include -arch x86_64 -I/Users/hiromu/miniconda3/envs/fonduer-dev/include -arch x86_64 -I/usr/local/Cellar/mecab/0.996/include -I/Users/hiromu/miniconda3/envs/fonduer-dev/include/python3.7m -c fugashi/fugashi.c -o build/temp.macosx-10.7-x86_64-3.7/fugashi/fugashi.o
    In file included from fugashi/fugashi.c:598:
    /usr/local/Cellar/mecab/0.996/include/mecab.h:380:47: warning: this function declaration is not a prototype [-Wstrict-prototypes]
      MECAB_DLL_EXTERN const char*   mecab_version();
                                                  ^
                                                   void
    /usr/local/Cellar/mecab/0.996/include/mecab.h:520:54: warning: this function declaration is not a prototype [-Wstrict-prototypes]
      MECAB_DLL_EXTERN mecab_lattice_t *mecab_lattice_new();
                                                         ^
                                                          void
    fugashi/fugashi.c:4828:13: warning: assigning to 'char *' from 'const char *' discards qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
      __pyx_t_5 = mecab_nbest_sparse_tostr(__pyx_v_self->c_tagger, __pyx_t_3, __pyx_t_4);
                ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fugashi/fugashi.c:9121:26: warning: code will never be executed [-Wunreachable-code]
                    module = PyImport_ImportModuleLevelObject(
                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    4 warnings generated.
    creating build/lib.macosx-10.7-x86_64-3.7
    gcc -bundle -undefined dynamic_lookup -L/Users/hiromu/miniconda3/envs/fonduer-dev/lib -arch x86_64 -L/Users/hiromu/miniconda3/envs/fonduer-dev/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.7/fugashi/fugashi.o -L/usr/local/Cellar/mecab/0.996/lib -lmecab -lstdc++ -o build/lib.macosx-10.7-x86_64-3.7/fugashi.cpython-37m-darwin.so
    clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
    ld: library not found for -lstdc++
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command "/Users/hiromu/miniconda3/envs/fonduer-dev/bin/python -u -c 'import setuptools, tokenize;__file__='"'"'/private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-install-adei_m45/fugashi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-record-y_d6t5au/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/6j/mctnhv6n2zx4zf8c657hbr2w0000gn/T/pip-install-adei_m45/fugashi/

I installed Mecab using Homebrew

$ brew info mecab
mecab: stable 0.996 (bottled)
Yet another part-of-speech and morphological analyzer
https://taku910.github.io/mecab/
Conflicts with:
  mecab-ko (because both install mecab binaries)
/usr/local/Cellar/mecab/0.996 (20 files, 4.2MB) *
  Poured from bottle on 2019-03-25 at 14:19:36
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/mecab.rb
==> Analytics
install: 7,671 (30 days), 24,244 (90 days), 50,937 (365 days)
install-on-request: 653 (30 days), 2,051 (90 days), 10,150 (365 days)
build-error: 0 (30 days)

fugashi>=1.0.2 tarballs do not have their versions

When installing from sources, fugashi==1.0.1 does indicate its version 1.0.1, but fugashi>=1.0.2 do not have their versions:

$ pip3 install fugashi==1.0.2 --no-binary fugashi
Collecting fugashi==1.0.2
  Downloading fugashi-1.0.2.tar.gz (334 kB)
     |████████████████████████████████| 334 kB 5.8 MB/s
  WARNING: Requested fugashi==1.0.2 from https://files.pythonhosted.org/packages/75/c0/5eb732b1b490a7bae2e22ab8653cc693143d411e7f2e61df613bf7e06dc2/fugashi-1.0.2.tar.gz#sha256=846148fbdd5d46a5b1b3aa31c2a0ea467d7bd62b0842f6b45f0af8dcb9ca8570, but installing version 0.0.0
Installing collected packages: fugashi
   Running setup.py install for fugashi ... done
Successfully installed fugashi-0.0.0
$ pip3 list | fgrep fugashi
fugashi                       0.0.0

It works even in Cygwin though its version-control goes broken. I suspect use_scm_version in setup.py but I'm vague...

Pickling error when multiprocessing

When I tried to use fugashi for multiprocessing, I got the following error.

File "stringsource", line 2, in fugashi.fugashi.GenericTagger.reduce_cython
self.c_tagger cannot be converted to a Python object for pickling

installation failed from pipenv.

Installing dependencies from Pipfile.lock (039fd1)…
An error occurred while installing fugashi==0.1.4 --hash=sha256:0dbb394b9d21bf48f3c1772fe247da9c8fe3a53d257a8d10e23941eed86b768d! Will try again.
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 23/23 — 00:00:46
Installing initially failed dependencies…
[pipenv.exceptions.InstallError]:   File "/Users/user/.pyenv/versions/3.7.5/envs/env-sample/lib/python3.7/site-packages/pipenv/core.py", line 1874, in do_install
[pipenv.exceptions.InstallError]:       keep_outdated=keep_outdated
[pipenv.exceptions.InstallError]:   File "/Users/user/.pyenv/versions/3.7.5/envs/env-sample/lib/python3.7/site-packages/pipenv/core.py", line 1253, in do_init
[pipenv.exceptions.InstallError]:       pypi_mirror=pypi_mirror,
[pipenv.exceptions.InstallError]:   File "/Users/user/.pyenv/versions/3.7.5/envs/env-sample/lib/python3.7/site-packages/pipenv/core.py", line 859, in do_install_dependencies
[pipenv.exceptions.InstallError]:       retry_list, procs, failed_deps_queue, requirements_dir, **install_kwargs
[pipenv.exceptions.InstallError]:   File "/Users/user/.pyenv/versions/3.7.5/envs/env-sample/lib/python3.7/site-packages/pipenv/core.py", line 763, in batch_install
[pipenv.exceptions.InstallError]:       _cleanup_procs(procs, not blocking, failed_deps_queue, retry=retry)
[pipenv.exceptions.InstallError]:   File "/Users/user/.pyenv/versions/3.7.5/envs/env-sample/lib/python3.7/site-packages/pipenv/core.py", line 681, in _cleanup_procs
[pipenv.exceptions.InstallError]:       raise exceptions.InstallError(c.dep.name, extra=err_lines)
[pipenv.exceptions.InstallError]: ['Collecting fugashi==0.1.4 (from -r /var/folders/91/7npclg6s2mn6z73kqn22fxj00000gn/T/pipenv-j3baxj38-requirements/pipenv-x686vbno-requirement.txt (line 1))', '  Using cached https://files.pythonhosted.org/packages/41/68/a9c829e26a7d5c058a482c542f16f2e8b63e12a51110d55418cf6e1dbe67/fugashi-0.1.4.tar.gz']
[pipenv.exceptions.InstallError]: ['ERROR: Command errored out with exit status 1:', '     command: /Users/user/.pyenv/versions/3.7.5/envs/env-sample/bin/python -c \'import sys, setuptools, tokenize; sys.argv[0] = \'"\'"\'/private/var/folders/91/7npclg6s2mn6z73kqn22fxj00000gn/T/pip-install-csbpfj56/fugashi/setup.py\'"\'"\'; __file__=\'"\'"\'/private/var/folders/91/7npclg6s2mn6z73kqn22fxj00000gn/T/pip-install-csbpfj56/fugashi/setup.py\'"\'"\';f=getattr(tokenize, \'"\'"\'open\'"\'"\', open)(__file__);code=f.read().replace(\'"\'"\'\\r\\n\'"\'"\', \'"\'"\'\\n\'"\'"\');f.close();exec(compile(code, __file__, \'"\'"\'exec\'"\'"\'))\' egg_info --egg-base pip-egg-info', '         cwd: /private/var/folders/91/7npclg6s2mn6z73kqn22fxj00000gn/T/pip-install-csbpfj56/fugashi/', '    Complete output (5 lines):', '    Traceback (most recent call last):', '      File "<string>", line 1, in <module>', '      File "/private/var/folders/91/7npclg6s2mn6z73kqn22fxj00000gn/T/pip-install-csbpfj56/fugashi/setup.py", line 5, in <module>', '        from Cython.Build import cythonize', "    ModuleNotFoundError: No module named 'Cython'", '    ----------------------------------------', 'ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.']
ERROR: ERROR: Package installation failed...

Multi-dictionary with fugashi.GenericTagger

Hello,
I want to create a custom dictionary and apply it to fugashi.
I succeeded in adding custom dictionary in mecab.

import MeCab 
import sys
m = MeCab.Tagger("-Ochasen -u /usr/local/lib/mecab/dic/userdic/mydic.dic")
text = m.parse("ユーザ設定")
print(text)

output

ユーザ設定	ユーザセッテイ	ユーザ設定	名詞-一般		
EOS

I am trying to use fugashi.GenericTagger to use the dictionary created in mecab for fugashi.
Like the mecab code above, I want to define both sysdic and userdic in fugashi.GenericTagger.
Is it possible?
image

Windows DLL Weirdness

Via email I have a report of a Windows user who installed fugashi via pip without errors, but didn't get libmecab.dll in their site-packages/fugashi directory, which led to errors at import time like this:

ImportError: DLL load failed while importing fugashi: the specified module could not be found

For what it's worth, the dll is definitely in the wheel file, and when I install it on Windows the dll ends up in the site-packages/fugashi package as expected.

This thread has some info on DLLs and Python on Windows:

Toblerity/Fiona#851

One thing that we could potentially do is check for ImportErrors, and if the code is being executed on Windows, check if libmecab.dll is present and give a very specific error if not. On the other hand, since it's not clear how this happened in the first place, maybe just having an FAQ entry (or this issue) is enough for now.

Bug in getting POS of proper noun.

It looks like there's a bug when using UniDic and getting the pos of a proper noun.

In [18]: from fugashi import Tagger, GenericTagger

In [19]: tagger = Tagger('-Owakati')

In [20]: tokens = tagger.parseToNodeList("むかし丹波の国に稻村屋源助という金持ちの商人が住んでいた。")

In [21]: tokens
Out[21]: [むかし, 丹波, の, 国, に, 稻村, 屋, 源助, と, いう, 金持ち, の, 商人, が, 住ん, で, い, た, 。]

In [22]: tokens[5]
Out[22]: 稻村

In [23]: tokens[5].pos
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: __new__() takes from 1 to 27 positional arguments but 28 were given
Exception ignored in: 'fugashi.Node.set_feature'
TypeError: __new__() takes from 1 to 27 positional arguments but 28 were given
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-7fc5ef581d7d> in <module>
----> 1 tokens[5].pos

fugashi/fugashi.pyx in fugashi.UnidicNode.pos.__get__()

TypeError: 'NoneType' object is unsliceable```

`mecabrc` location

I am using fugashi 1.1.0 and have installed MeCab from the Archlinux user repository. This package installs mecabrc on /etc/mecabrc. Thus, the following code snippet will fails:

from fugashi import GenericTagger

tagger = GenericTagger()

This issue can be fixed by creating a link where fugashi expects mecabrc to be: ln -s /etc/mecabrc /usr/local/etc/mecabrc.

Is this behavior expected? Is there something wrong with how fugashi looks for mecabrc or is the issue where the package installs it. Also, why is fugashi sensible to the location of the default configuration file anyways?

bizarre print behavior

Hello,

When using the tokenizer as part of a loop I get different outputs depending on whether a print call is in the for loop or not.
In the following for loop, printing articles gives the following output


tagger = fugashi.Tagger()
text = ['未来ある子どもたちを、たばこがもたらす健康被害',
            '◆生命の尊厳\u3000立法化検討13年\u3000党議拘束見送り「死」をどう考えるか。']

    articles = []
    for art in text:
        tokenized = tagger(art)
        articles.append(tokenized)
[[生, の, 厳, 立法, 討, 年, 党議, を, どう, る, か], [◆, 生命, の, 尊厳,  , 立法, 化, 検討, 13, 年,  , 党議, 拘束, 見送り, 「, 死, 」, を, どう, 考える, か, 。]]

While the following for loop gives a more correct result:


tagger = fugashi.Tagger()
text = ['未来ある子どもたちを、たばこがもたらす健康被害',
            '◆生命の尊厳\u3000立法化検討13年\u3000党議拘束見送り「死」をどう考えるか。']

    articles = []
    for art in text:
        print(art)
        tokenized = tagger(art)

        print(tokenized)
        articles.append(tokenized)
[[未来, ある, 子ども, たち, を, 、, たばこ, が, もたらす, 健康, 被害], [◆, 生命, の, 尊厳,  , 立法, 化, 検討, 13, 年,  , 党議, 拘束, 見送り, 「, 死, 」, を, どう, 考える, か, 。]]

Failed initializing MeCab with GenericTagger

from fugashi import GenericTagger
tagger = GenericTagger()

Then it caused:

Failed initializing MeCab. Please see the README for possible solutions:
    https://github.com/polm/fugashi

I can run below code successfully. I already installed pip install mecab-python3

from fugashi import Tagger

tagger = Tagger('-Owakati')
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')

I am running Anaconda Python 3.8 on Windows. I see issue #32 (comment) similar to mine, but I don't know where fugashi expects the mecabrc is located.

python3.6 support

Hi, thank you for this great project!
I want to use fugashi with python3.6.
Any plan to support it?

Add support to release linux aarch64 wheels

Problem

On aarch64, ‘pip install fugashi’ builds the wheels from source code and then installs it. It requires the user to have a development environment installed on his system. Also, it takes some time to build the wheels than downloading and extracting the wheels from pypi.

Resolution

On aarch64, ‘pip install fugashi’ should download the wheels from pypi.

@polm and Team Please let me know your interest in releasing aarch64 wheels. I can help in this.

Pronounce without 「ー」 (kana only)

How do I get the pronounce of a word without 「ー」? For example, I would like the following to print 「ケッコウ」.

from fugashi import Tagger
tagger = Tagger()
print([word.feature.pron for word in tagger('結構')])

On a related note, where can I find documentation for the library (for example, about the fields of word.feature)?

Transition to manylinux2014

The manylinux project is dropping support for the manylinux1 Docker image, which is used for fugashi wheel builds, on January 1.

pypa/manylinux#994

I tried switching to manylinux2014 and it didn't just work. It probably isn't very complicated, but it needs some more fiddling.

type stubs

Using PyLance to inspect my code, I get errors when importing from fugashi:
image

I believe the issue is the same as this one for lxml: because the types are in a native library, PyLance can't analyze them for type information.

The solution is to create a type stubs file. For lxml, there's a separate lxml-stubs package.

Don't know if this is a lot of work or a little bit of work, but I'll open the ticket here in case anyone else encounters the same thing.

Issue using fugashi with transformers[ja]==3.1.0

Hi,
I was importing transformer[ja] on Colab without any issue, but as I moved my code locally it start failing.

I am on a Mac (Catalina) and use Pycharm.

In my requirements.txt file I have transformers[ja]==3.1.0 and when I run some code I get the below error message

Failed initializing MeCab. Please see the README for possible solutions:
...
self.mecab = fugashi.GenericTagger(mecab_option)
  File "fugashi/fugashi.pyx", line 220, in fugashi.fugashi.GenericTagger.__init__
RuntimeError: Failed initializing MeCab

Any suggestions?

I tried adding the below to the requirements.txt but without success.

mecab-python3==1.0.1
unidic-lite==1.0.7
ipadic==1.0.0
unidic==1.0.2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.