Giter Club home page Giter Club logo

Comments (5)

polm avatar polm commented on June 12, 2024

It sounds like you have three separate issues, so to address them...


A. Applying taku910/mecab#70

I'll consider it, but it may take me a while to get to it. If you build fugashi from source you can use a local version of MeCab as the base, which would allow you to resolve your issue immediately. Based on my understanding of the issue, the resulting dictionary should work fine with unpatched MeCab.


B. Comments on UniDic 3.2 data

Thank you for pointing out the difference in the fields. I had a little trouble understanding what you were saying, so for my reference:

  • pron represents pronunciation, and uses a long vowel marker for long vowels
  • kana represents how the word is written in context, which can differ from pron due to long vowels and historical or irregular kana usage (this I had not noticed)
  • form is like kana but with mostly standard kana usage (??)

I am surprised that form and lform differ for クリエイティヴ, I'm not entirely sure what the logic is there.

If you have any further insight it would be appreciated, I'll look at this. It would probably be best to mail the UniDic maintainers for clarification though, unless this is already in the manual.


C. Adding access to further node fields

I would consider it but wouldn't treat it with priority - like most advanced features in MeCab, I've never known anyone to use it.

I would be happy to take a look at a PR.


Thank you for taking the time to create a Github account and post this. However, I will note that asking multiple questions in one issue makes it a little hard to follow. For now, I consider A. resolved, B. to require further investigation, and C. to be open. I made a new issue for C at #76 and we can use this thread to continue to discuss B.

from fugashi.

mewnd avatar mewnd commented on June 12, 2024

Regarding form 語形出現形 of point B.

階層的な見出し構造 of UniDic:
ref: https://clrd.ninjal.ac.jp/unidic/glossary.html#kaisouteki

語形 is written in katakana.
It groups the same word with different written expressions like「大きい」and「おおきい」together.

「大きい」and「おおきい」
書字形(基本形)「大きい」and「おおきい」are under the same category of 語形(基本形)「オオキイ」

書字形基本形「大きい」

  • 大きい
  • 大きく(連用形)
  • 大きけれ(仮定形)

書字形基本形「おおきい」 (written in hiragana 平仮名)

  • おおきい
  • おおきく
  • おおきけれ

form

Its casual expression「おっきい」has different pronunciation, so it is regarded as another 語形.
The rule also applies to other different conjugation types of colloquial and written expressions.

These variations are grouped under 語彙素, the highest level of the hierarchy,

lemma

「回」,「下位」and「貝」
They are different words of different meanings but they have the same written kana expression,
「かい」in hiragana and「カイ」in katakana, so they have the same 語形.

By having this hierarchy from 書字形 to 語彙素, it makes it possible to distinguish the queries
between「回」,「下位」and「貝」even by using its hiragana expression.

Above is my understanding from the glossary page.

For creative:

surface pron form kana lform lemma
クリエイティブ クリエーティブ クリエイティブ クリエイティブ クリエーティブ クリエーティブ-creative
クリエイティヴ クリエーティブ クリエイティブ クリエイティヴ クリエーティブ クリエーティブ-creative
クリエーティブ クリエーティブ クリエーティブ クリエーティブ クリエーティブ クリエーティブ-creative
Creative クリエーティブ クリエーティブ クリエーティブ クリエーティブ クリエーティブ-creative

Its lform (lemma form) 語彙素読み is クリエーティブ
It has two different forms 語形出現形:

  • クリエイティブ (for クリエイティブ and クリエイティヴ)
  • クリエーティブ (for クリエーティブ and Creative)

I hope it helps clarify the logic a bit.

Thank you for your reply.

from fugashi.

polm avatar polm commented on June 12, 2024

Thank you for the clarification, that is helpful to understanding.

I do have one question - are you just clarifying this, or do you propose a change to fugashi (or maybe my UniDic docs) somewhere?

from fugashi.

mewnd avatar mewnd commented on June 12, 2024

I suggest the following changes in README.md of unidic-py:

Modify:
For more information see the UniDic FAQ and its Hireachy,

Add description for type: (Please copy the code for folding)

type: seems to be the type of lemma 語彙素類
<details>
    <summary>A list of the fields in unidic-cwj-202302</summary>
    <pre>
type,pos1,pos2,pos3,pos4
人名,名詞,固有名詞,人名,一般
他,感動詞,フィラー,*,*
他,感動詞,一般,*,*
他,接続詞,*,*,*
体,代名詞,*,*,*
体,名詞,助動詞語幹,*,*
体,名詞,普通名詞,サ変可能,*
体,名詞,普通名詞,サ変形状詞可能,*
体,名詞,普通名詞,一般,*
体,名詞,普通名詞,副詞可能,*
体,名詞,普通名詞,助数詞可能,*
体,名詞,普通名詞,形状詞可能,*
係助,助詞,係助詞,*,*
副助,助詞,副助詞,*,*
助動,助動詞,*,*,*
助動,形状詞,助動詞語幹,*,*
助数,接尾辞,名詞的,助数詞,*
名,名詞,固有名詞,人名,名
固有名,名詞,固有名詞,一般,*
国,名詞,固有名詞,地名,国
地名,名詞,固有名詞,地名,一般
姓,名詞,固有名詞,人名,姓
接助,助詞,接続助詞,*,*
接尾体,接尾辞,名詞的,サ変可能,*
接尾体,接尾辞,名詞的,一般,*
接尾体,接尾辞,名詞的,副詞可能,*
接尾用,接尾辞,動詞的,*,*
接尾相,接尾辞,形容詞的,*,*
接尾相,接尾辞,形状詞的,*,*
接頭,接頭辞,*,*,*
数,名詞,数詞,*,*
格助,助詞,格助詞,*,*
準助,助詞,準体助詞,*,*
用,動詞,一般,*,*
用,動詞,非自立可能,*,*
相,副詞,*,*,*
相,形容詞,一般,*,*
相,形容詞,非自立可能,*,*
相,形状詞,タリ,*,*
相,形状詞,一般,*,*
相,連体詞,*,*,*
終助,助詞,終助詞,*,*
補助,空白,*,*,*
補助,補助記号,一般,*,*
補助,補助記号,句点,*,*
補助,補助記号,括弧閉,*,*
補助,補助記号,括弧開,*,*
補助,補助記号,読点,*,*
補助,補助記号,AA,一般,*
補助,補助記号,AA,顔文字,*
記号,記号,一般,*,*
記号,記号,文字,*,*
    </pre>
</details>

Add description for form and formBase:
form: 語形出現形, the form of the word as it appears. Form groups the same word with different written expressions together.
formBase: 語形基本形, the uninflected form of the word. For example, the formBase オオキイ groups its orthBase 書字形基本形 大きい and おおきい together. Also since its casual expression of the orthBase おっきい has a different pronunciation, it is regarded as another formBase オッキイ. (see UniDic Hireachy for details)

Add an example for lid: 語彙表ID.
For example, クリエイティブ, クリエイティヴ, クリエーティブ and Creative share the same lemma_id.

from fugashi.

polm avatar polm commented on June 12, 2024

Thank you for the clarification, I have added your suggestions to the README, so I will mark this as resolved.

from fugashi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.