tshatrov / ichiran Goto Github PK

View Code? Open in Web Editor NEW

278.0 15.0 27.0 1.16 MB

Linguistic tools for texts in Japanese language

License: MIT License

Common Lisp 99.65% Shell 0.35%

japanese-language common-lisp grammar language japanese linguistics dictionary

ichiran's People

Contributors

Stargazers

Watchers

ichiran's Issues

Docker entrypoint missing on Windows

I'm on Docker Desktop on Windows 10, trying to build the Docker containers.

I'm getting a /bin/sh: 1: entrypoint.sh: not found inside of ichiran-main-1 when I start the container up.

I did get this to run just fine on a Debian Linux system, so I think it might be something specific on Docker Desktop on Windows.

Slow parsing speed

This may be impossible to do anything about, but is there a way to increase the speed of the parser?

This is by far the most accurate Japanese parser I've seen, and I'm trying to use it to create a frequency list. I have tens of millions of sentences I want to analyze, but it took three days just to analyze 1 million. In comparison to something like mecab, mecab only took half an hour.

I am currently running the cli on each line, getting the JSON output, and working with that in NodeJS.

Let me know if there's anything I can do to speed up the process.

Database syntax error during full-init

Hello
I was trying to initialize the database following the instructions in the README but got stuck at this part

$ rlwrap sbcl --noinform --load /etc/default/quicklisp
* (ql:quickload :ichiran)
To load "ichiran":
  Load 1 ASDF system:
    ichiran
; Loading "ichiran"
.
;;; Checking for wide character support... WARNING: Lisp implementation doesn't use UTF-16, but accepts surrogate code points.
 yes, using code points.
..
;;; Checking for wide character support... WARNING: Lisp implementation doesn't use UTF-16, but accepts surrogate code points.
 yes, using code points.
;;; Building Closure with CHARACTER RUNES
.......
(:ICHIRAN)
* (ichiran/maintenance:full-init)
Initializing ichiran/dict...

debugger invoked on a CL-POSTGRES-ERROR:COLUMNS-ERROR in thread
#<THREAD "main thread" RUNNING {1000898083}>:
  Database error 42601: syntax error at or near "table"
QUERY: DROP TABLE IF EXISTS table

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [ABORT] Exit debugger, returning to top level.

(CL-POSTGRES::GET-ERROR #<SB-SYS:FD-STREAM for "socket 127.0.0.1:56920, peer: 127.0.0.1:5432" {1008725E13}>)
   source: (ERROR (CL-POSTGRES-ERROR::GET-ERROR-TYPE CODE) :CODE CODE :MESSAGE
                  (GET-FIELD #\M) :DETAIL (GET-FIELD #\D) :HINT (GET-FIELD #\H)
                  :CONTEXT (GET-FIELD #\W) ...)
0] 
*

Here's my settings.lisp:
Since the JMdict repository has been reorganized since the README was written I had to guess where the paths are supposed to point.

(in-package #:ichiran/conn)

(defparameter *connection* '("jmdict" "postgres" "" "localhost"))

; (defparameter *connections* '((:old "jmdict_old" "postgres" "" "localhost")
;                               (:test "jmdict_test" "postgres" "" "localhost")))

(in-package #:ichiran/dict)

(defparameter *jmdict-path* #p"/home/.../JMdict_e")

(defparameter *jmdict-data* #p"/home/.../jmdictdb/jmdictdb/data")

(in-package #:ichiran/kanji)

(defparameter *kanjidic-path* #P"/home/.../kanjidic2.xml")

Clarification of settings.lisp

I'm a bit confused on the entries:

(defparameter *jmdict-path* #p"/home/you/dump/JMdict_e")

I found JMdict_e here: http://ftp.monash.edu/pub/nihongo/JMdict_e.gz

Is this the file?

Also

(defparameter *jmdict-data* #p"/home/you/dump/jmdict-data/")

Where is jmdict-data? Is it supposed to be jmdictdb/data?

Thanks

Incorporate jmnedict database

Lately the few placenames etc. that exist in jmdict are being moved to jmnedict. If this continues, ichi.moe won't be able to recognize stuff like Tokyo etc., which is unacceptable. We need to incorporate jmnedict names without messing up the segmenting algorithm. Kanji names should be top priority, katakana names are not important and can be ignored for now. They should score lower than regular words so as not to pollute the results.

Whitespace/punctuation inconsistency

I noticed that there are some inconsistencies with how whitespace and punctuation are treated, and it causes some precision issues when trying to correlate with the original sentence. For example, a Japanese comma:、 is converted to a standard comma + space , or this combination: 。」 is converted to: . " (period space quote). I'm wondering fi there is a reason why punctuation is converted, and why spaces are added...and also if there is a way to preserve the information so that I could correlate perfectly each token index with the original sentence.

My use case is pretty common, generating the furigana for a sentence, but I want to know precisely the index in the original sentence. Another thing that might help this case is to include index locations for everything.

Curious treatment of kanji-break list

Looking at the code, it seems that kanji-break list is assummed to have different content constraints in different places.

Specifically, first in join-substring-words*, for every start-end range for which at least one word is found, kanji breaks within that range are found and added to the kanji-breaks list. There is a single list for all the ranges, so if the same two kanjis are parts of more than one word (found with different ranges), the position of the break will be added to the list more than once.

Then, just before gen-score call, we have a:

   for kb = (mapcar (lambda (n) (- n start)) (intersection (list start end) kanji-break))

which will put duplicates in the kb list, if it finds them in kanji-break list. So far, it looks like you are counting places that break more than one found word range as many times as there are different word ranges found. This code also reindexes the breaks so that they start from the beginning of the word that's about to be scored. So the resultting kb list could have, for example, contents like nil, (0), (1), (2), (0 1), (0 2) (typical) , but it could also have (0 0), (1 1), (0 0 1 1) (albeit rarely, most likely).

But then, when the kanji-break list contents are actually used, at the very beginning of kanji-break-penalty, we have:

(let ((end (cond ((cdr kanji-break) :both)
((eql (car kanji-break) 0) :beg)
(t :end)))

which seems to squarely assume, that the kanji-break (taken from kb) list will have either (0), (word-size), or (0 word-size) content. It can't be nil, of kanji-break-penallty wouldn't ever be called.

So there can be words, beginning of which breaks nothing, but the end of which comes to break more than one (start end) range, for which there was a word found, and so for which the kb will contain the word-size more then once, for example, for a 1-character/mora word it'll be (1 1), and then this 'let' will pick the end variable to the :both , even though it should have picked :end.

Is this by design? I seem to be missing something here, as I'd assume that if it's a bug, it'll have wide-ranging effects on segmentation.

Windows installation

Hi,

I followed your guide for the windows installation, but it just won't work. Everything goes fine until I run the command (ichiran/mnt:add-errata) through sbcl. I always get the error: alien function "gai_strerror" is undefined. The same error appears when I directly run (ichiran/test:run-all-tests). Do you have any idea? I'm on Windows 10 64bit and tried sbcl versions 2.1.9, 2.1.6.

thanks in advance

More portable version using SQLite

I'm curious if the maintainers would have any interest in a conversion from Postgresql to SQLite. The idea would be to Ichiran much more operationally simple, allowing it to be integrated into developer workflows more easily. Rather than needing a separate db process and configuration parameters, Sqlite could integrate much more seamlessly.

"[verb in te form] + ろくに" incorrectly parsed as "[verb in te form]ろ + くに"

Not sure how common this structure is, but I thought I'd share regardless.

I've noticed that [verb in te-form]ろくに is often segmented as [verb in te-form]ろ + くに, but I think [verb in te-form] + ろくに would actually be the correct segmentation?

Examples where I think ichiran gives incorrect results:

去年は忙しくてろくに更新もできず (from here)
- expected: 去年　は　忙しくて　ろくに　更新　も　できず
- ichiran: 去年　は　忙しくてろ　くに　更新　も　できず
その犬は弱っていてろくに歩けない (from goo)
- expected: その　犬　は　弱っていて　ろくに　歩けない
- ichiran: その　犬　は　弱っていてろ　くに　歩けない

An example where ichiran does give correct results: (from tatoeba)

会ってろくに話もしないうちに「アド教えて」って意味わかんない。

Docker setup with 2024 January dump complains

Hello hello! Congratulations on the big 2024 release 😁 I'm trying to modify the Docker setup for this release but I'm running into this Postgres error on the docker compose up step and wondering if this is an issue with the release or if it's Docker-related:

ERROR:  relation "sense_prop" does not exist at character 16
STATEMENT:  (SELECT * FROM sense_prop WHERE ((seq = 2089020) and (tag = E'pos') and (text = E'cop-da')))

I had to change a couple of small things to the Docker setup (see my diff at https://github.com/tshatrov/ichiran/compare/master...fasiha:ichiran:2024-jan-release?expand=1). Thanks for any tips! Thanks for continuing to support this great project 🫶

Spliting words functionality

Hello and thank you for your work!!

Is it possible somehow to use only the word splitting functionality of this package?

Docker compose issues with pg_restore and running tests/cli

Hello! I'm following the nice Docker instructions and docker compose build works great, but when I run docker compose up, I get a bunch of errors from pg_restore:

ichiran-pg-1    | /usr/local/bin/docker-entrypoint.sh: sourcing /docker-entrypoint-initdb.d/ichiran-db.sh
ichiran-pg-1    | =========================
ichiran-pg-1    | Starting ichiran DB init!
ichiran-pg-1    | =========================
ichiran-pg-1    | pg_restore: error: could not execute query: ERROR:  role "ichiran" does not exist
ichiran-pg-1    | Command was: ALTER DATABASE jmdict0122 OWNER TO ichiran;
ichiran-pg-1    |
ichiran-pg-1    | 2023-11-06 04:47:46.668 UTC [61] ERROR:  role "ichiran" does not exist
ichiran-pg-1    | 2023-11-06 04:47:46.668 UTC [61] STATEMENT:  ALTER DATABASE jmdict0122 OWNER TO ichiran;
ichiran-pg-1    |
ichiran-pg-1    |
ichiran-pg-1    | pg_restore: error: could not execute query: ERROR:  schema "public" already exists
ichiran-pg-1    | Command was: CREATE SCHEMA public;

This continues for a while with lots of these role "ichiran" does not exist errors?

I can run qsql on the Postgres container, e.g.,

$ docker exec -it ichiran-pg-1  bash
root@daa513dbd005:/# psql -d jmdict0122 -U postgres
psql (16.0 (Debian 16.0-1.pgdg120+1))
Type "help" for help.

jmdict0122=# \d
                     List of relations
 Schema |            Name            |   Type   |  Owner
--------+----------------------------+----------+----------
 public | conj_prop                  | table    | postgres
 public | conj_prop_id_seq           | sequence | postgres
 public | conj_source_reading        | table    | postgres
 public | conj_source_reading_id_seq | sequence | postgres
 public | conjugation                | table    | postgres
 public | conjugation_id_seq         | sequence | postgres
 public | entry                      | table    | postgres
 public | gloss                      | table    | postgres
 public | gloss_id_seq               | sequence | postgres
 public | kana_text                  | table    | postgres
 public | kana_text_id_seq           | sequence | postgres
 public | kanji                      | table    | postgres
 public | kanji_id_seq               | sequence | postgres
 public | kanji_text                 | table    | postgres
 public | kanji_text_id_seq          | sequence | postgres
 public | meaning                    | table    | postgres
 public | meaning_id_seq             | sequence | postgres
 public | okurigana                  | table    | postgres
 public | okurigana_id_seq           | sequence | postgres
 public | reading                    | table    | postgres

jmdict0122=# select * from reading limit 5;
 id  | kanji_id |  type  | text | suffixp | prefixp | stat_common
-----+----------+--------+------+---------+---------+-------------
 397 |       90 | ja_kun | ひ   | f       | f       |         102
 402 |       91 | ja_on  | いん | f       | f       |          10
 404 |       91 | ja_kun | の   | t       | f       |          17
 457 |      105 | ja_kun | あま | f       | t       |           9
 523 |      125 | ja_on  | うん | f       | f       |          43
(5 rows)

so a lot definitely got loaded, but when I'm not sure if the data restored because when I try to run the test suite, SBCL errors:

$ docker exec -it ichiran-main-1 test-suite
This is SBCL 2.2.4, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
could not open file "/root/ichiran.core"
open: No such file or directory

and a similar error when I try running docker exec -it ichiran-main-1 ichiran-sbcl. Trying to run the ichiran-cli app fails in a different way:

docker exec -it ichiran-main-1 ichiran-cli -i "一覧は最高だぞ"

OCI runtime exec failed: exec failed: unable to start container process: exec: "ichiran-cli": executable file not found in $PATH: unknown

I tried your tip of rm docker/pgdata and rerunning docker compose up but encountered the same issues (unable to run test/cli). I'm a Docker noob so forgive me if the above has an easy solution 🙇 thanks for any tips you can share!

Improving expression detection

Hi,
I noticed that ichiran will currently segment "どこから見ても" as three separate words "どこ", "から" and "見ても", rather than the expression which has a JMdict entry. Same is true for some other expressions as well, such as 負の遺産 or 取り留めも無い. Other expressions like どう見ても do get detected as such. I don't know how difficult it would be, but it would be great if ichiran was able to detect these expressions more reliably.

じゃなかったです is split into じゃ and なかったです

As of "Last updated: 2018-11-03 09:27:30" queries like "元気じゃなかったです” are split into "元気", "じゃ", and "なかったです". I'm not sure if this is intended but I would have expected it to be split into "元気”, ”じゃなかった", and "です”. The pattern is fairly elementary (it's in Lesson 5 of Genki I) so I think this should take priority over the former segmentation.

"じゃなかったです" by itself also splits into ”じゃ” and "なかったです".
"元気じゃなかった" splits as expected.
”元気じゃないです” splits as expected.

EDIT: Just read that it's possible to select one of the five best segmentations generated; there is a correct segmentation there but I think it should still be the first one generated.

If only they used ichi.moe's engine...

https://news.ycombinator.com/item?id=22033309

:-) Hello. Yes, the romaji support sucks. Can you help me a little? I'm tokenising with Mecab through a python lib, then the furigana of tokens is converted on the front-end with wanakana.js. You can mail me on [email protected] , maybe I can call you? David

Used postgres version

Hello!

I am trying to convert the database to SQLite to get a better understanding on how it works and to try to embed it. What Postgres version do you use exactly?

Support for んだ and んです suffix

Thanks so much for making this :)

I noticed that ichi.moe seperates ん and だ and ん and です at the end of the sentence when it seems often like they should remain together.

Consider:
トムって毎朝ひげ剃ってるんだ
https://ichi.moe/cl/qr/?q=%E3%83%88%E3%83%A0%E3%81%A3%E3%81%A6%E6%AF%8E%E6%9C%9D%E3%81%B2%E3%81%92%E5%89%83%E3%81%A3%E3%81%A6%E3%82%8B%E3%82%93%E3%81%A0&r=htr

JMDict has the following entry for んだ
のだ, んだ (exp) the expectation is that ...; the reason is that ...; the fact is that ...; the explanation is that ...; it is that ... (の and ん add emphasis)

んです:
のです, んです (exp) the expectation is that ...; the reason is that ...; the fact is that ...; the explanation is that ...; it is that ... (の and ん add emphasis)

So just wanted to report it!

Conjugations missing type in cli

First of all: Thank you very much for this amazing project and the work you put into the blog post to make the common lisp folk available to people who have never done anything with it. This project is truly a game changer!

I set up the cli yesterday and I had it running in no time. Today I started making a simple frontend for my project and I noticed that the conj.prop.type array is always empty. To make sure I tried the same example directly in the lisp command line and it worked fine.

As a simple example I used できなかった:

[[[[["dekinakatta",{"reading":"\u3067\u304D\u306A\u304B\u3063\u305F","text":"\u3067\u304D\u306A\u304B\u3063\u305F","kana":"\u3067\u304D\u306A\u304B\u3063\u305F","score":525,"seq":10027599,"conj":[{"prop":[{"pos":"v1","type":[],"neg":true}],"reading":"\u51FA\u6765\u308B \u3010\u3067\u304D\u308B\u3011","gloss":[{"pos":"[v1,vi]","gloss":"to be able (in a position) to do; to be up to the task"},{"pos":"[v1,vi]","gloss":"to be ready; to be completed"},{"pos":"[v1,vi]","gloss":"to be made; to be built"},{"pos":"[v1,vi]","gloss":"to be good at; to be permitted (to do)"},{"pos":"[v1,vi]","gloss":"to become intimate; to take up (with somebody)"},{"pos":"[v1,vi]","gloss":"to grow; to be raised"},{"pos":"[vi,v1]","gloss":"to become pregnant"}],"readok":true}]},[]]],525]]]

Clarification on seq number meaning

First off, thanks for your work on this project!

I'd like to know how the seq number is generated for words that have conjugations applied to them. For example, 得られ returns
[[[[["erare",{"reading":"\u5F97\u3089\u308C \u3010\u3048\u3089\u308C\u3011","text":"\u5F97\u3089\u308C","kana":"\u3048\u3089\u308C","score":240,"seq":11132474,"conj":[{"prop":[{"pos":"v1","type":"Continuative (~i)"}],"via":[{"prop":[{"pos":"v1","type":"Potential"},{"pos":"v1","type":"Passive"}],"reading":"\u5F97\u308B \u3010\u3048\u308B\u3011","gloss":[{"pos":"[v1,vt]","gloss":"to get; to earn; to acquire; to procure; to gain; to secure; to attain; to obtain; to win","info":"\u7372\u308B esp. refers to catching wild game, etc."},{"pos":"[v1,vt]","gloss":"to understand; to comprehend"},{"pos":"[v1,vt]","gloss":"to receive something undesirable (e.g. a punishment); to get (ill)"},{"pos":"[aux-v,v1,vt]","gloss":"to be able to ..., can ...","info":"after the -masu stem of a verb"}],"readok":true}],"readok":true}]},[]]],240]]]

Specifically, a seq number of 1132474. 得る has an edict number of 1588760, so I'm wondering if there is a way to extract or reveal the edict number of the dictionary form of conjugated terms.

Katakana proper nouns are being split up

I'm using Ichiran is because it is, by far, the best parser/tokenizer at when it comes to reasonable word boundaries in Japanese. It is so awesome! I have noticed, however, that there is a single area where it seems to underperform, and that is with katakana names. The behavior I see from other tokenizers (which I think would be the generally desired behavior), would be to provide the names/words possible. It appears to me that Ichiran might be finding the shortest. Here's an example.

https://ichi.moe/cl/qr/?q=%E3%81%8A%E3%81%AF%E3%82%88%E3%81%86%E3%80%81%E3%83%95%E3%83%AC%E3%83%83%E3%83%89&r=htr

I would expect フレッド in おはよう、フレッド to tokenize to フレッド not フ and レッド.

何だと does not get split correctly before verbs that take the と particle

Hello, is this the right place to report segmentation issues? If so, here's one:
I can mostly find this expression used before 思う (i.e. 何だと思う), but I think in principle it should also apply to other verbs that take と.

Example sentences:

http://tatoeba.org/eng/sentences/show/189060 [Shortened for clarity]

最良の方法は何だと思いますか。 - What do you think is the best way?

Expected segmentation: 最良　の　方法　は　何　だ　と　思います　か
Ichiran segmentation: 最良　の　方法　は　何だと　思います　か
http://tatoeba.org/eng/sentences/show/81700

僕を何だと思っているのか。 - What do you take me for? / "What do you think I am."

Expected segmentation: 僕　を　何　だ　と　思っている　のか
Ichiran segmentation: 僕　を　何だと　思っている　のか

ichiran gets 「1週間後」wrong

My knowledge in Japanese is very limited, but when I was taking to a native they said that 「1週間後」means one week in the future or one week later and is pronounced "isshuukango"

But ichiran-cli give me this output

$ ichiran-cli -i '1週間後'
isshuukan ato

* isshuukan  1週間 【いっしゅうかん】
Value: 1 [duration]
1. [n-suf,ctr,n] week

* ato  後 【あと】
1. [n,adj-no] behind; rear
2. [n,adj-no] after; later
3. [n,adj-no] remainder; the rest
4. [n,adv] more (e.g. five more minutes); left
5. [n,adv] also; in addition
6. [n,adj-no] descendant; successor; heir
7. [n,adj-no] after one's death
8. [adj-no,n] past; previous

一箇年 and 堪へる are missing kana_text, causing internal server error

Links:

And I'm pretty sure the server is otherwise OK because other searches work fine,

https://ichi.moe/cl/word/?q=%E6%B0%B4

I looked at the DB dump in the latest release and noticed those two words are missing kana_text rows.
So, there was probably a bug with parsing JMDict.

The entry content clearly shows that there should be kana.

一箇年 (いっかねん)

<?xml version="1.0" encoding="UTF-8"?>\n
<entry>\n
	<ent_seq>1161240</ent_seq>\n
	<k_ele>\n
		<keb>一箇年</keb>\n
	</k_ele>\n
	<r_ele>\n
		<reb>いっかねん</reb>\n
		<re_inf>ok</re_inf>\n
	</r_ele>\n
	<sense>\n
		<pos>n</pos>\n
		<gloss xml:lang="eng">one year</gloss>\n
	</sense>\n
</entry>

堪へる (たへる)

<?xml version="1.0" encoding="UTF-8"?>\n
<entry>\n
	<ent_seq>2209300</ent_seq>\n
	<k_ele>\n
		<keb>堪へる</keb>\n
	</k_ele>\n
	<r_ele>\n
		<reb>たへる</reb>\n
		<re_inf>ok</re_inf>\n
	</r_ele>\n
	<sense>\n
		<pos>v1</pos>\n
		<pos>vi</pos>\n
		<pos>vt</pos>\n
		<xref>堪える・1</xref>\n
		<gloss xml:lang="eng">to bear</gloss>\n
		<gloss xml:lang="eng">to stand</gloss>\n
		<gloss xml:lang="eng">to endure</gloss>\n
		<gloss xml:lang="eng">to put up with</gloss>\n
	</sense>\n
</entry>

There might be other inconsistencies in the database (like entry.n_kana, entry.n_kanji, kana_text.nokanji, etc. being wrong) but I didn't check.

JSON returned by ichiran/cli

I'm using ichiran-cli with the -f argument to provide my scripts with the full segmentation of a sentence. I'm writing my own little parser for the returned JSON, but I'm having problems since there's lots of structure in it, and I don't know where to find everything or what exactly should I expect from it. Here is an example:

['itsuni', {'reading': '一に【いつに】', 'text': '一に', 'kana': 'いつに', 'score': 128, 'seq': 1160930, 'gloss': [{'pos': '[adv]', 'gloss': 'solely; entirely; only; or'}], 'conj': []}, []]

Sometimes the 'gloss' list is empty, sometimes the 'conj' list is, and sometimes both in the case of e.g. suffixes. I don't know where to find the tense in the case of verbs.
Where can I find an explanation of the structure of the JSON that is returned by ichiran-cli?

Include root word information for conjugated words in JSON

Hi,
Thanks for your work on ichiran.

Would it be possible to include more information about the root word for conjugated words in the JSON from ichiran-cli -f? For my use case, ideally the text, kana and seq fields from the root word would be included. For example

見て => 見る, みる, 1259290
観て => 観る, みる, 1259290
みて => みる, みる, 1259290

I'm looking to generate anki cards from sentences, using ichiran to detect the individual words in the sentence. Currently it seems tricky to programmatically determine that みてみる is really the same word twice, for example:

[
  [
    [
      [
        [
          "mite",
          {
            "reading": "みて",
            "text": "みて",
            "kana": "みて",
            "score": 40,
            "seq": 10591144,
            "conj": [
              {
                "prop": [
                  {
                    "pos": "v1",
                    "type": "Conjunctive (~te)"
                  }
                ],
                "reading": "見る 【みる】",
                "gloss": [
                  {
                    "pos": "[v1,vt]",
                    "gloss": "to see; to look; to watch; to view; to observe"
                  },
                  {
                    "pos": "[v1,vt]",
                    "gloss": "to examine; to look over; to assess; to check; to judge"
                  },
                  {
                    "pos": "[v1,vt]",
                    "gloss": "to look after; to attend to; to take care of; to keep an eye on"
                  },
                  {
                    "pos": "[v1,vt]",
                    "gloss": "to experience; to meet with (misfortune, success, etc.)"
                  },
                  {
                    "pos": "[aux-v,v1]",
                    "gloss": "to try ...; to have a go at ...; to give ... a try",
                    "info": "after the -te form of a verb"
                  },
                  {
                    "pos": "[aux-v,v1]",
                    "gloss": "to see (that) ...; to find (that) ...",
                    "info": "as 〜てみると, 〜てみたら, 〜てみれば, etc."
                  }
                ],
                "readok": true
              }
            ]
          },
          []
        ],
        [
          "miru",
          {
            "reading": "みる",
            "text": "みる",
            "kana": "みる",
            "score": 40,
            "seq": 1259290,
            "gloss": [
              {
                "pos": "[v1,vt]",
                "gloss": "to see; to look; to watch; to view; to observe"
              },
              {
                "pos": "[v1,vt]",
                "gloss": "to examine; to look over; to assess; to check; to judge"
              },
              {
                "pos": "[v1,vt]",
                "gloss": "to look after; to attend to; to take care of; to keep an eye on"
              },
              {
                "pos": "[v1,vt]",
                "gloss": "to experience; to meet with (misfortune, success, etc.)"
              },
              {
                "pos": "[aux-v,v1]",
                "gloss": "to try ...; to have a go at ...; to give ... a try",
                "info": "after the -te form of a verb"
              },
              {
                "pos": "[aux-v,v1]",
                "gloss": "to see (that) ...; to find (that) ...",
                "info": "as 〜てみると, 〜てみたら, 〜てみれば, etc."
              }
            ],
            "conj": []
          },
          []
        ]
      ],
      80
    ]
  ]
]

Logging for Postgres queries

Hello!

Is it somehow possible to print the queries that the CLI does when segmenting a sentence? I wrote a Rust implementation based on Sqlite and want to verify that I properly query my underlying database.

I have no Lisp experience and reading the source code is really hard for be to be honest...

contact

I have just wrote in your blog website but I couldn't find your email. This is to let you know how to answer to me.

Compound word だしといて is split into だ + しといて

Hi again!

I recently came across a tweet with the following text:

もーすぐ　できるよン　お皿だしといてー
https://twitter.com/ktandoku/status/1350350938347696130 (fully SFW)

Looks like on ichi.moe, だしといて (and the uncontracted だしておいて too) is incorrectly segmented as だ + しといて instead of a compound word だしといて.

On the other hand, using the kanji form of だす (出しといて and 出しておいて) work fine.

Additional data getting inserted into json results

Test string: それはすごいね
Command used: ./ichiran-cli -f "それはすごいね" (using the 202201 dictionary dump)
Results: [[[[["sore wa",{"reading":"\u305D\u308C\u306F","text":"\u305D\u308C\u306F","kana":"\u305D\u308C\u200B\u200C\u306F","score":144,"seq":2134680,"gloss":[{"pos":"[adv]","gloss":"very; extremely"},{"pos":"[exp]","gloss":"that is"}],"conj":[]},[]],["sugoi",{"reading":"\u3059\u3054\u3044","text":"\u3059\u3054\u3044","kana":"\u3059\u3054\u3044","score":144,"seq":1374550,"gloss":[{"pos":"[adj-i]","gloss":"terrible; dreadful"},{"pos":"[adj-i]","gloss":"amazing (e.g. of strength); great (e.g. of skills); wonderful; terrific"},{"pos":"[adj-i]","gloss":"to a great extent; vast (in numbers)"},{"pos":"[adv]","gloss":"awfully; very; immensely"}],"conj":[]},[]],["ne",{"reading":"\u306D","text":"\u306D","kana":"\u306D","score":16,"seq":2029080,"gloss":[{"pos":"[prt]","gloss":"right?; isn't it?; doesn't it?; don't you?; don't you think?","info":"at sentence end; used as a request for confirmation or agreement"},{"pos":"[int]","gloss":"hey; say; listen; look; come on"},{"pos":"[prt]","gloss":"you know; you see; I must say; I should think","info":"at sentence end; used to express one's thoughts or feelings"},{"pos":"[prt]","gloss":"will you?; please","info":"at sentence end; used to make an informal request"},{"pos":"[prt]","gloss":"so, ...; well, ...; you see, ...; you understand?","info":"at the end of a non-final clause; used to draw the listener's attention to something"},{"pos":"[prt]","gloss":"I'm not sure if ...; I have my doubts about whether ...","info":"at sentence end after the question marker \u304B"}],"conj":[]},[]]],304]]]

More specifically, note the values of the first term:
"reading":"\u305D\u308C\u306F",
"text":"\u305D\u308C\u306F",
"kana":"\u305D\u308C\u200B\u200C\u306F"

The 3rd and 4th characters on kana do not appear in this text box, but do appear when viewing on a site such as https://jsonformatter.org/json-pretty-print

They appear to be a "zero width space" and "zero width non-joiner"

You noted in another post that you're not working actively on the project at the moment, but I wanted to note this in case you have a chance to look at it!

Newest Ichiran with newest data seems to be failing 31 tests

Edict, Kanjidic2, jmdict-data, quicklisp and ichiran pulled from the Net yesterday.

Did full-init.

Had to comment out 2209300 additions in the errata, because the entire entry was deleted in jmdict. Then applied errata again.

macOS 13.6.1 Intel, Postgres and SBCL installed through Brew.

Results:

| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("猫" "は" "しっぽ" "を" "ぴんと" "立てて" "歩いた")
| but saw ("猫" "は" "しっぽ" "を" "ぴんと立てて" "歩いた")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("わかりきった") but saw ("わ" "かりきった")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("おとめ" "に" "ふさわしい" "振る舞い") but saw ("お" "とめ" "に" "ふさわしい" "振る舞い")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("折りたたみ" "式" "ついたて") but saw ("折りたたみ" "式" "ついた" "て")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("使い物" "に" "ならん" "だろ") but saw ("使い" "物にならん" "だろ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雪" "が" "ない" "ため") but saw ("雪" "が" "な" "いため")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("バラしちゃってる") but saw ("バラ" "しちゃってる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("何も" "口" "に" "せぬ") but saw ("何も" "口" "にせぬ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("工夫" "が" "される") but saw ("工夫" "がされる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("だめ" "だったら") but saw ("だ" "めだったら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("彼女" "は" "苦しげ" "に" "うめいて" "横たわった")
| but saw ("彼女" "は" "苦しげ" "に" "うめ" "いて" "横たわった")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("共感" "性") but saw ("共感性")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("それ" "ただ" "の" "怪しい" "人" "です" "し")
| but saw ("それた" "だの" "怪しい" "人" "です" "し")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("出したい" "とき" "は") but saw ("出した" "いと" "き" "は")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("旅行" "に" "いきたい") but saw ("旅行" "にい" "きたい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("しない" "かい") but saw ("し" "ないかい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("てか" "最近" "ファン" "層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる"
"ってのは" "無謀")
| but saw ("てか" "最近" "ファン層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる" "ってのは"
"無謀")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("なんというか" "すみません") but saw ("なんという" "かすみません")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そう" "したい" "から" "した" "だけ" "だ") but saw ("そうした" "いからした" "だけ" "だ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("手にとって" "いただき" "やすくなる") but saw ("手にとっていた" "だ" "きやすくなる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("大事" "に" "なります") but saw ("大" "事になります")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("奴" "が" "まとも" "に" "見られない") but saw ("奴" "が" "まともに" "見られない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("といった" "ところ" "でしょうか") but saw ("と" "いった" "ところ" "でしょうか")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("言い方" "も" "します") but saw ("言い方" "もします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("届け" "したら") but saw ("届" "けしたら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("全く" "と" "いって" "いい") but saw ("全く" "と" "いっていい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("仲良し" "に" "なったら") but saw ("仲良し" "になったら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("体" "に" "悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
| but saw ("体に悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雨" "が" "降りそう" "な" "気がします") but saw ("雨が降りそう" "な" "気がします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そういう" "お" "隣" "どうし") but saw ("そういう" "お" "隣どうし")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("みんな" "土足で" "おいで") but saw ("みんな" "土足で" "おい" "で")
|
SEGMENTATION-TEST: 451 assertions passed, 31 failed, and an execution error.

| Execution error:
| Database error 42P01: relation "kanji" does not exist
QUERY: (SELECT r.text, r.type FROM kanji AS k INNER JOIN reading AS r ON (r.kanji_id = k.id) WHERE ((k.text = E'取') and (not (r.type IN (E'ja_na')))))
|
MATCH-READINGS-TEST: 0 assertions passed, 0 failed, and an execution error.

| Execution error:
| Database error 42P01: relation "kanji" does not exist
QUERY: (SELECT r.text, r.type FROM kanji AS k INNER JOIN reading AS r ON (r.kanji_id = k.id) WHERE ((k.text = E'気') and (not (r.type IN (E'ja_na')))))
|
SEGMENTATION-TEST: 451 assertions passed, 31 failed, and an execution error.

#<TEST-RESULTS-DB Total(707) Passed(676) Failed(31) Errors(2)>

[feature request] ichiran-cli segment sentence into dictionary forms

Requesting a flag which turns the input of a sentence like: "昨日すき焼きを食べました" into "昨日;すき焼き;を;食べる".
Needing this for automation with a command line dictionary. -i and -f already kinda do this but getting the dictionary forms from their output is a bit of a hassle and they take quite a bit longer to execute than a normal 'simple-segment' because of the included definitions (atleast that's what I'm guessing). Also this seems like a common enough usecase to warrent a flag.

Debugger starts at postmodern:connect

I try running (ichiran/mnt:add-errata) and i get the following

I tried making sure that i could connect to the database and it seems to work based on this

Is it possible to open source the website

This is really a good project,

may I know is the website https://ichi.moe/ open source ?

memo: add fixed conjo.csv (and other conj data?) into the repo

The publicly available conjo.csv is incorrect, so it would be convenient to include the patched version in this project.

Need to check if JMdict data files have a compatible license. Otherwise upload them somewhere else.

Limits of hiragana-based romanisation

Hi,

(this doesn't really belong in a bug report but I'd still like to take a second to say that what you've done here is fabulous, amazing, and incredibly helpful. Thank you!).

I'm not sure I understand completely what goes on in romanize.lisp, but under certain circumstances, it ends up merging an "o" and a "u" that it shouldn't. This issue is mentioned here and 追う is given as an example. The correct reading of 追 is お, so that in hiragana, the word comes out as おう. This transformation is lossy/ambiguous, however: Here, お and う are pronounced separately, in contrast to 王, which, too, is romanised as おう but pronounced as a long お. To romanise 追う as ō is misleading, I think.

I believe that the general rule (and this might make for an easy fix) is: Merging of お and う cannot occur across kanji boundaries. In the presence of kanji, the breakup into hiragana and merging of お and う needs to occur before those tokens are thrown together.

Since I'm not a native speaker (quite the opposite), I checked forvo.com and found a recording that supports the claim that お and う are not joined in 追う: In the recording by the user strawberrybrown, the お and the う can be made out quite distinctly. In contrast, I found a few examples of もう, ぽう, ちょう, and どう that she pronounces as mō, pō, chō, and dō, respectively, just as expected. Which is to say, this user does not generally pronounce お and う sounds separately (as could be the case in a dialect, maybe?) but only when they're really meant to be separate.

There is another recording by the user smime in the same place as linked to earlier where the pronunciation of 追う is more difficult to make out, which corresponds to casual speaking.

Finally, please see also wiktionary for romaji of 追う and 王.

Update: 子牛 is another example that showcases this problem. The romanisation is currently incorrectly given as kōshi.

Support for がい/かい suffix

Hiya,

I noticed that there doesn't seem to be support for the [verb masu stem] + がい + structure, and ichi.moe instead would try to parse this as everything except 甲斐 😅

Here's an example from the 転生したら剣でした manga chapter 58. This chapter is available on sites like klmanga[.]com if more context is needed.

...だが
それでこそ
狩りがいのある獣よ！

It would be really awesome if ichi.moe could parse this structure somehow. One quick fix I thought of was to treat the whole がいのある as a variant of 甲斐がある so that JMDict's entry for 甲斐がある appears. But then again, I realize that several other structures involving かい/がい also exist and aren't parsed properly by ichi.moe, so there's likely far more robust approaches than my suggestion here.

Thank you :)

`Database error 42P01: relation "kana_text" does not exist` from CLI due to `switch-conn-vars`

Hi,
Thanks for your work on ichiran.

I was able to prepare the ichiran database using the dump according to the instructions in the blog post (https://readevalprint.tumblr.com/post/639359547843215360/ichiranhome-2021-the-ultimate-guide) with all the tests passing, but when I would build the CLI and try to use it, it was unable to find the database tables.

❯ ./ichiran-cli "こんばんは"
ERROR: Database error 42P01: relation "kana_text" does not exist
QUERY: (SELECT * FROM kana_text WHERE (text IN (E'は', E'んは', E'ん', E'ばんは', E'ばん', E'ば', E'んばんは', E'んばん', E'んば', E'ん', E'こんばんは', E'こんばん', E'こんば', E'こん', E'こ')))

Since the tests worked and I could confirm that the relation does exist with psql, it appeared to me that the CLI might be using some default connection details instead of the ones I gave in the settings.lisp file. I tried removing the following lines which fixed the issue:

ichiran/cli.lisp

Lines 95 to 96 in 3c8891f

 (when conn 

 (switch-conn-vars conn))

I don't know the purpose of these lines, so I decided make an issue instead of a PR removing them.

てもいい / でもいい dropping も out of data

Tested on ichi.moe, but the same thing happens on the cli version.

Input: やってもいい
The kana/romaji is correct, but the breakup (compound) only shows やって and いい.
Interestingly, the definition given for いい is "it's ok if ... / is it ok if ...?", which suggests that at some layer, the system is looking at the てもいい structure.

Link: https://ichi.moe/cl/word/?q=%E3%82%84%E3%81%A3%E3%81%A6%E3%82%82%E3%81%84%E3%81%84

Paper/Explanation of algorithm used

I'd like to integrate segmentation into an app I wrote. As far as I know, it's not possible to run Lisp on Android.

Is there some kind of documentation that explains how the segmentation works? Even if it takes more time, I'd like to port it over to Kotlin.

Minor note about database_name

Postgres and Lisp aren't my main wheelhouse, so this is suggestion for a note that may help other noobs setting up:

Both
https://github.com/tshatrov/ichiran/releases
https://readevalprint.tumblr.com/post/639359547843215360/ichiranhome-2021-the-ultimate-guide

mention database_name, which made me think I could set it to whatever I wanted, e.g. ichiran, but it looks like the dump specifies it as jmdict0122, so at (ichiran/mnt:add-errata) I was getting errors. The rest worked completely as expected once I put jmdict0122 in settings.lisp

I really appreciate all the work you've put into this! Thank you!

ichiran-cli doesn't work

When I run ichiran-cli -h (or any other request) I get this

Internal error #11 "Object is of the wrong type." at 0000000021ecb191
    SC: 0, Offset: 9    $1=       0x04f8a09f: other pointer
    SC: 3, Offset: 14   $2=       0x00003627: list pointer
fatal error encountered in SBCL pid 722019620:
internal error too early in init, can't recover

Even though there were no indication of errors in the previous steps, (ichiran/mnt:add-errata) performed just fine, and (ichiran/test:run-all-tests) returns Passed(748) Failed(0) Errors(0)

Also when using ichiran from SBCL, the first request after restarting SBCL and running (ql:quickload :ichiran) always yields an error. But all the following ones (even if it is the same request repeated) work as expected.

I'm running on Windows 10 and SBCL 2.3.0

tshatrov / ichiran Goto Github PK

ichiran's People

Contributors

Stargazers

Watchers

Forkers

ichiran's Issues

Recommend Projects

Recommend Topics

Recommend Org