allan-simon / sinoparserd Goto Github PK

A service to convert chinese languages (mandarin. cantonese. shanghainese..) into their transliterated form. to segment them etc.

License: Other

C++ 57.89% C 41.64% CMake 0.46%

sinoparserd's People

Contributors

Stargazers

Watchers

Forkers

tatoeba edouard-lopez

sinoparserd's Issues

Script incorrectly detected as simplified

Tommy spot an error for sentence #3718946: it’s detected as simplified but should be displayed as traditional. It only contains characters used in both traditional and simplified.

@allan-simon Is there something you can think about to solve that problem?

Index.h points to the wrong tato header file

Index.h should point to TatoTreeStr.h instead of tree_str.h in order for sinoparserd to compile.

Fix 唔 and 冇 in Jyutping

sarefo reported that 冇 is mou5, not mou2, and 唔 is m4, not ng4.

Errors in simplified/traditional conversion

verdastelo9604 reported some errors in the simplified/traditional characters converter.

周 is not converted to 週 from simplified to traditional;
著 is not converted to 着 from traditional to simplified when as a particle;
甚 in 甚麼 is not converted to 什 from traditional to simpified;
里 is converted to 裏 from simplified to traditional sometimes, it should be 裡, in Taiwan standard.

Memory leaks

Sinoparserd’s memory usage grows little by little every day and end up using several gigabytes. As a result, we need to restart it once in a while.

Run

Here is my command (I'm using your default files):

./sinoparserd -m doc/mandarin.xml

Requests

And here is what I got when hitting http://localhost:8080/trad?str=*:

<root>
<trad><![CDATA[*]]></trad>
</root>

I was expecting all trad matching the * glob. I got similar bevahior with /pinyin?str=*

<root>
<romanization><![CDATA[*]]></romanization>
</root>

Doesn't compile on x64 architecture

the references to the c++ libs are all hardcoded paths, pkgconfig should be used instead

Does sinoparserd support character segmentation ?

Your application was recommended as a good Chinese segmenter, yet the only segmentation that seems available is in the <romanization> element (space separated words):

<romanization>ren2ren2 ke3 bian1ji2 de5 zi4you2 bai3ke1quan2shu1</romanization>
<alternateScript>人人可编辑的自由百科全书</alternateScript>

As you already seems to be able to segment, why not provide an API or an option to do it on Chinese scripts ?

review tokenizer algorithm

right we're doing a simple greedy algo that does not look backward

instead we should do a

1 - try to get all possible tokenization

2 - apply on it filtering rules to remove impossible combination (based on a "impossible tokenization" rules fils ?)

3 - weight remaining tokenization using a weighting function (maybe 2 can be mixed with this step, admitting that 'impossible' stuff would get a very low score)

4 - keep the highest score (add a mechanical arm to the server to flip a coin for tokenization with same weight, you see I've thought about everything, smartass )

create a rules file for n-grams of token

a rules files will be a list of rule (ordered?) , loosy bnf grammar (need to review by language theory lessons...) (not that right now i don't precise how it's going to be written, xml, json, whatever)

 rule      ->  anchor tokens
 anchor ->  TOKEN
 tokens ->  TOKEN tokens
 TOKEN

TODO: complete TOKEN description in a bnf or xslt way

The anchor will be the token that need to be matched in order to trigger the rule

if a TOKEN is present, it must match a token in the data given as input
a TOKEN is a set of key-> valueS (note the S to value)
a rule can be considered as match or not match (one MAY implement other values as 'partially match`)
a rule MUST BE be considered as matched only if all the TOKEN in it are matched with token in the data given as input. if not the rule MAY BE consired as not matched (as one can have implement 'partially match' )
a TOKEN that does not precise a key is considering as 'matching' a key present in a token given as entry (it means that a TOKEN does not need to precise all the key a token have)
a TOKEN cannot match a token in the data given as input that is already matched by an other TOKEN
a TOKEN is considered as matched if all its key have ONE of their values corresponding to the value of that key in one of the token in data given as input
a TOKEN MAY have an special key id that should be unique among other token of one rule (just so that we can make reference from one token to an other inside a rule) however the key id MUST NOT be used for other purpose, it MUST BE 0 or anchor for the TOKEN that supposed to represent the anchor of that rule
*a TOKEN MAY have a special key proxymity that will have for value a list of pair of, if not used as described, it MUST NOT be present)
- from (being the id of an other token)
- distance being the distance (starting at 1 / -1 between the current token and the one referenced by from with the following possible values (note the syntax is chosen to make it easy to be parse with simple split and convert to int and read one byte :
  - a integral number, positive meaning that the from token should be "before" , and a negative meaning that the from token should be after
  - [optional] * to mean any distance is valid
  - [optional] X+ to mean 'X' or more (as an absolute value)
  - [optional] X- to mean 'X' or less (as an absolute value)
  - [optional] X|Y to mean 'between X and Y included'
  - [optional] X,Y,Z to mean either X , Y or Z
  - if the reader does not understand an value it MUST consider it as meaning any distance

Pinyin romanization doesn’t convert punctuation characters

There is some code to convert Chinese punctuation to Latin, but it’s not working because all the pairs contain multibyte characters as keys, while the input string is segmented on bytes, not characters. It’s not trivial to fix since C++ totally lacks UTF-8 support.

Reproduce with:

$ curl http://127.0.0.1:8042/pinyin?str=`php -r 'print urlencode("？。、");'`
<?xml version="1.0" encoding="UTF-8"?>
<root>
<romanization><![CDATA[？。、]]></romanization>
</root>

Expected: <romanization><![CDATA[?.,]]></romanization>

Source of `mandarin.xml` and `cantonese.xml`

This came up in Tatoeba/tatoeba2#2189. There are a few pinyin errors in mandarin.xml that I currently plan to fix by using CC-CEDICT as the source instead, but maybe they have been corrected in the original source of mandarin.xml as well.

@allan-simon do you remember where you got those files from?