allan-simon / sinoparserd Goto Github PK
View Code? Open in Web Editor NEWA service to convert chinese languages (mandarin. cantonese. shanghainese..) into their transliterated form. to segment them etc.
License: Other
A service to convert chinese languages (mandarin. cantonese. shanghainese..) into their transliterated form. to segment them etc.
License: Other
Tommy spot an error for sentence #3718946: it’s detected as simplified but should be displayed as traditional. It only contains characters used in both traditional and simplified.
@allan-simon Is there something you can think about to solve that problem?
Index.h should point to TatoTreeStr.h instead of tree_str.h in order for sinoparserd to compile.
sarefo reported that 冇 is mou5, not mou2, and 唔 is m4, not ng4.
verdastelo9604 reported some errors in the simplified/traditional characters converter.
周 is not converted to 週 from simplified to traditional;
著 is not converted to 着 from traditional to simplified when as a particle;
甚 in 甚麼 is not converted to 什 from traditional to simpified;
里 is converted to 裏 from simplified to traditional sometimes, it should be 裡, in Taiwan standard.
Sinoparserd’s memory usage grows little by little every day and end up using several gigabytes. As a result, we need to restart it once in a while.
Here is my command (I'm using your default files):
./sinoparserd -m doc/mandarin.xml
And here is what I got when hitting http://localhost:8080/trad?str=*
:
<root>
<trad><![CDATA[*]]></trad>
</root>
I was expecting all trad matching the *
glob. I got similar bevahior with /pinyin?str=*
<root>
<romanization><![CDATA[*]]></romanization>
</root>
the references to the c++ libs are all hardcoded paths, pkgconfig should be used instead
Your application was recommended as a good Chinese segmenter, yet the only segmentation that seems available is in the <romanization>
element (space separated words):
<romanization>ren2ren2 ke3 bian1ji2 de5 zi4you2 bai3ke1quan2shu1</romanization>
<alternateScript>人人可编辑的自由百科全书</alternateScript>
As you already seems to be able to segment, why not provide an API or an option to do it on Chinese scripts ?
right we're doing a simple greedy algo that does not look backward
instead we should do a
1 - try to get all possible tokenization
2 - apply on it filtering rules to remove impossible combination (based on a "impossible tokenization" rules fils ?)
3 - weight remaining tokenization using a weighting function (maybe 2 can be mixed with this step, admitting that 'impossible' stuff would get a very low score)
4 - keep the highest score (add a mechanical arm to the server to flip a coin for tokenization with same weight, you see I've thought about everything, smartass )
a rules files will be a list of rule (ordered?) , loosy bnf grammar (need to review by language theory lessons...) (not that right now i don't precise how it's going to be written, xml, json, whatever)
rule -> anchor tokens
anchor -> TOKEN
tokens -> TOKEN tokens
TOKEN
TODO: complete TOKEN description in a bnf or xslt way
The anchor will be the token that need to be matched in order to trigger the rule
TOKEN
is present, it must match a token in the data given as inputTOKEN
in it are matched with token in the data given as input. if not the rule MAY BE consired as not matched (as one can have implement 'partially match' )id
that should be unique among other token of one rule (just so that we can make reference from one token to an other inside a rule) however the key id MUST NOT be used for other purpose, it MUST BE 0
or anchor
for the TOKEN that supposed to represent the anchor of that ruleproxymity
that will have for value a list of pair of, if not used as described, it MUST NOT be present)
from
(being the id of an other token)distance
being the distance (starting at 1
/ -1
between the current token and the one referenced by from
with the following possible values (note the syntax is chosen to make it easy to be parse with simple split
and convert to int
and read one byte
:
from
token should be "before" , and a negative meaning that the from
token should be after*
to mean any distance is validX+
to mean 'X' or more (as an absolute value)X-
to mean 'X' or less (as an absolute value)X|Y
to mean 'between X and Y included'X,Y,Z
to mean either X , Y or Zany distance
There is some code to convert Chinese punctuation to Latin, but it’s not working because all the pairs contain multibyte characters as keys, while the input string is segmented on bytes, not characters. It’s not trivial to fix since C++ totally lacks UTF-8 support.
Reproduce with:
$ curl http://127.0.0.1:8042/pinyin?str=`php -r 'print urlencode("?。、");'`
<?xml version="1.0" encoding="UTF-8"?>
<root>
<romanization><![CDATA[?。、]]></romanization>
</root>
Expected: <romanization><![CDATA[?.,]]></romanization>
This came up in Tatoeba/tatoeba2#2189. There are a few pinyin errors in mandarin.xml
that I currently plan to fix by using CC-CEDICT as the source instead, but maybe they have been corrected in the original source of mandarin.xml
as well.
@allan-simon do you remember where you got those files from?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.