Giter Club home page Giter Club logo

sinoparserd's People

Contributors

allan-simon avatar jiru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

sinoparserd's Issues

Errors in simplified/traditional conversion

verdastelo9604 reported some errors in the simplified/traditional characters converter.

周 is not converted to 週 from simplified to traditional;
著 is not converted to 着 from traditional to simplified when as a particle;
甚 in 甚麼 is not converted to 什 from traditional to simpified;
里 is converted to 裏 from simplified to traditional sometimes, it should be 裡, in Taiwan standard.

Memory leaks

Sinoparserd’s memory usage grows little by little every day and end up using several gigabytes. As a result, we need to restart it once in a while.

Empty response

Run

Here is my command (I'm using your default files):

./sinoparserd -m doc/mandarin.xml

Requests

And here is what I got when hitting http://localhost:8080/trad?str=*:

<root>
<trad><![CDATA[*]]></trad>
</root>

I was expecting all trad matching the * glob. I got similar bevahior with /pinyin?str=*

<root>
<romanization><![CDATA[*]]></romanization>
</root>

Does sinoparserd support character segmentation ?

Your application was recommended as a good Chinese segmenter, yet the only segmentation that seems available is in the <romanization> element (space separated words):

<romanization>ren2ren2 ke3 bian1ji2 de5 zi4you2 bai3ke1quan2shu1</romanization>
<alternateScript>人人可编辑的自由百科全书</alternateScript>

As you already seems to be able to segment, why not provide an API or an option to do it on Chinese scripts ?

review tokenizer algorithm

right we're doing a simple greedy algo that does not look backward

instead we should do a

1 - try to get all possible tokenization

2 - apply on it filtering rules to remove impossible combination (based on a "impossible tokenization" rules fils ?)

3 - weight remaining tokenization using a weighting function (maybe 2 can be mixed with this step, admitting that 'impossible' stuff would get a very low score)

4 - keep the highest score (add a mechanical arm to the server to flip a coin for tokenization with same weight, you see I've thought about everything, smartass )

create a rules file for n-grams of token

a rules files will be a list of rule (ordered?) , loosy bnf grammar (need to review by language theory lessons...) (not that right now i don't precise how it's going to be written, xml, json, whatever)

 rule      ->  anchor tokens
 anchor ->  TOKEN
 tokens ->  TOKEN tokens
 TOKEN

TODO: complete TOKEN description in a bnf or xslt way

The anchor will be the token that need to be matched in order to trigger the rule

  • if a TOKEN is present, it must match a token in the data given as input
  • a TOKEN is a set of key-> valueS (note the S to value)
  • a rule can be considered as match or not match (one MAY implement other values as 'partially match`)
  • a rule MUST BE be considered as matched only if all the TOKEN in it are matched with token in the data given as input. if not the rule MAY BE consired as not matched (as one can have implement 'partially match' )
  • a TOKEN that does not precise a key is considering as 'matching' a key present in a token given as entry (it means that a TOKEN does not need to precise all the key a token have)
  • a TOKEN cannot match a token in the data given as input that is already matched by an other TOKEN
  • a TOKEN is considered as matched if all its key have ONE of their values corresponding to the value of that key in one of the token in data given as input
  • a TOKEN MAY have an special key id that should be unique among other token of one rule (just so that we can make reference from one token to an other inside a rule) however the key id MUST NOT be used for other purpose, it MUST BE 0 or anchor for the TOKEN that supposed to represent the anchor of that rule
    *a TOKEN MAY have a special key proxymity that will have for value a list of pair of, if not used as described, it MUST NOT be present)
    • from (being the id of an other token)
    • distance being the distance (starting at 1 / -1 between the current token and the one referenced by from with the following possible values (note the syntax is chosen to make it easy to be parse with simple split and convert to int and read one byte :
      • a integral number, positive meaning that the from token should be "before" , and a negative meaning that the from token should be after
      • [optional] * to mean any distance is valid
      • [optional] X+ to mean 'X' or more (as an absolute value)
      • [optional] X- to mean 'X' or less (as an absolute value)
      • [optional] X|Y to mean 'between X and Y included'
      • [optional] X,Y,Z to mean either X , Y or Z
      • if the reader does not understand an value it MUST consider it as meaning any distance

Pinyin romanization doesn’t convert punctuation characters

There is some code to convert Chinese punctuation to Latin, but it’s not working because all the pairs contain multibyte characters as keys, while the input string is segmented on bytes, not characters. It’s not trivial to fix since C++ totally lacks UTF-8 support.

Reproduce with:

$ curl http://127.0.0.1:8042/pinyin?str=`php -r 'print urlencode("?。、");'`
<?xml version="1.0" encoding="UTF-8"?>
<root>
<romanization><![CDATA[?。、]]></romanization>
</root>

Expected: <romanization><![CDATA[?.,]]></romanization>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.