Giter Club home page Giter Club logo

chinese_dictionary's People

Contributors

kerfufflev2 avatar sotch-pr35mac avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chinese_dictionary's Issues

Query results are returned using lifetime of the input str reference.

For example:

pub fn query(raw: &str) -> Option<Vec<&WordEntry>> {

means

pub fn query<'a>(raw: &'a str) -> Option<Vec<&'a WordEntry>> {

However, returning the result with the same lifetime as the input str reference is not necessary: the actual WordEntries are &'static.

Right now, it's hard to do something like cache the results of a lookup, since the result only lives as long as the input.

Set of changes to this and your other associated crates

Hi,
I'm not quite sure where to put this since it touches on several other crates you made as well.
I've been working on a tool to annotate Chinese text: https://github.com/KerfuffleV2/mandarin-webutil
One of the main things it needs to do is segment text and perform dictionary lookups. Right now it's using your crates for that function, but they don't quite fit my needs so I'm looking to either make some changes, fork those repos or possibly just roll my own.

Since the changes I'd want to make are fairly significant (and sometimes backward compatible) I want to feel out whether you'd be likely to accept those sorts of pull requests.

One thing I'd like to do is just use a build.rs to generate a Rust source to compile in dictionary definitions directly instead of having to load it as bincode and then deserialize. This would result in the values in WordEntry changing from String to &'static str (I suppose it would be possible to use a different struct internally and build a bunch of Strings at runtime which isn't worse than the status quo but also isn't really ideal. Possibly this could happen lazily only if someone uses that interface.)

Having to load in the dictionary and deserialize is pretty expensive, especially with a WASM web application and the time it takes is also noticeable (especially when not building in release mode.) It probably uses more memory as well.

The structure for the main main data in the dictionary builder ( https://github.com/sotch-pr35mac/syng-dictionary-creator/blob/master/src/dictionary_utils.rs#L42 ) can just be a Vec since the word id is exactly the same as the position in the list. This will simplify building it but anything else using those datafiles would need to be changed also. Making it a flat Vec or array also means you can just index it directly.

It should also be possible to use this to make it so segmenting also collects the ids of the matched words, right now the character_converter crate only uses the FST to check for existence. What I'd like to have is a function where I can control whether it's matching Traditional vs Simplified and also pull in a list (or iterator) of non-matched sections and the actual matched words if I pass it some text.

For example, given: "abc我喜欢苹果!" I'd like to get something like:

[ 
  Segment::Raw("abc"), 
  Segment::Word(1), // 我
  Segment::Word(2), // 喜欢
  Segment::Word(3), // 苹果
  Segment::Raw("!"),
]

Absolutely no hard feelings if these changes are too extensive or not what you're looking for. You can of course wait to see if I write make anything worth yoinking and integrating (although you'd have to do the work yourself, probably.)
One thing I do want to avoid though is ending up in a situation where only some changes are accepted and I still need to either fork or write my own implementation. Hopefully that's understandable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.