sotch-pr35mac / chinese_dictionary Goto Github PK
View Code? Open in Web Editor NEWA searchable Chinese / English dictionary with helpful utilities.
Home Page: https://crates.io/crates/chinese_dictionary
License: MIT License
A searchable Chinese / English dictionary with helpful utilities.
Home Page: https://crates.io/crates/chinese_dictionary
License: MIT License
For example:
pub fn query(raw: &str) -> Option<Vec<&WordEntry>> {
means
pub fn query<'a>(raw: &'a str) -> Option<Vec<&'a WordEntry>> {
However, returning the result with the same lifetime as the input str
reference is not necessary: the actual WordEntries
are &'static
.
Right now, it's hard to do something like cache the results of a lookup, since the result only lives as long as the input.
Hi,
I'm not quite sure where to put this since it touches on several other crates you made as well.
I've been working on a tool to annotate Chinese text: https://github.com/KerfuffleV2/mandarin-webutil
One of the main things it needs to do is segment text and perform dictionary lookups. Right now it's using your crates for that function, but they don't quite fit my needs so I'm looking to either make some changes, fork those repos or possibly just roll my own.
Since the changes I'd want to make are fairly significant (and sometimes backward compatible) I want to feel out whether you'd be likely to accept those sorts of pull requests.
One thing I'd like to do is just use a build.rs
to generate a Rust source to compile in dictionary definitions directly instead of having to load it as bincode and then deserialize. This would result in the values in WordEntry
changing from String
to &'static str
(I suppose it would be possible to use a different struct internally and build a bunch of String
s at runtime which isn't worse than the status quo but also isn't really ideal. Possibly this could happen lazily only if someone uses that interface.)
Having to load in the dictionary and deserialize is pretty expensive, especially with a WASM web application and the time it takes is also noticeable (especially when not building in release mode.) It probably uses more memory as well.
The structure for the main main data in the dictionary builder ( https://github.com/sotch-pr35mac/syng-dictionary-creator/blob/master/src/dictionary_utils.rs#L42 ) can just be a Vec
since the word id is exactly the same as the position in the list. This will simplify building it but anything else using those datafiles would need to be changed also. Making it a flat Vec
or array also means you can just index it directly.
It should also be possible to use this to make it so segmenting also collects the ids of the matched words, right now the character_converter crate only uses the FST to check for existence. What I'd like to have is a function where I can control whether it's matching Traditional vs Simplified and also pull in a list (or iterator) of non-matched sections and the actual matched words if I pass it some text.
For example, given: "abc我喜欢苹果!" I'd like to get something like:
[
Segment::Raw("abc"),
Segment::Word(1), // 我
Segment::Word(2), // 喜欢
Segment::Word(3), // 苹果
Segment::Raw("!"),
]
Absolutely no hard feelings if these changes are too extensive or not what you're looking for. You can of course wait to see if I write make anything worth yoinking and integrating (although you'd have to do the work yourself, probably.)
One thing I do want to avoid though is ending up in a situation where only some changes are accepted and I still need to either fork or write my own implementation. Hopefully that's understandable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.