Hi, I'm not quite sure where to put this since it touches on several other crates

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Set of changes to this and your other associated crates about chinese_dictionary HOT 3 CLOSED

sotch-pr35mac commented on August 16, 2024

Set of changes to this and your other associated crates

from chinese_dictionary.

Comments (3)

sotch-pr35mac commented on August 16, 2024 1

No worries about the response time, life happens, especially for open source and side projects.

Thank you for providing an update and for sharing more information about your project. I haven’t seen a tool like that and I can see how that’d be helpful. I love to see the language learning tools becoming more robust. I hope it all goes well~

I’m going to close this issue for now but feel free to leave comments as necessary with any updates / changes.

from chinese_dictionary.

sotch-pr35mac commented on August 16, 2024

Hello,

TLDR; I'm open to the type of changes you're describing, as long as current behavior is maintainable in some way. It might be worth coordinating on some of the specifics before jumping in.

I would be interested in learning more about your mandarin-webutil project and what your upstream needs are more specifically. It's hard to say if I'd be likely to merge a PR without knowing a little bit more about it. As I'm sure you've already gathered, like you, I'm using these crates for a project of mine downstream; so I'd like to maintain much of the functionality that exists. With that being said, I'm certainly open to additions and breaking changes as long as the behavior my downstream project relies on can still be achieved in one way or another.

Regarding the two examples you provided:

Using a build.rs instead of desearlizing is something that I have been wanting to do for a while, and I'm completely fine with most changes in the interface that result from that. In fact, there are several breaking changes I've been planning to roll up into the next major release, including revsiting some of the WordEntry struct and the adding multiple data sources.
While I think what you're describing here is possible and reasonable, I would suggest keeping this function local to the chinese_dictionary crate, as there's no gaurentee that the vectors between the two will be the same, nor do I think there necessarily should be since they're separate crates. But that shouldn't impact making the desired segment function available in chinese_dictionary.

Everything you're describing here seems reasonable and I could see a clear path forward, but I hope you understand I'm not comfortable agreeing to changes carte blanche.

from chinese_dictionary.

KerfuffleV2 commented on August 16, 2024

@sotch-pr35mac I sincerely apologize for leaving this issue open for so long without a response.

At the point you replied, I'd already started working on my own Chinese dictionary/tools crate and I wanted to see what developed from that. It seemed like it would be only a few days, so I thought holding off would be reasonable. Unfortunately, I ran into complications with compiling in the data and even after solving those the scope has grown. It just seemed like it was a couple days away from being able to possibly share for most of this time! My time estimation skills truly are absolute garbage.

Even just compiling in the data isn't as easy as one might expect. My initial approach was to just generate a whole bunch of nested constant slices (since the number of definitions/readings can vary) but the Rust compiler has a lot of trouble there and turned out to be excruciatingly slow at compiling that type of code. (I gave up after 15 or so minutes.) It turned out to be necessary to flatten the data. Slices also were a bit inefficient for memory usage/binary size since it was taking 128bits for the pointer + length and then the actual data.

Right now, although it's very messy it can import/compile in CEDICT format dictionaries, perform segmentation with multiple dictionaries at the same time. The main point of the latter is to allow segmentation based on more common readings even if there's a rare/archaic reading that may be longer. Just an example with individual characters, there isn't really a way to know whether 离 should be read like li2 (distance, a very common word) or chi1 (archaic term for a mystical beast). I only glanced at your syng dictionary tool but it seems like this might also be a problem you'd have encountered there.

The next thing I have to do is add support for importing dictionaries with difficulty/frequency information (i.e. HSK) and it will basically be at feature parity (for my use case) with your dictionary crate that I was already using in my webutil thing. There would still be a lot of cleanup required for it to be something that could really be released to the public. I'm hoping to have something I could at least put in a public repo by next weekend.

I hope you understand I'm not comfortable agreeing to changes carte blanche.

100%, your response was completely fair and reasonable. I was mainly just trying to see if we were on the same page before putting work into it. (2) in your list is probably the only thing we'd disagree on for the approach. I'd really want to avoid including duplicate data as much as possible, and with separate crates and basically the same data needed by both it seems like that would be difficult to avoid.

I would be interested in learning more about your mandarin-webutil project and what your upstream needs are more specifically.

It doesn't really have a clear scope at the moment. It's a testbed for stuff I feel like I needed while trying to learn but couldn't find else where. One of the issues I have is that a lot of the time learning tools will just give you too much information instead of making you figure things out on your own. So it's an option between no pinyin or showing pinyin, or no definition or showing the whole definition. I want to try to add more subtle hints like tone colors, just showing the initial character of the phonetics, maybe showing the category of word (like animal/furniture/grammar/color/etc).

Right now it's mainly aimed at inputting some Chinese text and annotating it in various ways. With the segmentation stuff I'm working on, I can kind of see it developing some IME like capabilities - allowing one to enter pinyin and have it get transformed to characters, looking up English words (or maybe just entering them).

If you want to, you can close this issue. I can give you an update when I have something worth sharing, just let me know.

I tend to use a permissive license (MIT generally), so it will be possible for you to reuse whatever code seems like it could be useful should you choose to do so. It may eventually be suitable as a backend for your crates or front-end dictionary tool.

Sorry again for the inexcusably delayed response!

from chinese_dictionary.

Set of changes to this and your other associated crates about chinese_dictionary HOT 3 CLOSED

Comments (3)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent