The emoji-rs from richardanaya

searching

I'm trying to figure out the easiest way to search and find all skin tones for a particular emoji like clapping hands, for instance. any tips?

The problem

As I first brought up in this reddit comment, I think this crate would really benefit from additional data being stored with each emoji.

Here I'll present a draft of what kind of data may be included, how to scrape that data, and methods of generation. If this idea is accepted, it can be work-shopped before implementation.

Source Files

Currently, this repo pulls from the emoji-test page. While this does well to give basic information, it lacks much useful information. I propose using the Unicode CLDR to gather data from. Not only is it how major projects typically build emoji libraries, but it has much more data.

Some data that can be scraped with this method is as follows:

Codepoint characters
- Ex: 😂
Canonical name
- Ex: face with tears of joy
Category name
- Ex: Smileys & People
Subcategory name
- Ex: face-positive
Keywords
- Ex: face, face with tears of joy, joy, laugh, tear
Qualification
- Ex: fully-qualified

Scraping Method

Gathering data can be done in a few steps:

Categorize emoji-specific codepoints
Parsing basic info
Cross referencing keywords

Gathering the emoji-specific codepoints and initial data is easy. It's found in cldr/tools/java/org/unicode/cldr/util/data/emoji/emoji-test.txt via the CLDR link above. I believe this is the same data as where this project currently pulls from, but this should be double checked and a link to latest should be found.

Parsing basic info is done directly from the above file. This gives codepoint, string representation, qualification, and canonical name.

Cross referencing is done by examining files within common/annotations/*.xml.

Packaging

Interpreting the scraped data and dumping into a rust crate should be done with great care. There is a lot of data here, and I think a lot of room for improvement over the current method. I propose the following method:

Use build.rs to download files and generate Rust code. This removes the dependency on javascript that this crate currently has, and would allow for a very small footprint -- all generation is done during build time.
After scraping data, use build.rs to dump pre-formatted rust code into OUT_DIR to be included directly in lib.rs.
In addition to having each codepoint chronicled, include a final metadata marker -- a compile time hashmap in lib.rs to help in searching and filtering emoji.

Localization

Good news is that CLDR gives annotations in a large number of languages. Bad news is this project should eventually account for that. I propose we stick with English for now and work that out later.

However, here is a rough idea of what I was thinking:

Use crate features for each localization. This will require semi-manual updating when new CLDR localizations come out, but I think it's worth it.
Each feature is the name of the annotations/*.xml file. For example, English localication would be enabled via the en feature.
en should be a default feature

Dependencies

I recommend some of the following crates while working on this project:

phf to create perfect compile-time hashtables
xml-rs to parse the annotations
quote to generate the rust library code
proc_use for separating large modules into different files (these files will get seriously huge if not seperated)

More will be needed obviously, but I've had positive experiences with the ones above.

update crate?

Hey, I tried using this crate expecting the functionality of the #1 merge to be present and it's not. I don't see it on docs.rs either. Did you need to push up the new version of the crate?

richardanaya / emoji-rs Goto Github PK

emoji-rs's People

Contributors

Stargazers

Watchers

Forkers

emoji-rs's Issues

searching

Adding Extra Data

The problem

Source Files

Scraping Method

Packaging

Localization

Dependencies

update crate?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent