unicode-org / icu4x Goto Github PK

Solving i18n for client-side and resource-constrained environments.

License: Other

Rust 95.33% Makefile 0.16% JavaScript 0.08% C++ 0.38% Dockerfile 0.01% Shell 0.01% Dart 0.06% C 0.18% Python 0.01% HTML 0.21% SCSS 0.01% TypeScript 0.18% WebAssembly 3.38%

cldr icu internationalization localization rust unicode

icu4x's People

Contributors

Stargazers

Watchers

Forkers

manishearth sffc echeran xiphoseer hsivonen zbraniecki filmil kpozin evanjp szercinek12 bdragon jameshinshelwood sahandfarhoodi gregtatum dminor tako8ki chrissimpkins aethanyc indo-dev-0 frankyftang nordzilla tarekgh jswalden danielt998 berryjs mardiaufah eerhardt tmandry shadaj gnrunge makotokato mgaudet iainireland cyberflamego mildgravitas robertbastian tapeinosyne poulsbo atouchet lf- pandusonu2 darksonn fabura sapriyag kupiakos snow2flying samchen61661 isgasho younies icodein standardgalactic ben0x539 infinite-skx pdogr dschlosser hkalbasi ozghimire andrewpollack ningcng devoncarew eggrobin international-explore snktd harshavamsi nottirb yzhang1994 jedel1043 rizzen-yazston cad97 acidburn0zzz qnnokabayashi evanrichter srl295 sven-oly mgeisler yongqli dtolnay-contrib dsipasseuth redstrike mayhemheroes drsloth petrosagg kelebra hasheddan mojtab23 oyelowo pt2121 skywick-oss ralfjung andrewbaxter lrgar catull striezel-stash 0xflotus gg-big-org erickguan bheylin ekleog waldmatias hanabi1224

icu4x's Issues

Migrate unic-locale to ICU4X

@zbraniecki has graciously offered to migrate https://github.com/zbraniecki/unic-locale to this repository and make it the core locale type for ICU4X. This issue is for tracking the progress of this migration.

Make links from `README.md` into files under `docs/`

Filing this so I don't forget.

Since README.md is, at least for the time being, the main landing page for this project, consider
adding links to files in the docs/ dir so that they are easily accessible.

Import unicode-normalization or re-write from scratch?

@markusicu has done a great deal of work on ICU4C's normalizer. It depends on low-level and highly optimized data structures such as UCPTrie.

Writing normalization code from a clean room would allow us to:

Use the same core algorithms as ICU4C, allowing better interop of code, data, and clients
Build in proper string handling (#14)
Integrate it with ICU4X's locale data pipeline (including UCD data)

Add chair and vice chair to README

I volunteered as Chair; @zbraniecki and @nciric volunteered as Vice-Chairs. Document this.

Can a data provider be written in a way that allows data provision delegation?

Would it be possible to foresee a common key space for data, where ICU4X data uses only a small part of? That would allow multiplexing data providers, something we'd like to explore in Fuchsia. This is more of an inquiry if something like this would be of interest for ICU4X rather than a requirement.

Should we include IDNA in ICU4X?

IDNA is in UTS 46, but it is a very specific subject area. The existing IDNA Rust crate lives alongside other URL-related functionality.

ICU's IDNA has had bugs piling up, with no clear owner, since IDNA is largely tangential to the core ICU functionality of localized text processing. UAX 31 and UTS 39 (identifiers and confusables) are largely in the same bucket.

I personally would like to see us focus on ECMA-402, and declare UAX 31, UTS 39, and UTS 46 out of scope of ICU4X, at least for the time being.

@markusicu @macchiati @kpozin

Define traits for Unicode and ECMA 402 APIs

This issue a companion of: unicode-org/rust-discuss#14.

How interested are we in general in having a common set of trait-specified API surfaces to program against?
The idea is to allow multiple implementations, say omnicu (this work) and based on icu4c (like https://github.com/google/rust_icu). The theory is great, but practice is proving to be a bit more difficult than I thought.

Trying my hand at this I found that rust puts constraints on how the traits can look like, specifically since it is involved to return some useful forms of iterator.

Here's my attempt to work this out for https://github.com/zbraniecki/unic-locale/tree/master/unic-langid. For example, having a trait that returns an iterator trait (e.g. ExactSizeIterator) seems quite complicated because of the need to arrange the correct lifetimes of the iterator, the iterated-over-objects themselves, and the owner of the iterated-over objects. I got to this, but I'm not very happy about the outcome: https://github.com/unicode-org/rust-discuss/pull/19/files

FYI @zbraniecki

Data pipeline doc miscellaneous

There are comments on the Data Pipeline doc, since it already got merged. Would like to hear feedback to first determine if they merit changes.

definition of "data version" - This definition describes the version of data as something that is "abstracted away from" the versions of the format and schema. I would think that the relationship is actually dependent on the schema, but still independent of the format. Since schema is the structure of the data, if the schema version changes, I would expect it to force the data version to change.
Data version - To the extent that this matters or makes sense, it would be more readable if the keys delineate "key segments" differently from multi-word segments. CLDR_37_alpha1 and FOO_1_1 are parsed differently, whereas CLDR-37-alpha1 and FOO-1_1 would be unambiguous.
Schema version / Data version - If we allow the data provider to choose which version(s)' worth of data to hold, then it's possible for a user to call data for a key+version which is not supported (maybe the version is too old/new, or the key has changed due to schema change). Do we have a description of how we handle that? We could just make it easy and return null / throw error. I suppose a data provider can be configured to fetch from an authoritative service with all versions of all data (depicted in the diagram?), which makes it a data provider decision/configuration.

Running tests with different feature sets / architectures

We should think about how we test different feature sets and architectures. By default, cargo test only tests your default architecture and the crate's default features.

Examples of things we want to test:

std vs. no_std environment (by enabling or disabling the std feature)
Building for the wasm32_unknown_unknown architecture
Building on non-Linux, non-x86 architectures like Windows, macOS, ARM

Note: rust-lang/cargo#2911 is a feature request to allow integration tests to choose different feature sets.

[locale] Consider using a parsing iterator for comparing

At the moment PartialEq<&str> relies on to_string of the LanguageIdentifier, but as we can see from benchmarks, parsing is much faster than serializing.
Therefore it might be nice to do the reverse - start parsing the &str and compare subtags as we go.
The nice thing about this is that if we encounter mismatching tags early on, we can stop parsing.

Problems with multiple variant subtags (use cases?)

The unic-langid code @zbraniecki is importing supports multiple variant subtags, as is required by the BCP47 standard. However, I had previously found that the requirement to store and sort a variable-length list of subtags doubles the code size of unic-langid (zbraniecki/unic-locale#49).

Claim: Most language tags don't have variant subtags, and of the ones that do, they usually only have one. I don't have data to back up this claim.

Given that (1) unic-langid is very low-level and (2) multiple variant subtags are uncommon (a claim which could be refuted by evidence), I was thinking about changing the data model:

Instead of using a Vec to store the variant subtags, store them as a TinyStrAuto or TinyStr16*
- * If stored as a TinyStr16, we would just fail to parse if there are too many variants to fit. No, this is not BCP47-compliant, but can you point to real-world use cases we would break?
The variants string would be of the form "variant1-variant2-variant3", in alphabetical order

We could still have helper methods to split out the different variants, but the data model would be significantly lighter weight, especially if we use TinyStr16 and reject language codes with too many variants.

To be clear, this would only need to affect LanguageIdentifier, not Locale, since Locale needs to carry the extra machinery in order to handle Unicode extension subtags.

Thoughts on the current data pipeline design.

Current design:
https://github.com/unicode-org/omnicu/blob/master/docs/data-pipeline.md

Brainstorming doc (please comment):
https://docs.google.com/document/d/1s_DE6zH27yGNv7rcfZEL8K3Hd0F3eIwMEUmbr7qs3lM/edit#

I think that either I am completely misunderstanding what the intended use case for the design is, or it's worth rethinking some of it. Please comment on the doc.

Consider using the RACI framework for responsibility assignment

RACI is described here: https://en.wikipedia.org/wiki/Responsibility_assignment_matrix

tl;dr: this allows a distinction between people who are accountable for the outcome, and people who actually do the work (can be the same as accountable, but not necessary); as well as calling out explicitly the folks who can provide info vs people who are informed only.

https://github.com/unicode-org/icu4x/blob/master/docs/triaging.md#assignee talks about a "champion". This would be "accountable". Using prior art in responsibility assingment allows us not to spend time reinventing that.

Crate naming case convention

I've seen both kebab case and snake case used in crate names.

Which convention do we want to adopt?

icu-locale
icu_locale

Make scope more explicit in the charter

The charter currently says:

OmnICU will provide an ECMA-402-compatible API surface in the target client-side platforms

and:

What if clients need a feature that is not in ECMA-402?
Clients of OmnICU may need features beyond those recommended by ECMA-402. The subcommittee is not ruling out the option of adding additional features in the same style as ECMA-402 to cover additional client needs. The details for how to determine what features belong in OmnICU that aren't already in ECMA-402 will be discussed at a future time.

The caption in ecosystem.md says:

This document tracks the crates that already exist in the ecosystem that cover functionality that we may wish to cover in OmnICU.

I added the word "may" in #41.

I think it's important that we be more explicit about the use cases that ICU4X is supporting. This will guide our discussions, such as #43 (comment), when deciding whether a certain API or functional unit belongs in ICU4X.

We could start by making an explicit list of use cases that warrant APIs and functional units not covered by 402, and adding that list to the charter. It might be best to do this on a case-by-case basis: if proposing a feature not explicitly sanctioned by the charter, then propose a change to the charter adding the corresponding use case to the charter, such that we can agree on that change in the subcommittee meeting.

@zbraniecki

Make standard library dependencies pluggable

In order to reduce code size when shipping binaries to other environments, such as WebAssembly, it may be desirable to give a way to leverage the environment's standard library instead of building the Rust version into the OmnICU binary.

For example, JavaScript has a Map type. If we build OmnICU to target WebAssembly, it would be nice if OmnICU could import JavaScript Map instead of shipping hashbrown.

CC @kripken

Consider a cargo option to ingest JIS X 0208, Big5, and GB2312 orders from encoding_rs

The CLDR collation order for Japanese includes a chunk of data that's just Level 1 Kanji followed by Level 2 Kanji from JIS X 0208. (Starts with &[last regular]<*亜)

CLDR includes alternative collation orders for Chinese, gb2312han (starts with &[last regular]<*啊) and big5han (starts with &[last regular]<*兙) that appear to be Level 1 Hanzi followed by Level 2 Hanzi from GB2312 and Big5, respectively.

We might want to consider whether it makes sense as a binary size optimization (depending on what data layout the collator needs and whether the gb2312han and big5han orders are in actual use) to provide a cargo option on ICU4X collator and a cargo option on encoding_rs to use the data that already exists in the data segment of apps that depend on encoding_rs for constructing (the relevant parts of) these collation orders.

(This is just "writing this down". I don't expect us to act on this anytime soon.)

Async or sync APIs

Our data loading involves reading files on demand, and sometimes even requesting data from a REST provider over HTTP.

Is the plan to make all the APIs asynchronous? Or to have two versions of each method?

CC: @sffc @zbraniecki

Diplomat: Investigate using Rust as a transpilation source

The "X" part of ICU4X is still not fully defined. WebAssembly has been tossed out as one potential solution (main issue: #73). However, we should also investigate other approaches. One of those approaches could be treating Rust as a source language for transpilation.

Rust has certain advantages to serve as a source language for transpilation:

To transpile to most other languages, you need to "remove" things (lifetimes, etc) rather than add things. The high amount of static analysis that can be done on native Rust is a treasure trove of information to use Rust as a source for transpilation.
Since we are writing Rust from scratch, we can write it in a way to be "transpiler-friendly": for example, abstracting out standard library dependencies.

Obviously, the big hurdle is that we would have to figure out how to write this Rust transpiler, which, as far as I can tell, does not exist yet. For both Lisp and Rust, since these are uncommon languages at Google, it is uncertain whether we could find another team who would have the expertise to own the project.

However, given the strong community around Rust, I have a bit of hope that if we (at Google and Mozilla, two industry leaders) were to write such a tool, that it would obtain community adoption and be used in other projects.

Data-driven testing

One part of the larger testing strategy for ICU4X would be to have a clean, consistent way of organizing the unit tests for the business logic (i18n algorithms). In particular, it would be nice to have a data-oriented style of testing, as exemplified already in some parts of @zbraniecki's unic-locale repo. ICU unit tests tend to be written in a parameterized style, but the idea here is to take the data-driven nature further.

Pros and Cons

Pros:

ICU algorithms are mostly (all?) stateless, so describing in terms of data condenses test code to essential inputs and outputs
Using a data representation makes tests language agnostic and reusable in target languages
Representing tests as data would mirror the essence of user inputs and outputs described in the wrapping layer
A concise notation would incentivize uniformity and concision of testing

Cons:

Need to create a test harness/mini-framework to parse and execute test cases written as data
Target languages wanting to reuse tests-as-data on ICU4X client libraries would need to write a native harness/mini-framework
Tests that depend on stateful resources (ex: data stores) with setup/teardown phases would have to be handled in some other way
Representing tests in a new way would require some initial adjustment

Existing libraries

Most searches for "data driven testing" produce results for databases, spreadsheets, and automated web UI testing. Links to more relevant pre-existing libraries are welcome.

Some examples of test libraries written to reduce the cognitive load when testing, especially when testing data collections:

Truth - Fluent assertions for Java and Android
Expectations - error messages that do a nested data diff on expected vs. actual

Testing features

Beyond just asserting that the actual return value matches the provided expected value, we should also consider the following testing aspects:

matching functions/operators (equal, not equal, contains, does not contain, equal contents in {any, same} order, ...)
test name
test type (corresponding to ICU4X feature)
message on error
expected failure modes (ex: exceptions/panics)
parameterization (single test input vs. list of related test inputs)
APIs to read/construct test cases (ex: to enable interop with fuzz testing?)

IDL approach

Complementary to #1.

In case it is useful, we could consider using an IDL to auto-generate libraries from a common core implementation.

Pros:

No need to transpile.

Cons:

Won't work for languages that don't use FFI.

Naming convention for locale variable

There are a few conventions for how to name a locale variable. The two I've seen the most are:

locale
loc

Which convention should we adopt in ICU4X, e.g., in argument names?

Add doc comparing WebAssembly with Transpiler approach

@nciric has a Google Doc with a table laying out some differences between these two approaches for porting OmnICU to other programming languages. The doc should be migrated to Markdown in this repository.

Shall we require no_std compatibility?

Being compatible with #![no_std] is important for running on low-resource devices. The benefits of no_std include:

Makes it harder to accidentally pull in large standard library dependencies, reducing code size
Adds the ability to remove expensive debugging machinery (in terms of code size) from the standard library in release builds, such as stack traces and pretty-print error messages
Compiler settings like -Z can be used to compile standard library code

In a no_std environment, we would still depend on the alloc crate. A lightweight allocator such as wee_alloc can be used when necessary.

Data provider trait definition

In data-pipeline.md, I explain the need for a data provider trait, with an open question being exactly what the definition of this trait should be. The purpose of this thread is to reach a cohesive solution for that trait.

Requirements list (evolving):

Should not require unsafe code in Rust
Should be fast when loading data from static memory (avoid requirement to clone)
Should be amenable to data providers being connected in a chain, performing transformations on certain keys and passing through others
Should be ergonomic to use across an FFI boundary

Repository directory structure

What should be the directory structure for our monorepo project in Git?

code coverage

I find having a continuous code coverage to be very useful in finding missing spots in test coverage.

I only ever used coveralls, but it works quite well. Here's an example for fluent-rs.

I think we should set up some CI+codecov, either via github actions, travis or any other solution. tarpaulin, which is a crate I'm using, also supports codecov.

How should we build static data?

When we don't want to perform file or network I/O, it may be necessary to build the data into a binary format that can be built into the Rust static memory space.

We can use a solution like CBOR or bincode.

However, these solutions require parsing the data structure into Rust objects at runtime. That should be pretty quick, and strings buffers could point into the static memory, but a pure Rust static structure would likely be the fastest. However, I have not yet done code size or performance testing.

Design a cohesive solution for supported locales

I very often see clients who want to use ICU as a default behavior, but fall back to custom logic if ICU does not support a given locale.

The main problem, of course, is that the locale fallback chain is an essential piece of whether or not a locale is supported. If you have locale data for "en" and "en_001", but request "en_US" or "en_GB", the answer is that both of those locales are supported, even though they both load their data from a fallback locale.

I'm not 100% confident, but I think the prevailing use case is that programmers want to know whether the locale falls back all the way to root. If it gets "caught" by an intermediate language, then that's fine, as long as we don't use the stub data in root.

ECMA-402 has the concept of supportedLocalesOf. Although it's not clear in the MDN, it appears that this method has the ability to check for locale fallbacks. This is better than ICU4C's behavior of getAvailableLocales, which returns a string list and requires the user to figure out how to do fallbacks and matching on that list.

We could consider whether this use case fits in with the data provider, or whether we want to put it on APIs directly.

Add CI for linting

Things we should cover in a linting CI:

Code format is consistent with cargo fmt
Copyright headers are correct

[locale] Add a second component owner/peer

Locale component is currently owned by me and I get a lot of Rust specific help from Manish who has been my reviewer for this codebase.

Per today's conversation, it would be good to have a second person to co-own the component with me from the implementation/API etc. point of view.

I don't think it's blocking or urgent, but I'd like to file this issue so that we make sure to get the proper ownership coverage per component as we move.

Performance testing against ICU4C

The discussion in #40 regarding implementation touches on performance of existing ICU4C code vs. the pre-existing Rust module unicode-normalization.

Performance results enables comparison between implementations, which enables a decision on future implementation strategy. Performance results can be useful in general for its own sake as a measure of Rust code across changes, independent of Rust vs C comparisons.

Some of the aspects of performance testing:

Test input data set
- Where to gather info?
- Should the test data be representative of typical use cases or a stress test (worst-case)?
- How much is enough?
Testing framework
- Rust-only: cargo bench provides a way of running benchmarks for running test code
- Rust vs. C comparisons - how is a fiar comparison achieved? Is calling ICU4C from Rust via FFI acceptable?

[locale] Potential parsing optimization based on subtag length

In the #43 I was able to improve the parsing performance by ~24% by separating length measuring and returning Err before I parsed the subtag into a TinyStr.

The reason it works so well is because in multiple places we use parsing to test if a subtag is of a given type, but in most cases we can learn it just from the length of the subtag.

A potential optimization may be to either separate length measuring in the parser (so, take the length, check if it's 2..=4 and if not don't even try to parse as X) or even add an internal constructor that takes a length (from_bytes_with_length) so that you measure the length of the subtag once.

Depending on the cost of constructing the Result::Err and the cost of taking the len of a subtag (we do this per from_bytes) we may get additional wins there.

Any interest in adding `rust_icu` to `ecosystem.md`?

About https://github.com/unicode-org/omnicu/blob/master/docs/ecosystem.md

At the moment it seems that only one implementation per each icu::* is listed. Any interest in listing other implementations, or just candidates for ICU4X inclusion?

One big crate or many small crates?

We've agreed to start out with a monorepo. However, the question remains about whether we ship artifacts as one large crate or many small crates.

Ergonomic API & Data Providers

Based on pull request #28 I would like to discuss ways we can deal with data providers and end client API (in document referred as ergonomic API).

I feel that average developer shouldn't care where the data comes from, but should be aware of async nature of the request, as long as project as a whole can set it up for them. Think about Chrome, where Browser/Renderer processes set up data to be fetched from disk or if missing from a service. Ordinary developer wouldn't need to make that decision on every point of interaction with our API.

A similar approach to what @zbraniecki proposed for caching can be applied to data providers. We can have a simple DataProviderCache object, that's globally available to all constructors/methods. I don't expect that a single instance of our library will have more than handful different providers (if that), so cache would be fairly small.

An example of DataProviderCache initialization:

data_provider_cache = DataProviderCache()
data_provider_cache.insert('static_data', static_provider[, preference_level_0])
data_provider_cache.insert('aws_data', aws_provider[, preference_level_1])
data_provider_cache.insert('slow_data', slow_provider[, preference_level_2])
...

Preference level was added in case two providers can supply the same data set, but potentially with higher cost to speed, dollar amount etc.

Each data provider would know which locale it can handle, and data it can provide for each. It would also be able to tell if it already has that data so new fetch is not necessary.

Our ergonomic API in that case would be in a shape of:

Intl.NumberFormat(locale, options)

or if we want to enable developers to enforce specific data sources:

Intl.NumberFormat(locale, options, ['static_data', 'aws_data'])

What are the key low-level data structures we need to support ECMA-402?

ICU4C/ICU4J is built on top of a large, internal standard library of low-level data structures. We will likely want at least a subset of those in OmnICU.

My question is, which ones do we need in order to support the ECMA-402 featureset?

Examples:

UCharsTrie: map strings to integers. Useful for parsing well-defined syntaxes like number skeletons and measurement unit identifiers.
UCPTrie: map code points to integers. Useful as a basis for UnicodeSet.
UnicodeSet: a set of code points and strings (usually graphemes). Useful for a wide range of operations in ICU.
FormattedStringBuilder: A string builder that supports field positions (formatToParts) and is optimized for dealing with LDML string patterns.

CC @markusicu

Add macros for locale/langid

One of the features of unic-locale is that it allows for "free" encoding of language identifiers, locales and subtags thanks to proc macros.

In the current model it's quite quirky, and I didn't want to include it in the initial landing, but for that code to work, we only need to add two methods per subtag, and at least one of them doesn't have to be public - they just need to make it possible to create a subtag from a pre-computed u64/u32 (which is what the proc macro will do).

The good news is that Rust is about to stabilize proc macros (targeting Rust 1.45) which will allow us to get this feature without multi-crate hacks required before.

OmnICU Rust Style Guide

@hagbard has been drafting a style guide for OmnICU code in Rust. We should clean up the style guide and check it in to the docs/ folder as a Markdown file.

Set up continous integration

The way I run CI for my projects is via Travis.
I have an account there, connected my project to it and set up a .travis.yml file.

Here's the view of fluent-rs CI: https://travis-ci.org/github/projectfluent/fluent-rs

I'm not sure what other options exist beyond Travis and if we want to use it. @echeran ?

Crate Namespace

How do we name our crates? This depends a bit on the answer to #13, but we have two general options:

Prefix the names with "icu"
Prefix the names with "omnicu"

@zbraniecki had some arguments for preferring "icu". Can you lay those out here?

Transpilation approach

One approach to accomplish the main goals in the OmnICU Charter is to have a transpiler that converts input source code into the equivalent source code for each target language (or platform).

The intent of this approach is to decouple the code deliverable from the target language toolchain and allow the toolchain to optimize code on a per-application basis. The result would allow the target code to run with minimal dependencies and code size.

The input source code should represent the i18n functionality that is in the scope for OmnICU. It should also be able to pass unit tests that check the logical correctness of the input code itself, and do so before any transpilation occurs. This ensures an authoritative source of truth for logical correctness tests, decoupled from target language/runtime effects.

Document and address WebAssembly constraints

@hagbard wrote a doc with some constraints that WebAssembly brings that make it not a viable option for ICU4X porting to the Web Platform. We should:

Add this document to GitHub
Raise these concerns to the WebAssembly standards group to make progress on addressing those problems in future WebAssembly releases

Document intent for regular expressions

ecosystem.md mentions icu::Regex. The Rust regex crate already exists and is very performant (in part due to not supporting some Perl-popularized features that aren't actually regular and hinder performance).

It might be useful to signal intent in this area at some point.

Does the project seek to provide regular expressions that operate on UTF-8 for Rust apps? If so, what would be the elevator pitch relative to the regex crate?

Does the project seek to provide regular expressions that operate on UTF-16 and Latin1 and conform to ECMAScript regular expressions for use in JavaScript engines? If so, what would be the elevator pitch relative to what SpiderMonkey and V8 already have?

Does the project seek to provide regular expressions that Dart or Go programs would use? If so, what would be the elevator pitch relative to what the standard libraries of these languages provide?

Does the project seek to provide regular expressions that C or C++ apps would use via FFI? If so, would this just FFI around the regex crate (i.e. UTF-8), something new, or for UTF-16?

C++/Rust FFI

One of the questions we had from clients was how hard is it to use library through FFI.

I played around with C++, Rust & cbindgen, based on Rust FFI Omnibus and other sites.

Here are results in my experimental repo - files of interest are main.cc, lib.rs.

I could document findings so far if people feel it's useful. We can expand it over time.

Automate design doc approval rules

The Bylaws state that official decisions for design docs should be made at meetings with consensus.

In order to better stick to that process, we can have a Github PR bot that checks if reviewers from at least 2 member companies have reviewed the PR. We can reuse or extend a similar Github PR bot used for ICU PRs.

Document common commands that are available for each component

Spinning this off from #43.

Rust provides a number of commands that help universally work with each component. @nciric suggested to drop just listing those commands in README.md which I did, but I think it could be nice to have them documented.

Those commands are:

cargo test
cargo bench
cargo doc --open
cargo build --release
cargo fmt
cargo clippy

Decide on exposing LanguageIdentifier

This is a follow-up to #43 (comment).

There are some compelling arguments that we should expose only Locale and not LanguageIdentifier as public API in ICU4X. This issue is a reminder to revisit this discussion once the rest of unic-locale is rolled in and we are able to perform more testing.

Where do data struct definitions live?

My current WIP for the data provider (#61) passes around abstract hunks of data (Any in Rust), which can be cast to specific structs using dynamic type checking at runtime.

let request = datap::Request {
    locale: "root".to_string(),
    category: datap::Category::Decimal,
    key: datap::Key::Decimal(Key::SymbolsV1),
    payload: None,
};
let response = my_data_provider.load(request).unwrap();
let decimal_data = response.borrow_payload::<&SymbolsV1>().unwrap();

My question is: where does the struct definition SymbolsV1 live in code? It could live:

In the number crate, near where the struct is being consumed.
- Pro: Data definitions are close to where they are used. No need to touch the common crate when adding or updating a struct.
- Con: Data provider implementations (e.g., a JSON or Bincode data provider) need to depend on dozens of crates in order to pull in all the struct definitions they need.
In a common ICU4X crate, alongside other struct definitions.
- Pro: Everything the data provider needs is in one place. We can release canonical "macro" structs that are a collection of several smaller ones, useful for JSON serialization.
- Con: Not as extensible to people who want to build their own structs on top of the ICU4X data pipeline machinery.

I'm doing Option 2 in #61. I noticed Elango is doing Option 1 in #86. Because of the extensibility argument, I have a slight preference for Option 1, but I fear it could get unwieldy for data provider implementations.

String encodings (UTF-8/UTF-16)

We want to make OmnICU able to be ported to other environments where strings are not necessarily UTF-8. For example, if called via FFI from Dart or WebAssembly, we may want to process strings that are UTF-16. How should we deal with strings in OmnICU that makes it easy to target UTF-16 environments while not hurting performance in UTF-8 environments?