Giter Club home page Giter Club logo

icu4x's Introduction

ICU4X Docs Build Status Coverage Status (Coveralls) Coverage Status (Codecov)

Welcome to the home page for the ICU4X project.

ICU4X provides components enabling wide range of software internationalization. It draws deeply from the experience of ICU4C, ICU4J and ECMA-402 and relies on data from the CLDR project.

The design goals of ICU4X are:

  • Small and modular code
  • Pluggable locale data
  • Availability and ease of use in multiple programming languages
  • Written by internationalization experts to encourage best practices

Stay informed! Join our public, low-traffic mailing list: [email protected]. Note: After subscribing, check your spam folder for a confirmation.

Documentation

For an introduction to the project, please visit the "Introduction to ICU4X for Rust" tutorial. Further tutorials can be found in the tutorial index.

For technical information on how to use ICU4X, visit our API docs (latest stable) or API docs (tip of main).

More information about the project can be found in the docs subdirectory.

Quick Start

An example ICU4X powered application in Rust may look like below...

Cargo.toml:

[dependencies]
icu = "1.3.0"

src/main.rs:

use icu::calendar::DateTime;
use icu::datetime::{options::length, DateTimeFormatter};
use icu::locid::locale;

let options =
    length::Bag::from_date_time_style(length::Date::Long, length::Time::Medium).into();

let dtf = DateTimeFormatter::try_new(&locale!("es").into(), options)
    .expect("locale should be present in compiled data");

let date = DateTime::try_new_iso_datetime(2020, 9, 12, 12, 35, 0).expect("datetime should be valid");
let date = date.to_any();

let formatted_date = dtf.format_to_string(&date).expect("formatting should succeed");
assert_eq!(
    formatted_date,
    "12 de septiembre de 2020, 12:35:00"
);

Development

ICU4X is developed by the ICU4X Technical Committee (ICU4X-TC) in the Unicode Consortium. The ICU4X-TC leads strategy and development of internationalization solutions for modern platforms and ecosystems, including client-side and resource-constrained environments. See unicode.org for more information on our governance.

ICU4X-TC convenes approximately once per quarter in advance of ICU4X releases. Most work in the interim takes place in the ICU4X Working Group (ICU4X WG), which makes technical recommendations, lands them in the repository, and records them in CHANGELOG.md. The recommendations of ICU4X WG are subject to approval by the ICU4X-TC.

Please subscribe to this repository to participate in discussions. If you want to contribute, see our contributing.md.

Charter

For the full charter, including answers to frequently asked questions, see charter.md.

ICU4X is a new project whose objective is to solve the needs of clients who wish to provide client-side internationalization for their products in resource-constrained environments.

ICU4X, or "ICU for X", will be built from the start with several key design constraints:

  1. Small and modular code.
  2. Pluggable locale data.
  3. Availability and ease of use in multiple programming languages.
  4. Written by internationalization experts to encourage best practices.

ICU4X will provide an ECMA-402-compatible API surface in the target client-side platforms, including the web platform, iOS, Android, WearOS, WatchOS, Flutter, and Fuchsia, supported in programming languages including Rust, JavaScript, Objective-C, Java, Dart, and C++.

Licensing and Copyright

Copyright © 2020-2024 Unicode, Inc. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.

The project is released under LICENSE, the free and open-source Unicode License, which is based on the well-known MIT license, with the primary difference being that the Unicode License expressly covers data and data files, as well as code. For further information please see The Unicode Consortium Intellectual Property, Licensing, and Technical Contribution Policies.

A CLA is required to contribute to this project - please refer to the CONTRIBUTING.md file (or start a Pull Request) for more information.

icu4x's People

Contributors

aethanyc avatar andrewpollack avatar atcupps avatar dependabot[bot] avatar dminor avatar dsipasseuth avatar echeran avatar eggrobin avatar filmil avatar gregtatum avatar hsivonen avatar iainireland avatar jedel1043 avatar makotokato avatar manishearth avatar nciric avatar nordzilla avatar pandusonu2 avatar pdogr avatar robertbastian avatar samchen61661 avatar sffc avatar skius avatar snktd avatar sotam1069 avatar srl295 avatar striezel avatar waywardmonkeys avatar younies avatar zbraniecki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

icu4x's Issues

Shall we require no_std compatibility?

Being compatible with #![no_std] is important for running on low-resource devices. The benefits of no_std include:

  1. Makes it harder to accidentally pull in large standard library dependencies, reducing code size
  2. Adds the ability to remove expensive debugging machinery (in terms of code size) from the standard library in release builds, such as stack traces and pretty-print error messages
  3. Compiler settings like -Z can be used to compile standard library code

In a no_std environment, we would still depend on the alloc crate. A lightweight allocator such as wee_alloc can be used when necessary.

Consider using the RACI framework for responsibility assignment

RACI is described here: https://en.wikipedia.org/wiki/Responsibility_assignment_matrix

tl;dr: this allows a distinction between people who are accountable for the outcome, and people who actually do the work (can be the same as accountable, but not necessary); as well as calling out explicitly the folks who can provide info vs people who are informed only.

https://github.com/unicode-org/icu4x/blob/master/docs/triaging.md#assignee talks about a "champion". This would be "accountable". Using prior art in responsibility assingment allows us not to spend time reinventing that.

[locale] Potential parsing optimization based on subtag length

In the #43 I was able to improve the parsing performance by ~24% by separating length measuring and returning Err before I parsed the subtag into a TinyStr.

The reason it works so well is because in multiple places we use parsing to test if a subtag is of a given type, but in most cases we can learn it just from the length of the subtag.

A potential optimization may be to either separate length measuring in the parser (so, take the length, check if it's 2..=4 and if not don't even try to parse as X) or even add an internal constructor that takes a length (from_bytes_with_length) so that you measure the length of the subtag once.

Depending on the cost of constructing the Result::Err and the cost of taking the len of a subtag (we do this per from_bytes) we may get additional wins there.

Import unicode-normalization or re-write from scratch?

@markusicu has done a great deal of work on ICU4C's normalizer. It depends on low-level and highly optimized data structures such as UCPTrie.

Writing normalization code from a clean room would allow us to:

  1. Use the same core algorithms as ICU4C, allowing better interop of code, data, and clients
  2. Build in proper string handling (#14)
  3. Integrate it with ICU4X's locale data pipeline (including UCD data)

What are the key low-level data structures we need to support ECMA-402?

ICU4C/ICU4J is built on top of a large, internal standard library of low-level data structures. We will likely want at least a subset of those in OmnICU.

My question is, which ones do we need in order to support the ECMA-402 featureset?

Examples:

  1. UCharsTrie: map strings to integers. Useful for parsing well-defined syntaxes like number skeletons and measurement unit identifiers.
  2. UCPTrie: map code points to integers. Useful as a basis for UnicodeSet.
  3. UnicodeSet: a set of code points and strings (usually graphemes). Useful for a wide range of operations in ICU.
  4. FormattedStringBuilder: A string builder that supports field positions (formatToParts) and is optimized for dealing with LDML string patterns.

CC @markusicu

Add macros for locale/langid

One of the features of unic-locale is that it allows for "free" encoding of language identifiers, locales and subtags thanks to proc macros.

In the current model it's quite quirky, and I didn't want to include it in the initial landing, but for that code to work, we only need to add two methods per subtag, and at least one of them doesn't have to be public - they just need to make it possible to create a subtag from a pre-computed u64/u32 (which is what the proc macro will do).

The good news is that Rust is about to stabilize proc macros (targeting Rust 1.45) which will allow us to get this feature without multi-crate hacks required before.

Problems with multiple variant subtags (use cases?)

The unic-langid code @zbraniecki is importing supports multiple variant subtags, as is required by the BCP47 standard. However, I had previously found that the requirement to store and sort a variable-length list of subtags doubles the code size of unic-langid (zbraniecki/unic-locale#49).

Claim: Most language tags don't have variant subtags, and of the ones that do, they usually only have one. I don't have data to back up this claim.

Given that (1) unic-langid is very low-level and (2) multiple variant subtags are uncommon (a claim which could be refuted by evidence), I was thinking about changing the data model:

  • Instead of using a Vec to store the variant subtags, store them as a TinyStrAuto or TinyStr16*
    • * If stored as a TinyStr16, we would just fail to parse if there are too many variants to fit. No, this is not BCP47-compliant, but can you point to real-world use cases we would break?
  • The variants string would be of the form "variant1-variant2-variant3", in alphabetical order

We could still have helper methods to split out the different variants, but the data model would be significantly lighter weight, especially if we use TinyStr16 and reject language codes with too many variants.

To be clear, this would only need to affect LanguageIdentifier, not Locale, since Locale needs to carry the extra machinery in order to handle Unicode extension subtags.

[locale] Add a second component owner/peer

Locale component is currently owned by me and I get a lot of Rust specific help from Manish who has been my reviewer for this codebase.

Per today's conversation, it would be good to have a second person to co-own the component with me from the implementation/API etc. point of view.

I don't think it's blocking or urgent, but I'd like to file this issue so that we make sure to get the proper ownership coverage per component as we move.

Make standard library dependencies pluggable

In order to reduce code size when shipping binaries to other environments, such as WebAssembly, it may be desirable to give a way to leverage the environment's standard library instead of building the Rust version into the OmnICU binary.

For example, JavaScript has a Map type. If we build OmnICU to target WebAssembly, it would be nice if OmnICU could import JavaScript Map instead of shipping hashbrown.

CC @kripken

Performance testing against ICU4C

The discussion in #40 regarding implementation touches on performance of existing ICU4C code vs. the pre-existing Rust module unicode-normalization.

Performance results enables comparison between implementations, which enables a decision on future implementation strategy. Performance results can be useful in general for its own sake as a measure of Rust code across changes, independent of Rust vs C comparisons.

Some of the aspects of performance testing:

  • Test input data set
    • Where to gather info?
    • Should the test data be representative of typical use cases or a stress test (worst-case)?
    • How much is enough?
  • Testing framework
    • Rust-only: cargo bench provides a way of running benchmarks for running test code
    • Rust vs. C comparisons - how is a fiar comparison achieved? Is calling ICU4C from Rust via FFI acceptable?

Crate naming case convention

I've seen both kebab case and snake case used in crate names.

Which convention do we want to adopt?

  • icu-locale
  • icu_locale

Ergonomic API & Data Providers

Based on pull request #28 I would like to discuss ways we can deal with data providers and end client API (in document referred as ergonomic API).

I feel that average developer shouldn't care where the data comes from, but should be aware of async nature of the request, as long as project as a whole can set it up for them. Think about Chrome, where Browser/Renderer processes set up data to be fetched from disk or if missing from a service. Ordinary developer wouldn't need to make that decision on every point of interaction with our API.

A similar approach to what @zbraniecki proposed for caching can be applied to data providers. We can have a simple DataProviderCache object, that's globally available to all constructors/methods. I don't expect that a single instance of our library will have more than handful different providers (if that), so cache would be fairly small.

An example of DataProviderCache initialization:

data_provider_cache = DataProviderCache()
data_provider_cache.insert('static_data', static_provider[, preference_level_0])
data_provider_cache.insert('aws_data', aws_provider[, preference_level_1])
data_provider_cache.insert('slow_data', slow_provider[, preference_level_2])
...

Preference level was added in case two providers can supply the same data set, but potentially with higher cost to speed, dollar amount etc.

Each data provider would know which locale it can handle, and data it can provide for each. It would also be able to tell if it already has that data so new fetch is not necessary.

Our ergonomic API in that case would be in a shape of:

Intl.NumberFormat(locale, options)

or if we want to enable developers to enforce specific data sources:

Intl.NumberFormat(locale, options, ['static_data', 'aws_data'])

Data provider trait definition

In data-pipeline.md, I explain the need for a data provider trait, with an open question being exactly what the definition of this trait should be. The purpose of this thread is to reach a cohesive solution for that trait.

Requirements list (evolving):

  1. Should not require unsafe code in Rust
  2. Should be fast when loading data from static memory (avoid requirement to clone)
  3. Should be amenable to data providers being connected in a chain, performing transformations on certain keys and passing through others
  4. Should be ergonomic to use across an FFI boundary

[locale] Consider using a parsing iterator for comparing

At the moment PartialEq<&str> relies on to_string of the LanguageIdentifier, but as we can see from benchmarks, parsing is much faster than serializing.
Therefore it might be nice to do the reverse - start parsing the &str and compare subtags as we go.
The nice thing about this is that if we encounter mismatching tags early on, we can stop parsing.

IDL approach

Complementary to #1.

In case it is useful, we could consider using an IDL to auto-generate libraries from a common core implementation.

Pros:

  • No need to transpile.

Cons:

  • Won't work for languages that don't use FFI.

Add CI for linting

Things we should cover in a linting CI:

  1. Code format is consistent with cargo fmt
  2. Copyright headers are correct

Running tests with different feature sets / architectures

We should think about how we test different feature sets and architectures. By default, cargo test only tests your default architecture and the crate's default features.

Examples of things we want to test:

  • std vs. no_std environment (by enabling or disabling the std feature)
  • Building for the wasm32_unknown_unknown architecture
  • Building on non-Linux, non-x86 architectures like Windows, macOS, ARM

Note: rust-lang/cargo#2911 is a feature request to allow integration tests to choose different feature sets.

Where do data struct definitions live?

My current WIP for the data provider (#61) passes around abstract hunks of data (Any in Rust), which can be cast to specific structs using dynamic type checking at runtime.

let request = datap::Request {
    locale: "root".to_string(),
    category: datap::Category::Decimal,
    key: datap::Key::Decimal(Key::SymbolsV1),
    payload: None,
};
let response = my_data_provider.load(request).unwrap();
let decimal_data = response.borrow_payload::<&SymbolsV1>().unwrap();

My question is: where does the struct definition SymbolsV1 live in code? It could live:

  1. In the number crate, near where the struct is being consumed.
    • Pro: Data definitions are close to where they are used. No need to touch the common crate when adding or updating a struct.
    • Con: Data provider implementations (e.g., a JSON or Bincode data provider) need to depend on dozens of crates in order to pull in all the struct definitions they need.
  2. In a common ICU4X crate, alongside other struct definitions.
    • Pro: Everything the data provider needs is in one place. We can release canonical "macro" structs that are a collection of several smaller ones, useful for JSON serialization.
    • Con: Not as extensible to people who want to build their own structs on top of the ICU4X data pipeline machinery.

I'm doing Option 2 in #61. I noticed Elango is doing Option 1 in #86. Because of the extensibility argument, I have a slight preference for Option 1, but I fear it could get unwieldy for data provider implementations.

One big crate or many small crates?

We've agreed to start out with a monorepo. However, the question remains about whether we ship artifacts as one large crate or many small crates.

Design a cohesive solution for supported locales

I very often see clients who want to use ICU as a default behavior, but fall back to custom logic if ICU does not support a given locale.

The main problem, of course, is that the locale fallback chain is an essential piece of whether or not a locale is supported. If you have locale data for "en" and "en_001", but request "en_US" or "en_GB", the answer is that both of those locales are supported, even though they both load their data from a fallback locale.

I'm not 100% confident, but I think the prevailing use case is that programmers want to know whether the locale falls back all the way to root. If it gets "caught" by an intermediate language, then that's fine, as long as we don't use the stub data in root.

ECMA-402 has the concept of supportedLocalesOf. Although it's not clear in the MDN, it appears that this method has the ability to check for locale fallbacks. This is better than ICU4C's behavior of getAvailableLocales, which returns a string list and requires the user to figure out how to do fallbacks and matching on that list.

We could consider whether this use case fits in with the data provider, or whether we want to put it on APIs directly.

Document and address WebAssembly constraints

@hagbard wrote a doc with some constraints that WebAssembly brings that make it not a viable option for ICU4X porting to the Web Platform. We should:

  1. Add this document to GitHub
  2. Raise these concerns to the WebAssembly standards group to make progress on addressing those problems in future WebAssembly releases

Automate design doc approval rules

The Bylaws state that official decisions for design docs should be made at meetings with consensus.

In order to better stick to that process, we can have a Github PR bot that checks if reviewers from at least 2 member companies have reviewed the PR. We can reuse or extend a similar Github PR bot used for ICU PRs.

OmnICU Rust Style Guide

@hagbard has been drafting a style guide for OmnICU code in Rust. We should clean up the style guide and check it in to the docs/ folder as a Markdown file.

Document intent for regular expressions

ecosystem.md mentions icu::Regex. The Rust regex crate already exists and is very performant (in part due to not supporting some Perl-popularized features that aren't actually regular and hinder performance).

It might be useful to signal intent in this area at some point.

Does the project seek to provide regular expressions that operate on UTF-8 for Rust apps? If so, what would be the elevator pitch relative to the regex crate?

Does the project seek to provide regular expressions that operate on UTF-16 and Latin1 and conform to ECMAScript regular expressions for use in JavaScript engines? If so, what would be the elevator pitch relative to what SpiderMonkey and V8 already have?

Does the project seek to provide regular expressions that Dart or Go programs would use? If so, what would be the elevator pitch relative to what the standard libraries of these languages provide?

Does the project seek to provide regular expressions that C or C++ apps would use via FFI? If so, would this just FFI around the regex crate (i.e. UTF-8), something new, or for UTF-16?

Naming convention for locale variable

There are a few conventions for how to name a locale variable. The two I've seen the most are:

  1. locale
  2. loc

Which convention should we adopt in ICU4X, e.g., in argument names?

Should we include IDNA in ICU4X?

IDNA is in UTS 46, but it is a very specific subject area. The existing IDNA Rust crate lives alongside other URL-related functionality.

ICU's IDNA has had bugs piling up, with no clear owner, since IDNA is largely tangential to the core ICU functionality of localized text processing. UAX 31 and UTS 39 (identifiers and confusables) are largely in the same bucket.

I personally would like to see us focus on ECMA-402, and declare UAX 31, UTS 39, and UTS 46 out of scope of ICU4X, at least for the time being.

@markusicu @macchiati @kpozin

Decide on exposing LanguageIdentifier

This is a follow-up to #43 (comment).

There are some compelling arguments that we should expose only Locale and not LanguageIdentifier as public API in ICU4X. This issue is a reminder to revisit this discussion once the rest of unic-locale is rolled in and we are able to perform more testing.

Async or sync APIs

Our data loading involves reading files on demand, and sometimes even requesting data from a REST provider over HTTP.

Is the plan to make all the APIs asynchronous? Or to have two versions of each method?

CC: @sffc @zbraniecki

Document common commands that are available for each component

Spinning this off from #43.

Rust provides a number of commands that help universally work with each component. @nciric suggested to drop just listing those commands in README.md which I did, but I think it could be nice to have them documented.

Those commands are:

cargo test
cargo bench
cargo doc --open
cargo build --release
cargo fmt
cargo clippy

Define traits for Unicode and ECMA 402 APIs

This issue a companion of: unicode-org/rust-discuss#14.

How interested are we in general in having a common set of trait-specified API surfaces to program against?
The idea is to allow multiple implementations, say omnicu (this work) and based on icu4c (like https://github.com/google/rust_icu). The theory is great, but practice is proving to be a bit more difficult than I thought.

Trying my hand at this I found that rust puts constraints on how the traits can look like, specifically since it is involved to return some useful forms of iterator.

Here's my attempt to work this out for https://github.com/zbraniecki/unic-locale/tree/master/unic-langid. For example, having a trait that returns an iterator trait (e.g. ExactSizeIterator) seems quite complicated because of the need to arrange the correct lifetimes of the iterator, the iterated-over-objects themselves, and the owner of the iterated-over objects. I got to this, but I'm not very happy about the outcome: https://github.com/unicode-org/rust-discuss/pull/19/files

FYI @zbraniecki

How should we build static data?

When we don't want to perform file or network I/O, it may be necessary to build the data into a binary format that can be built into the Rust static memory space.

We can use a solution like CBOR or bincode.

However, these solutions require parsing the data structure into Rust objects at runtime. That should be pretty quick, and strings buffers could point into the static memory, but a pure Rust static structure would likely be the fastest. However, I have not yet done code size or performance testing.

String encodings (UTF-8/UTF-16)

We want to make OmnICU able to be ported to other environments where strings are not necessarily UTF-8. For example, if called via FFI from Dart or WebAssembly, we may want to process strings that are UTF-16. How should we deal with strings in OmnICU that makes it easy to target UTF-16 environments while not hurting performance in UTF-8 environments?

code coverage

I find having a continuous code coverage to be very useful in finding missing spots in test coverage.

I only ever used coveralls, but it works quite well. Here's an example for fluent-rs.

I think we should set up some CI+codecov, either via github actions, travis or any other solution. tarpaulin, which is a crate I'm using, also supports codecov.

C++/Rust FFI

One of the questions we had from clients was how hard is it to use library through FFI.

I played around with C++, Rust & cbindgen, based on Rust FFI Omnibus and other sites.

Here are results in my experimental repo - files of interest are main.cc, lib.rs.

I could document findings so far if people feel it's useful. We can expand it over time.

Data-driven testing

One part of the larger testing strategy for ICU4X would be to have a clean, consistent way of organizing the unit tests for the business logic (i18n algorithms). In particular, it would be nice to have a data-oriented style of testing, as exemplified already in some parts of @zbraniecki's unic-locale repo. ICU unit tests tend to be written in a parameterized style, but the idea here is to take the data-driven nature further.

Pros and Cons

Pros:

  • ICU algorithms are mostly (all?) stateless, so describing in terms of data condenses test code to essential inputs and outputs
  • Using a data representation makes tests language agnostic and reusable in target languages
  • Representing tests as data would mirror the essence of user inputs and outputs described in the wrapping layer
  • A concise notation would incentivize uniformity and concision of testing

Cons:

  • Need to create a test harness/mini-framework to parse and execute test cases written as data
  • Target languages wanting to reuse tests-as-data on ICU4X client libraries would need to write a native harness/mini-framework
  • Tests that depend on stateful resources (ex: data stores) with setup/teardown phases would have to be handled in some other way
  • Representing tests in a new way would require some initial adjustment

Existing libraries

Most searches for "data driven testing" produce results for databases, spreadsheets, and automated web UI testing. Links to more relevant pre-existing libraries are welcome.

Some examples of test libraries written to reduce the cognitive load when testing, especially when testing data collections:

  • Truth - Fluent assertions for Java and Android
  • Expectations - error messages that do a nested data diff on expected vs. actual

Testing features

Beyond just asserting that the actual return value matches the provided expected value, we should also consider the following testing aspects:

  • matching functions/operators (equal, not equal, contains, does not contain, equal contents in {any, same} order, ...)
  • test name
  • test type (corresponding to ICU4X feature)
  • message on error
  • expected failure modes (ex: exceptions/panics)
  • parameterization (single test input vs. list of related test inputs)
  • APIs to read/construct test cases (ex: to enable interop with fuzz testing?)

Transpilation approach

One approach to accomplish the main goals in the OmnICU Charter is to have a transpiler that converts input source code into the equivalent source code for each target language (or platform).

The intent of this approach is to decouple the code deliverable from the target language toolchain and allow the toolchain to optimize code on a per-application basis. The result would allow the target code to run with minimal dependencies and code size.

The input source code should represent the i18n functionality that is in the scope for OmnICU. It should also be able to pass unit tests that check the logical correctness of the input code itself, and do so before any transpilation occurs. This ensures an authoritative source of truth for logical correctness tests, decoupled from target language/runtime effects.

Diplomat: Investigate using Rust as a transpilation source

The "X" part of ICU4X is still not fully defined. WebAssembly has been tossed out as one potential solution (main issue: #73). However, we should also investigate other approaches. One of those approaches could be treating Rust as a source language for transpilation.

Rust has certain advantages to serve as a source language for transpilation:

  1. To transpile to most other languages, you need to "remove" things (lifetimes, etc) rather than add things. The high amount of static analysis that can be done on native Rust is a treasure trove of information to use Rust as a source for transpilation.
  2. Since we are writing Rust from scratch, we can write it in a way to be "transpiler-friendly": for example, abstracting out standard library dependencies.

Obviously, the big hurdle is that we would have to figure out how to write this Rust transpiler, which, as far as I can tell, does not exist yet. For both Lisp and Rust, since these are uncommon languages at Google, it is uncertain whether we could find another team who would have the expertise to own the project.

However, given the strong community around Rust, I have a bit of hope that if we (at Google and Mozilla, two industry leaders) were to write such a tool, that it would obtain community adoption and be used in other projects.

Consider a cargo option to ingest JIS X 0208, Big5, and GB2312 orders from encoding_rs

The CLDR collation order for Japanese includes a chunk of data that's just Level 1 Kanji followed by Level 2 Kanji from JIS X 0208. (Starts with &[last regular]<*亜)

CLDR includes alternative collation orders for Chinese, gb2312han (starts with &[last regular]<*啊) and big5han (starts with &[last regular]<*兙) that appear to be Level 1 Hanzi followed by Level 2 Hanzi from GB2312 and Big5, respectively.

We might want to consider whether it makes sense as a binary size optimization (depending on what data layout the collator needs and whether the gb2312han and big5han orders are in actual use) to provide a cargo option on ICU4X collator and a cargo option on encoding_rs to use the data that already exists in the data segment of apps that depend on encoding_rs for constructing (the relevant parts of) these collation orders.

(This is just "writing this down". I don't expect us to act on this anytime soon.)

Make scope more explicit in the charter

The charter currently says:

OmnICU will provide an ECMA-402-compatible API surface in the target client-side platforms

and:

What if clients need a feature that is not in ECMA-402?
Clients of OmnICU may need features beyond those recommended by ECMA-402. The subcommittee is not ruling out the option of adding additional features in the same style as ECMA-402 to cover additional client needs. The details for how to determine what features belong in OmnICU that aren't already in ECMA-402 will be discussed at a future time.

The caption in ecosystem.md says:

This document tracks the crates that already exist in the ecosystem that cover functionality that we may wish to cover in OmnICU.

I added the word "may" in #41.

I think it's important that we be more explicit about the use cases that ICU4X is supporting. This will guide our discussions, such as #43 (comment), when deciding whether a certain API or functional unit belongs in ICU4X.

We could start by making an explicit list of use cases that warrant APIs and functional units not covered by 402, and adding that list to the charter. It might be best to do this on a case-by-case basis: if proposing a feature not explicitly sanctioned by the charter, then propose a change to the charter adding the corresponding use case to the charter, such that we can agree on that change in the subcommittee meeting.

@zbraniecki

Crate Namespace

How do we name our crates? This depends a bit on the answer to #13, but we have two general options:

  1. Prefix the names with "icu"
  2. Prefix the names with "omnicu"

@zbraniecki had some arguments for preferring "icu". Can you lay those out here?

Data pipeline doc miscellaneous

There are comments on the Data Pipeline doc, since it already got merged. Would like to hear feedback to first determine if they merit changes.

  • definition of "data version" - This definition describes the version of data as something that is "abstracted away from" the versions of the format and schema. I would think that the relationship is actually dependent on the schema, but still independent of the format. Since schema is the structure of the data, if the schema version changes, I would expect it to force the data version to change.

  • Data version - To the extent that this matters or makes sense, it would be more readable if the keys delineate "key segments" differently from multi-word segments. CLDR_37_alpha1 and FOO_1_1 are parsed differently, whereas CLDR-37-alpha1 and FOO-1_1 would be unambiguous.

  • Schema version / Data version - If we allow the data provider to choose which version(s)' worth of data to hold, then it's possible for a user to call data for a key+version which is not supported (maybe the version is too old/new, or the key has changed due to schema change). Do we have a description of how we handle that? We could just make it easy and return null / throw error. I suppose a data provider can be configured to fetch from an authoritative service with all versions of all data (depicted in the diagram?), which makes it a data provider decision/configuration.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.