Comments (12)
Conversion is probably fine, but in the end they are just script codes, so it also makes sense to define the full set once and have Unicode APIs use a subset of the values.
The ones in the UCD are a subset of the full set.
And only the ones in the UCD have Unicode-defined long value names (identifiers).
from icu4x.
There are a few differences, the most well known one being that Hani
is a property script with Hans
and Hant
the corresponding subtag scripts.
Probably worth adding a conversion anyway, though, with the caveats listed.
from icu4x.
CC our SAH and PAG fellows, @Manishearth @eggrobin @markusicu
from icu4x.
I think "conversion with caveats" might be fine. Yes, they represent different things.
from icu4x.
Our name lookups are already fallible, so they could just return None
on non-UCD scripts.
from icu4x.
Is this table available in data, or do we need to crate it from the spec/Wikipedia?
from icu4x.
See https://unicode.org/iso15924/iso15924.txt, linked from https://unicode.org/iso15924/codelists.html.
The PVA column is from https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt.
from icu4x.
Also https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
look for Type: script
which becomes this in CLDR:
https://github.com/unicode-org/cldr/blob/main/common/validity/script.xml
Note that the CLDR list includes one or more private use script subtags:
- https://www.unicode.org/reports/tr35/#unicode_script_subtag_validity
- https://www.unicode.org/reports/tr35/#Private_Use_Codes
Qaag is current but yucky... Don't include Qaai which has become an alias for Zinh
from icu4x.
- @sffc - We already have a table for mapping the script numeric ID to the string ID. It's called the enum to TinyStr4 name mapper.
- @robertbastian - Can we make the
properties::Script
type use the TinyStr4-underlying repr internally? - @sffc - That would require loading the table all the time, as well as an indirection to load things from the table. It seems unnecessary. Not all clients need or want the TinyStr4 representation.
- @robertbastian - I don't think we should make clients convert between the two script types on their end. We should choose a single representation. The only value I see in exposing the u16s is interaction with ICU4C, which I'm not even sure is a use case
- @sffc - It's a rough corner of the API, but I think we should err on the side of modularity. We can make the conversion functions be nicer to use.
- @echeran - LocaleDirectionality spans two crates, which is weird.
- @sffc - Both of these "script" types are context-specific. Are these even things we want to promote to be the canonical representation of a script?
- @zbraniecki - The
icu_locid
representation is definitely designed as a script subtag. - @robertbastian - Unless we include the whole IANA registry it's always going to be open. Also properties only ever return scripts, it's currently an open enum already.
- @sffc - My position is still that what we have now is the most modular, efficient solution. We shouldn't deviate from that. We can make nice conversion functions, even
From
andInto
gated on#[cfg(feature = compiled_data)]
. But I don't think we should force all users to use the slower, bigger code path. I don't think the motivation is compelling enough for that.
propnames/to/short/linear4/sc@1, und, 802B, 55c3455e15d1d2ae
- @robertbastian - My original proposal was to use the tinystr representation in the data structs themselves, this way there is no conversion cost. It will make the CPT slightly bigger, but this could be offset by conversion code size even.
- @sffc - Maybe that would work.. it would change the value size from 2 to 4, which overall is less data than the additional lookup table. However, we lose the ability to return the ICU4C enumerated integers, which I believe is something that we should support so we can be a drop-in replacement.
- @robertbastian - I think that can be modularily added, ICU4C compatibility is not universally required
- @sffc - It would be easiest and most self-consistent to just keep all of the properties APIs returning integers.
No conclusion yet.
from icu4x.
Separately from the discussion of performance, I do think these are two different kinds of things and I would overall prefer us to have an explicit separation of types even if it may be annoying.
I don't think this is an ICU4C compat thing as much as it is a property thing. We implement the Unicode standard which has specific property values for this property, even if our numbers were different I'd still want us to use an open enum here rather than strings,
from icu4x.
I would argue that the concept of a "script" is different than either a Script Subtag or a Script Property.
A principled approach would be to introduce a new Script
type with a private inner representation. The type can have conversions to and from both subtags::Script
and properties::Script
. This new type would also have a function to get the script directionality.
However, I'm not convinced at this time that we have a clear need for that type, nor do I have an idea where such a type would live. Therefore, I tend to think that we should keep the two types context-specific and just focus on the conversion between them.
from icu4x.
I generally agree. I'm not really sure I can see any way a third type would make sense, but I think having the split is somewhat valuable, provided it's easy to convert.
from icu4x.
Related Issues (20)
- IterableDataProvider::supported_locales is a misnomer HOT 1
- Add try_write_pattern_from_items
- `with_locales_no_fallback` does fallback HOT 14
- Remove DynamicDataProvider and expand methods into ExportableProvider HOT 1
- Add Hangul_Syllable_Type enumerated property
- Now might be a good time to figure out off-the-shelf calendar sets in DateTimeFormatter HOT 7
- Transliterator is missing compiled data constructors HOT 4
- Add Temporal CalendarDateDayOfYear to FFI HOT 1
- Incorrect start date for Meiji era HOT 2
- Incorrect week day for Hebrew calendar when ISO year is negative HOT 1
- Integer sizes of day_of_year, day_of_month HOT 1
- Consider proper design for external loader
- Unicode properties `Joining_Group` and `Joining_Type` are not exposed by this library. HOT 1
- Import adjusted UTS 46 data from ICU4C 75 branch HOT 1
- icu_provider_blob contains code that will be rejected by a future version of Rust HOT 1
- Panic when IslamicUmmAlQura month has 31 days HOT 1
- Debug assertion failures with ECMA-262 Temporal.PlainDate minimum and maximum values
- Consider renaming APIs involving daylight time HOT 2
- Field `region_format_variants` is unused in TimeZoneFormatsV1
- Release ICU4X 1.5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from icu4x.