Giter Club home page Giter Club logo

Comments (12)

markusicu avatar markusicu commented on May 28, 2024 1

Conversion is probably fine, but in the end they are just script codes, so it also makes sense to define the full set once and have Unicode APIs use a subset of the values.

The ones in the UCD are a subset of the full set.

And only the ones in the UCD have Unicode-defined long value names (identifiers).

from icu4x.

sffc avatar sffc commented on May 28, 2024

There are a few differences, the most well known one being that Hani is a property script with Hans and Hant the corresponding subtag scripts.

Probably worth adding a conversion anyway, though, with the caveats listed.

from icu4x.

sffc avatar sffc commented on May 28, 2024

CC our SAH and PAG fellows, @Manishearth @eggrobin @markusicu

from icu4x.

Manishearth avatar Manishearth commented on May 28, 2024

I think "conversion with caveats" might be fine. Yes, they represent different things.

from icu4x.

robertbastian avatar robertbastian commented on May 28, 2024

Our name lookups are already fallible, so they could just return None on non-UCD scripts.

from icu4x.

robertbastian avatar robertbastian commented on May 28, 2024

Is this table available in data, or do we need to crate it from the spec/Wikipedia?

from icu4x.

eggrobin avatar eggrobin commented on May 28, 2024

See https://unicode.org/iso15924/iso15924.txt, linked from https://unicode.org/iso15924/codelists.html.

The PVA column is from https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt.

from icu4x.

markusicu avatar markusicu commented on May 28, 2024

Also https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
look for Type: script

which becomes this in CLDR:
https://github.com/unicode-org/cldr/blob/main/common/validity/script.xml

Note that the CLDR list includes one or more private use script subtags:

Qaag is current but yucky... Don't include Qaai which has become an alias for Zinh

from icu4x.

sffc avatar sffc commented on May 28, 2024
  • @sffc - We already have a table for mapping the script numeric ID to the string ID. It's called the enum to TinyStr4 name mapper.
  • @robertbastian - Can we make the properties::Script type use the TinyStr4-underlying repr internally?
  • @sffc - That would require loading the table all the time, as well as an indirection to load things from the table. It seems unnecessary. Not all clients need or want the TinyStr4 representation.
  • @robertbastian - I don't think we should make clients convert between the two script types on their end. We should choose a single representation. The only value I see in exposing the u16s is interaction with ICU4C, which I'm not even sure is a use case
  • @sffc - It's a rough corner of the API, but I think we should err on the side of modularity. We can make the conversion functions be nicer to use.
  • @echeran - LocaleDirectionality spans two crates, which is weird.
  • @sffc - Both of these "script" types are context-specific. Are these even things we want to promote to be the canonical representation of a script?
  • @zbraniecki - The icu_locid representation is definitely designed as a script subtag.
  • @robertbastian - Unless we include the whole IANA registry it's always going to be open. Also properties only ever return scripts, it's currently an open enum already.
  • @sffc - My position is still that what we have now is the most modular, efficient solution. We shouldn't deviate from that. We can make nice conversion functions, even From and Into gated on #[cfg(feature = compiled_data)]. But I don't think we should force all users to use the slower, bigger code path. I don't think the motivation is compelling enough for that.
propnames/to/short/linear4/sc@1, und, 802B, 55c3455e15d1d2ae
  • @robertbastian - My original proposal was to use the tinystr representation in the data structs themselves, this way there is no conversion cost. It will make the CPT slightly bigger, but this could be offset by conversion code size even.
  • @sffc - Maybe that would work.. it would change the value size from 2 to 4, which overall is less data than the additional lookup table. However, we lose the ability to return the ICU4C enumerated integers, which I believe is something that we should support so we can be a drop-in replacement.
  • @robertbastian - I think that can be modularily added, ICU4C compatibility is not universally required
  • @sffc - It would be easiest and most self-consistent to just keep all of the properties APIs returning integers.

No conclusion yet.

from icu4x.

Manishearth avatar Manishearth commented on May 28, 2024

Separately from the discussion of performance, I do think these are two different kinds of things and I would overall prefer us to have an explicit separation of types even if it may be annoying.

I don't think this is an ICU4C compat thing as much as it is a property thing. We implement the Unicode standard which has specific property values for this property, even if our numbers were different I'd still want us to use an open enum here rather than strings,

from icu4x.

sffc avatar sffc commented on May 28, 2024

I would argue that the concept of a "script" is different than either a Script Subtag or a Script Property.

A principled approach would be to introduce a new Script type with a private inner representation. The type can have conversions to and from both subtags::Script and properties::Script. This new type would also have a function to get the script directionality.

However, I'm not convinced at this time that we have a clear need for that type, nor do I have an idea where such a type would live. Therefore, I tend to think that we should keep the two types context-specific and just focus on the conversion between them.

from icu4x.

Manishearth avatar Manishearth commented on May 28, 2024

I generally agree. I'm not really sure I can see any way a third type would make sense, but I think having the split is somewhat valuable, provided it's easy to convert.

from icu4x.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.