Giter Club home page Giter Club logo

opentype-shaping-documents's People

Contributors

adrianwong avatar alfiedotwtf avatar chrissimpkins avatar n8willis avatar rajeeshknambiar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opentype-shaping-documents's Issues

Mark tagging

Steps 2-4 in section 2.8 state the following:

(2) All remaining marks must be tagged with the same positioning tag as the closest non-mark character the mark has affinity with, so that they move together during the sorting step.

(3) For all marks preceding the base consonant, the mark must be tagged with the same positioning tag as the closest preceding non-mark consonant.

(4) For all marks occurring after the base consonant, the mark must be tagged with the same positioning tag as the closest subsequent consonant.

Does (2) effectively cover (3)?

Also, in (3), what is a "non-mark consonant"?

Pre-base-reordering "Ra" in Telugu

As part of lengthier discussions in #32 and #41, it's been mentioned that (for Indic scripts at least) pre-base-reordering "Ra" only exists in Malayalam.

However, the Nirmala font has encoded it for Telugu too, and this OpenType entry mentions that Telugu "may display a pre-base form of "Ra"".

E.g. "Ga, Halant, Ra" using Nirmala:
Screen Shot 2019-03-19 at 10 47 02 am

"Consonant, Halant, ZWJ"

Our state machine recognises a Consonant, Halant, ZWJ sequence as a valid consonant syllable.

Is there such a thing as a consonant syllable that exists in half form?

Our spec states that it's only a Consonant, Halant, ZWJ, Consonant sequence that should receive the half form treatment.

Base Consonant Position

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

Base consonant algorithm

Bengali Ya-Phalaa

Consider the Bengali "Ra, ZWJ, Halant, Ya" sequence, where the ZWJ is inserted immediately after the "Ra" to obtain the ya-phalaa (Unicode 11, page 472).

Based on our spec, we exclude "Ra" from being considered for the base consonant and make "Ya" our base, which is incorrect as "Ra" should be the base, with "Ya" taking on post-base form.

Should we instead modify the first step of our algorithm so it says:

If the syllable starts with a "Ra, Halant" sequence and the syllable contains more than one consonant, exclude the starting "Ra" from the list of consonants to be considered.

Half Forms

Consider the Bengali "Ka, Halant, ZWJ, Ya" sequence. We skip "Ya" because it has a post-base form and make "Ka" the base consonant, even though the sequence "Ka, Halant, ZWJ" produces a half-form.

The algorithm should contain another condition where we terminate the base consonant search on coming across a "Halant, ZWJ" sequence.

Shaping Indic non-consonant syllables

I don't believe the spec covers how to shape vowel/standalone/broken/symbol syllables.

What HarfBuzz appears to do is to:

  • treat independent vowels, placeholders, and dotted circles as consonants, and
  • insert a dotted circle after a possible reph in a broken syllable so it can be treated like a standalone syllable.

This allows HarfBuzz to run the consonant syllable shaping logic on vowel, standalone, and broken syllables. (There is some additional logic with the initial reordering of standalone clusters to conform to Uniscribe behaviour).

For symbol syllables, they appear to skip the initial reordering step but not subsequent ones...? Not sure about this one.

Tagging Bengali post-base consonants

Hi @n8willis,

I've got a couple of questions about post-base consonant tagging:

  1. Section 2.7 in the Bengali spec mentions that any non-base consonants that occur after a matra should be tagged with POS_POSTBASE_CONSONANT. HarfBuzz appears to tag them with (their version of) POS_FINAL_CONSONANT instead, plus there is a comment mentioning that this only occurs in Sinhala. Highlighted HarfBuzz code here. Are we taking a different approach here? (The syllables we scraped from Wikipedia contain a fair number of "Ya", "Ba" and "Ra" consonants that occur after the base consonant but do not occur after a matra, thus leaving them untagged).

  2. The same section mentions that Bengali "includes one post-base consonant" ("Ya"), but Section 1 contradicts that by saying "three consonants in Bengali are allowed to occur in post-base position: "Ya", "Ba", and "Ra"." Is the statement in Section 1 the correct one? These same scraped syllables imply that it is.

"Consonant, Matra, Halant"

Our state machine recognises a Consonant, Matra, Halant sequence as a valid consonant syllable.

Please forgive my ignorance here, but how should this sequence be interpreted?

  • The Consonant in this sequence is the base consonant, as it is the only consonant in the sequence.
  • The base consonant carries the syllable's vowel sound, which is provided by the Matra and not the base consonant's inherent vowel.
  • A Halant is meant to strip a consonant of its inherent vowel, but here it is placed after the Matra. What is the significance of the Halant?

Tagging Bengali below-base consonants

Consider the sequence "Ka, Halant, Ba, U" (U+0995, U+09CD, U+09AC, U+09C1). Based on our tagging rules:

  • "Ka" == POS_BASE_CONSONANT
  • "U" == POS_AFTER_SUBJOINED (as it is a below-base dependent vowel)

What should the tag for "Ba" be? It's a codepoint that has a below-base form, but is in post-base position.

If we tag it as POS_POSTBASE_CONSONANT (i.e. by position relative to the base) (related issue: #38), the initial reorder results in "Ka, U, Halant, Ba" (the dependent vowel moves in front of the consonant, as POS_AFTER_SUBJOINED < POS_POSTBASE_CONSONANT). This gives us:

screen shot 2018-12-06 at 4 22 29 pm

If we tag it as POS_BELOWBASE_CONSONANT, we get the correct behaviour:

screen shot 2018-12-06 at 4 25 49 pm

The Bengali spec doesn't currently cover the tagging of POS_BELOWBASE_CONSONANTs.

[Indic] Final Reph reordering

There are a few things that have come up from our implementation of the final Reph reordering that I thought I'd capture here:

b. If the reph repositioning class is not after post-base: target position is after the first explicit halant glyph between the first post-reph consonant and last main consonant. If ZWJ or ZWNJ are following this halant, position is moved after it. If such position is found, this is the target position. Otherwise, proceed to the next step.
Note: in old-implementation fonts, where classifications were fixed in shaping engine, there was no case where reph position will be found on this step.

  • The quote above is step 2 in OpenType's final Reph reordering algorithm. Our spec lacks a similar step. HarfBuzz has it implemented, and it appears that CoreText does too. Should we also include it, or is there a good reason not to?
  • For scripts that incorporate the REPH_POS_BEFORE_POST characteristic:
    • There is no mention of the default/fallback Reph position being at the end of the syllable.
    • There is no mention of what to do in the event the Reph's final position is after a "matra, Halant" subsequence.
    • If no post-base consonants exist, our spec states that our "final Reph position is immediately before the first post-base matra, syllable modifier, or vedic sign." HarfBuzz and CoreText extend that criteria to "...before the first post-base matra, syllable modifier, or vedic sign that has a reordering class after the intended Reph position." ***

*** What we do is synonymous with OpenType's final Reph reordering step 4, but HarfBuzz and CoreText skip straight to step 5. This makes sense to me, as the second sentence in step 4:

If no consonant is found, the target position should be before the first matra, syllable modifier sign or vedic sign.

is redundant when followed by step 5 anyway.

[Gujarati] Invalid cluster example

Hello! This is a great resource, thanks and congratulations!

The following cluster would never occur:

screen shot 2018-02-10 at 7 33 44 am

You can use the following instead (text is ર્હ ):

screen shot 2018-02-10 at 7 33 10 amscreen shot 2018-02-10 at 7 33 23 am

Best,
Kalapi

Tibetan Tibetan character tables document

The paragraph at the end of this page just under the Miscellaneous Character Table which references shaping using Halant (and blocking this using ZWJ) appears to be in error as regards to the Tibetan script as the Halant character should not normally be used in Tibetan shaping - since in Tibetan consonant conjuncts are formed using the set of explicit Subjoined Consonants (U+0F90-U+0FBC) without any need of Halant (called Virama in Tibetan block).
In Tibetan all occurrences of the Halant / VIRAMA character (U+0F84) would normally be displayed with the glyph for that character.

Same thing with the whole next paragraph "Note, however, that the "consonant,Halant" subsequence in the above example may still trigger a half-forms feature....." Again, this paragraph doesn't make sense for Tibetan since Halant (VIRAMA) should never be needed to trigger a half-forms feature in Tibetan (since there are those explicit Subjoined Consonants in the Tibetan encoding).

Similarly with the next paragraph which references usage of the zero-width joiner is to prevent the formation of "Reph". In Tibetan RA followed by any subjoined consonant normally shapes as "ra-go" (abbreviated form of RA in Tibetan (there are a few exceptions which differ dependent on the particular style of Tibetan script). To prevent this shaping behavior in Tibetan U+0F6A "Fixed form Ra" is normally used. In Tibetan there is no "Reph".

Tamil Visarga

Currently Tamil U+0B83 is given shaping class MODIFYING_LETTER, should it be VISARGA?

Wikipedia issues

To make progress on Indic shaping we've assembled a corpus of words and syllables by scraping Wikipedia for the ten Indic languages we plan to support (hi.wikipedia.org, bn.wikipedia.org, etc.)

That has given us 22803 unique syllables for Hindi, 10404 for Bengali, and so on, which we can use as test cases for shaping.

The code for this is located at https://github.com/yeslogic/corpus

However we have found some oddities in the Wikipedia text, such as the use of many Indic codepoints that are officially unassigned:

Bengali:

\u{9b1}
\u{9b3}
\u{9c9}
\u{9e4}
\u{9e5}

Gurmukhi:

\u{a0b}
\u{a0c}
\u{a11}
\u{a37}
\u{a3b}
\u{a3d}
\u{a43}
\u{a52}
\u{a53}
\u{a54}
\u{a58}
\u{a5f}
\u{a60}
\u{a61}
\u{a64}

Gujarati:

\u{a92}
\u{aa9}
\u{ad8}
\u{add}
\u{ae4}
\u{ae5}
\u{af3}
\u{af5}

Oriya:

\u{b34}
\u{b49}
\u{b54}
\u{b58}
\u{b5a}
\u{b5b}
\u{b5e}
\u{b64}
\u{b65}

Tamil:

\u{b8b}
\u{b96}
\u{b97}
\u{b98}
\u{b9b}
\u{b9d}
\u{ba0}
\u{ba1}
\u{ba2}
\u{ba5}
\u{ba6}
\u{ba7}
\u{bab}
\u{bac}
\u{bad}
\u{bbc}
\u{bc9}
\u{be0}

Telugu:

\u{c50}
\u{c5b}
\u{c64}

Kannada:

\u{cbb}
\u{cc9}
\u{cf5}

Malayalam:

\u{d49}

Sinhala:

\u{d80}
\u{d81}
\u{d84}
\u{d97}
\u{d98}
\u{d99}
\u{db2}
\u{dbc}
\u{dbe}
\u{dbf}
\u{dc7}
\u{dc8}
\u{dc9}
\u{dcb}
\u{dcc}
\u{dcd}
\u{dce}
\u{dd5}
\u{dd7}
\u{de0}
\u{de1}
\u{de2}
\u{de3}
\u{de4}
\u{de5}
\u{df0}
\u{df1}
\u{df5}
\u{df6}
\u{df7}
\u{df8}
\u{df9}
\u{dfa}
\u{dfb}
\u{dfc}
\u{dfd}
\u{dfe}
\u{dff}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.