n8willis / opentype-shaping-documents Goto Github PK
View Code? Open in Web Editor NEWDocumentation of OpenType shaping behavior
Documentation of OpenType shaping behavior
Just noticed that our character tables haven't yet been updated to include new characters introduced in Unicode 11. These should need updating:
Gurmukhi doesn't have a "Ssa" consonant, so the "Ka, Halant, Ssa" sequence in section 3.3 shouldn't be possible.
Steps 2-4 in section 2.8 state the following:
(2) All remaining marks must be tagged with the same positioning tag as the closest non-mark character the mark has affinity with, so that they move together during the sorting step.
(3) For all marks preceding the base consonant, the mark must be tagged with the same positioning tag as the closest preceding non-mark consonant.
(4) For all marks occurring after the base consonant, the mark must be tagged with the same positioning tag as the closest subsequent consonant.
Does (2) effectively cover (3)?
Also, in (3), what is a "non-mark consonant"?
Could be emulated with another mark-positioning feature, given the fact that mset
usage is discouraged.
Our state machine recognises a Consonant, Halant, ZWJ
sequence as a valid consonant syllable.
Is there such a thing as a consonant syllable that exists in half form?
Our spec states that it's only a Consonant, Halant, ZWJ, Consonant
sequence that should receive the half form treatment.
Not clear from the MS OpenType shaping site whether init
is defined for Gurmukhi.
Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?
Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).
For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?
Consider the Bengali "Ra, ZWJ, Halant, Ya" sequence, where the ZWJ is inserted immediately after the "Ra" to obtain the ya-phalaa (Unicode 11, page 472).
Based on our spec, we exclude "Ra" from being considered for the base consonant and make "Ya" our base, which is incorrect as "Ra" should be the base, with "Ya" taking on post-base form.
Should we instead modify the first step of our algorithm so it says:
If the syllable starts with a "Ra, Halant" sequence and the syllable contains more than one consonant, exclude the starting "Ra" from the list of consonants to be considered.
Consider the Bengali "Ka, Halant, ZWJ, Ya" sequence. We skip "Ya" because it has a post-base form and make "Ka" the base consonant, even though the sequence "Ka, Halant, ZWJ" produces a half-form.
The algorithm should contain another condition where we terminate the base consonant search on coming across a "Halant, ZWJ" sequence.
"Fourth, any subsequences of adjacent marks ("Halant"s, "Nukta"s, syllable modifiers, and Vedic signs) must be reordered so that they appear in canonical order."
that is to say, Unicode canonical order?
Note that the relevant glyph (below-base "La") exists in the Noto Sans font. Activating it is the issue.
I don't believe the spec covers how to shape vowel/standalone/broken/symbol syllables.
What HarfBuzz appears to do is to:
reph
in a broken syllable so it can be treated like a standalone syllable.This allows HarfBuzz to run the consonant syllable shaping logic on vowel, standalone, and broken syllables. (There is some additional logic with the initial reordering of standalone clusters to conform to Uniscribe behaviour).
For symbol syllables, they appear to skip the initial reordering step but not subsequent ones...? Not sure about this one.
Hi @n8willis,
I've got a couple of questions about post-base consonant tagging:
Section 2.7 in the Bengali spec mentions that any non-base consonants that occur after a matra should be tagged with POS_POSTBASE_CONSONANT
. HarfBuzz appears to tag them with (their version of) POS_FINAL_CONSONANT
instead, plus there is a comment mentioning that this only occurs in Sinhala. Highlighted HarfBuzz code here. Are we taking a different approach here? (The syllables we scraped from Wikipedia contain a fair number of "Ya", "Ba" and "Ra" consonants that occur after the base consonant but do not occur after a matra, thus leaving them untagged).
The same section mentions that Bengali "includes one post-base consonant" ("Ya"), but Section 1 contradicts that by saying "three consonants in Bengali are allowed to occur in post-base position: "Ya", "Ba", and "Ra"." Is the statement in Section 1 the correct one? These same scraped syllables imply that it is.
Our state machine recognises a Consonant, Matra, Halant
sequence as a valid consonant syllable.
Please forgive my ignorance here, but how should this sequence be interpreted?
Consonant
in this sequence is the base consonant, as it is the only consonant in the sequence.Matra
and not the base consonant's inherent vowel.Halant
is meant to strip a consonant of its inherent vowel, but here it is placed after the Matra
. What is the significance of the Halant
?Consider the sequence "Ka, Halant, Ba, U" (U+0995
, U+09CD
, U+09AC
, U+09C1
). Based on our tagging rules:
POS_BASE_CONSONANT
POS_AFTER_SUBJOINED
(as it is a below-base dependent vowel)What should the tag for "Ba" be? It's a codepoint that has a below-base form, but is in post-base position.
If we tag it as POS_POSTBASE_CONSONANT
(i.e. by position relative to the base) (related issue: #38), the initial reorder results in "Ka, U, Halant, Ba" (the dependent vowel moves in front of the consonant, as POS_AFTER_SUBJOINED
< POS_POSTBASE_CONSONANT
). This gives us:
If we tag it as POS_BELOWBASE_CONSONANT
, we get the correct behaviour:
The Bengali spec doesn't currently cover the tagging of POS_BELOWBASE_CONSONANT
s.
"All single-part matras can be tagged based on their Mark-positioning subclass."
Given that the mark placement subclass is left/right/top/bottom, how does one determine the appropriate sorting tag from this? (eg. POS_PREBASE_MATRA).
There are a few things that have come up from our implementation of the final Reph reordering that I thought I'd capture here:
b. If the reph repositioning class is not after post-base: target position is after the first explicit halant glyph between the first post-reph consonant and last main consonant. If ZWJ or ZWNJ are following this halant, position is moved after it. If such position is found, this is the target position. Otherwise, proceed to the next step.
Note: in old-implementation fonts, where classifications were fixed in shaping engine, there was no case where reph position will be found on this step.
REPH_POS_BEFORE_POST
characteristic:
*** What we do is synonymous with OpenType's final Reph reordering step 4, but HarfBuzz and CoreText skip straight to step 5. This makes sense to me, as the second sentence in step 4:
If no consonant is found, the target position should be before the first matra, syllable modifier sign or vedic sign.
is redundant when followed by step 5 anyway.
The Indic general spec still references BASE_POS_FIRST, but no Indic scripts use this?
For Indic scripts that have the BLWF_MODE_PRE_AND_POST
characteristic, HarfBuzz doesn't apply BLWF
substitutions to pre-base consonants under the old shaping model, but our spec does not mention such a restriction.
The paragraph at the end of this page just under the Miscellaneous Character Table which references shaping using Halant (and blocking this using ZWJ) appears to be in error as regards to the Tibetan script as the Halant character should not normally be used in Tibetan shaping - since in Tibetan consonant conjuncts are formed using the set of explicit Subjoined Consonants (U+0F90-U+0FBC) without any need of Halant (called Virama in Tibetan block).
In Tibetan all occurrences of the Halant / VIRAMA character (U+0F84) would normally be displayed with the glyph for that character.
Same thing with the whole next paragraph "Note, however, that the "consonant,Halant" subsequence in the above example may still trigger a half-forms feature....." Again, this paragraph doesn't make sense for Tibetan since Halant (VIRAMA) should never be needed to trigger a half-forms feature in Tibetan (since there are those explicit Subjoined Consonants in the Tibetan encoding).
Similarly with the next paragraph which references usage of the zero-width joiner is to prevent the formation of "Reph". In Tibetan RA followed by any subjoined consonant normally shapes as "ra-go" (abbreviated form of RA in Tibetan (there are a few exceptions which differ dependent on the particular style of Tibetan script). To prevent this shaping behavior in Tibetan U+0F6A "Fixed form Ra" is normally used. In Tibetan there is no "Reph".
Currently Tamil U+0B83 is given shaping class MODIFYING_LETTER, should it be VISARGA?
To make progress on Indic shaping we've assembled a corpus of words and syllables by scraping Wikipedia for the ten Indic languages we plan to support (hi.wikipedia.org, bn.wikipedia.org, etc.)
That has given us 22803 unique syllables for Hindi, 10404 for Bengali, and so on, which we can use as test cases for shaping.
The code for this is located at https://github.com/yeslogic/corpus
However we have found some oddities in the Wikipedia text, such as the use of many Indic codepoints that are officially unassigned:
Bengali:
\u{9b1}
\u{9b3}
\u{9c9}
\u{9e4}
\u{9e5}
Gurmukhi:
\u{a0b}
\u{a0c}
\u{a11}
\u{a37}
\u{a3b}
\u{a3d}
\u{a43}
\u{a52}
\u{a53}
\u{a54}
\u{a58}
\u{a5f}
\u{a60}
\u{a61}
\u{a64}
Gujarati:
\u{a92}
\u{aa9}
\u{ad8}
\u{add}
\u{ae4}
\u{ae5}
\u{af3}
\u{af5}
Oriya:
\u{b34}
\u{b49}
\u{b54}
\u{b58}
\u{b5a}
\u{b5b}
\u{b5e}
\u{b64}
\u{b65}
Tamil:
\u{b8b}
\u{b96}
\u{b97}
\u{b98}
\u{b9b}
\u{b9d}
\u{ba0}
\u{ba1}
\u{ba2}
\u{ba5}
\u{ba6}
\u{ba7}
\u{bab}
\u{bac}
\u{bad}
\u{bbc}
\u{bc9}
\u{be0}
Telugu:
\u{c50}
\u{c5b}
\u{c64}
Kannada:
\u{cbb}
\u{cc9}
\u{cf5}
Malayalam:
\u{d49}
Sinhala:
\u{d80}
\u{d81}
\u{d84}
\u{d97}
\u{d98}
\u{d99}
\u{db2}
\u{dbc}
\u{dbe}
\u{dbf}
\u{dc7}
\u{dc8}
\u{dc9}
\u{dcb}
\u{dcc}
\u{dcd}
\u{dce}
\u{dd5}
\u{dd7}
\u{de0}
\u{de1}
\u{de2}
\u{de3}
\u{de4}
\u{de5}
\u{df0}
\u{df1}
\u{df5}
\u{df6}
\u{df7}
\u{df8}
\u{df9}
\u{dfa}
\u{dfb}
\u{dfc}
\u{dfd}
\u{dfe}
\u{dff}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.