n8willis / opentype-shaping-documents Goto Github PK

View Code? Open in Web Editor NEW

159.0 31.0 15.0 9.97 MB

Documentation of OpenType shaping behavior

Makefile 3.52% CSS 24.62% HTML 54.85% Python 12.57% Batchfile 4.44%

opentype opentype-fonts opentype-features unicode complex-scripts

opentype-shaping-documents's People

Contributors

Stargazers

Watchers

Forkers

pathumego mark2mark tapeinosyne jmsole adrianwong rajeeshknambiar chrissimpkins iamsurka alolita weixuan2008 pkzr15 sirdody bigwhite0 bigwhite00

opentype-shaping-documents's Issues

Update character tables with new Unicode 11 characters

Just noticed that our character tables haven't yet been updated to include new characters introduced in Unicode 11. These should need updating:

Arabic (Extended-A)
Bengali
Devanagari (Extended)
Gurmukhi
Hebrew
Kannada
Mongolian
Nko
Telugu

Unicode's delta code charts

[Gurmukhi] Example image missing for `pres`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#5-pres

[Gurmukhi] `akhn` substitution feature

Gurmukhi doesn't have a "Ssa" consonant, so the "Ka, Halant, Ssa" sequence in section 3.3 shouldn't be possible.

Mark tagging

Steps 2-4 in section 2.8 state the following:

(2) All remaining marks must be tagged with the same positioning tag as the closest non-mark character the mark has affinity with, so that they move together during the sorting step.

(3) For all marks preceding the base consonant, the mark must be tagged with the same positioning tag as the closest preceding non-mark consonant.

(4) For all marks occurring after the base consonant, the mark must be tagged with the same positioning tag as the closest subsequent consonant.

Does (2) effectively cover (3)?

Also, in (3), what is a "non-mark consonant"?

Could be emulated with another mark-positioning feature, given the fact that mset usage is discouraged.

[Tamil] Example image missing for `blws`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/tamil/tamil-image-generation-log.md#5-blws

[Tamil] Example image missing for `blwm`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/tamil/tamil-image-generation-log.md#5-blwm

Pre-base-reordering "Ra" in Telugu

As part of lengthier discussions in #32 and #41, it's been mentioned that (for Indic scripts at least) pre-base-reordering "Ra" only exists in Malayalam.

However, the Nirmala font has encoded it for Telugu too, and this OpenType entry mentions that Telugu "may display a pre-base form of "Ra"".

E.g. "Ga, Halant, Ra" using Nirmala:

"Consonant, Halant, ZWJ"

Our state machine recognises a Consonant, Halant, ZWJ sequence as a valid consonant syllable.

Is there such a thing as a consonant syllable that exists in half form?

Our spec states that it's only a Consonant, Halant, ZWJ, Consonant sequence that should receive the half form treatment.

[Oriya] Example image missing for `half`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/oriya/oriya-image-generation-log.md#39-half

[Gurmukhi] Example image missing for `init`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#5-init

Not clear from the MS OpenType shaping site whether init is defined for Gurmukhi.

[Gurmukhi] Example image missing for `vatu`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#311-vatu

Base Consonant Position

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

Base consonant algorithm

Bengali Ya-Phalaa

Consider the Bengali "Ra, ZWJ, Halant, Ya" sequence, where the ZWJ is inserted immediately after the "Ra" to obtain the ya-phalaa (Unicode 11, page 472).

Based on our spec, we exclude "Ra" from being considered for the base consonant and make "Ya" our base, which is incorrect as "Ra" should be the base, with "Ya" taking on post-base form.

Should we instead modify the first step of our algorithm so it says:

If the syllable starts with a "Ra, Halant" sequence and the syllable contains more than one consonant, exclude the starting "Ra" from the list of consonants to be considered.

Half Forms

Consider the Bengali "Ka, Halant, ZWJ, Ya" sequence. We skip "Ya" because it has a post-base form and make "Ka" the base consonant, even though the sequence "Ka, Halant, ZWJ" produces a half-form.

The algorithm should contain another condition where we terminate the base consonant search on coming across a "Halant, ZWJ" sequence.

[Kannada] Example image missing for `abvm`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/kannada/kannada-image-generation-log.md#6-abvm

[Kannada] Example images for `psts` may not be appropriate

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/kannada/kannada-image-generation-log.md#5-psts

[Gurmukhi] Example image missing for `psts`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#5-psts

Adjacent marks

https://github.com/n8willis/opentype-shaping-documents/blob/master/opentype-shaping-indic-general.md#24-adjacent-marks

"Fourth, any subsequences of adjacent marks ("Halant"s, "Nukta"s, syllable modifiers, and Vedic signs) must be reordered so that they appear in canonical order."

that is to say, Unicode canonical order?

[Malayalam] Example image missing for `blws`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/malayalam/malayalam-image-generation-log.md#5-blws

Note that the relevant glyph (below-base "La") exists in the Noto Sans font. Activating it is the issue.

Shaping Indic non-consonant syllables

I don't believe the spec covers how to shape vowel/standalone/broken/symbol syllables.

What HarfBuzz appears to do is to:

treat independent vowels, placeholders, and dotted circles as consonants, and
insert a dotted circle after a possible reph in a broken syllable so it can be treated like a standalone syllable.

This allows HarfBuzz to run the consonant syllable shaping logic on vowel, standalone, and broken syllables. (There is some additional logic with the initial reordering of standalone clusters to conform to Uniscribe behaviour).

For symbol syllables, they appear to skip the initial reordering step but not subsequent ones...? Not sure about this one.

[Gurmukhi] Example image missing for Reph positioning

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#43-reph-position

Tagging Bengali post-base consonants

Hi @n8willis,

I've got a couple of questions about post-base consonant tagging:

Section 2.7 in the Bengali spec mentions that any non-base consonants that occur after a matra should be tagged with POS_POSTBASE_CONSONANT. HarfBuzz appears to tag them with (their version of) POS_FINAL_CONSONANT instead, plus there is a comment mentioning that this only occurs in Sinhala. Highlighted HarfBuzz code here. Are we taking a different approach here? (The syllables we scraped from Wikipedia contain a fair number of "Ya", "Ba" and "Ra" consonants that occur after the base consonant but do not occur after a matra, thus leaving them untagged).
The same section mentions that Bengali "includes one post-base consonant" ("Ya"), but Section 1 contradicts that by saying "three consonants in Bengali are allowed to occur in post-base position: "Ya", "Ba", and "Ra"." Is the statement in Section 1 the correct one? These same scraped syllables imply that it is.

[Oriya] Example image for `cjct` may be unclear

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/oriya/oriya-image-generation-log.md#312-cjct

[Kannada] Example images for `abvs` may not be appropriate

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/kannada/kannada-image-generation-log.md#5-abvs

"Consonant, Matra, Halant"

Our state machine recognises a Consonant, Matra, Halant sequence as a valid consonant syllable.

Please forgive my ignorance here, but how should this sequence be interpreted?

The Consonant in this sequence is the base consonant, as it is the only consonant in the sequence.
The base consonant carries the syllable's vowel sound, which is provided by the Matra and not the base consonant's inherent vowel.
A Halant is meant to strip a consonant of its inherent vowel, but here it is placed after the Matra. What is the significance of the Halant?

Tagging Bengali below-base consonants

Consider the sequence "Ka, Halant, Ba, U" (U+0995, U+09CD, U+09AC, U+09C1). Based on our tagging rules:

"Ka" == POS_BASE_CONSONANT
"U" == POS_AFTER_SUBJOINED (as it is a below-base dependent vowel)

What should the tag for "Ba" be? It's a codepoint that has a below-base form, but is in post-base position.

If we tag it as POS_POSTBASE_CONSONANT (i.e. by position relative to the base) (related issue: #38), the initial reorder results in "Ka, U, Halant, Ba" (the dependent vowel moves in front of the consonant, as POS_AFTER_SUBJOINED < POS_POSTBASE_CONSONANT). This gives us:

If we tag it as POS_BELOWBASE_CONSONANT, we get the correct behaviour:

The Bengali spec doesn't currently cover the tagging of POS_BELOWBASE_CONSONANTs.

[Telugu] Example image missing for `cjct`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/telugu/telugu-image-generation-log.md#312-cjct

[Tamil] Example image missing for `nukt`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/tamil/tamil-image-generation-log.md#32-nukt

[Tamil] Reph shaping characteristic

Should it be REPH_POS_AFTER_POST as written in an earlier section here, instead of REPH_POS_BEFORE_POST as written here?

Tag decomposed matras

https://github.com/n8willis/opentype-shaping-documents/blob/master/opentype-shaping-indic-general.md#23-tag-decomposed-matras

"All single-part matras can be tagged based on their Mark-positioning subclass."

Given that the mark placement subclass is left/right/top/bottom, how does one determine the appropriate sorting tag from this? (eg. POS_PREBASE_MATRA).

[Malayalam] Example images used for `blwf` may not be appropriate

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/malayalam/malayalam-image-generation-log.md#37-blwf

[Telugu] Example image missing for `pstf`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/telugu/telugu-image-generation-log.md#310-pstf

[Telugu] Example image missing for `abvm`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/telugu/telugu-image-generation-log.md#abvm

[Telugu] Example image missing for Reph positioning

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/telugu/telugu-image-generation-log.md#43-reph-position

[Indic] Final Reph reordering

There are a few things that have come up from our implementation of the final Reph reordering that I thought I'd capture here:

b. If the reph repositioning class is not after post-base: target position is after the first explicit halant glyph between the first post-reph consonant and last main consonant. If ZWJ or ZWNJ are following this halant, position is moved after it. If such position is found, this is the target position. Otherwise, proceed to the next step.
Note: in old-implementation fonts, where classifications were fixed in shaping engine, there was no case where reph position will be found on this step.

The quote above is step 2 in OpenType's final Reph reordering algorithm. Our spec lacks a similar step. HarfBuzz has it implemented, and it appears that CoreText does too. Should we also include it, or is there a good reason not to?
For scripts that incorporate the REPH_POS_BEFORE_POST characteristic:
- There is no mention of the default/fallback Reph position being at the end of the syllable.
- There is no mention of what to do in the event the Reph's final position is after a "matra, Halant" subsequence.
- If no post-base consonants exist, our spec states that our "final Reph position is immediately before the first post-base matra, syllable modifier, or vedic sign." HarfBuzz and CoreText extend that criteria to "...before the first post-base matra, syllable modifier, or vedic sign that has a reordering class after the intended Reph position." ***

*** What we do is synonymous with OpenType's final Reph reordering step 4, but HarfBuzz and CoreText skip straight to step 5. This makes sense to me, as the second sentence in step 4:

If no consonant is found, the target position should be before the first matra, syllable modifier sign or vedic sign.

is redundant when followed by step 5 anyway.

[Telugu] Example image missing for JNya akhand form

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/telugu/telugu-image-generation-log.md#jnya

BASE_POS_FIRST

https://github.com/n8willis/opentype-shaping-documents/blob/master/opentype-shaping-indic-general.md#21-base-consonant

The Indic general spec still references BASE_POS_FIRST, but no Indic scripts use this?

[Gurmukhi] Example image missing for `cjct`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#312-cjct

BLWF substitutions in Indic scripts with BLWF_MODE_PRE_AND_POST

For Indic scripts that have the BLWF_MODE_PRE_AND_POST characteristic, HarfBuzz doesn't apply BLWF substitutions to pre-base consonants under the old shaping model, but our spec does not mention such a restriction.

[Arabic] Example image missing for `rclt`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/arabic/arabic-image-generation-log.md#410-rclt

[Gurmukhi] Example image missing for `rphf`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#34-rphf

[Gujarati] Invalid cluster example

Hello! This is a great resource, thanks and congratulations!

The following cluster would never occur:

You can use the following instead (text is ર્હ ):

Best,
Kalapi

Tibetan Tibetan character tables document

The paragraph at the end of this page just under the Miscellaneous Character Table which references shaping using Halant (and blocking this using ZWJ) appears to be in error as regards to the Tibetan script as the Halant character should not normally be used in Tibetan shaping - since in Tibetan consonant conjuncts are formed using the set of explicit Subjoined Consonants (U+0F90-U+0FBC) without any need of Halant (called Virama in Tibetan block).
In Tibetan all occurrences of the Halant / VIRAMA character (U+0F84) would normally be displayed with the glyph for that character.

Same thing with the whole next paragraph "Note, however, that the "consonant,Halant" subsequence in the above example may still trigger a half-forms feature....." Again, this paragraph doesn't make sense for Tibetan since Halant (VIRAMA) should never be needed to trigger a half-forms feature in Tibetan (since there are those explicit Subjoined Consonants in the Tibetan encoding).

Similarly with the next paragraph which references usage of the zero-width joiner is to prevent the formation of "Reph". In Tibetan RA followed by any subjoined consonant normally shapes as "ra-go" (abbreviated form of RA in Tibetan (there are a few exceptions which differ dependent on the particular style of Tibetan script). To prevent this shaping behavior in Tibetan U+0F6A "Fixed form Ra" is normally used. In Tibetan there is no "Reph".

Tamil Visarga

Currently Tamil U+0B83 is given shaping class MODIFYING_LETTER, should it be VISARGA?

[Kannada] Example images for `blws` may not be appropriate

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/kannada/kannada-image-generation-log.md#5-blws

[Gurmukhi] Example image missing for `half`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/gurmukhi/gurmukhi-image-generation-log.md#39-half

Wikipedia issues

To make progress on Indic shaping we've assembled a corpus of words and syllables by scraping Wikipedia for the ten Indic languages we plan to support (hi.wikipedia.org, bn.wikipedia.org, etc.)

That has given us 22803 unique syllables for Hindi, 10404 for Bengali, and so on, which we can use as test cases for shaping.

The code for this is located at https://github.com/yeslogic/corpus

However we have found some oddities in the Wikipedia text, such as the use of many Indic codepoints that are officially unassigned:

Bengali:

\u{9b1}
\u{9b3}
\u{9c9}
\u{9e4}
\u{9e5}

Gurmukhi:

\u{a0b}
\u{a0c}
\u{a11}
\u{a37}
\u{a3b}
\u{a3d}
\u{a43}
\u{a52}
\u{a53}
\u{a54}
\u{a58}
\u{a5f}
\u{a60}
\u{a61}
\u{a64}

Gujarati:

\u{a92}
\u{aa9}
\u{ad8}
\u{add}
\u{ae4}
\u{ae5}
\u{af3}
\u{af5}

Oriya:

\u{b34}
\u{b49}
\u{b54}
\u{b58}
\u{b5a}
\u{b5b}
\u{b5e}
\u{b64}
\u{b65}

Tamil:

\u{b8b}
\u{b96}
\u{b97}
\u{b98}
\u{b9b}
\u{b9d}
\u{ba0}
\u{ba1}
\u{ba2}
\u{ba5}
\u{ba6}
\u{ba7}
\u{bab}
\u{bac}
\u{bad}
\u{bbc}
\u{bc9}
\u{be0}

Telugu:

\u{c50}
\u{c5b}
\u{c64}

Kannada:

\u{cbb}
\u{cc9}
\u{cf5}

Malayalam:

\u{d49}

Sinhala:

\u{d80}
\u{d81}
\u{d84}
\u{d97}
\u{d98}
\u{d99}
\u{db2}
\u{dbc}
\u{dbe}
\u{dbf}
\u{dc7}
\u{dc8}
\u{dc9}
\u{dcb}
\u{dcc}
\u{dcd}
\u{dce}
\u{dd5}
\u{dd7}
\u{de0}
\u{de1}
\u{de2}
\u{de3}
\u{de4}
\u{de5}
\u{df0}
\u{df1}
\u{df5}
\u{df6}
\u{df7}
\u{df8}
\u{df9}
\u{dfa}
\u{dfb}
\u{dfc}
\u{dfd}
\u{dfe}
\u{dff}

[Devanagari] Example image missing for `init`

https://github.com/n8willis/opentype-shaping-documents/blob/master/images/devanagari/devanagari-image-generation-log.md#5-init