Base Consonant Position about opentype-shaping-documents HOT 10 CLOSED

n8willis commented on June 2, 2024

Base Consonant Position

from opentype-shaping-documents.

Comments (10)

mikeday commented on June 2, 2024

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

from opentype-shaping-documents.

n8willis commented on June 2, 2024

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

It's the same for all (in HarfBuzz). I left it out of the 'general" document because there were (initially) other base-positioning rules and it seemed wrong to describe one algorithm but not the others. Subsequently, HarfBuzz extracted Khmer into a separate shaper and only BASE_POS_LAST and BASE_POS_LAST_SINHALA remain.

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

So, based on my analysis of the Ragel machines, it is possible for a CONSONANT_DEAD to be identified as the base consonant. Whether or not real words do this is, naturally, a different matter. I think they don't.

But because CONSONANT_DEADs can occur in pre-base position, the classes are merged for the syllable-identification algorithm. So the shaper using the algorithm perhaps might identify a dead-consonant codepoint as base in a nonsense syllable -- but then again, it's "buyer beware" on nonsense syllables already, I would think.

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

My read is 'no'. The only possible beginnings for a valid consonant-based syllable are

repha
Consonant
consonantwithstacker

(And, for Sinhala, repha and consonantwithstacker don't exist in the script). The "broken syllable" expression can match potential-syllable-sequences starting with a ZWJ, but the shaper offers no guarantee of how they'll turn out.

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Right. Malayalan has a pre-base-reordering Ra, the others don't. Kannada & Telugu both allow any consonant to be a post-base form. As to whether the shaper needs to check the font, I guess that's a "level of trust" issue. If the font doesn't provide any post-base-form glyph variants through its GSUB, the user is probably not going to be able to read the resulting text since the shaping will look terrible. So an expensive check might not be worth it.

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

Correct; because it has its own base-positioning rule, BASE_POS_LAST_SINHALA.

from opentype-shaping-documents.

n8willis commented on June 2, 2024

Whoops; forgot to add: the differences in Bengali and Oriya are not intentional; will update.

from opentype-shaping-documents.

mikeday commented on June 2, 2024

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Another possibility would be to state explicitly which scripts have the same algorithm, so that the reader doesn't need to carefully check them all and compare?

from opentype-shaping-documents.

adrianwong commented on June 2, 2024

If we could have a table that summarises these similarities/differences, that would be great!

from opentype-shaping-documents.

n8willis commented on June 2, 2024

@adrianwong Do you mean the differences between the scripts, or the differences between the base-consonant algorithms?

from opentype-shaping-documents.

adrianwong commented on June 2, 2024

@n8willis Making one for the base-consonant algorithms would be a nice start, but ultimately having a summary of differences between all six stages of processing Indic2 texts in a table would be really useful.

This is an alternate approach to @mikeday's, but the motivations are the same - it'll make it easier for the reader to compare.

from opentype-shaping-documents.

n8willis commented on June 2, 2024

I'm not sure that the full shaping processes or even the base-consonant-locating algorithms would fit into a table format. They're algorithms (stating the obvious); the steps are of different lengths & complexities ... not to mention the real-world problem that GitHub-rendered Markdown pages are of a fixed width. We already have problems with the latter issue in several of the character tables.

Putting in a more explicit listing like in @mikeday's comment seems more feasible.

from opentype-shaping-documents.

adrianwong commented on June 2, 2024

Valid points! Thanks for giving it some thought.

from opentype-shaping-documents.

n8willis commented on June 2, 2024

Merged tables for all script-shaping characteristics in 711b4a7.

from opentype-shaping-documents.

Base Consonant Position about opentype-shaping-documents HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent