Giter Club home page Giter Club logo

Comments (10)

mikeday avatar mikeday commented on June 2, 2024

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

from opentype-shaping-documents.

n8willis avatar n8willis commented on June 2, 2024

Can the BASE_POS_LAST algorithm be described for Indic in general, or does it actually differ for each script?

It's the same for all (in HarfBuzz). I left it out of the 'general" document because there were (initially) other base-positioning rules and it seemed wrong to describe one algorithm but not the others. Subsequently, HarfBuzz extracted Khmer into a separate shaper and only BASE_POS_LAST and BASE_POS_LAST_SINHALA remain.

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Also does the base consonant always have shaping class "CONSONANT" and not "CONSONANT_DEAD"? (There is ambiguity due to a reference to consonants having this shaping class in 2.7: Post-base consonants).

So, based on my analysis of the Ragel machines, it is possible for a CONSONANT_DEAD to be identified as the base consonant. Whether or not real words do this is, naturally, a different matter. I think they don't.

But because CONSONANT_DEADs can occur in pre-base position, the classes are merged for the syllable-identification algorithm. So the shaper using the algorithm perhaps might identify a dead-consonant codepoint as base in a nonsense syllable -- but then again, it's "buyer beware" on nonsense syllables already, I would think.

For Sinhala, can the first consonant be preceded by a ZWJ and still be the last consonant?

My read is 'no'. The only possible beginnings for a valid consonant-based syllable are

  • repha
  • Consonant
  • consonantwithstacker

(And, for Sinhala, repha and consonantwithstacker don't exist in the script). The "broken syllable" expression can match potential-syllable-sequences starting with a ZWJ, but the shaper offers no guarantee of how they'll turn out.

Devanagari, Gujarati, Gurmukhi, Tamil, Malayalam, Kannada, and Telugu all have identical descriptions of the base consonant algorithm, although all but Malayalam have a note about lacking pre-base-reordering Ra and Kannada and Telugu have a note about all consonants having post-base form (does this mean we don't need to check the font for these?)

Right. Malayalan has a pre-base-reordering Ra, the others don't. Kannada & Telugu both allow any consonant to be a post-base form. As to whether the shaper needs to check the font, I guess that's a "level of trust" issue. If the font doesn't provide any post-base-form glyph variants through its GSUB, the user is probably not going to be able to read the resulting text since the shaping will look terrible. So an expensive check might not be worth it.

Bengali and Oriya are different: they write stand-alone instead of standalone and "Ra" instead of "Ra,Halant", I'm not sure if this difference is intentional.

Sinhala has its own algorithm.

Correct; because it has its own base-positioning rule, BASE_POS_LAST_SINHALA.

from opentype-shaping-documents.

n8willis avatar n8willis commented on June 2, 2024

Whoops; forgot to add: the differences in Bengali and Oriya are not intentional; will update.

from opentype-shaping-documents.

mikeday avatar mikeday commented on June 2, 2024

I think I would recommend leaving the BASE_POS_LAST algorithm description in each individual script page because, as you noted in the second comment, there are some minor differences at a practical level (like whether or not anything can actually take on a post-base form), so covering those all in one spot could get confusing. It would also add a lot of length, considering that it still needs to be in each script doc. But I'm open to persuasion.

Another possibility would be to state explicitly which scripts have the same algorithm, so that the reader doesn't need to carefully check them all and compare?

from opentype-shaping-documents.

adrianwong avatar adrianwong commented on June 2, 2024

If we could have a table that summarises these similarities/differences, that would be great!

from opentype-shaping-documents.

n8willis avatar n8willis commented on June 2, 2024

@adrianwong Do you mean the differences between the scripts, or the differences between the base-consonant algorithms?

from opentype-shaping-documents.

adrianwong avatar adrianwong commented on June 2, 2024

@n8willis Making one for the base-consonant algorithms would be a nice start, but ultimately having a summary of differences between all six stages of processing Indic2 texts in a table would be really useful.

This is an alternate approach to @mikeday's, but the motivations are the same - it'll make it easier for the reader to compare.

from opentype-shaping-documents.

n8willis avatar n8willis commented on June 2, 2024

I'm not sure that the full shaping processes or even the base-consonant-locating algorithms would fit into a table format. They're algorithms (stating the obvious); the steps are of different lengths & complexities ... not to mention the real-world problem that GitHub-rendered Markdown pages are of a fixed width. We already have problems with the latter issue in several of the character tables.

Putting in a more explicit listing like in @mikeday's comment seems more feasible.

from opentype-shaping-documents.

adrianwong avatar adrianwong commented on June 2, 2024

Valid points! Thanks for giving it some thought.

from opentype-shaping-documents.

n8willis avatar n8willis commented on June 2, 2024

Merged tables for all script-shaping characteristics in 711b4a7.

from opentype-shaping-documents.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.