Can a the result of shaping a Unicode string differ from concatenating the results of

<a href="https://github.com/harfbuzz/harfbuzz/issues/1463#issuecomment-50

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Yes, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="38

Can shaping be reused for different line breaks? about skribo HOT 10 OPEN

linebender commented on July 21, 2024

Can shaping be reused for different line breaks?

from skribo.

Comments (10)

behdad commented on July 21, 2024 1

harfbuzz/harfbuzz#1463 (comment) looks relevant.

Indeed..

I haven't been able to reason about their model yet. It's an interesting model (stop as soon as one "cluster" shaped the same as before...) I will be thinking about it and implement something in HarfBuzz in the next couple months or so.

from skribo.

Manishearth commented on July 21, 2024

What about in the case of soft hyphens, especially with Arabic (soft hyphens work with Arabic and leave the letter in its joined form even when broken).

Also kashida justification, but we're not planning on supporting that (yet?).

from skribo.

SimonSapin commented on July 21, 2024

It’s quite possible that my question comes from a Latin-centric view and this doesn’t work at all in the general case, or not as easily as I’d hope.

from skribo.

raphlinus commented on July 21, 2024

Ok, this is a good question, because it touches on a lot of issues. The way Android solves the problem is a good model, but it has different constraints (legacy API) and it might be possible to do better. I'll briefly describe the Android approach, then ideas for how to do things better. We do want to treat shaping as expensive and avoid duplication of work as much as possible.

First, Android makes the assumption that shaping can be split on "word boundaries", that shaping results can be cached, and layouts can be assembled from these pieces. The definition of "word boundary" for this purpose is interesting because it doesn't correspond to anything in UAX 29 or similar. Basically, it's a space character or the boundary between two ideographs. This is essentially making the assumption that shaping does not happen across a space, and also that there's no kerning of ideographs. These are maybe 99.9% valid assumptions, possibly we want to tweak them (Android doesn't).

Line break opportunities overlap some with this concept, but not entirely. For non-hyphenated Latin text (and other similar scripts), and also CJK, they're similar, but then there are the exceptions. Hyphens are one, as a hyphen can (and often does) kern with the adjoining text. There are two schools of thought here, approximation based on adding the width of the hyphen, and exact measuring. Android does the latter. Note also that, even in Latin, there are line break opportunities caused by punctuation, and these also affect shaping.

So assuming the cache is fast, I think the Android approach is reasonable: just shape the text three times (min, max, and actual), and expect very high cache hit rates. The exceptions, like punctuation, will be different cache entries.

There is another possibility, based on retaining layout objects, rather than relying on the cache for persistence. To do this would involve exposing the boundaries for shape-caching. Then, a layout driver would iterate along both shape-caching and line-break boundaries, adding the widths to get a max intrinsic width, making the max width to get min intrinsic width, and potentially persisting the layout objects once line breaks are determined, reusing them when the boundaries align.

This would be a somewhat more complex API, so the obvious question is whether it's worth it. I think a lot of that has to do with the cost of cache lookup. One candidate for the API is for the layout method to return a sequence (can be an iterator) of individually cached shaped layouts, and the higher level driver reusing layouts when the boundaries align.

The challenge varies by script, but Latin has similar problems as other scripts. One of the most challenging is Thai (and similar Southeast Asian scripts), as line break opportunities are more frequent than shaping-cache boundaries. For this script, it's likely that it will need three shaping passes, and caching is not likely to be effective for max-intrinsic and actual.

One possible optimization that comes to mind is the underlying shaping engine reporting shaping boundaries (ie a guarantee that breaking the string at that boundary preserves shaping on concatenating the shaped substrings). I'll bring this up with @behdad when we meet later in the week.

from skribo.

emilio commented on July 21, 2024

cc @jfkthame, do you know what Gecko does here? I'm pretty sure Gecko has a word cache, but that's pretty much all I know about it :)

from skribo.

behdad commented on July 21, 2024

Firefox and current Chrome / Blink use a word cache as well, but disable the word cache if they detect that the space glyph interacts with the lookups for the script. That's done using hb_ot_layout_lookup_collect_glyphs().

Blink's layout-ng rewrite removes caching and instead retains the shaping result on a per paragraph basis. That's where assistance from HarfBuzz to avoid reshaping comes in.

There's some relevant discussion in harfbuzz/harfbuzz#1463. I'll summarize here:

HarfBuzz already provides a flag called UNSAFE_TO_BREAK. That flag basically tells you, after shaping, which positions in the text have the property that shaping the two sides separately and concatenating results in the same output. However, as I explained in harfbuzz/harfbuzz#1463 (comment), this doesn't tell us anything about shaping of any other string.

It's possible to design stronger flags. I'll continue in harfbuzz/harfbuzz#1463 when I have a better proposal.

from skribo.

raphlinus commented on July 21, 2024

Yes, @behdad and I had a very good conversation on Friday where we went deep into this. I'm now encouraged to rely less heavily on caching (it has its own issues) and see if we can reuse more.

Since then, I've had an idea for an API which might be both nice and efficient. Basically, you pass in the entire string (so the max intrinsic calculation) and you get a width and an object back. From the object you get back, you can query for width and/or layout of an arbitrary substring (or, in the case of hyphenation, an arbitrary substring plus an additional hyphen character).

Internally, today that would use the HarfBuzz UNSAFE_TO_BREAK flag. Based on our conversation, I believe that if both start and end offsets have !UNSAFE_TO_BREAK, then that sublayout can be reused. Otherwise, a new layout is done of the substring.

My rough analysis is that this would work pretty well for Latin (most of the time, spaces and thus line breaks would be safe to break), and especially well for CJK.

This logic can then be refined as HarfBuzz evolves. I think the nice thing about the API I propose is that it's not too dependent on the details of how it's done under the hood. For example, even with a highly cache-centered approach, the style parameters (the "paint") can be interned in the first call, and then the interned paint can be used as part of the cache key. Doing it this way avoids the need to have an explicit API for interning (with associated lifetime and thread-safety concerns).

from skribo.

SimonSapin commented on July 21, 2024

From the object you get back, you can query for width and/or layout

By the way, is computing the width cheaper than doing “full” shaping and layout? If not in amount of computation, perhaps in memory space for storing results?

from skribo.

raphlinus commented on July 21, 2024

By the way, is computing the width cheaper than doing “full” shaping and layout? If not in amount of computation, perhaps in memory space for storing results?

Yes, though this needs careful empirical measurement. At the very least, it can be done without allocating, while full layout requires allocating the result buffer. I think it's also a question of how heavily we rely on caching - if we go through HarfBuzz, the difference might not be that much, but if we hit in the cache the relative cost of writing the layout result might be significant.

from skribo.

SimonSapin commented on July 21, 2024

harfbuzz/harfbuzz#1463 (comment) looks relevant.

from skribo.

Can shaping be reused for different line breaks? about skribo HOT 10 OPEN

Comments (10)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent