Giter Club home page Giter Club logo

Comments (17)

AllanCameron avatar AllanCameron commented on July 24, 2024 1

It looks like the design decisions here were vindicated @teunbrand - see this twitter link

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

Yes this is indeed an issue. We can use the textshaping library to do bidirectional text, which uses fridibi for that, but we'd need to figure out how to reliably translate glyphs back into strings.

In the example below, glyphs 9-16 are flipped because it a run of RTL text.

test <- "Sarah is \u05e9\u05e8\u05d4, \u05e9\u05e8\u05d4 with \u05e9 on R"
x <- textshaping::shape_text(test)$shape
x[9:18, ]
#> # A tibble: 10 x 7
#>    glyph index metric_id string_id x_offset y_offset x_midpoint
#>    <int> <int>     <int>     <int>    <dbl>    <dbl>      <dbl>
#>  1     8     3         1         1     44.0        0       1.66
#>  2    16   676         1         1     47.3        0       3.61
#>  3    15   696         1         1     54.6        0       3.05
#>  4    14   697         1         1     60.7        0       4.16
#>  5    13     3         1         1     69          0       1.66
#>  6    12    15         1         1     72.3        0       1.66
#>  7    11   676         1         1     75.7        0       3.61
#>  8    10   696         1         1     82.9        0       3.05
#>  9     9   697         1         1     89.0        0       4.16
#> 10    17     3         1         1     97.3        0       1.66

Created on 2021-12-08 by the reprex package (v2.0.1)

Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:

ints <- utf8ToInt(enc2utf8(string))
bidi <- Unicode::u_char_properties(ints, "Bidi_Class")$Bidi_Class

Either way, we'd end up with an extra dependency if we go this route. Any thoughts about this, Alan?

from geomtextpath.

AllanCameron avatar AllanCameron commented on July 24, 2024

Yes, this is down to the mechanism of how we calculate glyph positions via systemfonts::shape_string, which seems not to handle RtL text. If the string is passed as a whole block to grid (e.g. by setting keep_straight = TRUE), it renders correctly:

ggplot(iris, aes(x = Sepal.Length))+
  geom_textpath(aes(label = "Sarah is \u05e9\u05e8\u05d4 with ש on R"),   
                          stat = "density", vjust = -0.2, size =8, fontface = 1, 
                          keep_straight = TRUE) +
  ylim(0.1,0.5) +
  annotate("text", label = "expect: Sarah is \u05e9\u05e8\u05d4 with ש on R", 
                 y = 0.2, x = 5.75)

image

The easiest way to do this would be to add a dependency to textshaping which will automatically order characters correctly (the glyphs can be recovered by matching the index column in the output of shape_string, but I prefer the homebrew method using the Unicode package, since it is more flexible for future development.

from geomtextpath.

byteit101 avatar byteit101 commented on July 24, 2024

Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:

I would like to caution you against re-implementing the BiDi algorithm as there are some hairy cases (see some examples and more examples) though there are some tests that exist.

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

the glyphs can be recovered by matching the index column in the output of shape_string

The glyphs cannot be that easily recovered when you have combined characters as in #31. I accidentally figured that out when I was investigating whether I could use textshaping() to solve both that issue and this one --and in theory it could-- but reverse engineering the glyphs isn't as straightforward.

I'm more in favour of letting {textshaping} handle shaping text (if we can) than using a homebrew method because there are all kinds of exceptions that somebody else has already figured out, it is less of a hassle to maintain and we can defer responsibility in that department.

The state of that line of investigation is currently as follows. Let's say we have some string that we want to pull through textshaping::shape_text(), and reconstruct (in the real case we'll be extracting position parameters too). This works fine for simple strings.

test <- "Simple string"

reconstruct <- function(string) {
  shape <- textshaping::shape_text(string)$shape
  chars <- strsplit(string, "")[[1]]
  paste0(chars[shape$glyph + 1], collapse = "")
}

reconstruct(test)
#> [1] "Simple string"

This also works fine for RtL text:

test <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"

reconstruct(test)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with <U+05E9> on R"

However, it seems to break whenever there are compound characters:

test <- "Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"

reconstruct(test) # prints to console as ê, doesn't reprex well
#> [1] "Composed: ê, DeC: e,  <U+05D0><U+05D0>"

Created on 2021-12-08 by the reprex package (v2.0.1)

It works slightly better* if we use stringi::stri_split_boundaries(string, type = "character") instead of a regular strsplit(), but it isn't perfect either.

* The first two cases work, the third one also has compound characters now, but also some NAs.

from geomtextpath.

AllanCameron avatar AllanCameron commented on July 24, 2024

I think this is definitely something to explore further, and hopefully we can encapsulate a simple solution nicely inside the measure_text function.

I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.

there are all kinds of exceptions that somebody else has already figured out

I don't have a great feel for how often a plot really needs a curved label comprising mixed Hebrew and Latin characters, but my guess is we would struggle to find many examples in the wild.

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

but my guess is we would struggle to find many examples in the wild.

Well I totally agree with you there, its been a while since I even saw a plot annotated in my own native tongue.

encapsulate a simple solution nicely inside the measure_text function

Apart from some minor discrepancies (the retrieval of glyphs being one of the major ones), it should theoretically be a sort of drop-in replacement for systemfonts::shape_string()

I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.

Me too, and that is why it would be great if we can just get this to work with textshaping (meaning we can trade-in systemfonts as a dependency).

Do you think it is a good idea if I post a feature request at textshaping to ask whether they can add the character glyphs?

from geomtextpath.

AllanCameron avatar AllanCameron commented on July 24, 2024

Interestingly, the docs for textshaping::shape_text say that glyph should be actual glyphs (though it also looks as if the authors have copy / pasted their docs from systemfonts::shape_string, If you look at the code, the line that looks as if it should convert the glyph numbers to characters is commented out.. That line certainly doesn't convert the glyph number in the output of the C function to glyphs, so the commenting out is obviously purposeful.

Strangely enough, the index seems to be the actual unicode code point minus 29, so:

intToUtf8(textshaping::shape_text(string)$shape$index + 29, multiple = TRUE)

might work

from geomtextpath.

AllanCameron avatar AllanCameron commented on July 24, 2024

It looks like perhaps something like this could work:

index_to_utf8 <- function(ints)
{
  nums <- lapply(ints + 29, function(codepoint) {
    if (codepoint <= 0x7f) return(codepoint)
    
    if (codepoint <= 0x7ff) 
      return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
               bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})
  
  `Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}

Which seems to do the trick of harvesting the glyphs from the index column in shape_text in the right order (on this example at least):

string <- "Sarah is \u05e9\u05e8\u05d4 with ש on R"

string
#> [1] "Sarah is שרה with ש on R"

index_to_utf8(textshaping::shape_text(string)$shape$index)
#> [1] "Sarah is הרש with ש on R"

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

This looks quite promising! I think the docs mention that the index is an index into the font file character table, so I tested all fonts on my machine:

index_to_utf8 <- function(ints)
{
  nums <- lapply(ints + 29, function(codepoint) {
    if (codepoint <= 0x7f) return(codepoint)
    
    if (codepoint <= 0x7ff) 
      return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
               bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})
  
  `Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}

string <- "Sarah is \u05e9\u05e8\u05d4 with ? on R"

# 549 of them
all_fonts <- systemfonts::system_fonts()$path

all_indices <- lapply(all_fonts, function(path) {
  textshaping::shape_text(string, path = path)$shape$index
})

all_strings <- vapply(all_indices, index_to_utf8, character(1))

print(all_strings[[1]]) # Arial (correctly displayed, but not in reprex)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with ? on R"

# Most common cases
head(sort(table(all_strings), decreasing = TRUE))
#> all_strings
#>                                                                Sarah is \035\035\035 with ? on R 
#>                                                                                              296 
#> \035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035 
#>                                                                                               53 
#>                                                                R`q`g hr \035\035\035 vhsg > nm Q 
#>                                                                                               26 
#> 3<U+03E2><U+03F3><U+03E2><U+03E9> <U+03EA><U+03F4> \035\035\035 <U+03F8><U+03EA><U+03F5><U+03E9> <U+05BE> <U+03F0><U+03EF> 2 
#>                                                                                               24 
#> 3<U+03B3>t<U+03B3><U+03BA> <U+03BB><U+03C5> \035\035\035 <U+03C9><U+03BB>f<U+03BA> <U+051B> <U+03C1>p 2 
#>                                                                                               17 
#>                                                    Sarah is <U+05D4><U+05E8><U+05E9> with ? on R 
#>                                                                                               13

Created on 2021-12-09 by the reprex package (v2.0.0)

So the assumptions don't appear to hold for every font. However, we don't need these assumptions to hold for every font, because we can use the shaping once on a font for which we know it works, and then have the glyphs we need to do shaping on the true font. This is essentially shaping the text twice, but I think the trade-off seems good as we no longer measure the text during drawing/window resizing time.

from geomtextpath.

AllanCameron avatar AllanCameron commented on July 24, 2024

A more robust method would be to get the font table for, say, Arial, and create a lookup table. I found a parseable json version online.

library(jsonlite)
library(textshaping)

url <- paste0("http://zuga.net/articles/",
              "unicode-all-characters-supported-by-the-font-arial/",
              "d/p/1000-x.json?v=41")

codes <- read_json(url)
codes <- codes[-(1:2)]
codes <- lapply(codes, function(x) x$I)
glyphs <- unlist(sapply(codes, function(y) sapply(y, function(x) x$P)))
index <- unlist(sapply(codes, function(y) sapply(y, function(x) x$G)))
lookup <- numeric(max(index))
lookup[index] <- intToUtf8(glyphs, multiple = TRUE)

string <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"

string
#> [1] "Sarah is שרה with ש on R"

paste0(lookup[shape_text(string)$shape$index], collapse = "")
#> [1] "Sarah is הרש with ש on R"

We could have the lookup data frame hard-coded as an R file in the package.

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

Yes that does make sense to me, the only caveat I can spot is that non every operating system has the Arial font. For example, a unit test in {systemfonts} also tests different fonts per system. A possible remedy is to include a font ourselves with this package, that can act as a reference font. We'd have to carefully read the licence of any font we consider for that purpose, because not all are free to distribute for personal and commercial use.

One candidate is the Roboto font which is relatively popular, has an Apache v2.0 license that (this is no legal advice, I'm no lawyer) appears to be pretty permissive. A precedent for including fonts is in the hrbrthemes package.

from geomtextpath.

AllanCameron avatar AllanCameron commented on July 24, 2024

Actually, we may be trying to reinvent the wheel here. systemfonts::glyph_info(string)$index gives you the index of each glyph in string, so doing something like systemfonts::glyph_info(intToUtf8(1:64000, multiple = TRUE)) gives you the full Unicode lookup table for glyphs in whichever font you choose.

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

That seems pretty handy! We might simply cache the result of that until a new font is requested. But it does mean that we can't trade-in systemfonts for textshaping (which I'm ok with).

from geomtextpath.

AllanCameron avatar AllanCameron commented on July 24, 2024

We can't trade in systemfonts for textshaping anyway - systemfonts is a dependency of textshaping, so textshaping would be an addition rather than a substitution. If we accept that we want bidi enough that we need the overhead of an extra package, then textshaping is fine, since it has a minimal dependency chain (just cpp11, which doesn't have any strict dependencies of its own)

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

I'm in favour of adopting textshaping because not only does it do bidi text, it also does composite characters and perhaps other finnicky things that I am keen to not have to worry about myself. Also your investigations makes it more feasible than I realized at first.

from geomtextpath.

teunbrand avatar teunbrand commented on July 24, 2024

Alright this should be fixed now, so I'll be closing this.

from geomtextpath.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.