Comments (17)
It looks like the design decisions here were vindicated @teunbrand - see this twitter link
from geomtextpath.
Yes this is indeed an issue. We can use the textshaping library to do bidirectional text, which uses fridibi for that, but we'd need to figure out how to reliably translate glyphs back into strings.
In the example below, glyphs 9-16 are flipped because it a run of RTL text.
test <- "Sarah is \u05e9\u05e8\u05d4, \u05e9\u05e8\u05d4 with \u05e9 on R"
x <- textshaping::shape_text(test)$shape
x[9:18, ]
#> # A tibble: 10 x 7
#> glyph index metric_id string_id x_offset y_offset x_midpoint
#> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 8 3 1 1 44.0 0 1.66
#> 2 16 676 1 1 47.3 0 3.61
#> 3 15 696 1 1 54.6 0 3.05
#> 4 14 697 1 1 60.7 0 4.16
#> 5 13 3 1 1 69 0 1.66
#> 6 12 15 1 1 72.3 0 1.66
#> 7 11 676 1 1 75.7 0 3.61
#> 8 10 696 1 1 82.9 0 3.05
#> 9 9 697 1 1 89.0 0 4.16
#> 10 17 3 1 1 97.3 0 1.66
Created on 2021-12-08 by the reprex package (v2.0.1)
Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:
ints <- utf8ToInt(enc2utf8(string))
bidi <- Unicode::u_char_properties(ints, "Bidi_Class")$Bidi_Class
Either way, we'd end up with an extra dependency if we go this route. Any thoughts about this, Alan?
from geomtextpath.
Yes, this is down to the mechanism of how we calculate glyph positions via systemfonts::shape_string
, which seems not to handle RtL text. If the string is passed as a whole block to grid
(e.g. by setting keep_straight = TRUE
), it renders correctly:
ggplot(iris, aes(x = Sepal.Length))+
geom_textpath(aes(label = "Sarah is \u05e9\u05e8\u05d4 with ש on R"),
stat = "density", vjust = -0.2, size =8, fontface = 1,
keep_straight = TRUE) +
ylim(0.1,0.5) +
annotate("text", label = "expect: Sarah is \u05e9\u05e8\u05d4 with ש on R",
y = 0.2, x = 5.75)
The easiest way to do this would be to add a dependency to textshaping
which will automatically order characters correctly (the glyphs can be recovered by matching the index
column in the output of shape_string
, but I prefer the homebrew method using the Unicode package, since it is more flexible for future development.
from geomtextpath.
Alternatively, we could homebrew some simple flipping mechanism for stretches of RTL text, for which you could get some class information like below:
I would like to caution you against re-implementing the BiDi algorithm as there are some hairy cases (see some examples and more examples) though there are some tests that exist.
from geomtextpath.
the glyphs can be recovered by matching the index column in the output of shape_string
The glyphs cannot be that easily recovered when you have combined characters as in #31. I accidentally figured that out when I was investigating whether I could use textshaping()
to solve both that issue and this one --and in theory it could-- but reverse engineering the glyphs isn't as straightforward.
I'm more in favour of letting {textshaping} handle shaping text (if we can) than using a homebrew method because there are all kinds of exceptions that somebody else has already figured out, it is less of a hassle to maintain and we can defer responsibility in that department.
The state of that line of investigation is currently as follows. Let's say we have some string that we want to pull through textshaping::shape_text()
, and reconstruct (in the real case we'll be extracting position parameters too). This works fine for simple strings.
test <- "Simple string"
reconstruct <- function(string) {
shape <- textshaping::shape_text(string)$shape
chars <- strsplit(string, "")[[1]]
paste0(chars[shape$glyph + 1], collapse = "")
}
reconstruct(test)
#> [1] "Simple string"
This also works fine for RtL text:
test <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"
reconstruct(test)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with <U+05E9> on R"
However, it seems to break whenever there are compound characters:
test <- "Composed: \u00ea, DeC: e\u0302, \u05aa\u05d0\u05aa"
reconstruct(test) # prints to console as ê, doesn't reprex well
#> [1] "Composed: ê, DeC: e, <U+05D0><U+05D0>"
Created on 2021-12-08 by the reprex package (v2.0.1)
It works slightly better* if we use stringi::stri_split_boundaries(string, type = "character")
instead of a regular strsplit()
, but it isn't perfect either.
* The first two cases work, the third one also has compound characters now, but also some NA
s.
from geomtextpath.
I think this is definitely something to explore further, and hopefully we can encapsulate a simple solution nicely inside the measure_text
function.
I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.
there are all kinds of exceptions that somebody else has already figured out
I don't have a great feel for how often a plot really needs a curved label comprising mixed Hebrew and Latin characters, but my guess is we would struggle to find many examples in the wild.
from geomtextpath.
but my guess is we would struggle to find many examples in the wild.
Well I totally agree with you there, its been a while since I even saw a plot annotated in my own native tongue.
encapsulate a simple solution nicely inside the measure_text function
Apart from some minor discrepancies (the retrieval of glyphs being one of the major ones), it should theoretically be a sort of drop-in replacement for systemfonts::shape_string()
I'm a little apprehensive about introducing a lot of extra overhead (either in terms of processing, extra dependencies or coding effort) for what might be very rare use cases.
Me too, and that is why it would be great if we can just get this to work with textshaping (meaning we can trade-in systemfonts as a dependency).
Do you think it is a good idea if I post a feature request at textshaping to ask whether they can add the character glyphs?
from geomtextpath.
Interestingly, the docs for textshaping::shape_text
say that glyph
should be actual glyphs (though it also looks as if the authors have copy / pasted their docs from systemfonts::shape_string
, If you look at the code, the line that looks as if it should convert the glyph numbers to characters is commented out.. That line certainly doesn't convert the glyph number in the output of the C function to glyphs, so the commenting out is obviously purposeful.
Strangely enough, the index
seems to be the actual unicode code point minus 29, so:
intToUtf8(textshaping::shape_text(string)$shape$index + 29, multiple = TRUE)
might work
from geomtextpath.
It looks like perhaps something like this could work:
index_to_utf8 <- function(ints)
{
nums <- lapply(ints + 29, function(codepoint) {
if (codepoint <= 0x7f) return(codepoint)
if (codepoint <= 0x7ff)
return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})
`Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}
Which seems to do the trick of harvesting the glyphs from the index column in shape_text
in the right order (on this example at least):
string <- "Sarah is \u05e9\u05e8\u05d4 with ש on R"
string
#> [1] "Sarah is שרה with ש on R"
index_to_utf8(textshaping::shape_text(string)$shape$index)
#> [1] "Sarah is הרש with ש on R"
from geomtextpath.
This looks quite promising! I think the docs mention that the index is an index into the font file character table, so I tested all fonts on my machine:
index_to_utf8 <- function(ints)
{
nums <- lapply(ints + 29, function(codepoint) {
if (codepoint <= 0x7f) return(codepoint)
if (codepoint <= 0x7ff)
return(c(bitwOr(bitwShiftR(codepoint + 787, 6), 0xc0),
bitwOr(bitwAnd(codepoint + 787, 0x3f), 0x80)))})
`Encoding<-`(rawToChar(as.raw(unlist(nums))), "UTF-8")
}
string <- "Sarah is \u05e9\u05e8\u05d4 with ? on R"
# 549 of them
all_fonts <- systemfonts::system_fonts()$path
all_indices <- lapply(all_fonts, function(path) {
textshaping::shape_text(string, path = path)$shape$index
})
all_strings <- vapply(all_indices, index_to_utf8, character(1))
print(all_strings[[1]]) # Arial (correctly displayed, but not in reprex)
#> [1] "Sarah is <U+05D4><U+05E8><U+05E9> with ? on R"
# Most common cases
head(sort(table(all_strings), decreasing = TRUE))
#> all_strings
#> Sarah is \035\035\035 with ? on R
#> 296
#> \035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035\035
#> 53
#> R`q`g hr \035\035\035 vhsg > nm Q
#> 26
#> 3<U+03E2><U+03F3><U+03E2><U+03E9> <U+03EA><U+03F4> \035\035\035 <U+03F8><U+03EA><U+03F5><U+03E9> <U+05BE> <U+03F0><U+03EF> 2
#> 24
#> 3<U+03B3>t<U+03B3><U+03BA> <U+03BB><U+03C5> \035\035\035 <U+03C9><U+03BB>f<U+03BA> <U+051B> <U+03C1>p 2
#> 17
#> Sarah is <U+05D4><U+05E8><U+05E9> with ? on R
#> 13
Created on 2021-12-09 by the reprex package (v2.0.0)
So the assumptions don't appear to hold for every font. However, we don't need these assumptions to hold for every font, because we can use the shaping once on a font for which we know it works, and then have the glyphs we need to do shaping on the true font. This is essentially shaping the text twice, but I think the trade-off seems good as we no longer measure the text during drawing/window resizing time.
from geomtextpath.
A more robust method would be to get the font table for, say, Arial, and create a lookup table. I found a parseable json version online.
library(jsonlite)
library(textshaping)
url <- paste0("http://zuga.net/articles/",
"unicode-all-characters-supported-by-the-font-arial/",
"d/p/1000-x.json?v=41")
codes <- read_json(url)
codes <- codes[-(1:2)]
codes <- lapply(codes, function(x) x$I)
glyphs <- unlist(sapply(codes, function(y) sapply(y, function(x) x$P)))
index <- unlist(sapply(codes, function(y) sapply(y, function(x) x$G)))
lookup <- numeric(max(index))
lookup[index] <- intToUtf8(glyphs, multiple = TRUE)
string <- "Sarah is \u05e9\u05e8\u05d4 with \u05e9 on R"
string
#> [1] "Sarah is שרה with ש on R"
paste0(lookup[shape_text(string)$shape$index], collapse = "")
#> [1] "Sarah is הרש with ש on R"
We could have the lookup
data frame hard-coded as an R file in the package.
from geomtextpath.
Yes that does make sense to me, the only caveat I can spot is that non every operating system has the Arial font. For example, a unit test in {systemfonts} also tests different fonts per system. A possible remedy is to include a font ourselves with this package, that can act as a reference font. We'd have to carefully read the licence of any font we consider for that purpose, because not all are free to distribute for personal and commercial use.
One candidate is the Roboto font which is relatively popular, has an Apache v2.0 license that (this is no legal advice, I'm no lawyer) appears to be pretty permissive. A precedent for including fonts is in the hrbrthemes package.
from geomtextpath.
Actually, we may be trying to reinvent the wheel here. systemfonts::glyph_info(string)$index
gives you the index of each glyph in string
, so doing something like systemfonts::glyph_info(intToUtf8(1:64000, multiple = TRUE))
gives you the full Unicode lookup table for glyphs in whichever font you choose.
from geomtextpath.
That seems pretty handy! We might simply cache the result of that until a new font is requested. But it does mean that we can't trade-in systemfonts for textshaping (which I'm ok with).
from geomtextpath.
We can't trade in systemfonts for textshaping anyway - systemfonts is a dependency of textshaping, so textshaping would be an addition rather than a substitution. If we accept that we want bidi enough that we need the overhead of an extra package, then textshaping is fine, since it has a minimal dependency chain (just cpp11, which doesn't have any strict dependencies of its own)
from geomtextpath.
I'm in favour of adopting textshaping because not only does it do bidi text, it also does composite characters and perhaps other finnicky things that I am keen to not have to worry about myself. Also your investigations makes it more feasible than I realized at first.
from geomtextpath.
Alright this should be fixed now, so I'll be closing this.
from geomtextpath.
Related Issues (20)
- Denisty fill HOT 6
- label border? HOT 2
- geom_textsmooth() doesn't choose method if not explicitly set HOT 2
- Request something like "dodge" along path HOT 3
- Chinese character support HOT 3
- geom_labelsf() not recognizing aesthetics HOT 5
- While working with ragg, minuses aren't drawn HOT 2
- geom_textpath() always draw a empty box when label is ""
- Ignoring unknown parameters: text_smoothing HOT 10
- Feature request - multiple labels per line HOT 7
- geom_textsmooth computation fails if method argument is not specified HOT 3
- ggplot2 is separating size and linewidth HOT 2
- Negative values in geom_textcontour not appearing HOT 2
- geom_textlinerange HOT 1
- Question on angle HOT 1
- text_only in geom_labelsegment
- Feature request - avoid text overlapping
- two labels on same curve HOT 5
- straight argument unknown in geom_textsf() HOT 2
- `label` aesthetic in `geom_textdensity2d` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geomtextpath.