Giter Club home page Giter Club logo

Comments (19)

wch avatar wch commented on July 21, 2024 3

This seems to work:

RUN echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
   && locale-gen en_US.utf8 \
   && /usr/sbin/update-locale LANG=en_US.UTF-8

The starting state for /etc/locale.gen has en_US.UTF-8 commented out, along with all the other entries. Running dpkg-configure interactively and selecting en_US.UTF-8 has the same effect as this set of commands, I think.

Edit: FWIW, I found another Dockerfile that uses a similar strategy: https://registry.hub.docker.com/u/etna/drone-debian/dockerfile/

from rocker.

eddelbuettel avatar eddelbuettel commented on July 21, 2024

Another thing to add to r-base so that it bubbles up.

[ That said, I am a 7-bit snob now and rarely ever set these... But we probably should. ]

from rocker.

cboettig avatar cboettig commented on July 21, 2024

Note, I set locale to C.UTF-8 as in the second example, rather than en_US.UTF-8 as in the first example; and just set the Debian base. (A summary of C.UTF-8 vs en_US.UTF-8 here, but happy for input on which locale @wch had in mind).

from rocker.

wch avatar wch commented on July 21, 2024

I'm not an expert in this stuff, but I think that en_US.UTF-8 would be better, since it defines proper sorting for non-ASCII characters, while C_UTF-8 does not -- it probably just uses the unicode value for sorting.

For example, in en_US.UTF-8, all the a's with accents come before b:

> sort(c('A', 'a', 'Ä', 'ä', 'À', 'à', 'b'))
[1] "a" "A" "à" "À" "ä" "Ä" "b"

But it's not true in C.UTF-8:

> sort(c('A', 'a', 'Ä', 'ä', 'À', 'à', 'b'))
[1] "A" "a" "b" "À" "Ä" "à" "ä"

So I think that, despite the provincial-sounding label, en_US actually supports non-English languages better than C.

from rocker.

cboettig avatar cboettig commented on July 21, 2024

@wch sounds reasonable to me.

For reasons that are not obvious to me, just switching C.UTF-8 to en_US.UTF-8 in this Dockerfile results in an error:

*** update-locale: Error: invalid locale settings:  LANG=en_US.UTF-8
2014/10/07 20:12:37 The command [/bin/sh -c dpkg-reconfigure locales     && locale-gen en_US.UTF-8     && /usr/sbin/update-locale LANG=en_US.UTF-8] returned a non-zero code: 255

No idea why, en_US.UTF-8 is on the list of locales returned by the command...

from rocker.

eddelbuettel avatar eddelbuettel commented on July 21, 2024

+1 -- I don't think I have ever seen C.UTF-8 in the wild anywhere. Not that I pay much attention though...

from rocker.

eddelbuettel avatar eddelbuettel commented on July 21, 2024

Blech:

root@e5b38b5f638c:/# du -csh /usr/share/locale/
87M     /usr/share/locale/
87M     total
root@e5b38b5f638c:/# 

from rocker.

wch avatar wch commented on July 21, 2024

Doesn't seem so bad when I do it:

$ docker run --rm -ti eddelbuettel/debian-r-base /bin/bash

root@1dbe56be3aa1:/# du -csh /usr/share/locale
43M /usr/share/locale
43M total

root@1dbe56be3aa1:/# apt-get install -qq -y locales

root@1dbe56be3aa1:/# du -csh /usr/share/locale
47M /usr/share/locale
47M total

root@1dbe56be3aa1:/# echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen \
>    && locale-gen en_US.utf8 \
>    && /usr/sbin/update-locale LANG=en_US.UTF-8
Generating locales (this might take a while)...
  en_US.UTF-8... done
Generation complete.

root@1dbe56be3aa1:/# du -csh /usr/share/locale
47M /usr/share/locale
47M total

from rocker.

eddelbuettel avatar eddelbuettel commented on July 21, 2024

I was using the 'drd' (ie daily r-devel) which has more packages hence more po files. Anyway, on my home system it is 177 mb so ... that's just a cost of doing business.

I learned something new which may help shrink the image some more.

from rocker.

cboettig avatar cboettig commented on July 21, 2024

Testing: docker run -it rocker/r-base R

> Sys.getlocale(category = "LC_ALL")

[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

@wch Look good?

from rocker.

cboettig avatar cboettig commented on July 21, 2024

For some reason, the rstudio image (and thus hadleyverse) object to the locale settings.

The container throws a warning on startup:

$ docker run --rm -it rocker/rstudio R
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

And likewise R complains as well:

R version 3.1.1 (2014-07-10) -- "Sock it to Me"
...
Type 'q()' to quit R.

During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C" 
2: Setting LC_COLLATE failed, using "C" 
3: Setting LC_TIME failed, using "C" 
4: Setting LC_MESSAGES failed, using "C" 
5: Setting LC_MONETARY failed, using "C" 
6: Setting LC_PAPER failed, using "C" 
7: Setting LC_MEASUREMENT failed, using "C" 

and then defaults to the "C" locale:

> Sys.getlocale(category = "LC_ALL")
[1] "C"

from rocker.

eddelbuettel avatar eddelbuettel commented on July 21, 2024

That rings a bell but I don;t quite recall what to do. Should be a generic issue for Debian-based VMs etc though. Maybe as simple as setting it in /etc/bash/bashrc, or profile or ...

from rocker.

jangorecki avatar jangorecki commented on July 21, 2024

I get C locale on recent image r-base image.

docker run -it r-base
Sys.getlocale(category = "LC_ALL")
# [1] "C"

According to discussion here I should get US UTF8 so it looks like this issue needs to be reopened.

from rocker.

jangorecki avatar jangorecki commented on July 21, 2024

I used this SO answer to solve that issue on Ubuntu 14.04.

RUN locale-gen en_US.UTF-8  
ENV LANG en_US.UTF-8  
ENV LANGUAGE en_US:en  
ENV LC_ALL en_US.UTF-8 
Sys.getlocale(category = "LC_ALL")
# [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

I tried the same on official debian's r-base but it throws a lot of warnings about locale while build and in R console after run. So it cannot be directly applied to debian too.

from rocker.

cboettig avatar cboettig commented on July 21, 2024

Really? On r-base I see:


> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
> Sys.getenv("LC_ALL")
[1] "en_US.UTF-8"
> Sys.getlocale(category="LC_ALL")
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
> 

Are you sure you have the latest r-base image? (Not sure what you mean by 'official debian's r-base' or what warnings you're seeing either)

Yes, ubuntu and debian set locales differently; both are described in the link above. (And of course the debian way is also illustrated at the top of the r-base Dockerfile.

Does anyone else still see the C locale in r-base?

from rocker.

eddelbuettel avatar eddelbuettel commented on July 21, 2024

I get the same as Carl:

$ docker run --rm -ti r-base R -e 'sessionInfo()'

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> 
$

from rocker.

wch avatar wch commented on July 21, 2024

@cboettig, I get the same result as you.

from rocker.

jangorecki avatar jangorecki commented on July 21, 2024

heh, I cannot reproduce it anymore... so it was likely some issue on my side, maybe overlapping name of an image I've build a while ago.

from rocker.

hakanai avatar hakanai commented on July 21, 2024

So I think that, despite the provincial-sounding label, en_US actually supports non-English languages better than C.

Not sure how you jump to that conclusion.

What about languages where, for instance, "ä" or "å" are supposed to sort after "z"?

from rocker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.