Giter Club home page Giter Club logo

Comments (14)

cboettig avatar cboettig commented on June 25, 2024 1

@januz yes, the MRAN snapshots are a convenient way to pin the version (you don't need to specify a version when installing from MRAN, since the 'latest' version is already fixed to the date). Both the versioned rocker images and the standard binder R config uses this MRAN snapshot configuration.

For system libraries installed by apt-get things are relatively stable on the rocker-versioned stack as well, since these are always installed from the same release. (technically these can change in minor ways due to security updates, but the basic version is fixed. Most linux distros work more like bioconductor than CRAN, where all software in the distro is effectively pinned at a version for the lifespan of that distribution.)

Not trying to deter discussion but I'm going to mark this as closed since I believe the OP question is resolved with the re-triggered builds.

from binder.

cboettig avatar cboettig commented on June 25, 2024

@januz Thanks for the bug report and sorry for the trouble. This does indeed sound very weird, I'll have to poke around. Sounds like something funny has happened on the Docker Hub end, if the same Dockerfile building locally is working fine. Possibly something in the post-build hook configuration?? I'll tickle the hub to rebuild and then poke around.

from binder.

cboettig avatar cboettig commented on June 25, 2024

@januz can you get your image to rebuild on Binder? No idea where this went wrong, but everything seems to be working on my fork of your 'binder-fails' example: https://github.com/cboettig/binder-fails

from binder.

januz avatar januz commented on June 25, 2024

@cboettig Yes, indeed. After making a commit to the repo, the container builds successfully on Binder! @betatim's assumption that a cached layer with the old version of nbrsessionproxy is responsible, sounds plausible. Thanks for looking into it, I hope that you can find a mechanism to prevent this from happening (apparently randomly).

from binder.

cboettig avatar cboettig commented on June 25, 2024

If binder builds with docker build --pull it should always have the latest version, and binder's server should be seeing the same thing that we see when we just run docker locally (e.g. run docker run -p 8888:8888 rocker/binder).

from binder.

januz avatar januz commented on June 25, 2024

binder's server should be seeing the same thing that we see when we just run docker locally (e.g. run docker run -p 8888:8888 rocker/binder).

Hm, I am not 100% sure anymore, but I think that during my tests for the above problems, I had the same thing happening to me (i.e., RStudio not opening) when I built/ran my docker container from DockerHub. But the same fault (having a cached layer from an earlier build with the outdated nbrsessionproxy version) could have happened locally I guess.

from binder.

betatim avatar betatim commented on June 25, 2024

I think if long term runn-ability is your goal the best thing to do is to rebuild and run your image at regular intervals. From watching people use mybinder.org and using some of the repos in talks/demos over many months my take away is that it is surprisingly hard to make something that works now and will still work in 6months. Mostly this is around pinning the right kind of dependencies at the right level.

keeping the current docker image is a good start but if you want to keep the option open of ever re-building it I'd attempt rebuilding the image once a month or so (via cron job).

TL;DR: this is really hard :D

from binder.

cboettig avatar cboettig commented on June 25, 2024

Well said; I'm 💯 with Tim on this being a remarkably hard (and remarkably under-appreciated how hard) problem.

Rocker's tagged images (i.e. 3.5.0) are rebuilt once a month, latest is rebuilt daily using a cron job. You can also have CI do this (e.g. Circle-CI will let you set a cron table to rebuild regularly without needing to make a new commit, so you don't need to keep a server up running cron all the time). Unless your codebase is very intensive, this lets you confirm the code also still runs, and not just that everything can just still install...

from binder.

januz avatar januz commented on June 25, 2024

Thanks to you two for taking the time to investigate and for your tips!

So, if I understand you correctly, the best thing to do if I want long term usability (which is what I want as it is for a reproducible research compendium, so might be relevant to somebody some months/years down the road), is to rebuild and try out my docker image on DockerHub regularly (best without needing a commit to the repo as @cboettig describes).

But how does this translate to mybinder.org? There, the image is built based on my Dockerfile, not on the image I provide at DockerHub, correct?

Also, how does it translate to reproducibility of the computational environment in general? I was hoping to "pin" the complete environment by using Docker. From what I understand, you say that you can't really pin everything, so a build now is always somehow different from a build in a year or so.

from binder.

cboettig avatar cboettig commented on June 25, 2024

My-Binder and Rocker both try to pin things as best as possible, but nothing is perfect. e.g. R packages come from MRAN snapshots, Microsoft has done a great job keeping these (though it's hard to externally validate that everything in the snapshot comes from the date claimed); only failures I've seen are temporary server down-time. Of course MRAN could vanish in the future. System libraries are pinned by the linux distro, but can get backported security patches. And of course some aspects of 'reproducibility' are contingent on hardware, clearly beyond the scope here.

Right, I believe Binder looks at your Dockerfile and tries to build it; which is a good check on reproducibility. Of course using "your own" Dockerfile from Binder's perspective means you've taken responsibility for ensuring (or not) that it's a stably reproducible build (e.g. r-base is , intentionally, not stable). Successfully building the Dockerfile is a good check, though of course doesn't guarantee that your code actually runs. That's why I suggest something like travis or circle-ci (both of which can run Docker, so if you want, you can check your code in the identical environment that you get on binder).

Tim may have more insight on this, so I'd be curious on his take too.

from binder.

betatim avatar betatim commented on June 25, 2024

Even doing something that is conceptually simple like "Pin all the things" turns out to be tricky to get right if you have a sufficiently large project (I'd say this is the fundamental reason this issue was created :) ).

For example packageA (which you explicitly pin to version 3) will depend on packageB (which you might not pin because you missed it or what not). Most packages don't specify the exact version of their dependencies, they just say "I need B". So if between two builds packageB releases a new version your new build will pick that up. Now in principle that should be fine, unless packageB made some breaking changes (on purpose or accidentally). Now maybe we can use something like SemVer to deal with on purpose breaking changes.

We could pin everything to exactly the version we use. This would prevent accidental breakge, probably. Once you find everything and pin it all (which takes time because you won't notice that one thing you missed until in a few months time when it suddenly breaks). Now there is a bug fix in a package you were using. This means we need to decide if we want to update (your result becomes more correct) or not (repeatability, we keep reproducing the result we know is incorrect but it is the same as it has been).

A lot of this is only hard because humans are building software and make mistakes in the process. If you only rely on two other things chances are you won't be caught up in a mistake. However, if you depend on a large stack (everything from matplotlib to the linux kernel via some docker magic containerisation stuff) I bet you you will be the victim of a mistake being made somewhere :)

Hence, I would setup a mothly (or so) rebuild and re-run job. It costs nearly nothing and at least I get a timely notification when something breaks. The hypothesis being that fixing it close to when it breaks is much easier than trying to fix the accumulation of all breakages of 12months or 24months. Or you decide that "nope we won't fix this, it is Ok that it is now broken."

from binder.

betatim avatar betatim commented on June 25, 2024

As a physicist, I assume "spherical cow in a vacuum without friction". Everything is nice and easy to calculate. In reality, cows are a weird shape, there is friction and an atmosphere. Now something that was a nice simple problem you could solve with pen and paper has turned into something requiring complicated numerical approximations.

I see reproducibility a bit like that. In theory it should be simple, in practice there are so many factors that make it more complicated than you first thought :-/

from binder.

cboettig avatar cboettig commented on June 25, 2024

Tim, I think this is great, but maybe overstating the goal slightly. As you note, the real catch in this scenario is wanting to update some part of your stack to a newer version that you didn't actually use, because perhaps some bug was fixed in your software and you want to see if it changed your result. That's an important use case; but it is also very distinct from the use case of "wanting to reproduce your original results in the original environment, bugs and all".

from binder.

januz avatar januz commented on June 25, 2024

Thank you two so much for your insights!!

For example packageA (which you explicitly pin to version 3) will depend on packageB (which you might not pin because you missed it or what not). Most packages don't specify the exact version of their dependencies, they just say "I need B". So if between two builds packageB releases a new version your new build will pick that up. Now in principle that should be fine, unless packageB made some breaking changes (on purpose or accidentally). Now maybe we can use something like SemVer to deal with on purpose breaking changes.

@cboettig At least for the R side of things that is solvable, correct? (at least when assuming that MRAN works reliably) All packages that are installed into the docker image, are installed from an MRAN snapshot that is fixed to a specific date by the base image. If one wants to install newer versions of specific packages, I found that there is the risk that @betatim describes if one just uses

RUN Rscript -e "devtools::install_version('package', version = '1.2.3')"

But if one instead specifies a specific MRAN snapshot a package should be installed from, the installed dependencies should also be reliably installed from that snapshot, correct? For example

RUN Rscript -e "devtools::install_version('package', version = '1.2.3', repos = 'https://mran.microsoft.com/snapshot/2018-12-01')"

For everything outside of R (including non-R-depencies of R packages), there is less control though as you both line out.

from binder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.