Giter Club home page Giter Club logo

pangeo-binder's Introduction

Pangeo's BinderHub

Deployment and configuration documentation files for the public binder.pangeo.io service.

Deployment Status

Branch Build Docs Deployment
staging CircleCI Documentation Status https://staging.binder.pangeo.io/
prod CircleCI Documentation Status https://binder.pangeo.io/

Binder Monitoring

Monitoring of the binderhubs is done with Prometheus and Grafana.

About Pangeo's Binder

Much like mybinder.org, the Pangeo's BinderHub deployment (binder.pangeo.io) allows users to create and share custom computing environments. The main distinction between the two BinderHubs is that Pangeo's BinderHub allows users to perform scalable computations using Dask.

For more information on the Pangeo project, check out the online documentation.

How to Deploy:

This repo is continuously deployed using CircleCI. It can also be deployed from a local computer using the following command:

deployment={staging,prod}  # choose one
helm upgrade --wait --install --namespace=${deployment} ${deployment} \
    pangeo-binder --version=v0.2.0 \
    -f ./deploy/${deployment}.yaml \
    -f ./secrets/${deployment}.yaml

More Information

The setup and configuration of Pangeo's BinderHub largely follows the Zero to Binderhub instructions. A few key distinctions are present:

  1. We define our own chart, pangeo-binder, so that we can mix in some Pangeo specific bits (e.g. a RBAC for dask-kubernetes). The pangeo-binder chart is based on the binderhub chart but most settings are indented under the binderhub chart level. The mybinder.org project does the same thing in mybinder.org-deploy.
  2. The deployment steps in section 3.4 of the Zero to Binderhub documentation are modified slightly to account for the pangeo-binder chart. The steps are briefly outlined below:
    # get current versions of the chart dependencies
    cd pangeo-binder
    helm dependency update
    
    # install the pangeo-binder chart
    helm install pangeo-binder --version=0.2.0-...  --name=<choose-name> --namespace=<choose-namespace> -f secret.yaml -f config.yaml
    

Additional Links and Related Projects

Testing Binder Updates

We deploy staging and prod BinderHubs. Before re-deploying prod by merging staging into prod, it's a good idea to test the deployment on some real-world examples.

Create an empty pull request against the staging branch with the text test-staging in the commit message.

.. code-block:: console

$ git checkout -b test-staging $ git commit --allow-empty -m 'test staging' $ git push -u origin

Once that pull request is opened, the script in .github/workflows/scripts/run_staging.py is run. That will run the latest versions of the notebooks in http://gallery.pangeo.io/ against

  1. The staging binder deployment (e.g. https://staging.binder.pangeo.io/)
  2. The staging Docker image (e.g. https://github.com/pangeo-gallery/default-binder/tree/staging)

pangeo-binder's People

Contributors

andersy005 avatar dsludwig avatar minrk avatar mogulano avatar pangeo-bot avatar rabernat avatar salvis2 avatar scottyhq avatar tomaugspurger avatar tomyun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pangeo-binder's Issues

Verify that environment's dask-gateway matches binder's dask-gateway

We can get into poor situations when launching an image that's incompatible with the version of dask-gateway-server running on the binder. This happened recently with https://github.com/TomAugspurger/pangeo-dask-gateway where

  1. That was built initially with dask-gateway 0.6.1
  2. We bumped the binder to dask-gateway 0.7.1 (incompatible update)
  3. We tried to launch a binder session with the old image

Two possible solutions

  1. Clear our binder image cache (this doesn't work if the user has pinned dask-gateway).
  2. Insert a hook when spawning a user to check that if Dask Gateway is installed, that it's compatible with the dask-gateway-server running on the binder.

We should be able to hack something together that makes a KubeSpawner that just does a simple check for the dask-gateway version in the image when we start. We check that version against the minimum supported version (0.7.1 for now. Bump that when there's an incompatible upgrade to dask-gateway). If there's an issue we raise an informative error message.

Drop kube-lego for https

I'm trying to stand up a new binderhub on AWS and have been running into errors such as NET::ERR_CERT_AUTHORITY_INVALID

@jhamman and @TomAugspurger - I think we can get rid of kube-lego right? See jupyterhub/binderhub#1001 and new binderhub docs for enabling HTTPS https://binderhub.readthedocs.io/en/latest/https.html

kube-lego:
config:
LEGO_EMAIL: [email protected]
LEGO_URL: https://acme-v01.api.letsencrypt.org/directory
rbac:
create: true
image:
tag: 0.1.7

kubectl logs binder-prod-kube-lego-574687f898-gvwpl -n binder-prod

time="2020-03-25T00:16:11Z" level=info msg="process certificate requests for ingresses" context=kubelego
time="2020-03-25T00:16:11Z" level=info msg="no cert associated with ingress" context=ingress_tls name=jupyterhub namespace=binder-prod
time="2020-03-25T00:16:11Z" level=info msg="requesting certificate for hub.aws-test-binder.pangeo.io" context=ingress_tls name=jupyterhub namespace=binder-prod
time="2020-03-25T00:16:11Z" level=error msg="Error while processing certificate requests: 403 urn:acme:error:unauthorized: Account creation on ACMEv1 is disabled. Please upgrade your ACME client to a version that supports ACMEv2 / RFC 8555. See https://community.letsencrypt.org/t/end-of-life-plan-for-acmev1/88430 for details." context=kubelego
time="2020-03-25T00:16:11Z" level=error msg="worker: error processing item, requeuing after rate limit: 403 urn:acme:error:unauthorized: Account creation on ACMEv1 is disabled. Please upgrade your ACME client to a version that supports ACMEv2 / RFC 8555. See https://community.letsencrypt.org/t/end-of-life-plan-for-acmev1/88430 for details." context=kubelego

Also the PR that did away with kube-lego on jupyterhub jupyterhub/zero-to-jupyterhub-k8s#1539

Update chart to match binderhub

Our chart has grown a bit stale and we're now a few months behind the binderhub chart. It would be good to take advantage of some of the recent improvements upstream, in particular the recent release of the jupterhub 0.8 chart.

Image renewal

Hi,
I love to work with the binder.pangeo.io on my teaching repo:
https://github.com/obidam/m2poc2018
https://github.com/obidam/m2poc2019
https://github.com/obidam/pyxpcm-examples

However, when I update frequently my repo with new notebooks (or binder settings) during my class preparation, I noticed that the cached image is taking over and I don't see the updates.

So my questions are:

  • how long an image is cached ?
  • is there a mechanism to force image building (for the dev time, not use) ?

Thanks for your answer !
gm

New Kubenetes Cluster for binder.pangeo.io

I'd like to transition our current deployment to a new cluster with the following structure.

  • default pool: 2 n1-standard-2 (2 vCPUs, 7.5 GB memory) nodes to cover binderhub/jupyterhub/proxy deployments
  • notebook pool: autoscaling/preemptible [0-N] n1-highmem-4 (4 vCPUs, 26 GB memory) to cover user notebook sessions
  • comp pool: autoscaling/preemptible [0-N] n1-standard-4 (4 vCPUs, 15 GB memory) to cover user sessions that need to scale beyond a single node (e.g. dask-kubernetes).

A few other things we will want to change:

  • Each pools should use a SSD boot disks (which provide mush faster spinnups)
  • move to us-central1-b region where the main pangeo deployment is
  • add local ssds to the notebook and comp pools
  • use nodeSelectors for each pool

add pangeo branding to binder.pangeo.io

Currently, the main binderhub page on binder.pangeo.io looks a lot like mybinder.org. You actually have to read the text to see that there are some differences. It would be great to update the layout of the page to highlight the difference and to give pangeo a bigger presence. Visual/web design is not really a skill set I have so any help/ideas from others would be appreciated here.

This repo is where we can override style things: https://github.com/pangeo-data/pangeo-custom-binderhub-templates

KubeCluster error: "Expected a kubernetes.client.V1Pod object, got {'metadata': ..."

when I try to create a KubeCluster in my first test on pangeo-binder, I'm getting:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-6c80d44e9da6> in <module>()
----> 1 cluster = KubeCluster(n_workers=10)

/srv/conda/lib/python3.6/site-packages/dask_kubernetes/core.py in __init__(self, pod_template, name, namespace, n_workers, host, port, env, **kwargs)
    181                            **os.environ)
    182 
--> 183         self.pod_template = clean_pod_template(pod_template)
    184         # Default labels that can't be overwritten
    185         self.pod_template.metadata.labels['dask.pydata.org/cluster-name'] = name

/srv/conda/lib/python3.6/site-packages/dask_kubernetes/objects.py in clean_pod_template(pod_template)
    193                'If trying to pass a dictionary specification then use '
    194                'KubeCluster.from_dict')
--> 195         raise TypeError(msg % str(pod_template))
    196 
    197     pod_template = copy.deepcopy(pod_template)

TypeError: Expected a kubernetes.client.V1Pod object, got {'metadata': None, 'spec': {'restartPolicy': 'Never', 'containers': [{'args': ['dask-worker', '--nthreads', '2', '--no-bokeh', '--memory-limit', '6GB', '--death-timeout', '60'], 'image': 'gcr.io/pangeo-181919/pangeo-binderrsignell-2dusgs-2dadcirc-5fcloud-ef2345:cf56eda50c89f5030fee8ecc8c50963838daae38', 'name': 'dask-worker', 'resources': {'limits': {'cpu': '1.75', 'memory': '2G'}, 'requests': {'cpu': '1.75', 'memory': '2G'}}}]}}If trying to pass a dictionary specification then use KubeCluster.from_dict

@jhamman, should I be poking at this, or is it too soon?

[DOC] Builds are failing

Latest example: https://readthedocs.org/projects/pangeo-binder/builds/10596649/

Running Sphinx v1.6.5
making output directory...
loading translations [en]... done
Adding copy buttons to code blocks...

Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/pangeo-binder/envs/staging/lib/python3.7/site-packages/sphinx/cmdline.py", line 305, in main
    opts.warningiserror, opts.tags, opts.verbosity, opts.jobs)
  File "/home/docs/checkouts/readthedocs.org/user_builds/pangeo-binder/envs/staging/lib/python3.7/site-packages/sphinx/application.py", line 196, in __init__
    self.setup_extension(extension)
  File "/home/docs/checkouts/readthedocs.org/user_builds/pangeo-binder/envs/staging/lib/python3.7/site-packages/sphinx/application.py", line 456, in setup_extension
    self.registry.load_extension(self, extname)
  File "/home/docs/checkouts/readthedocs.org/user_builds/pangeo-binder/envs/staging/lib/python3.7/site-packages/sphinx/registry.py", line 207, in load_extension
    metadata = mod.setup(app)
  File "/home/docs/checkouts/readthedocs.org/user_builds/pangeo-binder/envs/staging/lib/python3.7/site-packages/sphinx_copybutton/__init__.py", line 25, in setup
    app.add_js_file('clipboard.min.js')
AttributeError: 'Sphinx' object has no attribute 'add_js_file'

Exception occurred:
  File "/home/docs/checkouts/readthedocs.org/user_builds/pangeo-binder/envs/staging/lib/python3.7/site-packages/sphinx_copybutton/__init__.py", line 25, in setup
    app.add_js_file('clipboard.min.js')
AttributeError: 'Sphinx' object has no attribute 'add_js_file'
The full traceback has been saved in /tmp/sphinx-err-2doxmucw.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!

pangeo-cloud-federation pods ended up here?

I think that someone (most likely me) seems to have deployed pangeo-cloud-federation resources to this binder.

Screen Shot 2020-06-04 at 2 04 59 PM

Is that correct? We shouldn't have anything namespaces other than prod and staging on the binder k8s cluster, right?

If I can get a +1 to that, I'll delete those namespaces.

Dask Permission error on Amazon Binder

When trying to run this Binder session: https://hub.aws-uswest2-binder.pangeo.io/user/ncar-cesm-lens-aws-uvea1vit/lab

I am getting this error when trying to create a Dask cluster:

Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py:623> exception=ApiException()>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/deploy/spec.py", line 50, in _
    await self.start()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_kubernetes/core.py", line 78, in start
    raise e
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_kubernetes/core.py", line 69, in start
    self.namespace, self.pod_template
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/kubernetes_asyncio/client/api_client.py", line 166, in __call_api
    _request_timeout=_request_timeout)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/kubernetes_asyncio/client/rest.py", line 230, in POST
    body=body))
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/kubernetes_asyncio/client/rest.py", line 181, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.rest.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Audit-Id': '93439412-c66a-42b7-8069-c10509eadf62', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Mon, 04 May 2020 19:45:28 GMT', 'Content-Length': '277')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:prod:pangeo\" cannot create resource \"pods\" in API group \"\" in the namespace \"prod\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}

Is this a simple auth problem, or some other issue entirely? If this is the wrong place to report the problem, apologies.

To reproduce the problem, click the following link and try running the kay-et-al-2015-v3.ipynb notebook:

https://aws-uswest2-binder.pangeo.io/v2/gh/NCAR/cesm-lens-aws/master?urlpath=lab

Move aws-uswest2-binder.pangeo.io config to this repo

@jhamman @TomAugspurger @salvis2 - I'm going to create a PR to integrate the AWS binder configuration in this repository to keep in-sync with the deployment on Google. I'll basically follow the pangeo-cloud-federation setup so a few things need to happen:

  • add deploy-aws/ and secrets-aws/ folders (later could refactor structure and ci to use hubploy?)
  • add aws secrets with git-crypt
  • add aws-deploy steps to circleci
  • create a production aws deployment (have just been running on a single 'staging' deployment to-date)

Workspace loading hang

Pangeo+Binder=SuperAwesome

I've used it for a couple of demonstrations of notebooks successfully. I've been trying it out in the Jupyter Lab configuration but once the container is up and running I get the following dialog box which never clears:

image

I am using it with the following repo:
https://github.com/informatics-lab/windshear_hero_demo/tree/binder

At the following Pangeo Binder url:
https://binder.pangeo.io/v2/gh/informatics-lab/windshear_hero_demo/binder?urlpath=lab/tree/wind_shear.ipynb

Any advice?

Production deployment is down

After merging #47 the production deployment has gone down. The current pod listing is:

 ~  kubectl get pods --namespace prod --output wide                                                                            ✔  10480  08:35:18
NAME                                                           READY     STATUS             RESTARTS   AGE       IP            NODE                                    NOMINATED NODE
binder-d6c8bbf58-59s9j                                         1/1       Running            0          9h        10.50.43.35   gke-binder-default-pool-109da6b7-387f   <none>
hub-bd6fd466c-qjkzh                                            1/1       Running            0          9h        10.50.43.37   gke-binder-default-pool-109da6b7-387f   <none>
jupyter-pangeo-2ddata-2dpan-2d-5focean-5fexamples-2dpg26eu8d   1/1       Running            0          54m       10.49.48.36   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
jupyter-xgcm-2dxgcm-5fexamples-2dnz5o013j                      1/1       Running            0          1h        10.49.48.35   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-dind-96vw6                                                0/1       CrashLoopBackOff   116        9h        10.49.48.26   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-image-cleaner-8xw9d                                       0/1       CrashLoopBackOff   120        9h        10.49.48.27   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-kube-lego-574f9948c6-vxjt8                                1/1       Running            0          23d       10.50.43.4    gke-binder-default-pool-109da6b7-387f   <none>
prod-nginx-ingress-controller-589f47d979-6rscr                 1/1       Running            0          9h        10.49.48.29   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-nginx-ingress-controller-589f47d979-xszqv                 1/1       Running            0          9h        10.49.48.28   gke-binder-jupyter-pool-6f85590c-b7wr   <none>
prod-nginx-ingress-default-backend-66b8f956d6-f4g56            1/1       Running            0          22d       10.50.44.12   gke-binder-default-pool-109da6b7-3gzs   <none>
proxy-5487465f49-97xpp                                         1/1       Running            0          9h        10.50.43.36   gke-binder-default-pool-109da6b7-387f   <none>

Details from the dind logs:

 ~  kubectl logs --namespace prod prod-dind-96vw6                                                                              ✔  10481  08:40:36
time="2019-05-01T15:39:33.431943017Z" level=warning msg="could not change group /var/run/dind/prod/docker.sock/docker.sock to docker: group docker not found"
time="2019-05-01T15:39:33.433283517Z" level=info msg="libcontainerd: started new containerd process" pid=25
time="2019-05-01T15:39:33.433433078Z" level=info msg="parsed scheme: \"unix\"" module=grpc
time="2019-05-01T15:39:33.433488005Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
time="2019-05-01T15:39:33.433593327Z" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///var/run/docker/containerd/containerd.sock 0  }]" module=grpc
time="2019-05-01T15:39:33.433649071Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
time="2019-05-01T15:39:33.433742959Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc4201956e0, CONNECTING" module=grpc
time="2019-05-01T15:39:33.455220961Z" level=info msg="starting containerd" revision=9754871865f7fe2f4e74d43e2fc7ccd237edcbce version=v1.2.2
time="2019-05-01T15:39:33.455703260Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
time="2019-05-01T15:39:33.455760366Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.456024013Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
time="2019-05-01T15:39:33.456084268Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462416022Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.aufs" error="modprobe aufs failed: "ip: can't find device 'aufs'\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1"
time="2019-05-01T15:39:33.462469424Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462544329Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462642643Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
time="2019-05-01T15:39:33.462881282Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
time="2019-05-01T15:39:33.462913517Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
time="2019-05-01T15:39:33.462927137Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
time="2019-05-01T15:39:33.462952533Z" level=warning msg="could not use snapshotter aufs in metadata plugin" error="modprobe aufs failed: "ip: can't find device 'aufs'\nmodprobe: can't change directory to '/lib/modules': No such file or directory\n": exit status 1"
time="2019-05-01T15:39:33.462972241Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
time="2019-05-01T15:39:43.433833795Z" level=error msg="failed connecting to containerd" error="failed to dial \"/var/run/docker/containerd/containerd.sock\": context deadline exceeded" module=libcontainerd
time="2019-05-01T15:39:43.534327107Z" level=info msg="killing and restarting containerd" module=libcontainerd pid=25
time="2019-05-01T15:39:43.535051235Z" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 25 [running]:
github.com/containerd/containerd/cmd/containerd/command.dumpStacks()
	/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:78 +0x8c
github.com/containerd/containerd/cmd/containerd/command.handleSignals.func1(0xc4202e4180, 0xc4202e4120, 0x190aba0, 0xc4200ca010, 0xc420090180)
	/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:53 +0x274
created by github.com/containerd/containerd/cmd/containerd/command.handleSignals
	/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:43 +0x8b

goroutine 1 [sleep]:
time.Sleep(0x2faf080)
/usr/local/go/src/runtime/time.go:102 +0x16c
github.com/containerd/containerd/vendor/go.etcd.io/bbolt.flock(0xc4203f4f00, 0x1, 0x0, 0x1a4, 0xc4203b6330)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/bolt_unix.go:40 +0x88
github.com/containerd/containerd/vendor/go.etcd.io/bbolt.Open(0xc420121400, 0x48, 0x1a4, 0x20cf680, 0xc4200f1048, 0x1, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:199 +0x165
github.com/containerd/containerd/services/server.LoadPlugins.func2(0xc420370c40, 0xc42025b4a0, 0x21, 0xc4200422a0, 0x1e)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/services/server/server.go:264 +0x49d
github.com/containerd/containerd/plugin.(*Registration).Init(0xc4200d14f0, 0xc420370c40, 0xc4200d14f0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/plugin/plugin.go:100 +0x3a
github.com/containerd/containerd/services/server.New(0x190aba0, 0xc4200ca010, 0xc420354000, 0x1, 0xc4204e1c80, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/services/server/server.go:120 +0x557
github.com/containerd/containerd/cmd/containerd/command.App.func1(0xc420352000, 0xc420352000, 0xc4204e1d07)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/cmd/containerd/command/main.go:141 +0x67e
github.com/containerd/containerd/vendor/github.com/urfave/cli.HandleAction(0x16f0900, 0x18e7090, 0xc420352000, 0xc4202e40c0, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:502 +0xca
github.com/containerd/containerd/vendor/github.com/urfave/cli.(*App).Run(0xc42034a000, 0xc4200d00f0, 0x5, 0x5, 0x0, 0x0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:268 +0x60e
main.main()
github.com/containerd/containerd/cmd/containerd/main.go:33 +0x51

goroutine 19 [syscall]:
os/signal.signal_recv(0x18fa520)
/usr/local/go/src/runtime/sigqueue.go:139 +0xa8
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:22 +0x24
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:28 +0x43

goroutine 20 [chan receive]:
github.com/containerd/containerd/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x20aa9a0)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/golang/glog/glog.go:879 +0x8d
created by github.com/containerd/containerd/vendor/github.com/golang/glog.init.0
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/golang/glog/glog.go:410 +0x205

goroutine 26 [select, locked to thread]:
runtime.gopark(0x18e9ed0, 0x0, 0x115f5ff, 0x6, 0x18, 0x1)
/usr/local/go/src/runtime/proc.go:291 +0x120
runtime.selectgo(0xc4203bbf50, 0xc420090240)
/usr/local/go/src/runtime/select.go:392 +0xe56
runtime.ensureSigM.func1()
/usr/local/go/src/runtime/signal_unix.go:549 +0x1f6
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1

goroutine 11 [select]:
github.com/containerd/containerd/vendor/github.com/docker/go-events.(*Broadcaster).run(0xc4200d1540)
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:117 +0x3c4
created by github.com/containerd/containerd/vendor/github.com/docker/go-events.NewBroadcaster
/tmp/tmp.y5M0SKgnNM/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:39 +0x1b1

=== END goroutine stack dump ==="
time="2019-05-01T15:39:43.636338550Z" level=error msg="containerd did not exit successfully" error="signal: killed" module=libcontainerd
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1525f50]

goroutine 26 [running]:
github.com/docker/docker/vendor/github.com/containerd/containerd.(*Client).Close(0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/client.go:536 +0x30
github.com/docker/docker/libcontainerd/supervisor.(*remote).monitorDaemon(0xc4207081a0, 0x26a5600, 0xc420682fc0)
/go/src/github.com/docker/docker/libcontainerd/supervisor/remote_daemon.go:321 +0x262
created by github.com/docker/docker/libcontainerd/supervisor.Start
/go/src/github.com/docker/docker/libcontainerd/supervisor/remote_daemon.go:90 +0x3fa

Details from the image cleaner pod:

~  kubectl logs --namespace prod prod-image-cleaner-8xw9d  ✔  10482  08:42:23 2019-05-01 15:40:05,798 Pruning docker images when /var/lib/docker has 80.0% inodes or blocks used Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen chunked=chunked) File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/local/lib/python3.6/http/client.py", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/local/lib/python3.6/http/client.py", line 1026, in _send_output self.send(msg) File "/usr/local/lib/python3.6/http/client.py", line 964, in send self.connect() File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 46, in connect sock.connect(self.unix_socket) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/local/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/usr/local/lib/python3.6/site-packages/docker/transport/unixconn.py", line 46, in connect
sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 171, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "/usr/local/lib/python3.6/site-packages/docker/api/daemon.py", line 179, in version
return self._result(self._get(url), json=True)
File "/usr/local/lib/python3.6/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 194, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 537, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 524, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 637, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/image-cleaner.py", line 198, in
main()
File "/usr/local/bin/image-cleaner.py", line 121, in main
client = docker.from_env(version='auto')
File "/usr/local/lib/python3.6/site-packages/docker/client.py", line 81, in from_env
**kwargs_from_env(**kwargs))
File "/usr/local/lib/python3.6/site-packages/docker/client.py", line 38, in init
self.api = APIClient(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 154, in init
self._version = self._retrieve_server_version()
File "/usr/local/lib/python3.6/site-packages/docker/api/client.py", line 179, in _retrieve_server_version
'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

@yuvipanda - does this still look like staging/prod are racing for the dind socket?

AWS build failing

https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/136/workflows/6ffd20d5-9965-45bc-b357-bbd263c4756d/jobs/144

#!/bin/bash -eo pipefail
aws eks update-cluster-config --name pangeo-binder --resources-vpc-config publicAccessCidrs=${AWS_IP_WHITELIST} > /dev/null



An error occurred (InvalidParameterException) when calling the UpdateClusterConfig operation: Cluster is already at the desired configuration with endpointPrivateAccess: true , endpointPublicAccess: true, and Public Endpoint Restrictions: [*************]


Exited with code exit status 255

CircleCI received exit code 255

cc @scottyhq

Optimizing nodegroup instance types and resource requests across Cloud providers

As discussed #141, there are a number of things to consider with nodegroup instance types that are running various binderhub components. For starters: cost/hour, CPU, RAM, max number of pods.

I don't think there is a 1:1 mapping of instance specs across cloud providers, but it would be good to check in on these to simplify common config and resource requests.

Here is as simplified table for current AWS config (https://github.com/pangeo-data/pangeo-binder/tree/staging/k8s-aws):

nodegroup min size max size desired capacity instance type vCPU RAM (Gb) max pods
core 1 1 1 t3.xlarge 4 16 58
scheduler 0 20 0 t3.medium 2 4 17
user 0 10 0 m5.2xlarge 8 32 58
worker 0 10 0 r5.2xlarge 8 64 58

@TomAugspurger @jhamman - what does this table look like for GCP?

For starters, it seems we can converge on the core node requests and limits for binderhub:

Non-terminated Pods:         (30 in total)
  Namespace                  Name                                                      CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                  ----                                                      ------------  ----------   ---------------  -------------  ---
  cert-manager               cert-manager-6b78b7c997-r5lpd                             10m (0%)      0 (0%)       32Mi (0%)        0 (0%)         23m
  cert-manager               cert-manager-cainjector-54c4796c5d-zrpkk                  0 (0%)        0 (0%)       0 (0%)           0 (0%)         23m
  cert-manager               cert-manager-webhook-77ccf5c8b4-n4g9c                     0 (0%)        0 (0%)       0 (0%)           0 (0%)         23m
  kube-system                aws-node-cml2p                                            10m (0%)      0 (0%)       0 (0%)           0 (0%)         24m
  kube-system                aws-node-termination-handler-bp582                        50m (2%)      100m (5%)    64Mi (0%)        128Mi (1%)     24m
  kube-system                cluster-autoscaler-78fb96cfd5-6cghx                       100m (5%)     100m (5%)    300Mi (3%)       300Mi (3%)     23m
  kube-system                coredns-86d5cbb4bd-5pzb7                                  100m (5%)     0 (0%)       70Mi (0%)        170Mi (2%)     23m
  kube-system                coredns-86d5cbb4bd-f4cb5                                  100m (5%)     0 (0%)       70Mi (0%)        170Mi (2%)     23m
  kube-system                kube-proxy-zdcpj                                          100m (5%)     0 (0%)       0 (0%)           0 (0%)         24m
  prod                       api-prod-dask-gateway-8f6b4896d-nhjwh                     0 (0%)        0 (0%)       0 (0%)           0 (0%)         23m
  prod                       binder-6f55f6c6c6-s9lwd                                   200m (10%)    0 (0%)       2Gi (26%)        0 (0%)         20m
  prod                       controller-prod-dask-gateway-64ff4648c-9tln7              0 (0%)        0 (0%)       0 (0%)           0 (0%)         20m
  prod                       hub-545cfd67c8-r5hm7                                      200m (10%)    1250m (62%)  1Gi (13%)        3Gi (39%)      20m
  prod                       prod-kube-lego-5dffd7fdc-mqjqh                            0 (0%)        0 (0%)       0 (0%)           0 (0%)         20m
  prod                       prod-nginx-ingress-controller-8d4c768fd-zqq9l             250m (12%)    500m (25%)   240Mi (3%)       240Mi (3%)     20m
  prod                       prod-nginx-ingress-default-backend-fc4bb9c47-dx5z7        0 (0%)        0 (0%)       0 (0%)           0 (0%)         20m
  prod                       proxy-745d9f9645-lclss                                    250m (12%)    500m (25%)   256Mi (3%)       512Mi (6%)     20m
  prod                       traefik-prod-dask-gateway-7b8ddc7cf-2z7cp                 0 (0%)        0 (0%)       0 (0%)           0 (0%)         20m
  prod                       user-scheduler-d8dbb46c8-mmm2z                            50m (2%)      0 (0%)       256Mi (3%)       0 (0%)         20m
  staging                    api-staging-dask-gateway-76c6bd8848-w2q9k                 0 (0%)        0 (0%)       0 (0%)           0 (0%)         2m2s
  staging                    binder-6b757d85d8-jw2kv                                   50m (2%)      0 (0%)       100Mi (1%)       0 (0%)         2m2s
  staging                    controller-staging-dask-gateway-76b4d4dbc8-ndshx          0 (0%)        0 (0%)       0 (0%)           0 (0%)         2m2s
  staging                    hub-6c5c76dc9f-448fs                                      50m (2%)      1250m (62%)  100Mi (1%)       1Gi (13%)      2m2s
  staging                    proxy-6fb5f57d5c-4fqzk                                    250m (12%)    500m (25%)   256Mi (3%)       512Mi (6%)     2m2s
  staging                    staging-dind-c7tvn                                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         2m2s
  staging                    staging-image-cleaner-27kjv                               0 (0%)        0 (0%)       0 (0%)           0 (0%)         2m2s
  staging                    staging-kube-lego-5f847b9787-dfqbs                        0 (0%)        0 (0%)       0 (0%)           0 (0%)         2m2s
  staging                    staging-nginx-ingress-default-backend-78c77d877f-gsntx    0 (0%)        0 (0%)       0 (0%)           0 (0%)         2m2s
  staging                    traefik-staging-dask-gateway-6c7dd5645c-7fbql             0 (0%)        0 (0%)       0 (0%)           0 (0%)         2m2s
  staging                    user-scheduler-66fc777965-mrb9r                           50m (2%)      0 (0%)       256Mi (3%)       0 (0%)         2m2s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1820m (91%)   4200m (210%)
  memory                      5072Mi (65%)  6128Mi (78%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

CircleCI Timeouts

We've been getting period timeouts in the helm upgrade ... step on CircleCI. The same command completes quickly locally. I'm not sure what is going on but the CircleCI badges are now out of sync with the status of the production and staging deployments.

Use pangeo-bot to update dependencies

@charlesbluca setup @pangeo-bot at the #pangeo2019 sprint. This bot mimics the functionality of @henchbot, creating pull requests to this repo every time there is a new version of repo2docker or BinderHub available. This seems to work well (#69).

This issue is meant to:

  1. coordinate the final implementation of the bot, notably the Heroku deployment we were working on at the end of the sprint.
  2. brainstorm other things we want this bot do (for example, it may be nice if the bot automatically opened a new PR to prod after each merge to staging)
  3. thank @charlesbluca for his help on this

user-pool not scaling down completely

image

Following the recent cluster updates and associated scheduling changes (affinity/taints/tolerations), most things in the cluster seem to be working really well. One issue still seems to present where the user-pool doesn't scale to size zero when not in use. Here's an example from a few minutes ago where a node hasn't scaled down despite hours of inactivity.

$ kubectl get pods --all-namespaces --output wide | grep user-pool
kube-system   fluentd-gcp-v3.1.1-p4g8t                                    2/2     Running   0          2d7h    10.128.0.64    gke-binder-user-pool-45fded5e-qlnh   <none>
kube-system   kube-proxy-gke-binder-user-pool-45fded5e-qlnh               1/1     Running   0          2d7h    10.128.0.64    gke-binder-user-pool-45fded5e-qlnh   <none>
kube-system   stackdriver-metadata-agent-cluster-level-8449cbd999-p667l   1/1     Running   1          2d7h    10.49.224.3    gke-binder-user-pool-45fded5e-qlnh   <none>
prod          prod-dind-4s2m7                                             1/1     Running   0          42h     10.49.224.29   gke-binder-user-pool-45fded5e-qlnh   <none>
prod          prod-image-cleaner-lqh4z                                    1/1     Running   0          42h     10.49.224.30   gke-binder-user-pool-45fded5e-qlnh   <none>
staging       staging-dind-dqr4h                                          1/1     Running   0          42h     10.49.224.26   gke-binder-user-pool-45fded5e-qlnh   <none>
staging       staging-image-cleaner-c5dl9                                 1/1     Running   0          42h     10.49.224.27   gke-binder-user-pool-45fded5e-qlnh   <none>

So there are three system pods, and dind/cleaner daemonset pods for each namespace deployment. My understanding is that none of these should keep the node alive in the absence of a notebook/build pod. @consideRatio, am I misinterpreting the scheduling behavior?

scheduling:
userScheduler:
enabled: true
replicas: 2
podPriority:
enabled: true
userPlaceholder:
enabled: false
userPods:
nodeAffinity:
matchNodePurpose: require

New DockerHub Image retention policies will delete unused images after 6 months

I think only the AWS binderhub is using DockerHub for storage (instead of AWS ECR), but cross posting this issue

Cross-posting pangeo-data/pangeo-docker-images#121

Given the new policies if a binder link isn't used for 6 months, the linked image on DockerHub will be deleted, eventually forcing a rebuild. For binders using a Dockerfile this will simply fail, for binders using repo2docker environment.ymls this will force a rebuild.

Problem downloading notebook from JupyterLab on pangeo binder

@marisolgr and I are showing a jupyter notebook at the PICES meeting on 10/24 (Thursday). We would like to show it using the pangeo binder for a number of reasons. When I run the image, it opens up a Jupyter Lab environment, with the example we want to show, which creates some figures and data files. Everything runs, no problem, but we are getting an error when trying to download any file. I think it may have something to do with this: jupyter/notebook#4541 1. To download a file I just right-click on it and select ‘download’. I then get a Jupyter error ‘Blocking request from unknown origin’

Here is a link to the pangeo binder image:
https://binder.pangeo.io/v2/gh/python4oceanography/PICES-tools/master 1

(I can download if I create the image using mybinder, but I’d rather use pangeo)

(Posted here at request of @rabernat)
Any help appreciated. Thanks, Chelle

Use of pangeo-binder and ocean.pangeo.io in the classroom

hi all,
I'm not sure this is here that I should be asking for that,
I'm going to give an Ocean Data Science class in January (30 students):

Is there any restrictions or preparation for that specific and planned increase of computational resources ?
(I just want to make sure this is ok with you guys and that it's going to work as expected)

AWS BinderHub deploy failed on staging

From https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/154/workflows/242b4d43-abb5-4b07-b5fa-518ec8e7a09f/jobs/161, which was deploying #159

#!/bin/bash -eo pipefail
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml
helm upgrade --wait --install \
  ${CIRCLE_BRANCH} pangeo-binder \
  --namespace=${CIRCLE_BRANCH} --version=v0.2.0 \
  -f ./deploy-aws/${CIRCLE_BRANCH}.yaml \
  -f ./secrets-aws/${CIRCLE_BRANCH}.yaml
#helm history ${CIRCLE_BRANCH} -n ${CIRCLE_BRANCH}

Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com configured
Error: UPGRADE FAILED: rendered manifests contain a new resource that already exists. Unable to continue with update: existing resource conflict: namespace: staging, name: binderhub, existing_kind: networking.k8s.io/v1beta1, Kind=Ingress, new_kind: networking.k8s.io/v1beta1, Kind=Ingress

Exited with code exit status 1

CircleCI received exit code 1

cc @salvis2. I don't know if the changes in #159 would cause that. Perhaps the change at https://github.com/pangeo-data/pangeo-binder/pull/159/files#diff-590371a4912f3ac827ca1b75524c8582L23, though deploy-aws should already override that.

Accessing S3 Buckets on Dask Gateway Workers

In testing out existing notebooks with dask gateway we ran into a KilledWorker running calculations on data in S3 (https://nbviewer.jupyter.org/gist/rsignell-usgs/71c4eac62d88aca2c8d765bc7dcd11de ):

KilledWorker: ('zarr-3a22ea2f96e23b46760ac4cb26937e9e', <Worker 'tls://192.168.156.109:44151', name: 88453c6cd7bc47aab1dfc10e8733b750, memory: 0, processing: 1>)

Turns out this was due to an error behind the scenes linking AWS IAM permissions to the service account for dask worker pods.

kubectl logs -n staging dask-gateway-scottyhq-pageodev-binder-image-kfup6kzl-worker-1fe6735a73bd41529b4575eb80ade649

  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/botocore/utils.py", line 1734, in __call__
    with self._open(self._web_identity_token_path) as token_file:
PermissionError: [Errno 13] Permission denied: '/var/run/secrets/eks.amazonaws.com/serviceaccount/token'

The dask worker pods and containers are not currently assigned a securityContext and at the very least need to set fsGroup for cloud provider credentials to be linked:
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-technical-overview.html

cc @TomAugspurger @jhamman @rsignell-usgs

Migrating away from daskkubernetes service account

@jhamman @TomAugspurger - In light of #128 As we transition to using dask-gateway, we still need to decide what service account to give dask worker pods and jupyter notebook pods. Right now they get the overly permissive daskkubernetes service account. If we take that away they get the default service account.

I'm thinking we should move to a pangeo service account for dask workers and jupyter notebooks. That way you can more easily link access to specific cloud resources like s3 buckets - see pangeo-data/pangeo-cloud-federation#485.

Linking those resources to the default service account doesn't feel right because so many pods use it.

Maximum runtime?

I've noticed that occasionally we have jupyter pods that have been around for 6+ hours. I wonder if we should cap runtime at some limit, perhaps the maximum that we expect for a tutorial?

Loosing connection to my binder notebook

Hi everyone,

Recently when preparing a Pangeo tutorial, I've experience some disconnection from my Binder launched notebook. I add to restart the whole binder notebook, loosing current notebooks state.

I'm not sure what is causing this, this just happened at around 9 UTC.

404 error when accessing the dask-gateway

When I try and start up a cluster using the recommended way in the binder documentation I get a cluster with zero workers. When the link is provided for the dashboard, it goes to a dead link. All of the dask related pages say "404 not found."

Here is my environment.yml file:
name: pangeo
channels:

  • conda-forge
    dependencies:
  • python=3.6
  • xarray
  • dask
  • distributed
  • act-atmos
  • dask-kubernetes
  • dask-gateway==0.7.1
  • gcsfs
  • matplotlib
  • numcodecs
  • python-blosc
  • lz4
  • nomkl
  • nbserverproxy
  • pangeo-notebook
  • jupyter
  • jupyterlab=0.35
  • jupyterlab_launcher
  • jupyter_client
  • ipywidgets
  • graphviz
  • nodejs
  • pip:

Just simply typing https://hub.binder.pangeo.io/services/dask-gateway into a web browser will even give me a 404 not found error. Is the dask-gateway down? I've even tried this on multiple machines to ensure that there is nothing network related.

Lingering dask-gateaway pods

We have some dask-gateway-* pods hanging around.

$ kubectl -n prod get pod | grep '^dask-gateway-'
dask-gateway-2qrzm                                   1/1     Running   0          11h
dask-gateway-2vwpc                                   1/1     Running   0          11h
dask-gateway-2xsd4                                   1/1     Running   0          11h
dask-gateway-411c06d8dc9e452bbdb2a0e6f9ca96d0        1/1     Running   0          12h
dask-gateway-4llf6                                   1/1     Running   0          11h
dask-gateway-4tmmd                                   1/1     Running   0          11h
dask-gateway-5sn95                                   1/1     Running   0          11h
dask-gateway-7vvk9                                   1/1     Running   0          11h
dask-gateway-8xqgc                                   1/1     Running   0          11h
dask-gateway-bqrs8                                   1/1     Running   0          11h
dask-gateway-cdc2n                                   1/1     Running   0          11h
dask-gateway-cq7k9                                   1/1     Running   0          11h
dask-gateway-dnshn                                   1/1     Running   0          11h
dask-gateway-dq2l5                                   1/1     Running   0          11h
dask-gateway-f7qlb                                   1/1     Running   0          11h
dask-gateway-fpvfs                                   1/1     Running   0          11h
dask-gateway-g8x9n                                   1/1     Running   0          11h
dask-gateway-gvft9                                   1/1     Running   0          11h
dask-gateway-h4sdl                                   1/1     Running   0          11h
dask-gateway-j62lb                                   1/1     Running   0          11h
dask-gateway-jtm6n                                   1/1     Running   0          11h
dask-gateway-kkcc8                                   1/1     Running   0          11h
dask-gateway-m9r6c                                   1/1     Running   0          11h
dask-gateway-mnpmd                                   1/1     Running   0          11h
dask-gateway-nqk25                                   1/1     Running   0          11h
dask-gateway-ps957                                   1/1     Running   0          11h
dask-gateway-rbgx8                                   1/1     Running   0          11h
dask-gateway-vxwr8                                   1/1     Running   0          11h
dask-gateway-wv8bl                                   1/1     Running   0          11h
dask-gateway-xfgkd                                   1/1     Running   0          11h
dask-gateway-zjczn                                   1/1     Running   0          11h

Should figure out why they weren't automatically removed when the user session went away (I don't see a user pod anywhere).

Problem when creating cluster: "suggest closing worker"

Hi everyone,

I created my own repository and run it on binder.pangeo.io:
https://binder.pangeo.io/v2/gh/lwchao-vivian/Pangeo_cluster/master

However, when I tried to create a Desk cluster in Pangeo_calculate_ECS_and_feedback.ipynb, it kept showing the warning messages that asked me to close the workers. The error message is like:
distributed.scheduler - INFO - Suggest closing workers

This error message didn't show up when I run hello_world.ipynb. Does anyone know how to solve this issue? Thank you.

Add documentation on how to use binder.pangeo.io

We should add a documentation page somewhere (probably pangeo.io) describing how to use binder.pangeo.io. There are a few steps that are not documented in binder's main docs that are specific to our use case so those should be documented clearly. For example:

  • how to configure dask-kubernetes. We should provide an example config file
  • required dependencies in to support dask diagnostics (jupyterlab extensions)
  • how to access datasets stored on GCS or S3

Setting worker pool taint for dask workers

@mrocklin had setup the worker pool for binder.pangeo.io with a taint (dask-worker: True). I'm now curious how he did that. From the GKE web interface or via the command line? In either even, it would be good to document how this was done so that when we rebuild the node pools, we can repeat this step.

(Note, the taint is no longer in action as I have redeployed the worker pool this morning).

Configure local ssds for worker pools

The current deployment cluster includes local ssds on the worker nodes. This may be quite nice for complicated dask graphs that may want to spill to disk. Although I've turned on local ssds to the worker node pools, those disks are not mounted by default. If anyone knows how to mount those disks, please, do tell.

better support/tuning for nodepool selectors

I have the pangeo.binder.io cluster setup to have multiple node pools. I have our helm chart pointing the hub/binder to the "default pool", notebook servers to a "user pool", and dask workers to a variety of "worker pools". This will help us provide custom computing environments and make sure the dask workers don't keep new binders from spinning up. However, I'm not sure this is fully working so some stress testing and tuning is likely required.

"Environment.yml should contain a dictionary" error?

When I try my notebook on pangeo-binder:
Binder
I'm getting a build error:

dMap'>Waiting for build to start...
Cloning into '/tmp/repo2dockerhvdmfjgu'...
HEAD is now at d860787 Update environment.yml
Error during build: environment.yml should contain a dictionary. Got <class 'ruamel.yaml.comments.CommentedMap'>

which is strange, since the environment.yml works fine locally.

@jhamman, probably something simple, right? 😸

Redo dask pod labels

In #145 I added the username to the Dask Gateway pods as a label like.

Labels:
  jupyterhub/username: <username here>

I just discovered that jupyterhub does something similar to their pods with Annotations

Name:                 jupyter-tomaugspurger-2dpangeo-2dbinder-2dtest-2dga6arn36
Namespace:            prod
Priority:             0
Priority Class Name:  prod-default-priority
Node:                 gke-binder-user-pool-3a8deb5e-5v53/10.128.0.76
Start Time:           Fri, 24 Apr 2020 14:02:59 -0500
Labels:               app=jupyterhub
                      chart=jupyterhub-0.9.0-beta.4
                      component=singleuser-server
                      heritage=jupyterhub
                      hub.jupyter.org/network-access-hub=true
                      release=prod
Annotations:          hub.jupyter.org/username: tomaugspurger-pangeo-binder-test-ga6arn36

It's probably best to align our method to jupyterhubs.

enable https / ssl authentication

Currently, pangeo-binder is a fairly insecure. I've tried adding the https config to the binderhub chart but it seems like there is a kubernetes api version mismatch: jupyterhub/zero-to-jupyterhub-k8s#475

Recording what I tried here and we'll come back to this.

  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
    https:
      enabled: true
      type: "kube-lego"
  kube-lego:
    config:
      LEGO_EMAIL: $EMAIL
      LEGO_URL: https://acme-v01.api.letsencrypt.org/directory
    rbac:
      create: true

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.