esipfed / esiphub-dev Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 21 KB

Development JupyterHub on AWS targeting pangeo environment for National Water Model exploration

License: MIT License

esip-lab

esiphub-dev's People

Contributors

Stargazers

Watchers

Forkers

mwengren

esiphub-dev's Issues

Data for CDI workshop

@dbuscombe-usgs has about 20GB of data that he would like students in the CDI workshop to be able to access. He has loaded it to google drive and shared with my USGS google account.

I'm currently copying it to AWS S3 using this rclone command:

rclone sync gdrive-usgs:imageclass aws:cdi-workshop --checksum --fast-list --transfers 16 &

Switch domain from pangeo.signell.us to pangeo-dev.esipfed.org

When @abburgess gets the CNAME pangeo-dev.esipfed.org pointing to our instance, we need to modify the secret-config.yaml, right?
https://github.com/pangeo-data/pangeo/blob/master/gce/secret-config.yaml#L10

@jreadey , do you know if there are additional steps required?

Access tensorboard with nbserverproxy?

Is there any way to launch a web browser from within a notebook, to view localhost?

http://jupyter-dbuscombe-2dusgs:6006

It is for accessing tensorboard outputs

Build cluster in us-east using kops

We would like to have a cluster built using kops so that we have a better chance of using autoscaling, and because the AWS landsat data is on US-EAST.

New issue with importing opencv

@rsignell-usgs it appears the python bindings to opencv no longer work.

python -c "import cv2"

returns

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Best way to implement interactive drawing in Notebook?

For his deep learning workshop, @dbuscombe-usgs needs the user to be able to draw polylines on an image and capture the pixel coordinates for further use in the Notebook.

@ocefpaf suggests trying matplotlib plt.imshow and plt.ginput, using %matplotlib notebook instead of the usual %matplotlib inline.

Implement autoscaling on AWS

Right now our Kubernetes nodes are always running, regardless if anyone is using them.

We really need to implement the kubernetes Cluster Autoscaler on AWS

I think we should try to implement this ASAP.

@jreadey, does this make sense to you?
Or do we need to try to get help?

Install pangeo using helm chart

https://github.com/pangeo-data/helm-chart

Computational resources for CDI workshop

@dbuscombe-usgs, I'd like to get the resources set up properly for the CDI workshop.

Do you know how much memory each user will need to execute the workflows?

Will each student need multiple Dask workers to work in parallel?
(and if so, how many would you like to use)?

opencv is not finding libXext.so.6

@dbuscombe-usgs needs opencv for the USGS CDI workshop on deep learning, and after installing from conda-forge, we are getting:

ImportError                               Traceback (most recent call last)
<ipython-input-1-c8ec22b3e787> in <module>()
----> 1 import cv2

ImportError: libXext.so.6: cannot open shared object file: No such file or directory

when we installed opencv, we got this, which looked okay to me:

jovyan@jupyter-rsignell-2dusgs:~$ conda install opencv
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - opencv


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libwebp-0.5.2              |                7         688 KB  conda-forge
    jasper-1.900.1             |                4         275 KB  conda-forge
    scipy-1.1.0                |py36_blas_openblas_200        40.2 MB  conda-forge
    openblas-0.2.20            |                8        17.0 MB  conda-forge
    blas-1.1                   |         openblas           1 KB  conda-forge
    graphite2-1.3.11           |                0         119 KB  conda-forge
    x264-20131218              |                0         4.9 MB  conda-forge
    opencv-3.4.1               |py36_blas_openblas_200        43.6 MB  conda-forge
    numpy-1.14.5               |py36_blas_openblash24bf2e0_200         9.0 MB  conda-forge
    harfbuzz-1.7.6             |                0         5.6 MB  conda-forge
    ffmpeg-3.2.4               |                3        58.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:       179.5 MB

The following NEW packages will be INSTALLED:

    blas:      1.1-openblas                 conda-forge
    ffmpeg:    3.2.4-3                      conda-forge
    graphite2: 1.3.11-0                     conda-forge
    harfbuzz:  1.7.6-0                      conda-forge
    jasper:    1.900.1-4                    conda-forge
    libwebp:   0.5.2-7                      conda-forge
    openblas:  0.2.20-8                     conda-forge
    opencv:    3.4.1-py36_blas_openblas_200 conda-forge [blas_openblas]
    x264:      20131218-0                   conda-forge

The following packages will be UPDATED:

    numpy:     1.14.2-py36hdbf6ddf_1        defaults    --> 1.14.5-py36_blas_openblash24bf2e0_200 conda-forge [blas_openblas]
    scipy:     1.1.0-py36hfc37229_0         defaults    --> 1.1.0-py36_blas_openblas_200          conda-forge [blas_openblas]

Proceed ([y]/n)? y

Are these downgraded packages from conda-forge ok?

@ocefpaf, when I just built the new pangeo.esipfed notebook container, I got this message:

The following packages will be DOWNGRADED:

    bleach:             2.1.3-py_0                            conda-forge           --> 1.5.0-py36_0     conda-forge
    html5lib:           1.0.1-py_0                            conda-forge           --> 0.9999999-py36_0 conda-forge

Does that mean those conda-forge packages need upgrading?

pangeo.esipfed.org interface very slow

Over the past few days the web interface http://pangeo.esipfed.org/ has been slow. Today it is really slow. It often takes several minutes to start server, and load an .ipynb file. Having to close and restart several times.

Explore FUSE access to original NWM NetCDF files on S3

I just heard on a pangeo web meeting that the Met Office developed a FUSE toolbox that you can use to mount all your s3 content as file system:
https://github.com/informatics-lab/s3-fuse-flex-volume/blob/master/README.md

We should enable this so we can compare this baseline to other approaches like zarr and HSDS.

Setting up ACCESS Pangeo on AWS

following https://zero-to-jupyterhub.readthedocs.io/en/latest/amazon/step-zero-aws.html

Instructions say:

Create a IAM Role

This role will be used to give your CI host permission to create and destroy resources on AWS

AmazonEC2FullAccess
IAMFullAccess
AmazonS3FullAccess
AmazonVPCFullAccess
Route53FullAccess (Optional)

I created this using the aws cli following the instructions on
https://github.com/kubernetes/kops/blob/master/docs/aws.md

I skipped the DNS step because I'll have a "gossip-based cluster".

Enable versioning and encrypthion on the $KOPS_STATE_STORE:

aws s3api put-bucket-versioning --bucket esip-pangeo-kops-state-store --versioning-configuration Status=Enabled

aws s3api put-bucket-encryption --bucket  esip-pangeo-kops-state-store  --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

create cluster:

$ kops create cluster kopscluster.k8s.local \
   --zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f  \
   --authorization RBAC \
   --master-size t2.small \
   --master-volume-size 10 \
   --node-size m4.2xlarge \
   --master-count 3 \
   --networking cni \
   --node-count 2 \
   --node-volume-size 120 \
   --image kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-02-08 \
   --yes

which produced this output:

I0206 15:13:48.491663   23761 create_cluster.go:496] Inferred --cloud=aws from zone "us-east-1a"
I0206 15:13:48.558748   23761 subnets.go:184] Assigned CIDR 172.20.32.0/19 to subnet us-east-1a
I0206 15:13:48.558862   23761 subnets.go:184] Assigned CIDR 172.20.64.0/19 to subnet us-east-1b
I0206 15:13:48.558913   23761 subnets.go:184] Assigned CIDR 172.20.96.0/19 to subnet us-east-1c
I0206 15:13:48.558962   23761 subnets.go:184] Assigned CIDR 172.20.128.0/19 to subnet us-east-1d
I0206 15:13:48.559028   23761 subnets.go:184] Assigned CIDR 172.20.160.0/19 to subnet us-east-1e
I0206 15:13:48.559071   23761 subnets.go:184] Assigned CIDR 172.20.192.0/19 to subnet us-east-1f
I0206 15:13:48.854296   23761 create_cluster.go:1407] Using SSH public key: /home/ec2-user/.ssh/id_rsa.pub
I0206 15:13:49.238073   23761 apply_cluster.go:542] Gossip DNS: skipping DNS validation
I0206 15:13:49.769488   23761 executor.go:103] Tasks: 0 done / 97 total; 34 can run
I0206 15:13:50.195611   23761 vfs_castore.go:736] Issuing new certificate: "ca"
I0206 15:13:50.509655   23761 vfs_castore.go:736] Issuing new certificate: "apiserver-aggregator-ca"
I0206 15:13:50.892007   23761 executor.go:103] Tasks: 34 done / 97 total; 29 can run
I0206 15:13:51.917987   23761 vfs_castore.go:736] Issuing new certificate: "kubecfg"
I0206 15:13:52.134623   23761 vfs_castore.go:736] Issuing new certificate: "kubelet-api"
I0206 15:13:52.333038   23761 vfs_castore.go:736] Issuing new certificate: "kube-proxy"
I0206 15:13:52.766449   23761 vfs_castore.go:736] Issuing new certificate: "kube-scheduler"
I0206 15:13:52.885908   23761 vfs_castore.go:736] Issuing new certificate: "kube-controller-manager"
I0206 15:13:53.122200   23761 vfs_castore.go:736] Issuing new certificate: "apiserver-proxy-client"
I0206 15:13:53.230130   23761 vfs_castore.go:736] Issuing new certificate: "apiserver-aggregator"
I0206 15:13:53.490835   23761 vfs_castore.go:736] Issuing new certificate: "kubelet"
I0206 15:13:53.513488   23761 vfs_castore.go:736] Issuing new certificate: "kops"
I0206 15:13:53.793454   23761 executor.go:103] Tasks: 63 done / 97 total; 26 can run
I0206 15:13:54.027747   23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:13:54.030642   23761 launchconfiguration.go:380] waiting for IAM instance profile "nodes.kopscluster.k8s.local" to be ready
I0206 15:13:54.066847   23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:13:54.179909   23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:14:04.639432   23761 executor.go:103] Tasks: 89 done / 97 total; 5 can run
I0206 15:14:04.997866   23761 vfs_castore.go:736] Issuing new certificate: "master"
I0206 15:14:05.726978   23761 executor.go:103] Tasks: 94 done / 97 total; 3 can run
I0206 15:14:06.331247   23761 executor.go:103] Tasks: 97 done / 97 total; 0 can run
I0206 15:14:06.406582   23761 update_cluster.go:290] Exporting kubecfg for cluster
kops has set your kubectl context to kopscluster.k8s.local

Cluster is starting.  It should be ready in a few minutes.

Suggestions:
 * validate cluster: kops validate cluster
 * list nodes: kubectl get nodes --show-labels
 * ssh to the master: ssh -i ~/.ssh/id_rsa [email protected]
 * the admin user is specific to Debian. If not using Debian please use the appropriate user based on your OS.
 * read about installing addons at: https://github.com/kubernetes/kops/blob/master/docs/addons.md.

Don't try to validate the cluster yes.
First enable networking:

 kubectl create -f https://git.io/weave-kube-1.6

Validate cluster. This will fail for several minutes before it works:

[ec2-user@ip-172-31-34-163 ~]$ kops validate cluster
Using cluster from kubectl context: kopscluster.k8s.local

Validating cluster kopscluster.k8s.local

INSTANCE GROUPS
NAME                    ROLE    MACHINETYPE     MIN     MAX     SUBNETS
master-us-east-1a       Master  t2.small        1       1       us-east-1a
master-us-east-1b       Master  t2.small        1       1       us-east-1b
master-us-east-1c       Master  t2.small        1       1       us-east-1c
nodes                   Node    m4.2xlarge      2       2       us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f

NODE STATUS
NAME                            ROLE    READY
ip-172-20-148-127.ec2.internal  node    True
ip-172-20-41-39.ec2.internal    master  True
ip-172-20-77-127.ec2.internal   master  True
ip-172-20-85-76.ec2.internal    node    True
ip-172-20-97-64.ec2.internal    master  True

Your cluster kopscluster.k8s.local is ready

Then enable storage:

(aws) [ec2-user@ip-172-31-34-163 ~]$ more storageclass.yml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  annotations:
     storageclass.beta.kubernetes.io/is-default-class: "true"
  name: gp2
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2

kubectl apply -f storageclass.yml

kubernetes secret


openssl rand -hex 128 >weave-passwd
kubectl create secret -n kube-system generic weave-passwd --from-file=./weave-passwd

 kubectl patch --namespace=kube-system daemonset/weave-net --type json -p '[ { "op": "add", "path": "/spec/template/spec/containers/0/env/0", "value": { "name": "WEAVE_PASSWORD", "valueFrom": { "secretKeyRef": { "key": "weave-passwd", "name": "weave-passwd" } } } } ]'

zero-to-jupyterhub step 0 complete!

install helm

curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash
kubectl --namespace kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --wait
kubectl patch deployment tiller-deploy --namespace=kube-system --type=json --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'

test:

helm version

Install kubernetes cluster autoscaler

following
https://akomljen.com/kubernetes-cluster-autoscaling-on-aws/

create node instance groups for each subregion:

I first created a IG template:

(aws) [ec2-user@ip-172-31-34-163 ~]$ more node_ig_template.yaml
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-02-01T14:32:59Z
  labels:
    kops.k8s.io/cluster: kopscluster.k8s.local
  name: nodes-us-east-#SUBZONE#-m4-2xlarge.kopscluster.k8s.local
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/node-template/label: ""
    kubernetes.io/cluster/kopscluster.k8s.local: owned
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: m4.2xlarge
  maxPrice: "0.38"
  maxSize: 50
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-us-east-#SUBZONE#-m4-2xlarge.kopscluster.k8s.local
  role: Node
  rootVolumeSize: 120
  subnets:
  - us-east-#SUBZONE#

and then I ran this script to create the IG in all 6 subregions:

#!/bin/bash
for SUBZONE in 1a 1b 1c 1d 1e 1f
do
  sed 's/#SUBZONE#/'"$SUBZONE"'/' node_ig_template.yaml > ig.yaml
  kops create -f ig.yaml
done

Then update cluster:

 kops update cluster kopscluster.k8s.local --yes

Now add IAM policy rules for the nodes:

kops edit cluster

and add the additionalPolicies to the spec: group:

spec:
  additionalPolicies:
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "autoscaling:DescribeAutoScalingGroups",
            "autoscaling:DescribeAutoScalingInstances",
            "autoscaling:SetDesiredCapacity",
            "autoscaling:DescribeLaunchConfigurations",
            "autoscaling:DescribeTags",
            "autoscaling:TerminateInstanceInAutoScalingGroup"
          ],
          "Resource": ["*"]
        }
      ]

and apply configuration:

kops update cluster --yes

Check what version of kubernetes we are using:

kubectl version

and note the ServerVersion=>GitVersion (e.g. 1.11.6).

The go to https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#releases and find the right CA version corresponding to your kubernetes version (e.g. 1.11.X => 1.3.X)

Then go to:
https://github.com/kubernetes/autoscaler/releases
and find the most recent version of your CA version (e.g. 1.3.5)

Specify this in your autoscaling helm chart:

helm install --name autoscaler \
    --namespace kube-system \
    --set image.tag=v1.3.5 \
    --set autoDiscovery.clusterName=kopscluster.k8s.local \
    --set extraArgs.balance-similar-node-groups=false \
    --set extraArgs.expander=random \
    --set rbac.create=true \
    --set rbac.pspEnabled=true \
    --set awsRegion=us-east-1 \
    --set nodeSelector."node-role\.kubernetes\.io/master"="" \
    --set tolerations[0].effect=NoSchedule \
    --set tolerations[0].key=node-role.kubernetes.io/master \
    --set cloudProvider=aws \
    stable/cluster-autoscaler

verify it's running:

kubectl --namespace=kube-system get pods -l "app=aws-cluster-autoscaler,release=autoscaler"

install pangeo helm chart:

helm upgrade --install esip-pangeo pangeo/pangeo --namespace esip-pangeo --version=0.1.1-ce2f7f5  -f jupyter-config-noscratch.yaml -f secret-config.yaml

find the IP:

 kubectl --namespace=esip-pangeo get svc proxy-public

which in my case, produced:

NAME           TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)        AGE
proxy-public   LoadBalancer   100.64.166.60   ada21177b295e11e9a0ee0eef77e790b-963275451.us-east-1.elb.amazonaws.com   80:32541/TCP   40s

set the default namespace:

kubectl config set-context $(kubectl config current-context) --namespace=esip-pangeo

After logging into JH and verifying that the cluster scaled up using the CA-enabled IGs to meet the dask workers requested, I deleted the IG for the original 2 nodes from the initial cluster creation:

kops delete ig nodes --yes

Update notebook and worker containers with jhamman's xarray fix for rasterio

partially completed on gamone

ssh gamone.whoi.edu
cd github/helm-chart/docker-images/notebook

docker images should show newly build images.

Python environment for CDI workshop

@dbuscombe-usgs, @csherwood-usgs alerted me to the instructions for installing the python environment for the CDI class and I'm quite worried as combining defaults, conda and pip in the way you are advocating is very problematic, as Filipe (@ocefpaf) can attest to.

I would recommend instead that you follow Filipe's python instructions for IOOS, but instead of using the IOOS environment.yml file use your own environment file:

conda env create -f tf_environment.yml

where tf_environment.yml file is:

name: tf
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.6
  - pydensecrf
  - cython
  - numpy
  - scipy
  - matplotlib
  - s3fs
  - scikit-image
  - scikit-learn
  - joblib
  - tensorflow
  - opencv
  - ipython
  - tensorflow
  - tensorflow-hub
  - tqdm

I've created this environment locally. Is there a notebook I can test on? (or you can just try it. It took 5 minutes to create)

Modify base conda environment

The base conda environment is missing nb_conda_kernels, which would allow us to pick alternate kernels (environments) from Jupyter.

It also might be helpful if the base environment included:

conda install xarray boto3 zarr s3fs nb_conda_kernels gcsfs nb_conda_kernels

National Water Model data from S3 to HSDS

As part of the NOAA Big Data Project, the National Water Model data is now on AWS S3:

From Conor Delaney:

From Just checked with CICs, the data is all there I just didn't understand how it was structured. To get to the a particular section of the data sets use the Prefix parameter. The archive data is collated by year and the forecast is collated by day.
http://nwm-archive.s3.amazonaws.com/?prefix=2017
or
http://noaa-nwm-pds.s3.amazonaws.com/?prefix=nwm.20180416

@jreadey what is the best way to convert some of this data to HSDS?

Here's the list of files that I would like to create the HSDS dataset from (using the RENCI opendap to illustrate the dataset):

import pandas as pd
import xarray as xr

root = 'http://tds.renci.org:8080/thredds/dodsC/nwm/forcing_short_range/'   # OPenDAP
dates = pd.date_range(start='2018-04-01T00:00', end='2018-04-07T23:00', freq='H')
urls = ['{}{}/nwm.t{}z.short_range.forcing.f001.conus.nc'.format(root,a.strftime('%Y%m%d'),a.strftime('%H')) for a in dates]
print('\n'.join(urls))

ds = xr.open_mfdataset(urls, concat_dim='time')
print(ds)

Could we just modify this somehow (perhaps using FUSE) to read the NetCDF files from S3?

Install s3fs in root conda environment

We should add s3fs and h5pyd in the root conda environment for pangeo.
But let's also leave gcsfs in there in case we want to access google cloud storage.

Adapt this OCC NWM notebook to read zarr and NHD web services

I'm here in DC at the NOAA Environmental Management Meeting, where @zflamig gave a nice presentation of the Open Commons Consortium's work on the National Water Model data. He showed this notebook which brings in the vector data necessary to marry with the NWM output to make maps:
https://github.com/occ-data/nwm-jupyter/blob/master/NWM.ipynb

I think it would be cool to rewrite the data ingest part of this to read NWM data via Zarr.

@dblodgett-usgs, is there a way to obtain the NHD vector data using services?

issue with rasterio package on http://pangeo.esipfed.org

I'm running into an error importing the rasterio package on the current deployment of http://pangeo.esipfed.org :

import rasterio

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-df51458539a9> in <module>()
      1 get_ipython().run_line_magic('matplotlib', 'inline')
----> 2 import rasterio
      3 import numpy as np
      4 import xarray as xr
      5 import matplotlib.pyplot as plt

/opt/conda/lib/python3.6/site-packages/rasterio/__init__.py in <module>()
     21             pass
     22 
---> 23 from rasterio._base import gdal_version
     24 from rasterio.drivers import is_blacklisted
     25 from rasterio.dtypes import (

ImportError: libncurses.so.6: cannot open shared object file: No such file or directory

Authentication for CDI workshop users

@dbuscombe-usgs, I created a cdi-workshops github org and made you owner, so anyone you invite to the org (and accepts) will show up on this list https://github.com/orgs/cdi-workshops/people and will have access to esipfed.pangeo.org. Sound like a plan?

How do I create a custom conda environment in JupyterHub?

To create a custom environment:

Ensure that nb_conda_kernels is installed in the base conda environment. If it's not, ask the JupyterHub provider to include it!
In JupyterHub, edit your ~/.condarc to specify a persisted directory (like /home/jovyan/my-conda-envs for your environments:

channels:
  - conda-forge
envs_dirs:
  - /home/jovyan/my-conda-envs

Create your custom environment including the ipykernel package:

conda create   -n my_new_custom_env      my_package_1  my_package_2 ipykernel

Stop and start your server.

You should now be able to see your custom env on the kernel pick list:

Fix DNS issue

It's a bit awkward using http://<38-characters>.us-west-2.elb.amazonaws.com/hub/home

Can't start server, 500 and 503 errors

I was unable to start my JupyterHub server on pangeo.esipfed.org last night. It was telling me to try restarting from the hub (http://pangeo.esipfed.org/hub/home) but it didn't work, giving me 500 and 503 errors.

When I checked the pods, I saw this:

(IOOS3) rsignell@gamone:~> kubectl get pods -n esip-dev
NAME                      READY     STATUS            RESTARTS   AGE
hub-67cf8994b6-9qbm2      1/1       Running           0          1h
jupyter-marasophia        1/1       Running           0          33m
jupyter-rsignell-2dusgs   0/1       PodInitializing   0          23m
jupyter-srharrison        1/1       Running           0          1h
proxy-749c488cd4-nd2b9    1/1       Running           0          1h

but although my pod said PodInitializing it seemed hung.

Create sample Zarr dataset on S3 with NWS data

Write 80GB or so of NWS data to Zarr format on S3. This should be sufficient for initial testing and demos.