esipfed / esiphub-dev Goto Github PK
View Code? Open in Web Editor NEWDevelopment JupyterHub on AWS targeting pangeo environment for National Water Model exploration
License: MIT License
Development JupyterHub on AWS targeting pangeo environment for National Water Model exploration
License: MIT License
@dbuscombe-usgs has about 20GB of data that he would like students in the CDI workshop to be able to access. He has loaded it to google drive and shared with my USGS google account.
I'm currently copying it to AWS S3 using this rclone
command:
rclone sync gdrive-usgs:imageclass aws:cdi-workshop --checksum --fast-list --transfers 16 &
When @abburgess gets the CNAME pangeo-dev.esipfed.org pointing to our instance, we need to modify the secret-config.yaml, right?
https://github.com/pangeo-data/pangeo/blob/master/gce/secret-config.yaml#L10
@jreadey , do you know if there are additional steps required?
Is there any way to launch a web browser from within a notebook, to view localhost?
http://jupyter-dbuscombe-2dusgs:6006
It is for accessing tensorboard outputs
We would like to have a cluster built using kops so that we have a better chance of using autoscaling, and because the AWS landsat data is on US-EAST.
@rsignell-usgs it appears the python bindings to opencv no longer work.
python -c "import cv2"
returns
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
For his deep learning workshop, @dbuscombe-usgs needs the user to be able to draw polylines on an image and capture the pixel coordinates for further use in the Notebook.
@ocefpaf suggests trying matplotlib plt.imshow
and plt.ginput
, using %matplotlib notebook
instead of the usual %matplotlib inline
.
Right now our Kubernetes nodes are always running, regardless if anyone is using them.
We really need to implement the kubernetes Cluster Autoscaler on AWS
I think we should try to implement this ASAP.
@jreadey, does this make sense to you?
Or do we need to try to get help?
@dbuscombe-usgs, I'd like to get the resources set up properly for the CDI workshop.
Do you know how much memory each user will need to execute the workflows?
Will each student need multiple Dask workers to work in parallel?
(and if so, how many would you like to use)?
@dbuscombe-usgs needs opencv
for the USGS CDI workshop on deep learning, and after installing from conda-forge
, we are getting:
ImportError Traceback (most recent call last)
<ipython-input-1-c8ec22b3e787> in <module>()
----> 1 import cv2
ImportError: libXext.so.6: cannot open shared object file: No such file or directory
when we installed opencv
, we got this, which looked okay to me:
jovyan@jupyter-rsignell-2dusgs:~$ conda install opencv
Solving environment: done
## Package Plan ##
environment location: /opt/conda
added / updated specs:
- opencv
The following packages will be downloaded:
package | build
---------------------------|-----------------
libwebp-0.5.2 | 7 688 KB conda-forge
jasper-1.900.1 | 4 275 KB conda-forge
scipy-1.1.0 |py36_blas_openblas_200 40.2 MB conda-forge
openblas-0.2.20 | 8 17.0 MB conda-forge
blas-1.1 | openblas 1 KB conda-forge
graphite2-1.3.11 | 0 119 KB conda-forge
x264-20131218 | 0 4.9 MB conda-forge
opencv-3.4.1 |py36_blas_openblas_200 43.6 MB conda-forge
numpy-1.14.5 |py36_blas_openblash24bf2e0_200 9.0 MB conda-forge
harfbuzz-1.7.6 | 0 5.6 MB conda-forge
ffmpeg-3.2.4 | 3 58.1 MB conda-forge
------------------------------------------------------------
Total: 179.5 MB
The following NEW packages will be INSTALLED:
blas: 1.1-openblas conda-forge
ffmpeg: 3.2.4-3 conda-forge
graphite2: 1.3.11-0 conda-forge
harfbuzz: 1.7.6-0 conda-forge
jasper: 1.900.1-4 conda-forge
libwebp: 0.5.2-7 conda-forge
openblas: 0.2.20-8 conda-forge
opencv: 3.4.1-py36_blas_openblas_200 conda-forge [blas_openblas]
x264: 20131218-0 conda-forge
The following packages will be UPDATED:
numpy: 1.14.2-py36hdbf6ddf_1 defaults --> 1.14.5-py36_blas_openblash24bf2e0_200 conda-forge [blas_openblas]
scipy: 1.1.0-py36hfc37229_0 defaults --> 1.1.0-py36_blas_openblas_200 conda-forge [blas_openblas]
Proceed ([y]/n)? y
@ocefpaf, when I just built the new pangeo.esipfed notebook container, I got this message:
The following packages will be DOWNGRADED:
bleach: 2.1.3-py_0 conda-forge --> 1.5.0-py36_0 conda-forge
html5lib: 1.0.1-py_0 conda-forge --> 0.9999999-py36_0 conda-forge
Does that mean those conda-forge packages need upgrading?
Over the past few days the web interface http://pangeo.esipfed.org/ has been slow. Today it is really slow. It often takes several minutes to start server, and load an .ipynb file. Having to close and restart several times.
I just heard on a pangeo web meeting that the Met Office developed a FUSE toolbox that you can use to mount all your s3 content as file system:
https://github.com/informatics-lab/s3-fuse-flex-volume/blob/master/README.md
We should enable this so we can compare this baseline to other approaches like zarr and HSDS.
following https://zero-to-jupyterhub.readthedocs.io/en/latest/amazon/step-zero-aws.html
Instructions say:
Create a IAM Role
This role will be used to give your CI host permission to create and destroy resources on AWS
AmazonEC2FullAccess
IAMFullAccess
AmazonS3FullAccess
AmazonVPCFullAccess
Route53FullAccess (Optional)
I created this using the aws cli following the instructions on
https://github.com/kubernetes/kops/blob/master/docs/aws.md
I skipped the DNS step because I'll have a "gossip-based cluster".
Enable versioning and encrypthion on the $KOPS_STATE_STORE
:
aws s3api put-bucket-versioning --bucket esip-pangeo-kops-state-store --versioning-configuration Status=Enabled
aws s3api put-bucket-encryption --bucket esip-pangeo-kops-state-store --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
create cluster:
$ kops create cluster kopscluster.k8s.local \
--zones us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f \
--authorization RBAC \
--master-size t2.small \
--master-volume-size 10 \
--node-size m4.2xlarge \
--master-count 3 \
--networking cni \
--node-count 2 \
--node-volume-size 120 \
--image kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-02-08 \
--yes
which produced this output:
I0206 15:13:48.491663 23761 create_cluster.go:496] Inferred --cloud=aws from zone "us-east-1a"
I0206 15:13:48.558748 23761 subnets.go:184] Assigned CIDR 172.20.32.0/19 to subnet us-east-1a
I0206 15:13:48.558862 23761 subnets.go:184] Assigned CIDR 172.20.64.0/19 to subnet us-east-1b
I0206 15:13:48.558913 23761 subnets.go:184] Assigned CIDR 172.20.96.0/19 to subnet us-east-1c
I0206 15:13:48.558962 23761 subnets.go:184] Assigned CIDR 172.20.128.0/19 to subnet us-east-1d
I0206 15:13:48.559028 23761 subnets.go:184] Assigned CIDR 172.20.160.0/19 to subnet us-east-1e
I0206 15:13:48.559071 23761 subnets.go:184] Assigned CIDR 172.20.192.0/19 to subnet us-east-1f
I0206 15:13:48.854296 23761 create_cluster.go:1407] Using SSH public key: /home/ec2-user/.ssh/id_rsa.pub
I0206 15:13:49.238073 23761 apply_cluster.go:542] Gossip DNS: skipping DNS validation
I0206 15:13:49.769488 23761 executor.go:103] Tasks: 0 done / 97 total; 34 can run
I0206 15:13:50.195611 23761 vfs_castore.go:736] Issuing new certificate: "ca"
I0206 15:13:50.509655 23761 vfs_castore.go:736] Issuing new certificate: "apiserver-aggregator-ca"
I0206 15:13:50.892007 23761 executor.go:103] Tasks: 34 done / 97 total; 29 can run
I0206 15:13:51.917987 23761 vfs_castore.go:736] Issuing new certificate: "kubecfg"
I0206 15:13:52.134623 23761 vfs_castore.go:736] Issuing new certificate: "kubelet-api"
I0206 15:13:52.333038 23761 vfs_castore.go:736] Issuing new certificate: "kube-proxy"
I0206 15:13:52.766449 23761 vfs_castore.go:736] Issuing new certificate: "kube-scheduler"
I0206 15:13:52.885908 23761 vfs_castore.go:736] Issuing new certificate: "kube-controller-manager"
I0206 15:13:53.122200 23761 vfs_castore.go:736] Issuing new certificate: "apiserver-proxy-client"
I0206 15:13:53.230130 23761 vfs_castore.go:736] Issuing new certificate: "apiserver-aggregator"
I0206 15:13:53.490835 23761 vfs_castore.go:736] Issuing new certificate: "kubelet"
I0206 15:13:53.513488 23761 vfs_castore.go:736] Issuing new certificate: "kops"
I0206 15:13:53.793454 23761 executor.go:103] Tasks: 63 done / 97 total; 26 can run
I0206 15:13:54.027747 23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:13:54.030642 23761 launchconfiguration.go:380] waiting for IAM instance profile "nodes.kopscluster.k8s.local" to be ready
I0206 15:13:54.066847 23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:13:54.179909 23761 launchconfiguration.go:380] waiting for IAM instance profile "masters.kopscluster.k8s.local" to be ready
I0206 15:14:04.639432 23761 executor.go:103] Tasks: 89 done / 97 total; 5 can run
I0206 15:14:04.997866 23761 vfs_castore.go:736] Issuing new certificate: "master"
I0206 15:14:05.726978 23761 executor.go:103] Tasks: 94 done / 97 total; 3 can run
I0206 15:14:06.331247 23761 executor.go:103] Tasks: 97 done / 97 total; 0 can run
I0206 15:14:06.406582 23761 update_cluster.go:290] Exporting kubecfg for cluster
kops has set your kubectl context to kopscluster.k8s.local
Cluster is starting. It should be ready in a few minutes.
Suggestions:
* validate cluster: kops validate cluster
* list nodes: kubectl get nodes --show-labels
* ssh to the master: ssh -i ~/.ssh/id_rsa [email protected]
* the admin user is specific to Debian. If not using Debian please use the appropriate user based on your OS.
* read about installing addons at: https://github.com/kubernetes/kops/blob/master/docs/addons.md.
Don't try to validate the cluster yes.
First enable networking:
kubectl create -f https://git.io/weave-kube-1.6
Validate cluster. This will fail for several minutes before it works:
[ec2-user@ip-172-31-34-163 ~]$ kops validate cluster
Using cluster from kubectl context: kopscluster.k8s.local
Validating cluster kopscluster.k8s.local
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
master-us-east-1a Master t2.small 1 1 us-east-1a
master-us-east-1b Master t2.small 1 1 us-east-1b
master-us-east-1c Master t2.small 1 1 us-east-1c
nodes Node m4.2xlarge 2 2 us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f
NODE STATUS
NAME ROLE READY
ip-172-20-148-127.ec2.internal node True
ip-172-20-41-39.ec2.internal master True
ip-172-20-77-127.ec2.internal master True
ip-172-20-85-76.ec2.internal node True
ip-172-20-97-64.ec2.internal master True
Your cluster kopscluster.k8s.local is ready
Then enable storage:
(aws) [ec2-user@ip-172-31-34-163 ~]$ more storageclass.yml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
annotations:
storageclass.beta.kubernetes.io/is-default-class: "true"
name: gp2
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
kubectl apply -f storageclass.yml
kubernetes secret
openssl rand -hex 128 >weave-passwd
kubectl create secret -n kube-system generic weave-passwd --from-file=./weave-passwd
kubectl patch --namespace=kube-system daemonset/weave-net --type json -p '[ { "op": "add", "path": "/spec/template/spec/containers/0/env/0", "value": { "name": "WEAVE_PASSWORD", "valueFrom": { "secretKeyRef": { "key": "weave-passwd", "name": "weave-passwd" } } } } ]'
zero-to-jupyterhub step 0 complete!
curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash
kubectl --namespace kube-system create serviceaccount tiller
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --wait
kubectl patch deployment tiller-deploy --namespace=kube-system --type=json --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'
test:
helm version
following
https://akomljen.com/kubernetes-cluster-autoscaling-on-aws/
create node instance groups for each subregion:
I first created a IG template:
(aws) [ec2-user@ip-172-31-34-163 ~]$ more node_ig_template.yaml
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-02-01T14:32:59Z
labels:
kops.k8s.io/cluster: kopscluster.k8s.local
name: nodes-us-east-#SUBZONE#-m4-2xlarge.kopscluster.k8s.local
spec:
cloudLabels:
k8s.io/cluster-autoscaler/enabled: ""
k8s.io/cluster-autoscaler/node-template/label: ""
kubernetes.io/cluster/kopscluster.k8s.local: owned
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: m4.2xlarge
maxPrice: "0.38"
maxSize: 50
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: nodes-us-east-#SUBZONE#-m4-2xlarge.kopscluster.k8s.local
role: Node
rootVolumeSize: 120
subnets:
- us-east-#SUBZONE#
and then I ran this script to create the IG in all 6 subregions:
#!/bin/bash
for SUBZONE in 1a 1b 1c 1d 1e 1f
do
sed 's/#SUBZONE#/'"$SUBZONE"'/' node_ig_template.yaml > ig.yaml
kops create -f ig.yaml
done
Then update cluster:
kops update cluster kopscluster.k8s.local --yes
Now add IAM policy rules for the nodes:
kops edit cluster
and add the additionalPolicies
to the spec
: group:
spec:
additionalPolicies:
node: |
[
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:SetDesiredCapacity",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": ["*"]
}
]
and apply configuration:
kops update cluster --yes
Check what version of kubernetes we are using:
kubectl version
and note the ServerVersion=>GitVersion
(e.g. 1.11.6).
The go to https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#releases and find the right CA version corresponding to your kubernetes version (e.g. 1.11.X => 1.3.X)
Then go to:
https://github.com/kubernetes/autoscaler/releases
and find the most recent version of your CA version (e.g. 1.3.5)
Specify this in your autoscaling helm chart:
helm install --name autoscaler \
--namespace kube-system \
--set image.tag=v1.3.5 \
--set autoDiscovery.clusterName=kopscluster.k8s.local \
--set extraArgs.balance-similar-node-groups=false \
--set extraArgs.expander=random \
--set rbac.create=true \
--set rbac.pspEnabled=true \
--set awsRegion=us-east-1 \
--set nodeSelector."node-role\.kubernetes\.io/master"="" \
--set tolerations[0].effect=NoSchedule \
--set tolerations[0].key=node-role.kubernetes.io/master \
--set cloudProvider=aws \
stable/cluster-autoscaler
verify it's running:
kubectl --namespace=kube-system get pods -l "app=aws-cluster-autoscaler,release=autoscaler"
install pangeo helm chart:
helm upgrade --install esip-pangeo pangeo/pangeo --namespace esip-pangeo --version=0.1.1-ce2f7f5 -f jupyter-config-noscratch.yaml -f secret-config.yaml
find the IP:
kubectl --namespace=esip-pangeo get svc proxy-public
which in my case, produced:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
proxy-public LoadBalancer 100.64.166.60 ada21177b295e11e9a0ee0eef77e790b-963275451.us-east-1.elb.amazonaws.com 80:32541/TCP 40s
set the default namespace:
kubectl config set-context $(kubectl config current-context) --namespace=esip-pangeo
After logging into JH and verifying that the cluster scaled up using the CA-enabled IGs to meet the dask workers requested, I deleted the IG for the original 2 nodes from the initial cluster creation:
kops delete ig nodes --yes
partially completed on gamone
ssh gamone.whoi.edu
cd github/helm-chart/docker-images/notebook
docker images should show newly build images.
@dbuscombe-usgs, @csherwood-usgs alerted me to the instructions for installing the python environment for the CDI class and I'm quite worried as combining defaults, conda and pip in the way you are advocating is very problematic, as Filipe (@ocefpaf) can attest to.
I would recommend instead that you follow Filipe's python instructions for IOOS, but instead of using the IOOS environment.yml
file use your own environment file:
conda env create -f tf_environment.yml
where tf_environment.yml
file is:
name: tf
channels:
- conda-forge
- defaults
dependencies:
- python=3.6
- pydensecrf
- cython
- numpy
- scipy
- matplotlib
- s3fs
- scikit-image
- scikit-learn
- joblib
- tensorflow
- opencv
- ipython
- tensorflow
- tensorflow-hub
- tqdm
I've created this environment locally. Is there a notebook I can test on? (or you can just try it. It took 5 minutes to create)
The base conda environment is missing nb_conda_kernels
, which would allow us to pick alternate kernels (environments) from Jupyter.
It also might be helpful if the base environment included:
conda install xarray boto3 zarr s3fs nb_conda_kernels gcsfs nb_conda_kernels
As part of the NOAA Big Data Project, the National Water Model data is now on AWS S3:
From Conor Delaney:
From Just checked with CICs, the data is all there I just didn't understand how it was structured. To get to the a particular section of the data sets use the Prefix parameter. The archive data is collated by year and the forecast is collated by day.
http://nwm-archive.s3.amazonaws.com/?prefix=2017
or
http://noaa-nwm-pds.s3.amazonaws.com/?prefix=nwm.20180416
@jreadey what is the best way to convert some of this data to HSDS?
Here's the list of files that I would like to create the HSDS dataset from (using the RENCI opendap to illustrate the dataset):
import pandas as pd
import xarray as xr
root = 'http://tds.renci.org:8080/thredds/dodsC/nwm/forcing_short_range/' # OPenDAP
dates = pd.date_range(start='2018-04-01T00:00', end='2018-04-07T23:00', freq='H')
urls = ['{}{}/nwm.t{}z.short_range.forcing.f001.conus.nc'.format(root,a.strftime('%Y%m%d'),a.strftime('%H')) for a in dates]
print('\n'.join(urls))
ds = xr.open_mfdataset(urls, concat_dim='time')
print(ds)
Could we just modify this somehow (perhaps using FUSE) to read the NetCDF files from S3?
We should add s3fs
and h5pyd
in the root conda environment for pangeo.
But let's also leave gcsfs
in there in case we want to access google cloud storage.
I'm here in DC at the NOAA Environmental Management Meeting, where @zflamig gave a nice presentation of the Open Commons Consortium's work on the National Water Model data. He showed this notebook which brings in the vector data necessary to marry with the NWM output to make maps:
https://github.com/occ-data/nwm-jupyter/blob/master/NWM.ipynb
I think it would be cool to rewrite the data ingest part of this to read NWM data via Zarr.
@dblodgett-usgs, is there a way to obtain the NHD vector data using services?
I'm running into an error importing the rasterio
package on the current deployment of http://pangeo.esipfed.org :
import rasterio
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-df51458539a9> in <module>()
1 get_ipython().run_line_magic('matplotlib', 'inline')
----> 2 import rasterio
3 import numpy as np
4 import xarray as xr
5 import matplotlib.pyplot as plt
/opt/conda/lib/python3.6/site-packages/rasterio/__init__.py in <module>()
21 pass
22
---> 23 from rasterio._base import gdal_version
24 from rasterio.drivers import is_blacklisted
25 from rasterio.dtypes import (
ImportError: libncurses.so.6: cannot open shared object file: No such file or directory
@dbuscombe-usgs, I created a cdi-workshops github org and made you owner, so anyone you invite to the org (and accepts) will show up on this list https://github.com/orgs/cdi-workshops/people and will have access to esipfed.pangeo.org. Sound like a plan?
To create a custom environment:
Ensure that nb_conda_kernels
is installed in the base conda environment. If it's not, ask the JupyterHub provider to include it!
In JupyterHub, edit your ~/.condarc
to specify a persisted directory (like /home/jovyan/my-conda-envs
for your environments:
channels:
- conda-forge
envs_dirs:
- /home/jovyan/my-conda-envs
ipykernel
package:conda create -n my_new_custom_env my_package_1 my_package_2 ipykernel
Stop and start your server.
You should now be able to see your custom env on the kernel pick list:
It's a bit awkward using http://<38-characters>.us-west-2.elb.amazonaws.com/hub/home
I was unable to start my JupyterHub server on pangeo.esipfed.org last night. It was telling me to try restarting from the hub (http://pangeo.esipfed.org/hub/home) but it didn't work, giving me 500 and 503 errors.
When I checked the pods, I saw this:
(IOOS3) rsignell@gamone:~> kubectl get pods -n esip-dev
NAME READY STATUS RESTARTS AGE
hub-67cf8994b6-9qbm2 1/1 Running 0 1h
jupyter-marasophia 1/1 Running 0 33m
jupyter-rsignell-2dusgs 0/1 PodInitializing 0 23m
jupyter-srharrison 1/1 Running 0 1h
proxy-749c488cd4-nd2b9 1/1 Running 0 1h
but although my pod said PodInitializing
it seemed hung.
Write 80GB or so of NWS data to Zarr format on S3. This should be sufficient for initial testing and demos.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.