Comments (14)
In a moment of desperation, I tore the user-pool down and rebuilt it from scratch. Lo and behold, the user-pool scaled up and back down to zero nodes without incident. The stackdriver-metadata-agent-cluster-level
pod never came up in the new node-pool so I'm wondering (not sure) if that was the offending pod. In any event, I think I'm going to close this since we seem to be operating as expected now.
Thanks @consideRatio and @yuvipanda for the input.
from pangeo-binder.
Working with the cloud is 1% inspiration, 49% perspiration and 50% desperation so I approve of this solution.
from pangeo-binder.
Something I often do to try debug these is to cordon the node, ssh in and just restart the node. This always moves critical pods to other nodes without you having to figure out which ones they are.
from pangeo-binder.
@jhamman hmmmm hmmmm, I'm somewhat new to BinderHub world, and dind especially. They are daemonsets right? I don't understand why this does not scale down then.
I would inspect the cluster autoscaler's status:
# look under the NodeGroups title for the node-pool of relevance
# does it identify candidates for example?
kubectl get cm -n kube-system cluster-autoscaler-status -o yaml
NodeGroups:
Name: https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west4-a/instanceGroups/gke-ds-platform-users-e99bd23e-grp
Health: Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=3))
LastProbeTime: 2019-05-26 16:06:29.451821679 +0000 UTC m=+24264.965654891
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleUp: NoActivity (ready=1 cloudProviderTarget=1)
LastProbeTime: 2019-05-26 16:06:29.451821679 +0000 UTC m=+24264.965654891
LastTransitionTime: 2019-05-26 15:47:49.0528279 +0000 UTC m=+23144.566661089
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2019-05-26 16:06:29.451821679 +0000 UTC m=+24264.965654891
LastTransitionTime: 2019-05-26 15:47:49.0528279 +0000 UTC m=+23144.566661089
I'd also learn more about the image cleaner and the dind pods:
- are they from daemonsets?
- Do they have a PDB targeting them?
- Do they have a cluster-autoscaler related annotations on them saying something like system-critical or safe-to-evict (this last annotation is deprecated I think).
from pangeo-binder.
For some reason, the cloudProviderTarget is 1 node in this state:
apiVersion: v1
data:
status: |+
Cluster-autoscaler status at 2019-05-26 21:58:05.063879692 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=4 unready=0 notStarted=0 longNotStarted=0 registered=4 longUnregistered=0)
LastProbeTime: 2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
LastTransitionTime: 2019-05-16 22:15:58.534748738 +0000 UTC m=+28.041530042
ScaleUp: NoActivity (ready=4 registered=4)
LastProbeTime: 2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
LastTransitionTime: 2019-05-26 04:47:45.767323699 +0000 UTC m=+801135.274104985
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
LastTransitionTime: 2019-05-26 05:13:13.608632799 +0000 UTC m=+802663.115414109
NodeGroups:
Name: https://content.googleapis.com/compute/v1/projects/pangeo-181919/zones/us-central1-b/instanceGroups/gke-binder-user-pool-45fded5e-grp
Health: Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=25))
LastProbeTime: 2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
ScaleUp: NoActivity (ready=1 cloudProviderTarget=1)
LastProbeTime: 2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
LastTransitionTime: 2019-05-24 09:16:46.00873405 +0000 UTC m=+644475.515515363
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
LastTransitionTime: 2019-05-24 09:47:19.168821679 +0000 UTC m=+646308.675602967
are they from daemonsets?
AFAICT, both the image-cleaner and dind pods are daemonsets.
Do they have a PDB targeting them?
How would I check this? Googling...
I don't think so.
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-system heapster-pdb N/A 1 1 233d
kube-system metrics-server-pdb N/A 1 1 233d
kube-system tiller-deploy-pdb N/A 1 1 233d
prod binderhub 1 N/A 0 76d
prod hub 1 N/A 0 76d
prod prod-nginx-ingress-controller 1 N/A 1 76d
prod prod-nginx-ingress-default-backend 0 N/A 1 76d
prod proxy 1 N/A 0 76d
prod user-scheduler 1 N/A 1 76d
staging binderhub 1 N/A 0 3d1h
staging hub 1 N/A 0 3d1h
staging proxy 1 N/A 0 3d1h
staging staging-nginx-ingress-controller 1 N/A 1 3d1h
staging staging-nginx-ingress-default-backend 0 N/A 1 3d1h
staging user-scheduler 1 N/A 0 3d1h
Do they have a cluster-autoscaler related annotations on them saying something like system-critical or safe-to-evict (this last annotation is deprecated I think).
Same. Googling...
I don't think so.
dind & image cleaner:
Annotations: <none>
from pangeo-binder.
Ah, but the following kube-system pods do have this annotation:
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
- fluentd-gcp-v3.1.1
- kube-proxy-gke
from pangeo-binder.
Hmmm but I think these don't block scale down for me... Hmmmm I recall there was a way to read the cluster-autoscaler binaries logs...
I found:
- https://cloud.google.com/compute/docs/autoscaler/viewing-autoscaler-logs
- https://cloud.google.com/compute/docs/autoscaler/managing-autoscalers#update_an_autoscaler
- https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions
Oh autoscaling is off but this is the wrong kind of autoscaling... this is not kubernetes based autoscaling... WOOPS! Sorry for the distraction. This was irrelevant as the k8s based autoscaler is another abstraction layer higer than this.
from pangeo-binder.
This is relevant, kinda what I tried writing up from memory before:
https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#scheduling-and-disruption
from pangeo-binder.
Regarding the system critical pods, hmmm...
They are kube-system pods that are run on the node by default though... Otherwise they could block by being
Kube-system pods that:
- are not run on the node by default, *
from pangeo-binder.
I don't get this either. I wonder if it relates to having a lot of unschedulable dask-jovyan pods around for some reason... Hmmm
from pangeo-binder.
jupyterhub/mybinder.org-deploy#901
from pangeo-binder.
The api server logs are attached.
kube-apiserver.log.zip
I'm starting to wonder if the presence of daemonsets from two namespaces (prod and staging) are confusing the scheduler here. I say this because mybinder.org-deploy uses two separate clusters for its prod and staging deployments. I wonder if this is why? Maybe @betatim or @yuvipanda can comment.
from pangeo-binder.
MyBinder.org runs on two clusters because I was young and idealistic. It might be related to this issue, but no idea!
from pangeo-binder.
Oh I wonder why that pod were allowed to live on our tainted node without being evictable, sounds wierd
from pangeo-binder.
Related Issues (20)
- 404 with nbgitpuller on pangeo gallery HOT 4
- New DockerHub Image retention policies will delete unused images after 6 months
- AWS BinderHub deploy failed on staging HOT 8
- Update prometheus-operator deployments HOT 5
- Failing deployment on AWS staging HOT 5
- Dask-Gateway Workers in CrashLoopBackOff HOT 2
- Patch setting of AWS credential file environment variable
- Can't set environment variables for dask-gateway HOT 1
- prod deployment failed on both AWS and GCP HOT 6
- Set cluster limits
- AWS binder failing to launch sessions due to SSL certificate problem HOT 3
- can't launch dask gateway cluster if image has dask-gateway>=0.9 HOT 7
- temporarily disabled aws binder due to bitcoin mining HOT 5
- Redirect problem with static file viewing
- Simple Pangeo binder spin up failing on provision HOT 4
- Launching server fails HOT 11
- Failing to build image (Docker Rate Limit for jupyter/repo2docker) HOT 3
- binder.pangeo.io shut down due to crypto mining HOT 7
- Unable to get AWS pangeo binder to work HOT 4
- Can't get R-based Conda environment to deploy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pangeo-binder.