Giter Club home page Giter Club logo

Comments (14)

jhamman avatar jhamman commented on July 30, 2024 2

In a moment of desperation, I tore the user-pool down and rebuilt it from scratch. Lo and behold, the user-pool scaled up and back down to zero nodes without incident. The stackdriver-metadata-agent-cluster-level pod never came up in the new node-pool so I'm wondering (not sure) if that was the offending pod. In any event, I think I'm going to close this since we seem to be operating as expected now.

Thanks @consideRatio and @yuvipanda for the input.

from pangeo-binder.

yuvipanda avatar yuvipanda commented on July 30, 2024 2

Working with the cloud is 1% inspiration, 49% perspiration and 50% desperation so I approve of this solution.

from pangeo-binder.

yuvipanda avatar yuvipanda commented on July 30, 2024 2

Something I often do to try debug these is to cordon the node, ssh in and just restart the node. This always moves critical pods to other nodes without you having to figure out which ones they are.

from pangeo-binder.

consideRatio avatar consideRatio commented on July 30, 2024

@jhamman hmmmm hmmmm, I'm somewhat new to BinderHub world, and dind especially. They are daemonsets right? I don't understand why this does not scale down then.

I would inspect the cluster autoscaler's status:

# look under the NodeGroups title for the node-pool of relevance
# does it identify candidates for example?
kubectl get cm -n kube-system cluster-autoscaler-status -o yaml
    NodeGroups:
      Name:        https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west4-a/instanceGroups/gke-ds-platform-users-e99bd23e-grp
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=3))
                   LastProbeTime:      2019-05-26 16:06:29.451821679 +0000 UTC m=+24264.965654891
                   LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2019-05-26 16:06:29.451821679 +0000 UTC m=+24264.965654891
                   LastTransitionTime: 2019-05-26 15:47:49.0528279 +0000 UTC m=+23144.566661089
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2019-05-26 16:06:29.451821679 +0000 UTC m=+24264.965654891
                   LastTransitionTime: 2019-05-26 15:47:49.0528279 +0000 UTC m=+23144.566661089

I'd also learn more about the image cleaner and the dind pods:

  • are they from daemonsets?
  • Do they have a PDB targeting them?
  • Do they have a cluster-autoscaler related annotations on them saying something like system-critical or safe-to-evict (this last annotation is deprecated I think).

from pangeo-binder.

jhamman avatar jhamman commented on July 30, 2024

For some reason, the cloudProviderTarget is 1 node in this state:

apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2019-05-26 21:58:05.063879692 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=4 unready=0 notStarted=0 longNotStarted=0 registered=4 longUnregistered=0)
                   LastProbeTime:      2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
                   LastTransitionTime: 2019-05-16 22:15:58.534748738 +0000 UTC m=+28.041530042
      ScaleUp:     NoActivity (ready=4 registered=4)
                   LastProbeTime:      2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
                   LastTransitionTime: 2019-05-26 04:47:45.767323699 +0000 UTC m=+801135.274104985
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
                   LastTransitionTime: 2019-05-26 05:13:13.608632799 +0000 UTC m=+802663.115414109

    NodeGroups:
      Name:        https://content.googleapis.com/compute/v1/projects/pangeo-181919/zones/us-central1-b/instanceGroups/gke-binder-user-pool-45fded5e-grp
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=25))
                   LastProbeTime:      2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
                   LastTransitionTime: 0001-01-01 00:00:00 +0000 UTC
      ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                   LastProbeTime:      2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
                   LastTransitionTime: 2019-05-24 09:16:46.00873405 +0000 UTC m=+644475.515515363
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2019-05-26 21:58:04.287003384 +0000 UTC m=+862953.793784676
                   LastTransitionTime: 2019-05-24 09:47:19.168821679 +0000 UTC m=+646308.675602967

are they from daemonsets?

AFAICT, both the image-cleaner and dind pods are daemonsets.

Do they have a PDB targeting them?

How would I check this? Googling...

I don't think so.

NAMESPACE     NAME                                    MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
kube-system   heapster-pdb                            N/A             1                 1                     233d
kube-system   metrics-server-pdb                      N/A             1                 1                     233d
kube-system   tiller-deploy-pdb                       N/A             1                 1                     233d
prod          binderhub                               1               N/A               0                     76d
prod          hub                                     1               N/A               0                     76d
prod          prod-nginx-ingress-controller           1               N/A               1                     76d
prod          prod-nginx-ingress-default-backend      0               N/A               1                     76d
prod          proxy                                   1               N/A               0                     76d
prod          user-scheduler                          1               N/A               1                     76d
staging       binderhub                               1               N/A               0                     3d1h
staging       hub                                     1               N/A               0                     3d1h
staging       proxy                                   1               N/A               0                     3d1h
staging       staging-nginx-ingress-controller        1               N/A               1                     3d1h
staging       staging-nginx-ingress-default-backend   0               N/A               1                     3d1h
staging       user-scheduler                          1               N/A               0                     3d1h

Do they have a cluster-autoscaler related annotations on them saying something like system-critical or safe-to-evict (this last annotation is deprecated I think).

Same. Googling...

I don't think so.

dind & image cleaner:

Annotations:        <none>

from pangeo-binder.

jhamman avatar jhamman commented on July 30, 2024

Ah, but the following kube-system pods do have this annotation:

Annotations:        scheduler.alpha.kubernetes.io/critical-pod:
  • fluentd-gcp-v3.1.1
  • kube-proxy-gke

from pangeo-binder.

consideRatio avatar consideRatio commented on July 30, 2024

Hmmm but I think these don't block scale down for me... Hmmmm I recall there was a way to read the cluster-autoscaler binaries logs...

I found:

Oh autoscaling is off but this is the wrong kind of autoscaling... this is not kubernetes based autoscaling... WOOPS! Sorry for the distraction. This was irrelevant as the k8s based autoscaler is another abstraction layer higer than this.

image

image

from pangeo-binder.

consideRatio avatar consideRatio commented on July 30, 2024

This is relevant, kinda what I tried writing up from memory before:
https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#scheduling-and-disruption

from pangeo-binder.

consideRatio avatar consideRatio commented on July 30, 2024

Regarding the system critical pods, hmmm...

They are kube-system pods that are run on the node by default though... Otherwise they could block by being

Kube-system pods that:
- are not run on the node by default, *

from pangeo-binder.

consideRatio avatar consideRatio commented on July 30, 2024

I don't get this either. I wonder if it relates to having a lot of unschedulable dask-jovyan pods around for some reason... Hmmm

from pangeo-binder.

consideRatio avatar consideRatio commented on July 30, 2024

jupyterhub/mybinder.org-deploy#901

from pangeo-binder.

jhamman avatar jhamman commented on July 30, 2024

The api server logs are attached.
kube-apiserver.log.zip

I'm starting to wonder if the presence of daemonsets from two namespaces (prod and staging) are confusing the scheduler here. I say this because mybinder.org-deploy uses two separate clusters for its prod and staging deployments. I wonder if this is why? Maybe @betatim or @yuvipanda can comment.

from pangeo-binder.

yuvipanda avatar yuvipanda commented on July 30, 2024

MyBinder.org runs on two clusters because I was young and idealistic. It might be related to this issue, but no idea!

from pangeo-binder.

consideRatio avatar consideRatio commented on July 30, 2024

Oh I wonder why that pod were allowed to live on our tainted node without being evictable, sounds wierd

from pangeo-binder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.