Giter Club home page Giter Club logo

cluster-kube-descheduler-operator's Introduction

Kube Descheduler Operator

Run the descheduler in your OpenShift cluster to move pods based on specific strategies.

Releases

kdo version ocp version k8s version golang
5.0.0 4.15, 4.16 1.28 1.20
5.0.1 4.15, 4.16 1.29 1.21

Deploy the operator

Quick Development

  1. Build and push the operator image to a registry:
  2. Ensure the image spec in deploy/05_deployment.yaml refers to the operator image you pushed
  3. Run oc create -f deploy/.

OperatorHub install with custom index image

This process refers to building the operator in a way that it can be installed locally via the OperatorHub with a custom index image

  1. Build and push the operator image to a registry:

    export QUAY_USER=${your_quay_user_id}
    export IMAGE_TAG=${your_image_tag}
    podman build -t quay.io/${QUAY_USER}/cluster-kube-descheduler-operator:${IMAGE_TAG} -f Dockerfile.rhel7
    podman login quay.io -u ${QUAY_USER}
    podman push quay.io/${QUAY_USER}/cluster-kube-descheduler-operator:${IMAGE_TAG}
  2. Export your desired/current version:

    export OPERATOR_VERSION=${your_version}
  3. Update the .spec.install.spec.deployments[0].spec.template.spec.containers[0].image field in the SSO CSV under ./manifests/${OPERATOR_VERSION}/cluster-kube-descheduler-operator.v${OPERATOR_VERSION}.0.clusterserviceversion.yaml to point to the newly built image.

  4. build and push the metadata image to a registry (e.g. https://quay.io):

    podman build -t quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-metadata:${IMAGE_TAG} -f Dockerfile.metadata .
    podman push quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-metadata:${IMAGE_TAG}
  5. build and push image index for operator-registry (pull and build https://github.com/operator-framework/operator-registry/ to get the opm binary)

    opm index add --bundles quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-metadata:${IMAGE_TAG} --tag quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-index:${IMAGE_TAG}
    podman push quay.io/${QUAY_USER}/cluster-kube-descheduler-operator-index:${IMAGE_TAG}

    Don't forget to increase the number of open files, .e.g. ulimit -n 100000 in case the current limit is insufficient.

  6. create and apply catalogsource manifest (remember to change <<QUAY_USER>> and <<IMAGE_TAG>> to your own values):

    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: cluster-kube-descheduler-operator
      namespace: openshift-marketplace
    spec:
      sourceType: grpc
      image: quay.io/<<QUAY_USER>>/cluster-kube-descheduler-operator-index:<<IMAGE_TAG>>
  7. create openshift-kube-descheduler-operator namespace:

    $ oc create ns openshift-kube-descheduler-operator
    
  8. open the console Operators -> OperatorHub, search for descheduler operator and install the operator

Sample CR

A sample CR definition looks like below (the operator expects cluster CR under openshift-kube-descheduler-operator namespace):

apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  deschedulingIntervalSeconds: 1800
  profiles:
  - AffinityAndTaints
  - LifecycleAndUtilization
  profileCustomizations:
    podLifetime: 5m
    namespaces:
      included:
      - ns1
      - ns2

The operator spec provides a profiles field, which allows users to set one or more descheduling profiles to enable.

These profiles map to preconfigured policy definitions, enabling several descheduler strategies grouped by intent, and any that are enabled will be merged.

Profiles

The following profiles are currently provided:

Along with the following profiles, which are in development and may change:

Each of these enables cluster-wide descheduling (excluding openshift and kube-system namespaces) based on certain goals.

AffinityAndTaints

This is the most basic descheduling profile and it removes running pods which violate node and pod affinity, and node taints.

This profile enables the RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, and RemovePodsViolatingNodeTaints strategies.

TopologyAndDuplicates

This profile attempts to balance pod distribution based on topology constraint definitions and evicting duplicate copies of the same pod running on the same node. It enables the RemovePodsViolatingTopologySpreadConstraints and RemoveDuplicates strategies.

SoftTopologyAndDuplicates

This profile is the same as TopologyAndDuplicates, however it will also consider pods with "soft" topology constraints for eviction (ie, whenUnsatisfiable: ScheduleAnyway)

LifecycleAndUtilization

This profile focuses on pod lifecycles and node resource consumption. It will evict any running pod older than 24 hours and attempts to evict pods from "high utilization" nodes that can fit onto "low utilization" nodes. A high utilization node is any node consuming more than 50% its available cpu, memory, or pod capacity. A low utilization node is any node with less than 20% of its available cpu, memory, and pod capacity.

This profile enables the LowNodeUtilizaition, RemovePodsHavingTooManyRestarts and PodLifeTime strategies. In the future, more configuration may be made available through the operator for these strategies based on user feedback.

DevPreviewLongLifecycle

This profile provides cluster resource balancing similar to LifecycleAndUtilization for longer-running clusters. It does not evict pods based on the 24 hour lifetime used by LifecycleAndUtilization.

EvictPodsWithPVC

By default, the operator prevents pods with PVCs from being evicted. Enabling this profile in combination with any of the above profiles allows pods with PVCs to be eligible for eviction.

EvictPodsWithLocalStorage

By default, pods with local storage are not eligible to be considered for eviction by any profile. Using this profile allows them to be evicted if necessary. A pod is defined as using local storage if any of its volumes have HostPath or EmptyDir set (note that a pod that only uses PVCs does not fit this definition, and will need the EvictPodsWithPVC profile instead. Pods that use both will need both profiles to be evicted).

Profile Customizations

Some profiles expose options which may be used to configure the underlying Descheduler strategy parameters. These are available under the profileCustomizations field:

Name Type Description
podLifetime time.Duration Sets the lifetime value for pods evicted by the LifecycleAndUtilization profile
thresholdPriorityClassName string Sets the priority class threshold by name for all strategies
thresholdPriority string Sets the priority class threshold by value for all strategies
namespaces.included, namespaces.excluded []string Sets the included/excluded namespaces for all strategies (included namespaces are not allowed to include protected namespaces which consist of kube-system, hypershift and all openshift- prefixed namespaces)
devLowNodeUtilizationThresholds string Sets experimental thresholds for the LowNodeUtilization strategy of the LifecycleAndUtilization profile in the following ratios: Low for 10%:30%, Medium for 20%:50%, High for 40%:70%

Descheduling modes

The operator provides two modes of eviction:

  • Predictive: configures the descheduler to only simulate eviction
  • Automatic: configures the descheduler to evict pods

The predictive mode is the default mode. The descheduler in either of the modes still produces metrics (unless the metrics are disabled). When the predictive mode is configured, the reported metrics can serve as an estimation of evicted pods in the cluster.

How does the descheduler operator work?

Descheduler operator at a high level is responsible for watching the above CR

  • Create a configmap that could be used by descheduler.
  • Run descheduler as a deployment mounting the configmap as a policy file in the pod.

The configmap created from above sample CR definition looks like this:

apiVersion: descheduler/v1alpha1
    kind: DeschedulerPolicy
    strategies:
      RemovePodsViolatingInterPodAntiAffinity:
        enabled: true
        ...
      RemovePodsViolatingNodeAffinity:
        enabled: true
        params:
          ...
          nodeAffinityType:
          - requiredDuringSchedulingIgnoredDuringExecution
      RemovePodsViolatingNodeTaints:
        enabled: true
        ...

(Some generated parameters omitted.)

Parameters

The Descheduler operator exposes the following parameters in its CRD:

  • deschedulingIntervalSeconds - sets the number of seconds between descheduler runs
  • profiles - sets which descheduler strategy profiles are enabled
  • profileCustomizations - contains various parameters for modifying the default behavior of certain profiles
  • mode - configures the descheduler to either evict pods or to simulate the eviction

cluster-kube-descheduler-operator's People

Contributors

atiratree avatar bergerhoffer avatar chandra0007 avatar damemi avatar deejross avatar ingvagabund avatar jupierce avatar knelasevero avatar locriandev avatar openshift-bot avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar prb112 avatar ravisantoshgudimetla avatar sjenning avatar smarterclayton avatar soltysh avatar sosiouxme avatar stlaz avatar thiagoalessio avatar thrasher-redhat avatar vrutkovs avatar yselkowitz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-kube-descheduler-operator's Issues

Run unit and e2e tests in CI

As of now, we don't have proper gating for changes going into repo. We'd like to be at a stage atleast where

  • Travis CI is running all the unit tests.
  • e2e's are running in openshift CI.

As of now, there are 2 issues blocking e2e CI setup.

  • Operator sdk supporting local testing. We'd like to have e2es run locally(without building container images) using operator sdk. While this is not a complete blocker, it becomes difficult to manage the registry to which we push images for every PR.

Ref: operator-framework/operator-sdk#745

  • How to integrate with CI? How OLM can pull the bits provided in the PR for running tests in. Quoting Evan from OLM team

We’re planning to add some easier methods to inject things into catalogs (e.g. just write out a new CR in a cluster describing the catalog entry)

Both the above items are WIP from respective teams.

/cc @sjenning

system-cluster-critical pod forbidden to run

I just deployed the descheduler operator from operator hub and got this event from the job:

Error creating: pods "example-descheduler-1-1571692440-" is forbidden: pods with system-cluster-critical priorityClass is not permitted in descheduler namespace

descheduler is normal project I created to run the operator. there was no special instruction on where the operator should be run. What am I doing wrong.
Also as result of this issue I now have several pending jobs, this should probably not be happening.

Resources are not configurable

Openshift-Descheduler Resources are preconfigured:

containers:
        - resources:
            limits:
              cpu: 100m
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 500Mi

For large clusters it will direct to OOM Pod restarts. It would be fine to have possiblity to setup in minimum own limits with the KubeDescheduler custom resource.

LowNodeUtilization "NumberOfNodes" not working

In the example and in the documentation, for the LowNodeUtilization stragegy the parameters"NumberOfNodes"is used in many places
It does not work. It is due to those lines of code that "switchi" on the "toLower" value of the parameter, but test on non lowercase value
Using"nodes" as parameter name works

This is due to those lines:

switch strings.ToLower(param.Name) {
case "cputhreshold":
thresholds[v1.ResourceCPU] = deschedulerapi.Percentage(value)
case "memorythreshold":
thresholds[v1.ResourceMemory] = deschedulerapi.Percentage(value)
case "podsthreshold":
thresholds[v1.ResourcePods] = deschedulerapi.Percentage(value)
case "cputargetthreshold":
targetThresholds[v1.ResourceCPU] = deschedulerapi.Percentage(value)
case "memorytargetthreshold":
targetThresholds[v1.ResourceMemory] = deschedulerapi.Percentage(value)
case "podstargetthreshold":
targetThresholds[v1.ResourcePods] = deschedulerapi.Percentage(value)
case "nodes", "numberOfNodes":
utilizationThresholds.NumberOfNodes = value
}

Switch to actual upstream Descheduler policy

We currently have our own config API in the operator that differs from the upstream Descheduler API. For example, the operator needs to be configured with a strategies field like:

apiVersion: operator.openshift.io/v1beta1
kind: KubeDescheduler
metadata:
  name: config
  namespace: openshift-kube-descheduler-operator
spec:
  strategies:
    - name: "RemoveDuplicates"
    - name: "RemovePodsHavingTooManyRestarts"
      params:
       - name: "PodRestartThreshold"
         value: "10"
       - name: "IncludingInitContainers"
         value: "false"

which just gets internally translated into:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
     enabled: true
  "RemovePodsHavingTooManyRestarts":
     enabled: true
     params:
       podsHavingTooManyRestarts:
         podRestartThreshold: 10
         includingInitContainers: false

So, while our own API is slightly simpler in its definition, in practice it must be manually converted which adds complexity to the codebase. It also means that we need to constantly update our own operator code to support new strategies and parameters as they are added upstream meaning double the work for us to add a new feature upstream.

In addition, it is also confusing to users that the config API is different when using our operator vs running the descheduler on their own which could inhibit adoption of the operator. It would be much simpler to simply have to point to upstream docs for configuring the descheduler.

This is why I propose a field called policy in the operator spec, which would point to a configmap containing an actual descheduler policy (along with an optional field namespace, which defaults to openshift-config). This would match the design of the scheduler operator which has a Policy field that points to a configmap with a regular scheduler policy (see the OpenShift docs on how to deploy the scheduler operator with a custom policy. This would be the exact same design).

I think we will still need to support the current v1beta1 config API until it can be fully deprecated, but this shift will save us effort and reduce potential failure points.

Update: I've opened these PRs to begin the work required for this:

Why is policy customization removed in 4.7?

In 4.6 we could configure the descheduler policies with the strategies field as the defaults don't work for us, but now in 4.7 the field is deprecated and can only enable the default profiles with no configuration options. Our only choices now are to keep using the 4.6 operator or remove it completely and run descheduler ourselves.

descheduler pod OOM on large clusters

I experience on our largest openshift cluster that the descheduler pod runs out of memory. Is there a way to set the pod resources in the deployment.apps/descheduler ?

I tried to set the operator to "unmanaged" and change the deployment.apps/descheduler manually. But the operator keeps setting the default. So I had to remove the operator.

Thanks

LowNodeUtilization: "TargetThreshold" params not translated correctly, overriden by "Threshold" values

The "TargetThreshold" values are not correctly translated in the cluster Configmap. The "TargetThreshold" values are taken from the "Threshold" values
As is, the Descheduler operator is not usable, except if we update the generated cluster ConfighMap by hand and we don't touch theDeschedulerinstance..

This "strategy":

strategies:
  - name: "LowNodeUtilization"
    params:
      - name: "CPUThreshold"
        value: "10"
      - name: "MemoryThreshold"
        value: "20"
      - name: "PodsThreshold"
        value: "30"
      - name: "CPUTargetThreshold"
        value: "40"
      - name: "MemoryTargetThreshold"
        value: "50"
      - name: "PodsTargetThreshold"
        value: "60"

Is translated in "cluster" ConfigMap to:

 strategies:
  LowNodeUtilization:
    enabled: true
    params:
      nodeResourceUtilizationThresholds:
        targetThresholds:
          cpu: 10
          memory: 20
          pods: 30
        thresholds:
          cpu: 10
          memory: 20
          pods: 30

Descheduler should parse IMAGE env var for development

For QA and development, there is a need to be able to set a custom descheduler image in the operator (often to the latest ART build in order to verify bug fixes). This used to be the image field in the operator spec, but because we are removing that from the supported user config this should be added as an undocumented/unsupported flag.

For example, the kube scheduler operator (and others) get the env var in the operator deployment then sub that in with the config reconciler

README inconsistencies

In testing the operator, found two minor inconsistencies from the README file.

  1. The README references the openshift-descheduler-operator namespace twice, but it's actually the openshift-kube-descheduler-operator namespace.
  2. The Sample CR section says that the operator expects the name config, but the name is cluster when created.

service monitors are scrapped by user workload monitoring.

I have a customer that have installed this operator in openshift 4.10

Now the alert PrometheusOperatorRejectedResources is firing.

After checking the prometheus operator in user workload monitoring, it shows that the service monitor:

openshift-kube-descheduler-operator/kube-descheduler

is trying to be scrapped by user workload monitoring instead of by openshift-monitoring.

The label:

openshift.io/cluster-monitoring: "true"

is not found in the namespace that seems it's managed by operator.

Support evict annotation for namespaces

The operator currently auto-excludes all namespaces with openshift-* or kube-* prefixes from eviction. This makes sense to prevent users from breaking their cluster with the Descheduler, and those are reserved prefixes so users should not be able to create their own namespaces that match the pattern.

However, it may be useful for administrators and support to be able to include certain system namespaces for rebalancing (for example, during and after upgrades). Perhaps we could add a check for the same descheduler.alpha.kubernetes.io/evict annotation on namespaces before assuming they should be excluded. Pods within that namespace would still be subject to the same eviction rules

cc @ingvagabund wdyt?

Operator Hub installation does not create openshift-kube-descheduler-operator project and install inside it

I realized the issue when I first attempted to install the descheduler through the Operator Hub. The Operator does not create and install into the hardcoded openshift-kube-descheduler-operator project. This project does not exist ahead of time, and a cluster-admin cannot create the project due to a admission controller preventing new projects with openshift-* from being created.

Once you deploy the descheduler into a user-managed namespace, the pods complain of missing cluster cr in openshift-kube-descheduler-operator

image

Support proper strategy names

Currently we have this shorthand approach to descheduler strategies that maps a new name to the actual upstream name:

Operator param Descheduler strategy
duplicates RemoveDuplicates
interpodantiaffinity RemovePodsViolatingInterPodAntiAffinity
lownodeutilization LowNodeUtilization
nodeaffinity RemovePodsViolatingNodeAffinity
nodetaints RemovePodsViolatingNodeTaints

This seems confusing and adds something to translate when trying to configure the operator. These are handled in a simple switch statement and it should be relatively easy to add support for the real upstream strategy names and make those our primary. We can silently support these shorthands for backward compatibility and phase out eventually

"unknown conversion" with descheduler policy post-1.14

Using a descheduler image built after the go1.14 bump upstream results in the following error when used with our operator:

$ oc logs pod/cluster-64fd56cddf-c4mf7
E0717 15:33:52.633020       1 server.go:46] failed converting versioned policy to internal policy version: converting (v1alpha1.DeschedulerPolicy) to (api.DeschedulerPolicy): unknown conversion

I haven't verified yet if this is only an error with our operator and how we generate the policy, or if it's an issue with the descheduler's api itself. The error arises from here: https://github.com/kubernetes-sigs/descheduler/blob/267b0837dc3085c387d1ee6bf76050bf0db91c9a/pkg/descheduler/policyconfig.go#L51

/kind bug
/priority critical-urgent

Make descheduler run as cron job

As of now, descheduler is running as job, in order to avoid regressions from 3.10 and 3.11, we need to make it as cron job.

will post a PR soon.

Update readme with new manual deployment options

Just opening this to track these updates. With the switch to operatorhub setup, I'd still like to know how I can manually deploy the operator from source if possible (does oc create -f manifests/. still work?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.