hjacobs / kube-janitor Goto Github PK

View Code? Open in Web Editor NEW

473.0 6.0 40.0 210 KB

Clean up (delete) Kubernetes resources after a configured TTL (time to live)

License: GNU General Public License v3.0

Dockerfile 0.71% Makefile 0.88% Python 97.25% HTML 1.15%

kubernetes kubernetes-operator cleanup resource-management ttl garbage-collector

kube-janitor's Introduction

Moved to https://codeberg.org/hjacobs/kube-janitor

kube-janitor's People

Contributors

Stargazers

Watchers

kube-janitor's Issues

Support week as time unit ("w")

It might be convenient to specify the TTL in weeks. Support this by introducing w as time unit, e.g. 2w would define a time to live of 2 weeks.

This should be easily possible as the "w" is unambiguous and weeks have a fixed length of 7 days (not accounting for timezones/leap seconds).

How to disable tls verification(self signed certs)

Trying to run janitor on cluster with self signed certs.
I'm getting this error urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='', port=443): Max retries exceeded with url: /api/v1/namespaces (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get issuer certificate (_ssl.c:1056)')))
Is there a way to ignore invalid certs?

Support wildcard patterns in include/exclude lists

It would be helpful to support a wildcard pattern when specifying namespaces to include/exclude.

Example:

--include-namespaces=devspace-*
--exclude-namespaces=kube-*

Use Case:

It is useful when namespaces are used as a separation between deployment instances in different namespaces, particularly for personal developer or team namespaces.

At least some version tags in this repo are placed on the wrong commits

The 20.2.0 tag is placed on commit f395fbc instead of 86e1dea, which follows it.

When one does git checkout 20.2.0, they get 20.2.0 bits with the 20.1.0 version number

Please see the screenshot

It looks like this is an issues that persists throughout versions:

I can see three possible remedies:

re-do the tags and releases and start making tags/releases on the right commits in future
leave everything as-is and make a note of this issue in README
leave this issue open so that consumers can see it in the list of open issues

Others have been confused by this, as can be see in this comment

Consider generating events

https://twitter.com/bryanl/status/1097203808411836416

Pods are orphaned when Deployment deleted by kube-janitor

What happened:

I created Deployment and annotated it:

kubectl run temp-nginx --image=nginx
kubectl annotate deploy temp-nginx janitor/ttl=1m

The pod had valid ownerReferences:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2019-03-27T14:02:19Z"
  generateName: temp-nginx-68498674c5-
  labels:
    pod-template-hash: 68498674c5
    run: temp-nginx
  name: temp-nginx-68498674c5-2mxfc
  namespace: dev
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: temp-nginx-68498674c5
    uid: ef5dbd98-5098-11e9-999b-02405219215c

As expected, the Deployments was deleted after 1m, here the logs of kube-janitor:

2019-03-27 14:03:37,720 DEBUG: Deployment temp-nginx with 1m TTL is 1m18s old
2019-03-27 14:03:37,720 INFO: Deployment temp-nginx with 1m TTL is 1m18s old and will be deleted (annotation janitor/ttl is set)
2019-03-27 14:03:37,731 DEBUG: https://10.96.0.1:443 "POST /api/v1/namespaces/dev/events HTTP/1.1" 201 836
2019-03-27 14:03:37,732 INFO: Deleting Deployment dev/temp-nginx..
2019-03-27 14:03:37,738 DEBUG: https://10.96.0.1:443 "DELETE /apis/extensions/v1beta1/namespaces/dev/deployments/temp-nginx HTTP/1.1" 200 1712
2019-03-27 14:03:37,739 DEBUG: ReplicaSet temp-nginx-68498674c5 with 1m TTL is 1m18s old
2019-03-27 14:03:37,739 INFO: ReplicaSet temp-nginx-68498674c5 with 1m TTL is 1m18s old and will be deleted (annotation janitor/ttl is set)
2019-03-27 14:03:37,745 DEBUG: https://10.96.0.1:443 "POST /api/v1/namespaces/dev/events HTTP/1.1" 201 858
2019-03-27 14:03:37,745 INFO: Deleting ReplicaSet dev/temp-nginx-68498674c5..
2019-03-27 14:03:37,752 DEBUG: https://10.96.0.1:443 "DELETE /apis/extensions/v1beta1/namespaces/dev/replicasets/temp-nginx-68498674c5 HTTP/1.1" 200 1546
2019-03-27 14:03:37,754 INFO: Clean up run completed: resources-processed=472, deployments-with-ttl=1, deployments-deleted=1, replicasets-with-ttl=1, replicasets-deleted=1

But pod wasn't deleted and was orphaned. ownerReferences was removed:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2019-03-27T14:02:19Z"
  generateName: temp-nginx-68498674c5-
  labels:
    pod-template-hash: 68498674c5
    run: temp-nginx
  name: temp-nginx-68498674c5-2mxfc
  namespace: dev
  resourceVersion: "66358563"
  selfLink: /api/v1/namespaces/dev/pods/temp-nginx-68498674c5-2mxfc
  uid: ef60a8b0-5098-11e9-999b-02405219215c

What I expected:

That deletion of resources is cascading. Otherwise there is little sense to just delete Deployments and leave all the Pods behind.

Context

kubectl version:

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

kube-janitor version and flags:

Janitor v0.6 started with debug=True, delete_notification=None, dry_run=False, exclude_namespaces=kube-system, exclude_resources=events,controllerrevisions, include_namespaces=all, include_resources=all, interval=30, once=False, rules_file=/config/rules.yaml

rules.yaml:

# example rules configuration to set TTL for arbitrary objects
# see https://github.com/hjacobs/kube-janitor for details
rules:
  - id: require-application-label
    # remove deployments and statefulsets without a label "application"
    resources:
      # resources are prefixed with "XXX" to make sure they are not active by accident
      # modify the rule as needed and remove the "XXX" prefix to activate
      - XXXdeployments
      - XXXstatefulsets
    # see http://jmespath.org/specification.html
    jmespath: "!(spec.template.metadata.labels.application)"
    ttl: 4d
  - id: temporary-pr-namespaces
    # delete all namespaces with a name starting with "pr-*"
    resources:
      # resources are prefixed with "XXX" to make sure they are not active by accident
      # modify the rule as needed and remove the "XXX" prefix to activate
      - XXXnamespaces
    # this uses JMESPath's built-in "starts_with" function
    # see http://jmespath.org/specification.html#starts-with
    jmespath: "starts_with(metadata.name, 'pr-')"
    ttl: 4h

Remove unmounted/unused PVCs

Kubernetes Janitor should allow deleting PersistentVolumeClaims which are no longer mounted or referenced, e.g. because the StatefulSet was deleted.

PVCs are not automatically deleted and it's easy to forget them. From https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/:

Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.

Idea: add additional context properties which can be used in the rule jmespath.

Support for delete notifications

Kube-janitor is a very good tool to remove dangling resources. It would be good if it can integrate resource deletion notifications feature also. for ex. If kube-janitor deletes a namespace then user should get a notification for the same.
Start with notification channels like Email, Slack. Later more channels can be added.

Delete notification

Hi thanks for this awesome project.

I am thinking in some notification mechanism when a resource is going to be deleted soon.

Is this possible? Something like a webhooks or callbacks to an specific endpoint. Make sense for you?

When using `include_resources`, it should not check other resources

I've understood that, if you set the option include_resources, it will only work with the resources given.

I've seen, when debugging, that even though the application is set to work (in my case) with namespaces, It asks the api for all the resources and the skips them.

Wouldn't be better directly not calling the api for all the resources and just call for the required ones? This would prevent too many unneeded calls and prevent congesting the API server for no reason.

Log (cropped):

2019-08-09 11:01:52,197 INFO: Janitor v0.7 started with debug=True, delete_notification=None, dry_run=False, exclude_namespaces=kube-system, exclude_resources=events,controllerrevisions, include_namespaces=all, include_resources=namespaces, interval=30, once=True, rules_file=None
2019-08-09 11:01:52,200 DEBUG: Starting new HTTPS connection (1): <API_SERVER_IP>:443
2019-08-09 11:01:52,213 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/namespaces HTTP/1.1" 200 None
2019-08-09 11:01:52,219 DEBUG: Namespace julio will expire on 2019-08-09T22:55
2019-08-09 11:01:52,219 DEBUG: Skipping Namespace kube-system
2019-08-09 11:01:52,222 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/ HTTP/1.1" 200 None
2019-08-09 11:01:52,229 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/configmaps HTTP/1.1" 200 None
2019-08-09 11:01:52,230 DEBUG: Skipping ConfigMap *** ... # Multiple lines
2019-08-09 11:01:52,235 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/endpoints HTTP/1.1" 200 None
2019-08-09 11:01:52,237 DEBUG: Skipping Endpoints *** ... # Multiple lines
2019-08-09 11:01:52,241 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/limitranges HTTP/1.1" 200 711
2019-08-09 11:01:52,241 DEBUG: Skipping LimitRange *** ... # Multiple lines
2019-08-09 11:01:52,256 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/pods HTTP/1.1" 200 None
2019-08-09 11:01:52,279 DEBUG: Skipping Pod *** ... # Multiple lines
2019-08-09 11:01:52,299 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/secrets HTTP/1.1" 200 None
2019-08-09 11:01:52,302 DEBUG: Skipping Secret *** ... # Multiple lines
2019-08-09 11:01:52,308 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/serviceaccounts HTTP/1.1" 200 None
2019-08-09 11:01:52,310 DEBUG: Skipping ServiceAccount *** ... # Multiple lines
2019-08-09 11:01:52,316 DEBUG: https://<API_SERVER_IP>:443 "GET /api/v1/services HTTP/1.1" 200 None
2019-08-09 11:01:52,318 DEBUG: Skipping Service *** ... # Multiple lines

etc etc

The application is run with this options: --once --include-resources=namespaces --debug

Deleting CRDS like EtcdCluster is not working

I am trying to delete CRDs like EtcdCluster via Janitor. Specified like so:

rulesFile:
    rules:
      - id: namespace-cleanup
        resources:
        - deployments
        - services
        - etcdcluster
        jmespath: "!(spec.template.metadata.labels.janitorIgnore)"
        ttl: 12h

it does not delete the etcdcluster resource. This resource is introduced by the Etcd Operator from coreOS. Do I have to specify the CRDs somehow special?

Unused deployments

Hi @hjacobs,
can we find out unused deployments/resources and delete them as Janitor Monkey! please comment

Not iterating over all namespaces

It looks like this is not iterating over all namespaces.

I see this

[kube-janitor-7cc797f987-5pgjz] 2020-05-04 00:26:44,482 INFO: Clean up run completed: resources-processed=3012

and looking over the previous logs, it just looked at resources in the namespace it's deployed in (kube-system)

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: kube-janitor
    version: 20.4.0
  name: kube-janitor
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-janitor
  template:
    metadata:
      labels:
        app: kube-janitor
        version: 20.4.0
    spec:
      containers:
      - args:
        - --dry-run
        - --debug
        - --interval=60
        image: hjacobs/kube-janitor:20.4.0
        name: janitor
        resources:
          limits:
            cpu: 500m
            memory: 110Mi
          requests:
            cpu: 5m
            memory: 100Mi
        securityContext:
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
      serviceAccountName: kube-janitor

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app: kube-janitor
  name: kube-janitor
rules:
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - get
  - watch
  - list
  - delete

Not sure why

Provide special value for janitor/ttl that will block wipe process.

Sometimes people use the same file to deploy to different environments (live, staging, pr), as deployments to different stacks may differ only in parameter values (and parameters are taken from config maps or some other template engine).
Unfortunately, presence of janitor/ttl forces people to create several yaml files with only difference - presence of janitor/ttl annotation.
I believe that ability to set special value (janitor/ttl: undead, for example) that is blocking janitor in specific environments will allow people to create unified deployments.

Additional rule matching with janitor/ttl?

Are there any plans on adding additional rule matching (e.g. whitelisting a specific instance of a resource, etc.) alongside the janitor/ttl annotation, or is the design based on the premise that if the resource is included and there is an annotation, it will be removed? Or is this the case that rule matching will always be used in the situation when an annotation isn't marked?

Provide example on how to run as CronJob

Running kube-janitor as a CronJob is probably better (does not need to run very frequently).

--include-resources not working as expected

I've been having issues trying to get --include-resources to work as expected.

I tried two variants of this spec but didn't get anything to work:

spec:
      containers:
      - args:
        - --debug
        - --interval=20
        - --rules-file=/config/rules.yaml
        - --include-resources=deployment

spec:
      containers:
      - args:
        - --debug
        - --interval=20
        - --rules-file=/config/rules.yaml
        - --include-resources=deployment
        - --include-namespaces=default

I ran kubectl run temp-nginx --image=nginx and annotated it with kubectl annotate deploy temp-nginx janitor/ttl=5s, but the deployment doesn't disappear after a loop run. Any ideas on what to do?

how to run unittests

running pipenv run coverage run --source=kube_janitor -m py.test, and got No module named 'py'

Also, I find it not really running unit tests. Do you think so?

Delete notification: Operation cannot be fulfilled on replicasets.extensions, the object has been modified

An exception occurs in v0.6 when --delete-notification is set to a value > 0 and a deployment with a TTL annotation is processed. Note that everything works (events emitted, resources deleted) and the annotations are set as expected, but the exception is a bit annoying:

kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:20:28,730 INFO: Deployment nginx will be deleted at 2019-03-23T22:24:30Z (annotation janitor/ttl is set)
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:20:28,756 INFO: ReplicaSet nginx-dbddb74b8 will be deleted at 2019-03-23T22:24:30Z (annotation janitor/ttl is set)
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:20:28,773 ERROR: Failed to clean up: Operation cannot be fulfilled on replicasets.extensions "nginx-dbddb74b8": the object has been modified; please apply your changes to the latest version and try again
kube-janitor-69f4484d9b-djc9m janitor Traceback (most recent call last):
kube-janitor-69f4484d9b-djc9m janitor   File "/usr/local/lib/python3.7/site-packages/pykube/http.py", line 239, in raise_for_status
kube-janitor-69f4484d9b-djc9m janitor     resp.raise_for_status()
kube-janitor-69f4484d9b-djc9m janitor   File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
kube-janitor-69f4484d9b-djc9m janitor     raise HTTPError(http_error_msg, response=self)
kube-janitor-69f4484d9b-djc9m janitor requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://10.3.0.1:443/apis/extensions/v1beta1/namespaces/default/replicasets/nginx-dbddb74b8
kube-janitor-69f4484d9b-djc9m janitor 
kube-janitor-69f4484d9b-djc9m janitor During handling of the above exception, another exception occurred:
kube-janitor-69f4484d9b-djc9m janitor 
kube-janitor-69f4484d9b-djc9m janitor Traceback (most recent call last):
kube-janitor-69f4484d9b-djc9m janitor   File "/kube_janitor/main.py", line 51, in run_loop
kube-janitor-69f4484d9b-djc9m janitor     dry_run=dry_run)
kube-janitor-69f4484d9b-djc9m janitor   File "/kube_janitor/janitor.py", line 231, in clean_up
kube-janitor-69f4484d9b-djc9m janitor     counter.update(handle_resource_on_ttl(resource, rules, delete_notification, dry_run))
kube-janitor-69f4484d9b-djc9m janitor   File "/kube_janitor/janitor.py", line 156, in handle_resource_on_ttl
kube-janitor-69f4484d9b-djc9m janitor     send_delete_notification(resource, reason, expiry_time, dry_run=dry_run)
kube-janitor-69f4484d9b-djc9m janitor   File "/kube_janitor/janitor.py", line 75, in send_delete_notification
kube-janitor-69f4484d9b-djc9m janitor     add_notification_flag(resource, dry_run=dry_run)
kube-janitor-69f4484d9b-djc9m janitor   File "/kube_janitor/janitor.py", line 52, in add_notification_flag
kube-janitor-69f4484d9b-djc9m janitor     resource.update()
kube-janitor-69f4484d9b-djc9m janitor   File "/usr/local/lib/python3.7/site-packages/pykube/objects.py", line 142, in update
kube-janitor-69f4484d9b-djc9m janitor     self.api.raise_for_status(r)
kube-janitor-69f4484d9b-djc9m janitor   File "/usr/local/lib/python3.7/site-packages/pykube/http.py", line 246, in raise_for_status
kube-janitor-69f4484d9b-djc9m janitor     raise HTTPError(resp.status_code, payload["message"])
kube-janitor-69f4484d9b-djc9m janitor pykube.exceptions.HTTPError: Operation cannot be fulfilled on replicasets.extensions "nginx-dbddb74b8": the object has been modified; please apply your changes to the latest version and try again
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:21:30,727 INFO: Clean up run completed: resources-processed=2941, deployments-with-ttl=1, replicasets-with-ttl=1
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:22:32,614 INFO: Clean up run completed: resources-processed=2941, deployments-with-ttl=1, replicasets-with-ttl=1
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:23:34,703 INFO: Clean up run completed: resources-processed=2941, deployments-with-ttl=1, replicasets-with-ttl=1
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:24:36,777 INFO: Deployment nginx with 10m TTL is 10m6s old and will be deleted (annotation janitor/ttl is set)
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:24:36,789 INFO: Deleting Deployment default/nginx..
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:24:36,808 INFO: ReplicaSet nginx-dbddb74b8 with 10m TTL is 10m6s old and will be deleted (annotation janitor/ttl is set)
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:24:36,817 INFO: Deleting ReplicaSet default/nginx-dbddb74b8..
kube-janitor-69f4484d9b-djc9m janitor 2019-03-23 22:24:36,834 INFO: Clean up run completed: resources-processed=2941, deployments-with-ttl=1, deployments-deleted=1, replicasets-with-ttl=1, replicasets-deleted=1

Logging message shows "None" when deleting namespaces

I just discovered that the log message should be improved to not show "None" when deleting namespaces:

INFO: Deleting Namespace None/d-26f3uxxcro63tdvozphzn18cg6..

(from zalando-incubator/kubernetes-on-aws#2009)

This is only a cosmetic issue and does not affect functionality.

Deleting Helm releases

Kube-janitor seems perfect for auto-cleaning of temp. feature deploys. But... we use Helm. So it would remove the resources and leave the release data (secret objects) dangling.

Possible ways to cleanup helm releases could be:

Determine release and use helm to uninstall

Determine helm release name from the resources found, e.g. based on labels heritage: Helm and release: <release-name>.
Instead of removing individual resources, shell out to helm, executing helm uninstall <release-name>.

Some pros/cons:

Pro: Applications installed by helm are probably best uninstalled by helm. There might be hooks, resources not annotated, etc. It would be consistent by design.
Con: Major change in kube-janitor. Shell commands, helm binary requirement. Violates linux mantra 'do one thing well'.

Find helm release secret object and remove that as well

Determine helm release name from the resources found, e.g. based on labels heritage: Helm and release: <release-name>.
Helm release data would be in a secret having label name: <release-name> and a name like sh.helm.release.v1.<release-name>.v1.

Some pros/cons:

Pro: Relatively easy to add to kube-janitor
Con: Might leave resources that aren't annotated. This is already the case with current operation of kube-janitor but when using helm charts not authored yourself, you're dependent on the ability of the chart to annotate all resources. Note that this restriction doesn't seem to apply when using kube-janitor with a rules file. If I'm not mistaken that doesn't require having annotations on each resource.

Any implementation would obviously be 'opt in' and might require some additional config options or logic, e.g. an annotation specifying the label to extract the helm release from.

I'd like to hear your thoughts. Personally I think the 2nd approach would be a fit for kube-janitor while 1st approach has risk of embedding what would become a parallel completely new implementation.

Coming days (given time, 🎄after all) I'll try to run some tests to see how Helm copes with possible dangling resources, while release data has been removed.

Use paging of Kubernetes objects to limit memory footprint

We could limit the memory footprint by using paged results.

A scale question

Hey,
Apologies this is more of a question than an issue... I tried to get the answer from the code but python isn't my strong suit :)

I'm thinking about kube-janitor at scale, we have 500+ namespaces etc, and I believe (please correct me if I'm wrong) the approach janitor takes is to iterate over them all, pulling at the resources, then inspecting the annotations - every minute.

That feels like an expensive operation, and I'm wondering if you've considered either:

Being able to restrict the resources to those which have a label of janitor=true, as well as the relevant annotation?

Watching the api for changes rather than polling per minute?

Cheers
Karl

PVCs: also check for references by CronJob

Some CronJobs use Persistent Volumes which not be deleted between CronJob runs, e.g:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: "foobar"
spec:
  schedule: "0 23 * * *"
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            application: "foobar"
        spec:
          restartPolicy: Never
          containers:
            - name: cont
              image: "my-docker-image"
              volumeMounts:
                - mountPath: "/data"
                  name: "foobar-data"
          volumes:
            - name: "foobar-data"
              persistentVolumeClaim:
                claimName: "foobar-data"

_context.pvc_is_not_referenced should be false for the PVC foobar-data in this case.

Publishing through helm charts repo?

Hi there,
Would you be into the idea of putting kube-janitor up on helm charts repo? I'm happy to put it up for you as well, if you're OK with it.

EDIT: There is already another package using the name kube-janitor so this package will have to be published under another name though. (https://github.com/helm/charts/blob/master/incubator/kube-janitor/Chart.yaml)

dockerised development setup

It would be good if developers and Travis agent don't have to install dependencies such as pipenv, and just use Docker?

Docker can help with dep isolation, just as what virtualenv would help with. Not having to install a bunch of deps makes the development env more portable.

Investigate CRDs which are not "seen"

Some CRDs apparently are not appearing in logs, investigate.

Kube Janitor ttl "forever" not working

kube-janitor/kube_janitor/janitor.py

Line 243 in 29c5152

if now > expiry_timestamp:

expiry_timestamp is -1 when ttl is set to "forever", thus condition now>expiry_timestamp evaluates to true and pod is deleted

kube-janitor/kube_janitor/helper.py

Line 27 in 29c5152

return -1

what if in parse_ttl() when ttl is set to "forever" we return (current_time + some_large_number_of_days(maybe 365d)). In this way it will not be deleted ever

Support "expires" time as annotation

The current TTL annotation (janitor/ttl) always denotes a maximum time to live counted from the time of creation of the resource. Sometimes it's desired to mark resources for deletion at arbitrary times in the future. Use case example:

some stray resources are found in the test environment, nobody knows who owns/needs them --> send an email to the team asking about it and at the same time mark the resources as "delete in 4 days" (so the team has enough time to answer/fix and if no response --> delete)
a project or hackathon has a fixed time frame (e.g. 1 week), i.e. the end date is known --> mark the used namespace as "delete at end of project date"

Proposal: support a new annotation janitor/expires which accepts an absolute timestamp in the format YYYY-MM-DDTHH:MM:SSZ (same format as Kubernetes creationTimestamp). Resources should be deleted if their janitor/expires timestamp lies in the past.

Deleting Object successful but it does not remove respective objects

I have a custom resource whose count in all namespaces is ~770. I ran janitor in debug mode and it prints

X xxx-20191127-122231162 with 2m TTL is 6d19h16m42s old and will be deleted (rule require-completed-or-failed-on-not-default matches)
https://172.20.0.1:443 "POST /api/v1/namespaces/default/events HTTP/1.1" 201 942
Deleting X default/xxx-20191127-122231162..
https://172.20.0.1:443 "DELETE /apis/xxx.xxx.dev/v1/namespaces/default/X/xxx-20191127-122231162 HTTP/1.1" 200 None

But it does not delete that object. I exported the object to my docker-for-desktop, it prints same output but this time it is deleted

It is not about permissions because I modified clusterrole and this time output is:

Deleting X default/xxx-20191127-122231162..
https://172.20.0.1:443 "DELETE /apis/xxx.xxx.dev/v1/namespaces/default/X/xxx-20191127-122231162 HTTP/1.1" 403 475
Could not delete X default/xxx-20191127-122231162: X.xxx.xxx.dev "xxx-20191127-122231162" is forbidden: User "system:serviceaccount:non-default:kube-janitor" cannot delete resource "X" in API group "xxx.xxx.dev" in the namespace "default"

Here it is the last line of main cluster:
Clean up run completed: resources-processed=3762, X-with-ttl=17, X-deleted=17, rule-require-completed-or-failed-on-not-default-matches=16

Kube-janitor doesn't work at all if one of the API's is inaccessible

When running kube-janitor on our cluster, we got 503's when attempting to access the api metrics/v1alpha1.

2019-08-12 05:06:00,534 DEBUG: Collecting resources for metrics/v1alpha1..
2019-08-12 05:06:00,734 DEBUG: https://172.20.0.1:443 "GET /apis/metrics/v1alpha1/ HTTP/1.1" 503 20
2019-08-12 05:06:00,734 ERROR: Failed to clean up: 503 Server Error: Service Unavailable for url: https://172.20.0.1:443/apis/metrics/v1alpha1/
Traceback (most recent call last):
  File "/kube_janitor/main.py", line 51, in run_loop
    dry_run=dry_run)
  File "/kube_janitor/janitor.py", line 215, in clean_up
    for _type in resource_types:
  File "/kube_janitor/resources.py", line 39, in get_namespaced_resource_types
    for api_version, resource in discover_namespaced_api_resources(api):
  File "/kube_janitor/resources.py", line 32, in discover_namespaced_api_resources
    r2.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://172.20.0.1:443/apis/metrics/v1alpha1/

This prevented kube-janitor from running at all on our cluster.

Is it by design supposed to fail when one API is inaccessible?

In addition(and I think this has been raised in another issue), this would also be avoided in our specific use-case if we didn't iterate through all but instead just the required APIs(That, of course, doesn't address the root cause).

403 Client Error: Forbidden for url

Hello,
I've installed kube-janitor as instructed: git clone, kubectl apply -f deploy/common/, then kubectl apply -f deploy/deployment/. However, I'm getting the following error in the pod logs. I'm not sure where 10.222.0.1 is coming from as there aren't any nodes in cluster (inlcuding masters) with that IP.
Any ideas?
Best,
Greg

(This says 0.6 but I've also tried master).

2019-08-06 05:39:49,669 INFO: Janitor v0.6 started with debug=True, delete_notification=None, dry_run=True, exclude_namespaces=kube-system, exclude_resources=events,controllerrevisions, include_namespaces=all, include_resources=all, interval=60, once=False, rules_file=/config/rules.yaml
2019-08-06 05:39:49,669 INFO: **DRY-RUN**: no deletions will be performed!
2019-08-06 05:39:49,688 INFO: Loaded 2 rules from file /config/rules.yaml
2019-08-06 05:39:49,697 DEBUG: Starting new HTTPS connection (1): 10.222.0.1:443
2019-08-06 05:39:49,714 DEBUG: https://10.222.0.1:443 "GET /api/v1/namespaces HTTP/1.1" 403 294
2019-08-06 05:39:49,715 ERROR: Failed to clean up: 403 Client Error: Forbidden for url: https://10.222.0.1:443/api/v1/namespaces
Traceback (most recent call last):
  File "/kube_janitor/main.py", line 51, in run_loop
    dry_run=dry_run)
  File "/kube_janitor/janitor.py", line 201, in clean_up
    for namespace in Namespace.objects(api):
  File "/usr/local/lib/python3.7/site-packages/pykube/query.py", line 148, in __iter__
    return iter(self.query_cache["objects"])
  File "/usr/local/lib/python3.7/site-packages/pykube/query.py", line 138, in query_cache
    cache["response"] = self.execute().json()
  File "/usr/local/lib/python3.7/site-packages/pykube/query.py", line 123, in execute
    r.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.222.0.1:443/api/v1/namespaces

Not all Custom Resources are listed by default

We have a custom resource which does not get deleted by kube-janitor.

The CRD (without schema) is defined as:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: eventstreams.zalando.org
spec:
  conversion:
    strategy: None
  group: zalando.org
  names:
    kind: EventStream
    listKind: EventStreamList
    plural: eventstreams
    shortNames:
    - fes
    singular: eventstream
  preserveUnknownFields: true
  scope: Namespaced
  versions:
  - name: v1alpha1
    served: true
    storage: true
    subresources:
      status: {}

An instance of the crd uses a kube-janitor annotation:

# kubectl get fes my-fes -o yaml
apiVersion: zalando.org/v1alpha1
kind: EventStream
metadata:
  annotations:
    deployment-time: "2020-02-20T12:13:49Z"
    janitor/ttl: 1d

But even after the time has passed it is not deleted.

From an initial investigation, it appears that kube-janitor simply doesn't list the resources under this crd when it is listing all resources. If it was failing to delete them there would have been error messages in the logs.

Support `deployment-time` annotation

This annotation is injected during deployment time, so the janitor should prefer using it instead of creationTime. Currently it'll remove a deployment 2 days after it was created even if the users redeploy every hour, which is definitely not what anyone would expect.

--include-namespaces=kube-system does not overwrite the default exclusion

If I run --include-namespaces=kube-system without explicitly changes --exclude-namespaces, kube-janitor does not overwrite the exclusion. Is this by design, or should this be a change?

RFC: Remove resources based on github issue/PR/branch

This is a proposal for an additional feature, so I wanted to get feedback on the idea before starting on an implementation.

Our use of kube-janitor at Ecosia is as follows: we create a QA/review environment as a k8s namespace for every PR, and after a TTL has been reached (eg. 7 days), we delete the namespace. We would rather automatically delete the namespace based on the PR status change (eg. closed or merged), and we’ve tried a number of techniques for this (CI on the merge commit, github actions, etc) and none are particularly consistent or clean, yet.

It occurred to me the other day that for our use-case, kube-janitor could have a different kind of annotation, eg janitor/github-pr or janitor/github-branch, that would use the github API to check if the PR is open, or the branch exists, and remove the annotated resource when that condition are no longer met. In summary, if the annotation janitor/github-pr: “ecosia/example-repo/101” existed, kube-janitor would use the github api to check if the PR number 101 on the repo ‘ecosia/example-repo’ was still in ‘open’ status.

Please let me know if this is a feature you’d be willing to include and if so, I can try, sometime in the near future, to take a crack at an implementation.

Cheers! 🙂

Resources not being deleted

I'm having a problem with kube-janitor where the debug log shows success on deleting a resource:

2020-01-02 20:01:38,929 DEBUG: Rule backtest-configmaps with JMESPath "starts_with(metadata.name, 'backtest-params-')" evaluated for ConfigMap backtest/backtest-params-536746: True
2020-01-02 20:01:38,929 DEBUG: Rule backtest-configmaps applies 48h TTL to ConfigMap backtest/backtest-params-536746
2020-01-02 20:01:38,929 DEBUG: ConfigMap backtest-params-536746 with 48h TTL is 2d19h34m29s old
2020-01-02 20:01:38,929 INFO: ConfigMap backtest-params-536746 with 48h TTL is 2d19h34m29s old and will be deleted (rule backtest-configmaps matches)
2020-01-02 20:01:38,934 DEBUG: https://172.20.0.1:443 "POST /api/v1/namespaces/backtest/events HTTP/1.1" 201 867
2020-01-02 20:01:38,934 INFO: Deleting ConfigMap backtest/backtest-params-536746..
2020-01-02 20:01:38,941 DEBUG: https://172.20.0.1:443 "DELETE /api/v1/namespaces/backtest/configmaps/backtest-params-536746 HTTP/1.1" 200 None

but the resource doesn't get deleted. Any idea of what might be happening?

Rules file:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-janitor
  namespace: backtest
data:
  rules.yaml: |-
    rules:
      - id: backtest-jobs
        resources:
          - jobs
        jmespath: "starts_with(metadata.name, 'backtest-')"
        ttl: 48h
      - id: backtest-configmaps
        resources:
          - configmaps
        jmespath: "starts_with(metadata.name, 'backtest-params-')"
        ttl: 48h

kube-janitor leaves behind replicaset relics

kube-janitor will leave behind an old replicaset when you edit the deployment for kube-janitor

Steps to reproduce:

kubectl apply -f deploy/
kubectl edit deploy kube-janitor
(some edit of the deployment)
kubectl get replicasets

Generic/global TTL rules for resources without annotation

For test or prototyping environments/clusters, it can be desirable to automatically calculate a TTL for resources based on certain rules, e.g.:

a user is allowed to deploy some random Docker image from an untrusted source (like Docker Hub) to the prototyping environment, but the deployment/pods should be automatically deleted after 4 days
resources without certain labels (e.g. pointing to an registered application, team, department, service unit, etc) should be deleted automatically after 7 days
namespaces with a certain name pattern (e.g. "pr-*" created by CI/CD PR deployments) should automatically be removed after 8 hours

Unused PVCs (e.g. EBS volumes): also check for Deployments

Some people run Deployments with volumes. The PVC check should consider this. It's pretty trivial to add a check for Deployments here: https://github.com/hjacobs/kube-janitor/blob/master/kube_janitor/resource_context.py#L19

ClusterRole vs namespaced permissions

Looks like kube-janitor is expecting clusterrole permission level.

However for our least privileged approach we cannot grant clusterrole level permission.
@hjacobs could a namespace limited access via Role rather be supported?

│ 2020-03-11T15:53:13.726691299Z requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://10.100.0.1:443/api/v1/namespaces                            │
│ 2020-03-11T15:53:23.731598165Z 2020-03-11 15:53:23,731 DEBUG: Starting new HTTPS connection (1): 10.100.0.1                                                            │
│ 2020-03-11T15:53:23.73769914Z 2020-03-11 15:53:23,737 DEBUG: https://10.100.0.1:443 "GET /api/v1/namespaces HTTP/1.1" 403 297                                          │
│ 2020-03-11T15:53:23.738242474Z 2020-03-11 15:53:23,737 ERROR: Failed to clean up: 403 Client Error: Forbidden for url: https://10.100.0.1:443/api/v1/namespaces        │
│ 2020-03-11T15:53:23.738259476Z Traceback (most recent call last):                                                                                                      │
│ 2020-03-11T15:53:23.738264047Z   File "/kube_janitor/main.py", line 66, in run_loop                                                                                    │
│ 2020-03-11T15:53:23.738267899Z     clean_up(                                                                                                                           │
│ 2020-03-11T15:53:23.738271363Z   File "/kube_janitor/janitor.py", line 279, in clean_up                                                                                │
│ 2020-03-11T15:53:23.738274853Z     for namespace in Namespace.objects(api):                                                                                            │
│ 2020-03-11T15:53:23.738278123Z   File "/usr/local/lib/python3.8/site-packages/pykube/query.py", line 196, in __iter__                                                  │
│ 2020-03-11T15:53:23.738282166Z     return iter(self.query_cache["objects"])                                                                                            │
│ 2020-03-11T15:53:23.738285887Z   File "/usr/local/lib/python3.8/site-packages/pykube/query.py", line 186, in query_cache                                               │
│ 2020-03-11T15:53:23.738297474Z     cache["response"] = self.execute().json()                                                                                           │
│ 2020-03-11T15:53:23.738301192Z   File "/usr/local/lib/python3.8/site-packages/pykube/query.py", line 161, in execute                                                   │
│ 2020-03-11T15:53:23.738304959Z     r.raise_for_status()                                                                                                                │
│ 2020-03-11T15:53:23.738308315Z   File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 940, in raise_for_status                                       │
│ 2020-03-11T15:53:23.738312089Z     raise HTTPError(http_error_msg, response=self)

Support ARM/ARM64

I have a small k3s cluster running on a couple Raspberry Pis, and with the size of the cluster, som automatic clean up of stuff I forget would be very helpful.

Any chance you can build a multi-arch image?

PS.
You could try out https://pypi.org/project/dockerma/ for a simple way to build multi-arch.
Disclaimer: I'm the author of dockerma, and as far as I know I'm the only one who has used it so far 😄.