Summary In our deployment the csi-driver will slowly raise its CPU

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

We deploy our Kubernetes Cluster using Rancher with the <a href="https://github.com/Jo

CPU Usage slowly increasing about csi-driver HOT 12 CLOSED

hetznercloud commented on June 28, 2024

CPU Usage slowly increasing

from csi-driver.

Comments (12)

apricote commented on June 28, 2024

I was able to quickly reproduce this issue in a fresh cluster using 5 cronjobs on a 1-minute schedule.

manifest for cronjobs

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-1
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-1
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-1

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-2
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-2
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-2

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-3
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-3
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-3


---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-4
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-4
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-4

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-5
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-5
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-5

from csi-driver.

thcyron commented on June 28, 2024

Maybe it would make sense to export some Go runtime metrics via Prometheus to help debug this.

from csi-driver.

LKaemmerling commented on June 28, 2024

@apricote could you try it with version 1.2.0 again?

We have the same deployment for a few days running in one of our test clusters and can't see a CPU usage like on your screenshots.

from csi-driver.

apricote commented on June 28, 2024

@LKaemmerling I will do that once I have some free time.

from csi-driver.

apricote commented on June 28, 2024

I can still reproduce the issue (CSI Driver v1.2.1, K8s v1.16.3).

I setup dashboards with data from #67 today, so perhaps I will have some results tomorrow.

from csi-driver.

apricote commented on June 28, 2024

I could not detect any noticable changes in the metrics that correlate with the increased CPU usage.

For reference, I used the dashboards Go gRPC1 and Go Runtime, as well as some self-built charts to visualize the metrics.

from csi-driver.

LKaemmerling commented on June 28, 2024

I guess we have found the issue 🎉 It was a mix of return of wrong error codes and the k8s retrying mechanism which runs out of control...

Legend:

Before our change
Remove of the "Aborted" Error Code
Adding dedicated error code für volume already attached

I will prepare the PR shortly. Thank you for your help!

Could you try this specific container: lkdevelopment/csi-driver:1.2.2 . it contains the fix (will be removed after the release)

cc @apricote

from csi-driver.

apricote commented on June 28, 2024

I re-ran the test with the new versions and the problem still persists for me ☹️.

from csi-driver.

LKaemmerling commented on June 28, 2024

Okay, could you give use some more information about your setup?

Which k8s version?
Which server types (and location)?

GRPC related:
Which codes are returned? (https://grafana.com/grafana/dashboards/9186) --> Graph "gRPC request code"

from csi-driver.

apricote commented on June 28, 2024

We deploy our Kubernetes Cluster using Rancher with the docker-machine-driver-hetzner and ui-driver-hetzner.

Rancher: v2.3.2
Kubernetes: v1.16.3
Docker: v18.9.9
Server Type: cx21
Server Location: nbg1-dc3 (todays test), fsn1-dc14 (prod)

gRPC Responses are 50/50 OK/Unavailable (v1.2.2):

from csi-driver.

LKaemmerling commented on June 28, 2024

Okay, thank you. We will try to reproduce this. My first idea is, that this is normal behavior.

The process of attaching/detaching volumes is really CPU intensive for the controller. We saw on our test cluster (with CCX31) that the controller uses ~0,5% of CPU. A CCX31 has 8 (dedicated) Cores.

As a general tough we recommend using CCX types for k8s workloads.

from csi-driver.

LKaemmerling commented on June 28, 2024

@apricote I have monitored this now on some cx21 servers over a month. I can not see an unnormal behavior. The controller needs some resources because he does a lot of attaching and detaching requests (and watch on the result). Therefore I will close this issue.

from csi-driver.

CPU Usage slowly increasing about csi-driver HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent