Giter Club home page Giter Club logo

Comments (12)

apricote avatar apricote commented on June 28, 2024

I was able to quickly reproduce this issue in a fresh cluster using 5 cronjobs on a 1-minute schedule.

grafana-csi-cpu-usage-2_000

manifest for cronjobs
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-1
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-1
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-1

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-2
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-2
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-2

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-3
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-3
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-3


---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-4
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-4
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-4

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-vol-5
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 10Gi

---

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: test-job-5
  namespace: default
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: test
            image: busybox
            command: ["echo", "Hello"]
            volumeMounts:
              - mountPath: /test
                name: test
          volumes:
            - name: test
              persistentVolumeClaim:
                claimName: test-vol-5

from csi-driver.

thcyron avatar thcyron commented on June 28, 2024

Maybe it would make sense to export some Go runtime metrics via Prometheus to help debug this.

from csi-driver.

LKaemmerling avatar LKaemmerling commented on June 28, 2024

@apricote could you try it with version 1.2.0 again?

We have the same deployment for a few days running in one of our test clusters and can't see a CPU usage like on your screenshots.

from csi-driver.

apricote avatar apricote commented on June 28, 2024

@LKaemmerling I will do that once I have some free time.

from csi-driver.

apricote avatar apricote commented on June 28, 2024

I can still reproduce the issue (CSI Driver v1.2.1, K8s v1.16.3).

I setup dashboards with data from #67 today, so perhaps I will have some results tomorrow.

from csi-driver.

apricote avatar apricote commented on June 28, 2024

I could not detect any noticable changes in the metrics that correlate with the increased CPU usage.

For reference, I used the dashboards Go gRPC1 and Go Runtime, as well as some self-built charts to visualize the metrics.

from csi-driver.

LKaemmerling avatar LKaemmerling commented on June 28, 2024

Bildschirmfoto 2019-12-10 um 08 10 30

I guess we have found the issue πŸŽ‰ It was a mix of return of wrong error codes and the k8s retrying mechanism which runs out of control...

Legend:

  1. Before our change
  2. Remove of the "Aborted" Error Code
  3. Adding dedicated error code fΓΌr volume already attached

I will prepare the PR shortly. Thank you for your help!

Could you try this specific container: lkdevelopment/csi-driver:1.2.2 . it contains the fix (will be removed after the release)

cc @apricote

from csi-driver.

apricote avatar apricote commented on June 28, 2024

I re-ran the test with the new versions and the problem still persists for me ☹️.

hcloud-csi-cpu-usage

from csi-driver.

LKaemmerling avatar LKaemmerling commented on June 28, 2024

Okay, could you give use some more information about your setup?

Which k8s version?
Which server types (and location)?

GRPC related:
Which codes are returned? (https://grafana.com/grafana/dashboards/9186) --> Graph "gRPC request code"

from csi-driver.

apricote avatar apricote commented on June 28, 2024

We deploy our Kubernetes Cluster using Rancher with the docker-machine-driver-hetzner and ui-driver-hetzner.

Rancher: v2.3.2
Kubernetes: v1.16.3
Docker: v18.9.9
Server Type: cx21
Server Location: nbg1-dc3 (todays test), fsn1-dc14 (prod)

gRPC Responses are 50/50 OK/Unavailable (v1.2.2):

hcloud-csi-cpu-response-code

from csi-driver.

LKaemmerling avatar LKaemmerling commented on June 28, 2024

Okay, thank you. We will try to reproduce this. My first idea is, that this is normal behavior.

The process of attaching/detaching volumes is really CPU intensive for the controller. We saw on our test cluster (with CCX31) that the controller uses ~0,5% of CPU. A CCX31 has 8 (dedicated) Cores.

As a general tough we recommend using CCX types for k8s workloads.

from csi-driver.

LKaemmerling avatar LKaemmerling commented on June 28, 2024

@apricote I have monitored this now on some cx21 servers over a month. I can not see an unnormal behavior. The controller needs some resources because he does a lot of attaching and detaching requests (and watch on the result). Therefore I will close this issue.

from csi-driver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.