Giter Club home page Giter Club logo

Comments (19)

Philipp-Reisner avatar Philipp-Reisner commented on July 27, 2024 4

Fixed on the DRBD side with LINBIT/drbd@857db82 and LINBIT/drbd@343e077.

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024 3

Just wanted to let you know that we think we have tracked down the issue, no fix yet but we should have something ready for next DRBD release.

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024 2

Yes, there will be a 1.5.1 with that. We still intend to fix the issue in DRBD, too.

from piraeus.

ksyblast avatar ksyblast commented on July 27, 2024

vmcore-dmesg.txt.tar.gz
Full log attached
Any tips or ideas are highly appreciated

from piraeus.

ksyblast avatar ksyblast commented on July 27, 2024

Looks like this is similar to LINBIT/drbd#86

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024

Hello! Thanks for the report. I guess it would be a good idea to add that information to the DRBD issue, as that seems to be the root cause.

We have seen it internally, but never been able to reproduce it reliably. Adding more context seems like a good idea.

from piraeus.

ksyblast avatar ksyblast commented on July 27, 2024

Thanks for the answer. Should I add more details how I reproduced that?

from piraeus.

ksyblast avatar ksyblast commented on July 27, 2024

Also, does it make sense to try with some older piraeus version? It's also reproduced with drbd 9.2.6 and piraeus v2.3.0

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024

You could try DRBD 9.1.18.

That does mean you have to use host networking, but you already do use that.

from piraeus.

duckhawk avatar duckhawk commented on July 27, 2024

@WanzenBug hello. There are our reproduction steps:

We have 5-nodes k8s cluster with SSD storage pools of 100 GB each (Thin LVM)

All queues are processed in 1 parallel operation:
csiAttacherWorkerThreads: 1
csiProvisionerWorkerThreads: 1
csiSnapshotterWorkerThreads: 1
csiResizerWorkerThreads: 1

  • 30 STS are made with 3 replicas each, and each replica have 5 gigabytes PV
  • When pods are up, we change the size of all PVCs to 5.1 gigabytes
  • After all PV resize finished, we delete the namespace.
  • After that we restart the process from the beginning

When such a scheme is launched in a continuous cycle, we almost invariably have several node reboots per day. The operating system is not essential; we have encountered a similar problem with various 5.x and 6.x kernels from different distributions. However, the issue is definitely reproducible on the current LTS Ubuntu 22.04.

STS spec:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: flog-generator-0
  namespace: test1
spec:
  podManagementPolicy: OrderedReady
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: flog-generator-0
  serviceName: ""
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: flog-generator-0
    spec:
      containers:
      - args:
        - -c
        - /srv/flog/run.sh 2>&1 | tee -a /var/log/flog/fake.log
        command:
        - /bin/sh
        env:
        - name: FLOG_BATCH_SIZE
          value: "1024000"
        - name: FLOG_TIME_INTERVAL
          value: "1"
        image: ex42zav/flog:0.4.3
        imagePullPolicy: IfNotPresent
        name: flog-generator
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log/flog
          name: flog-pv
      - env:
        - name: LOGS_DIRECTORIES
          value: /var/log/flog
        - name: LOGROTATE_INTERVAL
          value: hourly
        - name: LOGROTATE_COPIES
          value: "2"
        - name: LOGROTATE_SIZE
          value: 500M
        - name: LOGROTATE_CRONSCHEDULE
          value: 0 2 * * * *
        image: blacklabelops/logrotate
        imagePullPolicy: Always
        name: logrotate
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log/flog
          name: flog-pv
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: flog-pv
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: linstor-r2
      volumeMode: Filesystem

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024

Thanks! You could also try switching to DRBD 9.1.18. We suspect there is a race condition introduced in the 9.2 branch.

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024

Another idea on what might be causing the issue, with a work around in the CSI driver: piraeusdatastore/linstor-csi#256

You might try that by using the v1.5.0-2-g16c206a tag for the CSI image. You can edit the piraeus-operator-image-config to change the image.

from piraeus.

ksyblast avatar ksyblast commented on July 27, 2024

We have tested with DRBD 9.1.18. Looks like the issue is not reproduced with this version

from piraeus.

duckhawk avatar duckhawk commented on July 27, 2024

I'm also testing 9.1.18 now.
Can you tell, please, is it safe to move existing installation from 9.2.5 to 9.1.18?

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024

Can you tell, please, is it safe to move existing installation from 9.2.5 to 9.1.18?

Yes, it is safe.

from piraeus.

duckhawk avatar duckhawk commented on July 27, 2024

@WanzenBug it looks like v1.5.0-2-g16c206a solves the node restart problem. Will you please create a tag version with it? (maybe like 1.5.1)
Also, it looks like there also problem inside DRBD, that cause crash in some conditions? Will you solve it? If you can't reproduce situation, I think, I can gave an ssh access to cluster where I can reproduce situation for you.

from piraeus.

WanzenBug avatar WanzenBug commented on July 27, 2024

Thank you for testing! So just to confirm, you tested with DRBD 9.2.8 and the above CSI version and did not observe the crash?

Then it must have something to do with removing a volume from a resource, as I expected. I will use that to try to reproduce the bevahiour.

from piraeus.

duckhawk avatar duckhawk commented on July 27, 2024

We tested this with 9.2.5 and 9.2.8, and above CSI version. Yes, there were no crash anymore.

Thank you, I'll wait for your solution.

Can you tell, will fix from v1.5.0-2-g16c206a come in 1.5.1?

from piraeus.

ksyblast avatar ksyblast commented on July 27, 2024

We will also test with 1.5.1 and drbd 3.2.8 when 1.5.1 is released

from piraeus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.