Comments (19)
Fixed on the DRBD side with LINBIT/drbd@857db82 and LINBIT/drbd@343e077.
from piraeus.
Just wanted to let you know that we think we have tracked down the issue, no fix yet but we should have something ready for next DRBD release.
from piraeus.
Yes, there will be a 1.5.1 with that. We still intend to fix the issue in DRBD, too.
from piraeus.
vmcore-dmesg.txt.tar.gz
Full log attached
Any tips or ideas are highly appreciated
from piraeus.
Looks like this is similar to LINBIT/drbd#86
from piraeus.
Hello! Thanks for the report. I guess it would be a good idea to add that information to the DRBD issue, as that seems to be the root cause.
We have seen it internally, but never been able to reproduce it reliably. Adding more context seems like a good idea.
from piraeus.
Thanks for the answer. Should I add more details how I reproduced that?
from piraeus.
Also, does it make sense to try with some older piraeus version? It's also reproduced with drbd 9.2.6 and piraeus v2.3.0
from piraeus.
You could try DRBD 9.1.18.
That does mean you have to use host networking, but you already do use that.
from piraeus.
@WanzenBug hello. There are our reproduction steps:
We have 5-nodes k8s cluster with SSD storage pools of 100 GB each (Thin LVM)
All queues are processed in 1 parallel operation:
csiAttacherWorkerThreads: 1
csiProvisionerWorkerThreads: 1
csiSnapshotterWorkerThreads: 1
csiResizerWorkerThreads: 1
- 30 STS are made with 3 replicas each, and each replica have 5 gigabytes PV
- When pods are up, we change the size of all PVCs to 5.1 gigabytes
- After all PV resize finished, we delete the namespace.
- After that we restart the process from the beginning
When such a scheme is launched in a continuous cycle, we almost invariably have several node reboots per day. The operating system is not essential; we have encountered a similar problem with various 5.x and 6.x kernels from different distributions. However, the issue is definitely reproducible on the current LTS Ubuntu 22.04.
STS spec:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: flog-generator-0
namespace: test1
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: flog-generator-0
serviceName: ""
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: flog-generator-0
spec:
containers:
- args:
- -c
- /srv/flog/run.sh 2>&1 | tee -a /var/log/flog/fake.log
command:
- /bin/sh
env:
- name: FLOG_BATCH_SIZE
value: "1024000"
- name: FLOG_TIME_INTERVAL
value: "1"
image: ex42zav/flog:0.4.3
imagePullPolicy: IfNotPresent
name: flog-generator
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/flog
name: flog-pv
- env:
- name: LOGS_DIRECTORIES
value: /var/log/flog
- name: LOGROTATE_INTERVAL
value: hourly
- name: LOGROTATE_COPIES
value: "2"
- name: LOGROTATE_SIZE
value: 500M
- name: LOGROTATE_CRONSCHEDULE
value: 0 2 * * * *
image: blacklabelops/logrotate
imagePullPolicy: Always
name: logrotate
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log/flog
name: flog-pv
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: flog-pv
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: linstor-r2
volumeMode: Filesystem
from piraeus.
Thanks! You could also try switching to DRBD 9.1.18. We suspect there is a race condition introduced in the 9.2 branch.
from piraeus.
Another idea on what might be causing the issue, with a work around in the CSI driver: piraeusdatastore/linstor-csi#256
You might try that by using the v1.5.0-2-g16c206a
tag for the CSI image. You can edit the piraeus-operator-image-config
to change the image.
from piraeus.
We have tested with DRBD 9.1.18. Looks like the issue is not reproduced with this version
from piraeus.
I'm also testing 9.1.18 now.
Can you tell, please, is it safe to move existing installation from 9.2.5 to 9.1.18?
from piraeus.
Can you tell, please, is it safe to move existing installation from 9.2.5 to 9.1.18?
Yes, it is safe.
from piraeus.
@WanzenBug it looks like v1.5.0-2-g16c206a solves the node restart problem. Will you please create a tag version with it? (maybe like 1.5.1)
Also, it looks like there also problem inside DRBD, that cause crash in some conditions? Will you solve it? If you can't reproduce situation, I think, I can gave an ssh access to cluster where I can reproduce situation for you.
from piraeus.
Thank you for testing! So just to confirm, you tested with DRBD 9.2.8 and the above CSI version and did not observe the crash?
Then it must have something to do with removing a volume from a resource, as I expected. I will use that to try to reproduce the bevahiour.
from piraeus.
We tested this with 9.2.5 and 9.2.8, and above CSI version. Yes, there were no crash anymore.
Thank you, I'll wait for your solution.
Can you tell, will fix from v1.5.0-2-g16c206a come in 1.5.1?
from piraeus.
We will also test with 1.5.1 and drbd 3.2.8 when 1.5.1 is released
from piraeus.
Related Issues (20)
- Enable DCO per CNCF IP Policy HOT 1
- snapshot stucks and PVC turns readOnlny HOT 5
- Piraeus supports for arm64 HOT 4
- Build multi-arch images HOT 19
- Update from piraeus-server 1.11.1 to 1.12.3 failed with Database initialization error HOT 2
- Add reference in README that you're a CNCF sandbox project
- piraeus-server,docker: Makefile only builds with bash
- Problems with Ubuntu 22.04 and drbd9-jammy HOT 1
- Unable to find way to reduce placement count HOT 2
- drbd9-focal compilation doesn't work on Ubuntu 20.04 5.15.0-43-generic HOT 1
- Got 401 while creating LinstorCluster: piraeusdatastore/drbd9-bullseye not found in quay.io HOT 1
- ERROR: failed to push quay.io/piraeusdatastore/drbd9-flatcar:v9.1.14: unexpected status: 401 UNAUTHORIZE HOT 1
- CNCF TOC annual review due HOT 2
- Two questions about drbd containerized installation solution HOT 1
- Piraeus Annual Review Evaluation and next steps HOT 4
- Losing quorum as soon as a node goes down HOT 4
- GitHub repository does not link to the project website url HOT 1
- A potential risk in piraeus that could lead to takeover of the cluster HOT 1
- drbd-module-loader - drbd9-noble:v9.2.10: make not found HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from piraeus.