Comments (10)
Hi - Thanks for using the project and the detailed bug report.
Given the Error: Text file busy (os error 26)
which is usually caused when the OS is trying to run a binary when it's being copied and the fact it's only happening on 2-3 nodes it looks like it's some sort of race condition on the deploy.
This is unusual as the core agent deploys the composer before changing the sysctls
A couple of questions:
It seems very unlikely but is it possible that there are core dumps happening in the production environment as the chart is being deployed?
Did this happen on the very first install or part of an upgrade?
If it is an upgrade can you confirm that you are removing the chart and installing it when upgrading as outlined here
I'll try and replicate this week but it's likely to be Friday before I get to do an install.
from core-dump-handler.
It seems very unlikely but is it possible that there are core dumps happening in the production environment as the chart is being deployed?
Other than this project, we have nothing else that is in place that interacts with the kernel to generate coredumps on our GKE workers. I'm not sure if they are even enabled by default.
Did this happen on the very first install or part of an upgrade?
It happened on the second install (I initially tried to upgrade v8.0.3 to v8.3.1 due to the agents to picking up on the segfaulter test container crashing.)
If it is an upgrade can you confirm that you are removing the chart and installing it when upgrading as outlined here
I didn't upgraded this way. I've used the helm upgrade command
.
I've just tried it using the instructions without success (tried with v8.0.3 and v8.3.1)
from core-dump-handler.
OK the fact it happened on the second try and upgrade
was used is starting to give a little insight.
When testing helm upgrade
previously it didn't work hence the update note in the README.
Specifically the patch strategy of helm/k8s seemed to create order issues between the termination of the old container cleanup code and the new container deployment code.
Current recommendation is to remove the chart and then install as per docs.
The long term fix would be to put the deploy of the host config and the agent into two separate containers and manage them through an operator but that is not in the current road map.
As an aside: For the question on core dumps happening I was looking to find out if you have other containers crashing apart from not have you installed
from core-dump-handler.
For the question on core dumps happening I was looking to find out if you have other containers crashing apart from not have you installed
Yes, I confirm that some containers were crashing before and during the installation of coredump helm chart.
from core-dump-handler.
OK other containers crashing during the install will cause the text file busy error.
The code tries to mitigate against it but it doesn't guarantee that it won't happen.
If the uninstall and reinstall doesn't work during a period where there are no crashing containers then please update this issue
from core-dump-handler.
Just tried again (following the install/update notes) in the prod cluster when there's no container crashing and the result is the same. It's actually the same three nodes that has this issue.
Is there some element holding a "state" in the faulty nodes that could be cleared/reset?
from core-dump-handler.
It's likely the kernel thinks that something is using the cdc
exe.
From where the deploy process has got to the error is raised when copying the cdc file to the host machine
https://github.com/IBM/core-dump-handler/blob/main/core-dump-agent/src/main.rs#L435
If you can get onto the host machine and run lsof | grep cdc
you might be able to see if a process is holding on to the cdc file.
from core-dump-handler.
I've used toolbox for CoreOs in order to troubleshoot the issue after connecting inside one of the GKE nodes. The command ps
was used to find processes using /home/kubernetes/bin/cdc
The issue I'm seeing is that some segfaulter process that I ran weeks ago gets stuck when I executed by /home/kubernetes/bin/cdc
.
Examples:
# ps aux | grep cdc
root 2070765 0.0 0.0 3080 896 ? S+ 15:54 0:00 grep cdc
root 3423665 0.0 0.0 13584 2480 ? S Apr26 0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983080 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter
root 3425383 0.0 0.0 13584 2492 ? S Apr26 0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983162 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter
root 3427276 0.0 0.0 13584 2360 ? S Apr26 0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983257 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter
Some production workloads that crashed were also stuck.
Killing the processes with kill <pid>
then restarting the core-dump pod for that worker node fixes the issue related to the pod not starting.
Since this issue is resolved, should we close this thread and open a new one for the issue regarding core-dumps freezing when being executed by the cdc
process?
from core-dump-handler.
Glad you have it resolved.
It would be great to know what is causing cdc to hang. Does this issue reoccur now you have upgraded?
I think the mitigation for this would be to have cdc
spawn a supervised task that has a timeout.
Please feel free to close this and open up an issue for cdc freezing issues.
from core-dump-handler.
Closing this and linking it to the systemd requirement as that would provide supervised process support if required.
from core-dump-handler.
Related Issues (20)
- Clarify compatibility with Google Kubernetes HOT 4
- got error Bucket Creation Failed: serde_xml: ParseIntError: invalid digit found in string with AWS IAM s3 HOT 8
- pullSecrets is not used inside daemonset.yaml HOT 6
- multi tenant support HOT 6
- latest musl images are not available HOT 3
- Support for application level artifacts HOT 3
- composer.log is missing in agent HOT 5
- Kubernetes 1.23+ does not mount service account token anymore HOT 1
- SELinux blocks cdc from running
- Additional troubleshooting advice? HOT 1
- Deployment error fails to EKS HOT 3
- Duplicate file names in generated zip files HOT 1
- Core dump files housekeeping HOT 2
- Compatibility with Oracle Kubernetes Engine. HOT 11
- Core dump generates random uuid instead of using pod uuid HOT 2
- Memory cleanup after crash dump taken
- configurable priorityClassName for daemonset pods HOT 1
- Using an s3 compatible storage - fails to connect on error upload Failed reqwest: HOT 7
- In Kubernetes: Mounts of the hostdir doubling with every restart HOT 3
- Capture proc data to be able to generate minidump HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from core-dump-handler.