Hello, Context We're deploying core-dump-handler

I've used <a href="https://cloud.google.com/container-optimized-os/docs/how-to/toolbox

Error: Text file busy (os error 26) - (GKE COS w/ v1.20.11-gke.1300) about core-dump-handler HOT 10 CLOSED

ibm commented on May 31, 2024

Error: Text file busy (os error 26) - (GKE COS w/ v1.20.11-gke.1300)

from core-dump-handler.

Comments (10)

No9 commented on May 31, 2024

Hi - Thanks for using the project and the detailed bug report.

Given the Error: Text file busy (os error 26) which is usually caused when the OS is trying to run a binary when it's being copied and the fact it's only happening on 2-3 nodes it looks like it's some sort of race condition on the deploy.

This is unusual as the core agent deploys the composer before changing the sysctls

A couple of questions:

It seems very unlikely but is it possible that there are core dumps happening in the production environment as the chart is being deployed?

Did this happen on the very first install or part of an upgrade?

If it is an upgrade can you confirm that you are removing the chart and installing it when upgrading as outlined here

I'll try and replicate this week but it's likely to be Friday before I get to do an install.

from core-dump-handler.

dsaintilma-flinks commented on May 31, 2024

It seems very unlikely but is it possible that there are core dumps happening in the production environment as the chart is being deployed?

Other than this project, we have nothing else that is in place that interacts with the kernel to generate coredumps on our GKE workers. I'm not sure if they are even enabled by default.

Did this happen on the very first install or part of an upgrade?

It happened on the second install (I initially tried to upgrade v8.0.3 to v8.3.1 due to the agents to picking up on the segfaulter test container crashing.)

If it is an upgrade can you confirm that you are removing the chart and installing it when upgrading as outlined here

I didn't upgraded this way. I've used the helm upgrade command.
I've just tried it using the instructions without success (tried with v8.0.3 and v8.3.1)

from core-dump-handler.

No9 commented on May 31, 2024

OK the fact it happened on the second try and upgrade was used is starting to give a little insight.
When testing helm upgrade previously it didn't work hence the update note in the README.

Specifically the patch strategy of helm/k8s seemed to create order issues between the termination of the old container cleanup code and the new container deployment code.

Current recommendation is to remove the chart and then install as per docs.

The long term fix would be to put the deploy of the host config and the agent into two separate containers and manage them through an operator but that is not in the current road map.

As an aside: For the question on core dumps happening I was looking to find out if you have other containers crashing apart from not have you installed

from core-dump-handler.

dsaintilma-flinks commented on May 31, 2024

For the question on core dumps happening I was looking to find out if you have other containers crashing apart from not have you installed

Yes, I confirm that some containers were crashing before and during the installation of coredump helm chart.

from core-dump-handler.

No9 commented on May 31, 2024

OK other containers crashing during the install will cause the text file busy error.
The code tries to mitigate against it but it doesn't guarantee that it won't happen.
If the uninstall and reinstall doesn't work during a period where there are no crashing containers then please update this issue

from core-dump-handler.

dsaintilma-flinks commented on May 31, 2024

Just tried again (following the install/update notes) in the prod cluster when there's no container crashing and the result is the same. It's actually the same three nodes that has this issue.

Is there some element holding a "state" in the faulty nodes that could be cleared/reset?

from core-dump-handler.

No9 commented on May 31, 2024

It's likely the kernel thinks that something is using the cdc exe.
From where the deploy process has got to the error is raised when copying the cdc file to the host machine
https://github.com/IBM/core-dump-handler/blob/main/core-dump-agent/src/main.rs#L435
If you can get onto the host machine and run lsof | grep cdc you might be able to see if a process is holding on to the cdc file.

from core-dump-handler.

dsaintilma-flinks commented on May 31, 2024

I've used toolbox for CoreOs in order to troubleshoot the issue after connecting inside one of the GKE nodes. The command ps was used to find processes using /home/kubernetes/bin/cdc

The issue I'm seeing is that some segfaulter process that I ran weeks ago gets stuck when I executed by /home/kubernetes/bin/cdc.

Examples:

# ps aux | grep cdc
root     2070765  0.0  0.0   3080   896 ?        S+   15:54   0:00 grep cdc
root     3423665  0.0  0.0  13584  2480 ?        S    Apr26   0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983080 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter
root     3425383  0.0  0.0  13584  2492 ?        S    Apr26   0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983162 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter
root     3427276  0.0  0.0  13584  2360 ?        S    Apr26   0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983257 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter

Some production workloads that crashed were also stuck.

Killing the processes with kill <pid> then restarting the core-dump pod for that worker node fixes the issue related to the pod not starting.

Since this issue is resolved, should we close this thread and open a new one for the issue regarding core-dumps freezing when being executed by the cdc process?

from core-dump-handler.

No9 commented on May 31, 2024

Glad you have it resolved.
It would be great to know what is causing cdc to hang. Does this issue reoccur now you have upgraded?
I think the mitigation for this would be to have cdc spawn a supervised task that has a timeout.
Please feel free to close this and open up an issue for cdc freezing issues.

from core-dump-handler.

No9 commented on May 31, 2024

Closing this and linking it to the systemd requirement as that would provide supervised process support if required.

from core-dump-handler.

Error: Text file busy (os error 26) - (GKE COS w/ v1.20.11-gke.1300) about core-dump-handler HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent