Giter Club home page Giter Club logo

Comments (10)

No9 avatar No9 commented on May 31, 2024

Hi - Thanks for using the project and the detailed bug report.

Given the Error: Text file busy (os error 26) which is usually caused when the OS is trying to run a binary when it's being copied and the fact it's only happening on 2-3 nodes it looks like it's some sort of race condition on the deploy.

This is unusual as the core agent deploys the composer before changing the sysctls

A couple of questions:

It seems very unlikely but is it possible that there are core dumps happening in the production environment as the chart is being deployed?

Did this happen on the very first install or part of an upgrade?

If it is an upgrade can you confirm that you are removing the chart and installing it when upgrading as outlined here

I'll try and replicate this week but it's likely to be Friday before I get to do an install.

from core-dump-handler.

dsaintilma-flinks avatar dsaintilma-flinks commented on May 31, 2024

It seems very unlikely but is it possible that there are core dumps happening in the production environment as the chart is being deployed?

Other than this project, we have nothing else that is in place that interacts with the kernel to generate coredumps on our GKE workers. I'm not sure if they are even enabled by default.

Did this happen on the very first install or part of an upgrade?

It happened on the second install (I initially tried to upgrade v8.0.3 to v8.3.1 due to the agents to picking up on the segfaulter test container crashing.)

If it is an upgrade can you confirm that you are removing the chart and installing it when upgrading as outlined here

I didn't upgraded this way. I've used the helm upgrade command.
I've just tried it using the instructions without success (tried with v8.0.3 and v8.3.1)

from core-dump-handler.

No9 avatar No9 commented on May 31, 2024

OK the fact it happened on the second try and upgrade was used is starting to give a little insight.
When testing helm upgrade previously it didn't work hence the update note in the README.

Specifically the patch strategy of helm/k8s seemed to create order issues between the termination of the old container cleanup code and the new container deployment code.

Current recommendation is to remove the chart and then install as per docs.

The long term fix would be to put the deploy of the host config and the agent into two separate containers and manage them through an operator but that is not in the current road map.

As an aside: For the question on core dumps happening I was looking to find out if you have other containers crashing apart from not have you installed

from core-dump-handler.

dsaintilma-flinks avatar dsaintilma-flinks commented on May 31, 2024

For the question on core dumps happening I was looking to find out if you have other containers crashing apart from not have you installed

Yes, I confirm that some containers were crashing before and during the installation of coredump helm chart.

from core-dump-handler.

No9 avatar No9 commented on May 31, 2024

OK other containers crashing during the install will cause the text file busy error.
The code tries to mitigate against it but it doesn't guarantee that it won't happen.
If the uninstall and reinstall doesn't work during a period where there are no crashing containers then please update this issue

from core-dump-handler.

dsaintilma-flinks avatar dsaintilma-flinks commented on May 31, 2024

Just tried again (following the install/update notes) in the prod cluster when there's no container crashing and the result is the same. It's actually the same three nodes that has this issue.

Is there some element holding a "state" in the faulty nodes that could be cleared/reset?

from core-dump-handler.

No9 avatar No9 commented on May 31, 2024

It's likely the kernel thinks that something is using the cdc exe.
From where the deploy process has got to the error is raised when copying the cdc file to the host machine
https://github.com/IBM/core-dump-handler/blob/main/core-dump-agent/src/main.rs#L435
If you can get onto the host machine and run lsof | grep cdc you might be able to see if a process is holding on to the cdc file.

from core-dump-handler.

dsaintilma-flinks avatar dsaintilma-flinks commented on May 31, 2024

I've used toolbox for CoreOs in order to troubleshoot the issue after connecting inside one of the GKE nodes. The command ps was used to find processes using /home/kubernetes/bin/cdc

The issue I'm seeing is that some segfaulter process that I ran weeks ago gets stuck when I executed by /home/kubernetes/bin/cdc.

Examples:

# ps aux | grep cdc
root     2070765  0.0  0.0   3080   896 ?        S+   15:54   0:00 grep cdc
root     3423665  0.0  0.0  13584  2480 ?        S    Apr26   0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983080 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter
root     3425383  0.0  0.0  13584  2492 ?        S    Apr26   0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983162 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter
root     3427276  0.0  0.0  13584  2360 ?        S    Apr26   0:00 /home/kubernetes/bin/cdc -c=18446744073709551615 -e=segfaulter -p=1 -s=4 -t=1650983257 -d=/home/kubernetes/cores -h=segfaulter -E=!usr!local!bin!segfaulter

Some production workloads that crashed were also stuck.

Killing the processes with kill <pid> then restarting the core-dump pod for that worker node fixes the issue related to the pod not starting.

Since this issue is resolved, should we close this thread and open a new one for the issue regarding core-dumps freezing when being executed by the cdc process?

from core-dump-handler.

No9 avatar No9 commented on May 31, 2024

Glad you have it resolved.
It would be great to know what is causing cdc to hang. Does this issue reoccur now you have upgraded?
I think the mitigation for this would be to have cdc spawn a supervised task that has a timeout.
Please feel free to close this and open up an issue for cdc freezing issues.

from core-dump-handler.

No9 avatar No9 commented on May 31, 2024

Closing this and linking it to the systemd requirement as that would provide supervised process support if required.

from core-dump-handler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.