Giter Club home page Giter Club logo

core-dump-handler's Introduction

Core Dump Handler

This helm chart is designed to deploy functionality that automatically saves core dumps from most public cloud kubernetes service providers and private kubernetes instances to an S3 compatible storage service.

build status Docker Repository on Quay CII Best Practices Artifact Hub

Contributions

Please read the CONTRIBUTING.md it has some important notes. Pay specific attention to the Coding style guidelines and the Developer Certificate of Origin

Code Of Conduct

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

The full code of conduct is available here

Install

Please refer to the chart README.md for full details.

Kubernetes Service Compatibility

This is a matrix of confirmed test targets. Please PR environments that are also known to work

ProviderProductVersionValidated?Working?
AWSEKS1.21YesYes
AWSROSA4.8YesYes
Custom BuildK8SN/AYesYes
Digital OceanK8S1.21.5-do.0YesYes
GoogleGKE-cos_containerd1.20.10-gke.1600YesYes
GoogleGKE-Ubuntu1.20.10-gke.1600YesYes
IBMIKS1.19-1.21YesYes
IBMROKS4.6-4.8YesYes
MicrosoftAKS1.19YesYes
MicrosoftARO4.8YesYes
RedHatOn-Premises4.8YesYes

Background

Core Dumps are a critical part of observability.

As systems become more distributed core dumps offer teams a non-invasive approach to understanding why programs are malfunctioning in any environment they are deployed to.

Core Dumps are useful in a wide number of scenarios but they are very relevant in the following cases:

  • The process exits without a useful stack trace

  • The process runs out of memory

  • An application doesn’t behave as expected

The traditional problems with core dumps are:

  • Overhead of managing the dumps

  • Dump Analysis required specific tooling that wasn't readily available on the developers machine.

  • Managing Access to the dumps as they can contain sensitive information.

This chart aims to tackle the problems surrounding core dumps by leveraging common platforms (K8s, ROKS and Object Storage) in a cloud environment to pick up the heavy lifting.

Chart Details

The chart deploys two processes:

  1. The agent manages the updating of /proc/sys/kernel/* configuration, deploys the composer service and uploads the core dumps zipfile created by the composer to an object storage instance.

  2. The composer handles the processing of a core dump and creating runtime, container coredump and image JSON documents from CRICTL and inserting them into a single zip file. The zip file is stored on the local file system of the node for the agent to upload.

When you install the IBM Cloud Core Dump Handler Helm chart, the following Kubernetes resources are deployed into your Kubernetes cluster:

  • Namespace: A specific namespace is created to install the components into - defaults to ibm-observe

  • Handler Daemonset: The daemonset deploys a pod on every worker node in your cluster. The daemonset contains configuration to enable the elevated process to define the core pattern to place the core dump into object storage as well as gather pod information if available.

  • Privileged Policy: The daemonset configures the host node so priviledges are required.

  • Service Account: Standard Service account to run the daemonset

  • Volume Claims: For copying the composer to the host and enabling access to the generated core dumps

  • Cluster Role: Created with an event resource and create verb and associated with the service account.

Component Diagram

Service Component Layout

Component Diagram

Permissions

To install the Helm chart in your cluster, you must have the Administrator platform role.

Security implications

This chart deploys privileged kubernetes daemonset with the following implications:

  1. the automatic creation of privileged container per kubernetes node capable of reading core files querying the crictl for pod info.

  2. The daemonset uses hostpath feature interacting with the underlying Linux OS.

  3. The composer binary is deployed and ran on the host server

  4. Core dumps can contain sensitive runtime data and the storage bucket access must be managed accordingly.

  5. Object storage keys are stored as secrets and used as environment variables in the daemonset

Resources Required

The IBM Cloud Core Dump Handler requires the following resources on each worker node to run successfully:

  • CPU: 0.2 vCPU
  • Memory: 128MB

Updating the Chart

  1. Delete the chart. Don't worry this won't impact the data stored in object storage.
$ helm delete core-dump-handler --namespace observe
  1. Ensure the persitent volume forhost-name are deleted before continuing
$ kubectl get pvc -n observe
  1. Install the chart using the same bucket name as per the first install but tell the chart not to creat it.
$ helm install core-dump-handler . --namespace observe 

Removing the Chart

helm delete core-dump-handler -n observe

Build and Deploy a Custom Version

  1. Build the image docker build -t YOUR_TAG_NAME .

  2. Push the image to your container registry

  3. Update the container in the values.yaml file to use it.

image:
  registry: YOUR_REGISTRY
  repository: YOUR_REPOSITORY
  tag: YOUR_TAG

or run the helm install command with the settings

--set image.registry=YOUR_REGISTRY \
--set image.repository=YOUR_REPOSITORY \
--set image.tag=YOUR_TAG

Testing

unit testing

  1. The services are written in Rust using rustup.

  2. Local unit tests can be ran using cargo test in the base folder

integration testing

  1. Currently only IBM Cloud ROKS and IKS are supported but we are happy to carry integration tests for other services but we can't run them before release.

  2. To run the integration tests build follow the instructions for a custom build

  3. In the root of the project folder create a file called .env with the following configuration

    S3_ACCESS_KEY=XXXX
    S3_SECRET=XXXX
    S3_BUCKET_NAME=XXXX
    S3_REGION=XXXX
    
  4. Change directory to the integration folder and run the test

    cd integration
    ./run-ibm.sh
    

Releases

Releases are built on a pre-release branch e.g. pre-8.5.0 integration tests are ran manually and a release is generated when merged to main.

It currently isn't possible to automate this as the kubernetes integration in github actions is not reliable enough.

If you wish to test a pre-release with your own integration testing then please raise an issue and we can collaborate on your test run.

Troubleshooting

The first place to look for issues is in the agent console. A successful install should look like this


[2021-09-08T22:28:43Z INFO core_dump_agent] Setting host location to: /var/mnt/core-dump-handler
[2021-09-08T22:28:43Z INFO core_dump_agent] Current Directory for setup is /app
[2021-09-08T22:28:43Z INFO core_dump_agent] Copying the composer from ./vendor/default/cdc to /var/mnt/core-dump-handler/cdc
[2021-09-08T22:28:43Z INFO core_dump_agent] Starting sysctl for kernel.core_pattern /var/mnt/core-dump-handler/core_pattern.bak
[2021-09-08T22:28:43Z INFO core_dump_agent] Created Backup of /var/mnt/core-dump-handler/core_pattern.bak
[2021-09-08T22:28:43Z INFO core_dump_agent] Starting sysctl for kernel.core_pipe_limit /var/mnt/core-dump-handler/core_pipe_limit.bak
[2021-09-08T22:28:43Z INFO core_dump_agent] Created Backup of /var/mnt/core-dump-handler/core_pipe_limit.bak
[2021-09-08T22:28:43Z INFO core_dump_agent] Starting sysctl for fs.suid_dumpable /var/mnt/core-dump-handler/suid_dumpable.bak
[2021-09-08T22:28:43Z INFO core_dump_agent] Created Backup of /var/mnt/core-dump-handler/suid_dumpable.bak
[2021-09-08T22:28:43Z INFO core_dump_agent] Created sysctl of kernel.core_pattern=|/var/mnt/core-dump-handler/cdc -c=%c -e=%e -p=%p -s=%s -t=%t -d=/var/mnt/core-dump-handler/core -h=%h -E=%E
kernel.core_pattern = |/var/mnt/core-dump-handler/cdc -c=%c -e=%e -p=%p -s=%s -t=%t -d=/var/mnt/core-dump-handler/core -h=%h -E=%E
kernel.core_pipe_limit = 128
[2021-09-08T22:28:43Z INFO core_dump_agent] Created sysctl of kernel.core_pipe_limit=128
fs.suid_dumpable = 2
[2021-09-08T22:28:43Z INFO core_dump_agent] Created sysctl of fs.suid_dumpable=2
[2021-09-08T22:28:43Z INFO core_dump_agent] Creating /var/mnt/core-dump-handler/.env file with LOG_LEVEL=info
[2021-09-08T22:28:43Z INFO core_dump_agent] Executing Agent with location : /var/mnt/core-dump-handler/core
[2021-09-08T22:28:43Z INFO core_dump_agent] Dir Content []

If the agent is running successfully then there may be a problem with the composer configuration. To check the logs for the composer open a shell into the agent and cat the composer.log to see if there are any error messages.

cat /var/mnt/core-dump-handler/composer.log

If there are no errors then you should change the default log from error to debug in the values.yaml and redeploy the chart. Create a core dump again and /var/mnt/core-dump-handler/composer.log should contain specific detail on each upload.

core-dump-handler's People

Contributors

0xdiba avatar antonwiklund99 avatar gonzalesraul avatar imgbotapp avatar mikeage avatar nktomhaygarth avatar no9 avatar pacevedom avatar pkit avatar razielgn avatar splitice avatar stonemaster avatar tanmaykm avatar timbuchwaldt avatar tjungblu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

core-dump-handler's Issues

Custom / Templated Filenames

(follow-up to the conversations that triggered #44)

We talked with the customer today and it turned out that they have a multi-tenant system where each tenant is separated by namespaces. They would like to have a way to distinguish the core-dumps by namespaces, so they can later on send them to the right tenant.

I can see the namespace is in the pod-info.json already, what do you think about making the filename configurable with an "sprintf"-style pattern? It's already pretty close if I look into it: https://github.com/IBM/core-dump-handler/blob/main/core-dump-composer/src/main.rs#L190-L193

Of course just adding the namespace to the existing format is easiest.

We continue with talks with their customers as well (it's a vendor/partner scenario) so take this with a grain of salt - will update accordingly as a comment.

Enable the cdcli to accept custom debug images

While prebuilt images are meant to cover most use cases there may be a need to bring some custom tooling or support a language that isn't covered yet.
The feature should provide an additional flag for the cli that takes an image location as it's argument.

Watch filesystem rather than polling

Agent take polling approach to check whether core dumps are exist. Theoretically dumps are rarely generated so in most cases, it becomes a useless operation.

Therefore, I think it is better to optimize using filesystem watch instead of polling. Most OSs support this feature(e.g. Linux's inotify, BSD's kqueue, ...) and there is some rust open source related to this.

There are two things to consider when using watch

  1. Linux's inotify, BSD's kqueue are lack of recursive watch feature. So if we want to watch recursive directories, application-level recursive watch is needed and we should release that watch before delete it.
    • But currently we don't need recursive watch, so it's not problem.
  2. There might be some missing events(due to agent crash or something), so agent should scan target directory and upload existing dumps once on init.

Add support to take a core dump of a running process

In some scenarios such as memory leaks or hung processes it would be useful to request a coredump from a running exe rather than login and create a signal to the process.
One possible route would be to add a code and tools to the agent pod that would allow gcore or comparable tools to be executed from within that pod against a specific pid.
This will need some investigation before a specific route is decided.

Add support for GKE cos_containerd node type

Installation on GKE currently fails due to the requirement of multi-node clusters using local rather than HOST storage.

The persistent volume needs a conditional based on a new daemonset.storage flag that creates the required object.

The main application shouldn't need any further modification.

OpenShift compatibility

Hey,

I'm currently looking into this project for OpenShift, you mention for ARO/ROSA:

building compatible binaries seems to be the next step

what is missing to get this to work with RHCOS?

Cheers,
Thomas

Use systemd management facilities

As part of the discussion in #44 (comment) it was suggested that on systems using systemd the agent could open a socket directly to systemd-coredump via its socket (/run/systemd/coredump) and process the results. That would requires less kernel configuration and enable the collection from a container.

While this isn't an immediate requirement for any of our target platforms if someone was hesitant about deploying a binary onto a host or didn't wish to modify the kernel_pattern tunable this could be an option.

There is no plan to develop this as yet but PRs with this functionality will be accepted.
Just let us know you're starting so that others are aware the work is happening.

Add values files for different xKS environments

As part of the RHCOS support it was identified that having an additional values file for different flavours of k8s could simplify the install process. This issue is to track the creation and documentation of this functionality.

configure coredump.conf

Hey @No9,

sorry to annoy you again - another day, another requirement :)
Apparently there are coredumps that can be above 2G and there is a limit in systemd coredump.conf that needs to be configurable.

Specifically they need to set the following in /etc/systemd/coredump.conf:

ProcessSizeMax=2G
ExternalSizeMax=2G

Here's a MAN page for all the respective configuration values:
https://www.systutorials.com/docs/linux/man/5-coredump/

What do you think about adding a small config map that contains such values, and on pod start will write those into the conf file.


Looking a bit into this KCS article, for RHEL 7/8 I actually see we would need to restart a daemon:
https://access.redhat.com/solutions/649193

@travier apologies to pull you in here again, is there anything specific we need to pay attention to on CoreOS for those settings? I'm also happy to update the KCS article with information for it.

Cheers,
Thomas

8.0.0 fails to take crashdump

Did an upgrade to 8.0.0

Started getting the following in dmesg (and no dump handled)

[1073776.504905] Core dump to |/var/mnt/core-dump-handler/cdc pipe failed

Any ideas?

Optionally seperate S3 object directory by metadata

As far as I know, dump file name is {CORE_UUID}-dump-{CORE_TIMESTAMP}-{CORE_HOSTNAME}-{CORE_EXE_NAME}-{CORE_PID}-{CORE_SIGNAL} and this is equal to S3 object name.

When someone uses a single S3 bucket, and If core-dump-handler is used on multi-cloud/multi-cluster/multi-region, it's hard to figure out where does a dump file come from. Since these metadatas are statically settled on deploying daemonset, It can be injected via environment variable and used for S3 upload directory seperation.

But maybe seperating S3 bucket is just enough, So it'll be great we can choose

  1. Multiple S3 bucket + flat S3 directory(by vendor/cluster-name)
  2. Single S3 bucket + nested S3 directory

Additionally, it would be nice to be able to distinguish the namespace of crashed containers. From core-dump-handler's point of view, crashed-container's namespace are runtime metadata so it'll be hard to figure out. I guess modifying core_pattern to |{HOST_LOCATION}/{CORE_DUMP_COMPOSER_NAME} -c=%c -e=%e -p=%p -s=%s -t=%t -d={HOST_DIR}/core -h=%h -E=%E -NS=$NAMESPACE might work since %h is crashed-container's hostname, not core-dump-handler agent's hostname.

Alternative upload sink

(yet another follow-up to the conversations that triggered #44 + #47)

Our customer is in the Telco space and already has a so called "Element Management System" that allows to observe crashes and manage crash dumps. They would like to upload the dump there instead of an ObjectStore.

I don't really have a good understanding of what they can upload and how that works in detail yet, I just wanted to figure out if we can make the upload system more composable.

One solution that immediately came to me would be to have a configurable sidecar container in the DS (basically just config in helm and values.yaml) that can run inotify or poll on the coreDirectory and upload to where it should go.

arm64 builds are missing (and impossible to build)

I tried to build the image on arm64, but ubi7 does not provide such arch (and won't).
Also the crictl binary would need to be pulled for arm64.
From what I understood, the two builds are due to libc issues. Would it be a viable solution to build binaries with musl, in order to have just one static binary for both?

Port install to an operator

Currently this project uses helm as it's deployment approach and has a bash daemon to ensure the file management configuration stays in place. This will likely be better implemented as an operator.
This issue is to track the investigation and work.

Crash Loop due to out of memory

Been seeing OOMs with he default helm settings (Exit Code: 137)

$ kubectl logs core-dump-handler-p42wl -n coredump -f
[2022-01-27T12:01:25Z INFO  core_dump_agent] no .env file found
     That's ok if running in kubernetes
[2022-01-27T12:01:25Z INFO  core_dump_agent] Setting host location to: /var/mnt/core-dump-handler
[2022-01-27T12:01:25Z INFO  core_dump_agent] Current Directory for setup is /app
[2022-01-27T12:01:25Z INFO  core_dump_agent] Copying the crictl from ./crictl to /var/mnt/core-dump-handler/crictl
[2022-01-27T12:01:25Z INFO  core_dump_agent] Copying the composer from ./vendor/default/cdc to /var/mnt/core-dump-handler/cdc

[2022-01-27T12:01:30Z INFO  core_dump_agent] Starting sysctl for kernel.core_pattern /var/mnt/core-dump-handler/core_pattern.bak
[2022-01-27T12:01:30Z INFO  core_dump_agent] Getting sysctl for kernel.core_pattern
[2022-01-27T12:01:30Z INFO  core_dump_agent] kernel.core_pattern with value |/var/mnt/core-dump-handler/cdc -c=%c -e=%e -p=%p -s=%s -t=%t -d=/var/mnt/core-dump-handler/cores -h=%h -E=%E is already applied
[2022-01-27T12:01:30Z INFO  core_dump_agent] Starting sysctl for kernel.core_pipe_limit /var/mnt/core-dump-handler/core_pipe_limit.bak
[2022-01-27T12:01:30Z INFO  core_dump_agent] Getting sysctl for kernel.core_pipe_limit
[2022-01-27T12:01:30Z INFO  core_dump_agent] kernel.core_pipe_limit with value 128 is already applied
[2022-01-27T12:01:30Z INFO  core_dump_agent] Starting sysctl for fs.suid_dumpable /var/mnt/core-dump-handler/suid_dumpable.bak
[2022-01-27T12:01:30Z INFO  core_dump_agent] Getting sysctl for fs.suid_dumpable
[2022-01-27T12:01:30Z INFO  core_dump_agent] fs.suid_dumpable with value 2 is already applied
[2022-01-27T12:01:30Z INFO  core_dump_agent] Creating /var/mnt/core-dump-handler/.env file with LOG_LEVEL=Warn
[2022-01-27T12:01:30Z INFO  core_dump_agent] Writing composer .env
    LOG_LEVEL=Warn
    IGNORE_CRIO=false
    CRIO_IMAGE_CMD=img
    USE_CRIO_CONF=false

[2022-01-27T12:01:30Z INFO  core_dump_agent] Executing Agent with location : /var/mnt/core-dump-handler/cores
[2022-01-27T12:01:30Z INFO  core_dump_agent] Setting s3 endpoint location to: https://**
[2022-01-27T12:01:30Z INFO  core_dump_agent] Dir Content ["/var/mnt/core-dump-handler/cores/241421d3-9096-4cf0-923c-562be16ed2e1-dump-1643279286-qtunnel-0-memcheck-amd64--981-11.zip", "/var/mnt/core-dump-handler/cores/fcda701e-cb59-4282-a145-dcb934f12ae7-dump-1643280363-qtunnel-0-memcheck-amd64--2154-11.zip", "/var/mnt/core-dump-handler/cores/a1a08db0-73e1-4552-808e-ed1925f49b7c-dump-1643279954-qtunnel-0-memcheck-amd64--1702-11.zip", "/var/mnt/core-dump-handler/cores/63c727ee-4ad0-4158-bad0-80abc51c4a5e-dump-1643280260-qtunnel-0-memcheck-amd64--2036-11.zip", "/var/mnt/core-dump-handler/cores/09e97bfc-f949-4daa-b9cd-1a1daec14840-dump-1643279647-qtunnel-0-memcheck-amd64--1371-11.zip", "/var/mnt/core-dump-handler/cores/c58fa3aa-47f9-4544-8b63-f7ace6c120c6-dump-1643279340-qtunnel-0-memcheck-amd64--1040-11.zip", "/var/mnt/core-dump-handler/cores/47a0000b-6245-4e88-b659-3a070c168772-dump-1643280056-qtunnel-0-memcheck-amd64--1797-11.zip", "/var/mnt/core-dump-handler/cores/bd63aed0-86f8-49f5-b4db-ef6e2772eae0-dump-1643280311-qtunnel-0-memcheck-amd64--2095-11.zip", "/var/mnt/core-dump-handler/cores/9f11b2b7-7798-41e6-833f-951722360c85-dump-1643279596-qtunnel-0-memcheck-amd64--1309-11.zip", "/var/mnt/core-dump-handler/cores/eb02c2e7-bebc-44b6-b02b-4cfa4227c48d-dump-1643279698-qtunnel-0-memcheck-amd64--1430-11.zip", "/var/mnt/core-dump-handler/cores/80bb8158-3833-4e92-afa5-5a8bb8798bbb-dump-1643279494-qtunnel-0-memcheck-amd64--1197-11.zip", "/var/mnt/core-dump-handler/cores/8720fa81-1c8e-4223-8be5-a83560122c84-dump-1643280107-qtunnel-0-memcheck-amd64--1856-11.zip", "/var/mnt/core-dump-handler/cores/a827e6f5-62be-404a-86f6-9feb6d1c955f-dump-1643279545-qtunnel-0-memcheck-amd64--1253-11.zip", "/var/mnt/core-dump-handler/cores/1ed2d493-73c5-4183-8666-3d69293cb886-dump-1643279749-qtunnel-0-memcheck-amd64--1466-11.zip", "/var/mnt/core-dump-handler/cores/ac8fef42-87d7-4df9-b6bb-bc4e68c1a7fa-dump-1643279902-qtunnel-0-memcheck-amd64--1643-11.zip", "/var/mnt/core-dump-handler/cores/d65ac2ae-44b3-434c-80b5-740ca5291567-dump-1643280158-qtunnel-0-memcheck-amd64--1918-11.zip", "/var/mnt/core-dump-handler/cores/319b49e7-6f2d-412c-a693-40d0fa2861cb-dump-1643279851-qtunnel-0-memcheck-amd64--1587-11.zip", "/var/mnt/core-dump-handler/cores/5d144582-d9f3-4cbc-a614-9d1113077fd1-dump-1643280209-qtunnel-0-memcheck-amd64--1977-11.zip", "/var/mnt/core-dump-handler/cores/43b72baf-0db0-4837-804d-9dc1a5d257b8-dump-1643279800-qtunnel-0-memcheck-amd64--1525-11.zip", "/var/mnt/core-dump-handler/cores/ebe5aa07-34eb-482d-a601-2510c604ca14-dump-1643279443-qtunnel-0-memcheck-amd64--1135-11.zip"]
[2022-01-27T12:01:30Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/241421d3-9096-4cf0-923c-562be16ed2e1-dump-1643279286-qtunnel-0-memcheck-amd64--981-11.zip
[2022-01-27T12:01:30Z INFO  core_dump_agent] zip size is 43810758

I attempted to increase the memory limit, to 256MB but this too failed. This seems supicious given that it's only uploading a file, it shouldnt need anywhere near this amount of memory.

it repeats this crash in a loop.

Generate CRICTL YAML

For digital ocean the node also needs the following crictl client config

runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 2
debug: true
pull-image-on-create: false

Place into /etc/crictl.yaml

Core Dumping large memory applications

Hey @No9,

from #69 we had the customer requirements to configure limits - we came to the conclusion that there are indeed no limits on the OS side.

I tried to adapt the segfaulter script today with a bit more memory than necessary (6 gigs worth of 42s https://github.com/tjungblu/segfaulter) and run it as usual with

kubectl run -i -t segfaulter --image=quay.io/tjungblu/segfaulter_6g --restart=Never

The core-dump triggered and albeit having only 4mb in size (which is fine, I guess 6gigs worth of 42s compress really well), I do have issues unzipping it:

Archive:  test.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of test.zip or
        test.zip.zip, and cannot find test.zip.ZIP, period.

Here is the zip in the pod:

[root@core-dump-handler-rq2ng cores]# ls -l
total 4076
-rw-r--r--. 1 root root 4173458 Mar 14 12:31 29a453fb-0c89-4d84-b618-6ad3523e74ee-dump-1647261075-segfaulter-segfaulter-1-4.zip

I assume we're running into issues with the limit of the original zip format, where this big of a dump would require a ZIP64: https://en.wikipedia.org/wiki/ZIP_(file_format)#ZIP64

There were no errors during collection, the upload didn't work as it wasn't configured, I pulled it via rsync:

[2022-03-14T12:30:03Z INFO  core_dump_agent] Executing Agent with location : /mnt/core-dump-handler/cores
[2022-03-14T12:30:03Z INFO  core_dump_agent] Dir Content []
[2022-03-14T12:30:03Z INFO  core_dump_agent] INotify Starting...
[2022-03-14T12:30:03Z INFO  core_dump_agent] INotify Initialised...
[2022-03-14T12:30:03Z INFO  core_dump_agent] INotify watching : /mnt/core-dump-handler/cores
[2022-03-14T12:31:28Z INFO  core_dump_agent] Uploading: /mnt/core-dump-handler/cores/29a453fb-0c89-4d84-b618-6ad3523e74ee-dump-1647261075-segfaulter-segfaulter-1-4.zip
[2022-03-14T12:31:28Z INFO  core_dump_agent] zip size is 4173458
[2022-03-14T12:31:28Z ERROR core_dump_agent] Upload Failed error sending request for url (https://xxx/XXX/29a453fb-0c89-4d84-b618-6ad3523e74ee-dump-1647261075-segfaulter-segfaulter-1-4.zip?uploads): error trying to connect: dns error: failed to lookup address information: Name or service not known

EDIT, actually looking into the composer log, there IS an error:

DEBUG - 2022-03-14T12:31:15.932866037+00:00 - Arguments: Args { inner: ["/mnt/core-dump-handler/cdc", "-c=18446744073709551615", "-e=segfaulter", "-p=1", "-s=4", "-t=1647261075", "-d=/mnt/core-dump-handler/cores", "-h=segfaulter", "-E=!usr!local!bin!segfaulter"] }
INFO - 2022-03-14T12:31:15.932992397+00:00 - Environment config:
 IGNORE_CRIO=false
CRIO_IMAGE_CMD=img
USE_CRIO_CONF=false
INFO - 2022-03-14T12:31:15.932997217+00:00 - Loading .env
INFO - 2022-03-14T12:31:15.933000150+00:00 - Set logfile to: "\"/var/mnt/core-dump-handler/composer.log\""
DEBUG - 2022-03-14T12:31:15.933019681+00:00 - Creating dump for 29a453fb-0c89-4d84-b618-6ad3523e74ee-dump-1647261075-segfaulter-segfaulter-1-4
DEBUG - 2022-03-14T12:31:15.933023182+00:00 - running ["pods", "--name", "segfaulter", "-o", "json"] "/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/home/kubernetes/bin:/var/mnt/core-dump-handler"
DEBUG - 2022-03-14T12:31:16.126926097+00:00 - Create a JSON file to store the dump meta data
29a453fb-0c89-4d84-b618-6ad3523e74ee-dump-1647261075-segfaulter-segfaulter-1-4-dump-info.json
ERROR - 2022-03-14T12:31:28.576238180+00:00 - Error writing core file 
***Large file option has not been set***

Here's the file, if you want to look a bit into it:
test.zip

Any clue?

Create demo video

It would be great to get a demo of installing the tool and running the sample segfault app for people interested in using the project.
Link to the video should be available from the top of the project page.
This can be done on any hosting provider or a your own cluster some suggested steps are

  1. show the configuration of the object store
  2. Runs the helm chart with the params
  3. Waits for the agent pod to be running
  4. Runs the sample test container
    kubectl run -i -t segfaulter --image=quay.io/icdh/segfaulter --restart=Never
    
  5. Downloads the generated zip file
  6. Possibly show using gdb to get the stacktrace?

Add support for getting metadata from EKS

Currently only the core file and the metadata that is sent to composer process from the OS is captured on EKS.
This is because the OS on the node doesn't provide access to the crictl client.
Investigation is needed to confirm if the crictl client is available on the machine but not in the PATH.
If it's not available the two options are to copy a version of crictl to the machine as part of the deployment or to integrate with the crio service using a library.

Error: Text file busy (os error 26) - (GKE COS w/ v1.20.11-gke.1300)

Hello,

Context

We're deploying core-dump-handler. So far it has worked with our dev cluster which is using the version 1.21.9-gke.1002 of Kubernetes. No luck with our prod cluster. Relevant specs are below:

GKE control plane version: 1.20.15-gke.3400
GKE worker nodes version:: 1.20.11-gke.1300
GKE workers image type: Container-Optimized OS with containerd (cos_containerd)

Otherwise, the parameters used to configure both our dev and prod cluster are the same.

Helm non-sensitive values:

daemonset:
  hostDirectory: "/home/kubernetes/bin"
  coreDirectory: "/home/kubernetes/cores"
  extraEnvVars: |-
    - name: S3_ENDPOINT 
      value: "https://storage.googleapis.com"
image:
  tag: v8.0.3 ***************(also tried with v8.3.1)***************
composer:
  logLevel: "Error"

Description of the issue.

Some pod for 2 to 3 of our 14 workers nodes in prod (again using the version 1.20.11-gke.1300 for the worker nodes) falls into the CrashLoopBackOff state. Here's logs when the issue happens:

[2022-04-26T20:58:18Z INFO  core_dump_agent] no .env file found 
     That's ok if running in kubernetes
[2022-04-26T20:58:18Z INFO  core_dump_agent] Setting host location to: /home/kubernetes/bin
[2022-04-26T20:58:18Z INFO  core_dump_agent] Current Directory for setup is /app
[2022-04-26T20:58:18Z INFO  core_dump_agent] Copying the composer from ./vendor/default/cdc to /home/kubernetes/bin/cdc
Error: Text file busy (os error 26)
Stream closed EOF for observe/core-dump-handler-2kbvh (coredump-container)

Let me know if you require additional info from my end.

Question: Who runs Composer?

I'm interested in this chart and currently look into it. And there is something I don't understand.

Doc says a DaemonSet Agent deploys Composer, but when I see Agent code, I can't find codes running Composer.
As far as I currently understand,

  1. core-dump-handler image's entrypoint is Agent. It can be confirmed by Dockerfile and daemonset.yaml.
  2. core-dump-handler image contains Composer binary, so Agent can run Composer. ok.
  3. But Agent just copy the Composer binary and some .bak to 'host_location' and does not run this.
  4. I checked whether composer execution is included in the command defined in yaml, etc., but I could only check that
    removing Composer binary at postStop.

Please let me know if I am missing anything.

Allow setting resources

Many clusters specify default resource limits, those need to be overridden with fitting values by the user.
It would be helpful to allow setting resource requests and limits for the daemonsets pods.

Add support for a local debug session using podman

Podman allows you to build pod definitions where containers can share resources in a similar fashion to kubernetes.
The idea here is to mimic the k8s deploy so a local pod is started in a two container configuration.
This issue is to track the work being done on it.
Items to address:
The first iteration may use the download capability in the debug containers
https://github.com/IBM/core-dump-containers
However the ability to use an already downloaded zip file may also be required.
This might be complicated as the the pods should not require root access but a local mount is required.

Adding the pod logs

(yet another follow-up to the conversations that triggered #44 + #47 + #48 - last one for now)

Our customer would like to correlate pod logs, this is potentially easy done in OpenShift as you can take the pod info from the dump and search in the logging system. However, because our customer is a vendor the cluster logging system might not be accessed by the respective tenants very easily.

Another "easy" solution would be to include the pod log (up until the crash) into the zip file. My naive thought would include a simply cp from the configured log directory on the host. Alternatively this could be done with kubectl log going through the API server.

What's your take on the story?

Add separate core location to deployment

WIth the introduction of scheduled uploads core dumps may accumulate so this issue is to track the creation of a separate PV and PVC for the storage of the cores.

zip files are corrupted

Hello,

I have deployed core-dump-handler in a k8s cluster and it's currently sending coredump to s3 as intended.

The only issue is that all the files that get sent to s3 are corrupted.

core-dump-handler-gclrk coredump-container [2021-08-12T09:10:42Z INFO  core_dump_agent] Dir Content ["/var/mnt/core-dump-handler/core/4739e2b1-565f-4336-99d9-f25f5d4c8516-dump-1628759416-busybox-sh-6-11.zip"]
core-dump-handler-gclrk coredump-container [2021-08-12T09:10:42Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/core/4739e2b1-565f-4336-99d9-f25f5d4c8516-dump-1628759416-busybox-sh-6-11.zip
core-dump-handler-gclrk coredump-container [2021-08-12T09:10:42Z INFO  core_dump_agent] zip size is 368
core-dump-handler-gclrk coredump-container [2021-08-12T09:10:42Z INFO  core_dump_agent] S3 Returned: 200
➜ unzip 4739e2b1-565f-4336-99d9-f25f5d4c8516-dump-1628759416-busybox-sh-6-11.zip
Archive:  4739e2b1-565f-4336-99d9-f25f5d4c8516-dump-1628759416-busybox-sh-6-11.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of 4739e2b1-565f-4336-99d9-f25f5d4c8516-dump-1628759416-busybox-sh-6-11.zip or
        4739e2b1-565f-4336-99d9-f25f5d4c8516-dump-1628759416-busybox-sh-6-11.zip.zip, and cannot find 4739e2b1-565f-4336-99d9-f25f5d4c8516-dump-1628759416-busybox-sh-6-11.zip.ZIP, period.

This happens to all files. Tried downloading via Browser and via aws CLI same result.

pvc -> pv mismatch when bindng

Now, moving from dev to staging got into the following issue:

$ kubectl get pv core-volume
NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS   REASON   AGE
core-volume   10Gi       RWO            Retain           Bound    core-dump/host-storage-pvc   hostclass               36m
$ kubectl get pv host-volume
NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
host-volume   1Gi        RWO            Retain           Available           hostclass               16m
$ kubectl get pvc -n core-dump
NAME               STATUS    VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
core-storage-pvc   Pending                                           hostclass      7m40s
host-storage-pvc   Bound     core-volume   10Gi       RWO            hostclass      37m

As you can see the host-storage-pvc is bound to core-volume pv, which makes it impossible for core-storage-pvc to bind (1Gi vs 10Gi)
Looks like you either need to pre-bind or use some other creative idea.
Will think about something, but I don't have anything rn.

Possible bug - coredump hangs

We had a production outage some time ago but reporting this here got lost during other tasks.
I pinpointed the issue to a big number of running coredump-composer processes that never finished processing the coredump. This in turn lead to the crashed process being kept alive (they can't be killed until the coredump handling finished) and in turn containerd becoming very unhappy about pods not terminating.

During the outage I unfortunately didn't debug further where the composer was stuck, but I could see the processes clearly running for a long time not doing active work.

My suggestion would be extending the coredump composer to have an upper bound of processing time after which it terminates itself to prevent those stuck processes staying around and potentially guarding the different parts of the file creation process with timeouts (I'm imagining crictl being stuck, but we have already written out the dump itself and I'd rather have it without lots of extra info compared to not having it at all).

NOTICE - Splitting of debugging cli into a seperate repo

During a couple of conversation on the route to release 5.0.0 it became apparent that the debugging cli should be partitioned away from the main core-dump collection. This is due to the shared configuration between the client and server not being ideal and the release cadence of the cli being very different than the server.
The likely location for the cli will be https://github.com/IBM/core-dump-containers/ which itself will be renamed
This will happen as part of the 6.0.0 release due in Novemeber

v8.0.3 image is missing from quay.io

Installing v8.0.3 gave me an ImagePullBackOff:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  3m52s                  default-scheduler  Successfully assigned observe/core-dump-handler-rg9jv to ip-10-7-5-141.eu-west-1.compute.internal
  Normal   Pulling    2m25s (x4 over 3m51s)  kubelet            Pulling image "quay.io/icdh/core-dump-handler:v8.0.3"
  Warning  Failed     2m23s (x4 over 3m50s)  kubelet            Failed to pull image "quay.io/icdh/core-dump-handler:v8.0.3": rpc error: code = Unknown desc = Error response from daemon: manifest for quay.io/icdh/core-dump-handler:v8.0.3 not found: manifest unknown: manifest unknown
  Warning  Failed     2m23s (x4 over 3m50s)  kubelet            Error: ErrImagePull
  Warning  Failed     116s (x6 over 3m49s)   kubelet            Error: ImagePullBackOff
  Normal   BackOff    102s (x7 over 3m49s)   kubelet            Back-off pulling image "quay.io/icdh/core-dump-handler:v8.0.3"

And indeed, the image isn't there; see https://quay.io/repository/icdh/core-dump-handler?tab=tags&tag=latest

Did something go wrong with the release process?

Get podname and namespace "unknown"

Hi, I set the filenameTemplate: "{uuid}-dump-{timestamp}-{hostname}-{exe_name}-{pid}-{signal}-{podname}-{namespace}", but I get the filename like this: "9a1fc79c-758c-4599-a22d-2e94444a3250-dump-1657867608-segfaulter-segfaulter-1-4-unknown-unknown.zip". How to fix it?

Azure Storage Support

This Tool looks like the exact thing i need now for my AKS Clusters , but unfortunately Im not able to find any option to send the core dump to any Native Azure Solution . I see s3 is supported and is there any way to send those dumps over to Azure storage Blob or File share ? Azure natively doesn't support s3 protocol

Cloud Event Support

Following on from the discussion in #61
Opening this issue to track this feature for release in 9.0.0

Agent Mocks

Currently the agent requires root access to make the sysctl calls this makes it hard to develop with locally and very difficult to test in CI environments.
Logging this issue to investigate creating a mock sysctl similar to the crictl in composer that could be used to validate most of the agent functionality without having to run the integration test.

Document installation on different cloud providers

It would be great if there were markdown files for each of the working cloud providers that contained documentation for setting the project up on the different k8s services and object storage environments. This may be as simple as finding the different links to the documentation for setting up a cluster, configuring the kubectl client and configuring object storage.

enhancement: ability to set fs.suid_dumpable

Lots of applications start as root user, and setuid in order to run with lower privileges. This is usually to do with opening log files or binding to privileged ports.

Segfaults from these setuid'd processes do not create core dumps by default, and require the sysctl config fs.suid_dumpable set to 1 or 2.

Would it be possible to have an env var for the core-dump-agent to set this sysctl setting? 2 should be best, as it prevents the process being able to read it (if the host volume was even mounted in the container).

Processes that do this include:

  • nginx
  • apache
  • php-fpm

Release to chart repo

It's a best practice to upload the released helm chart to a helm repo, this allows users like us to define a dependency on it and e.g. automatically update it.

AWS Bottlerocket support

When deploying the helm chart on bottlerocket nodes, it doesn't appear the containerd socket is available under /run. I made a modification to the helm chart locally to mount it, and everything seemed to work. I can make a PR with my changes wrapped in a conditional, but I'm not sure if there are other use cases for customizing where the socket exists based on the OS.

psact@10b6c33

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.