Giter Club home page Giter Club logo

chaos-controller's Introduction

Oldest Kubernetes version supported: 1.16

โš ๏ธ Kubernetes version 1.20.x is not supported! This Kubernetes issue prevents the controller from running properly on Kubernetes 1.20.0-1.20.4. Earlier versions of Kubernetes as well as 1.20.5 and later are still supported.

Datadog Chaos Controller

๐Ÿ’ฃ Disclaimer ๐Ÿ’ฃ

The Chaos Controller allows you to disrupt your Kubernetes infrastructure through various means including but not limited to: bringing down resources you have provisioned and preventing critical data from being transmitted between resources. The use of Chaos Controller on your production system is done at your own discretion and risk.

The Chaos Controller is a Kubernetes controller with which you can inject various systemic failures, at scale, and without caring about the implementation details of your Kubernetes infrastructure. It was created with a specific mindset answering Datadog's internal needs:

  • ๐Ÿ‡ Be fast and operate at scale
    • At Datadog, we are running experiments injecting and cleaning failures to/from thousands of targets within a few minutes.
  • ๐Ÿš‘ Be safe and operate in highly disrupted environments
    • The controller is built to be able to limit the blast radius of failures but also to be able to recover by itself in catastrophic scenarios.
  • ๐Ÿ’ก Be smart and operate in various technical environments
    • With Kubernetes, all environments are built differently.
    • Whatever your cluster configuration and implement details choice, the controller is able to inject failures by relying on low-level Linux kernel features such as cgroups, tc or even eBPF.
  • ๐Ÿช™ Be simple and operate at low cost
    • Most of the time, your Chaos Engineering platform is waiting and doing nothing.
    • We built this project so it uses resources only when it is really doing something:
      • No DaemonSet or any always-running processes on your nodes for injection, no reserved resources when it's not needed.
      • Injection pods are created only when it is needed, killed once experiment is done, and built to be evicted if necessary to free resources.
      • A single long-running pod, the controller, and nothing else!

Getting Started

๐Ÿ’ก Read the latest release quick installation guide and the configuration guide to know how to deploy the controller.

Disruptions are built as short-living resources which should be manually created and removed once your experiments are done. They should not be part of any application deployment. The Disruption resource is immutable. Once applied, you can't edit it. If you need to change the disruption definition, you need to delete the existing resource and to re-create it.

Getting started is as simple as creating a Kubernetes resource:

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: node-failure
  namespace: chaos-demo # it must be in the same namespace as targeted resources
spec:
  selector: # a label selector used to target resources
    app: demo-curl
  count: 1 # the number of resources to target, can be a percentage
  duration: 1h # the amount of time before your disruption automatically terminates itself, for safety
  nodeFailure: # trigger a kernel panic on the target node
    shutdown: false # do not force the node to be kept down

To disrupt your cluster, run kubectl apply -f <disruption_file>.yaml. You can clean up the disruption with kubectl delete -f <disruption_file>.yaml. For your safety, we recommend you get started with the dry-run mode enabled.

๐Ÿ“– The features guide details all the features of the Chaos Controller.

๐Ÿ“– The examples guide contains a list of various disruption files that you can use.

Check out Chaosli if you want some help understanding/creating disruption configurations.

Chaos Scheduling

New feature in 8.0.0

The Chaos Controller has expanded its capabilities by introducing disruption scheduling, enhancing your ability to automate and test system resilience consistently. Instead of manual creation and deletion, use DisruptionCron to regularly disrupt long-lived Kubernetes resources like Deployments and StatefulSets.

Example:

apiVersion: chaos.datadoghq.com/v1beta1
kind: DisruptionCron
metadata:
  name: node-failure
  namespace: chaos-demo
spec:
  schedule: "*/15 * * * *" # every 15 minutes
  targetResource: 
    kind: deployment
    name: demo-curl
  disruptionTemplate:
    count: 1 
    duration: 1h 
    nodeFailure:
      shutdown: false

To schedule disruption in your cluster, run kubectl apply -f <disruption_cron_file>.yaml. To stop, run kubectl delete -f <disruption_cron_file>.yaml.

๐Ÿ”Ž Check out DisruptionCron guide for more detailed information on how to schedule disruptions.

Contributing

Chaos Engineering is necessarily different from system to system. We encourage you to try out this tool, and extend it for your own use cases. If you want to run the source code locally to make and test implementation changes, visit the Contributing Doc. By the way, we welcome Pull Requests.

Useful Links

chaos-controller's People

Contributors

aymericdd avatar azoam avatar blazebissar avatar brandon-dd avatar clairecng avatar craig-seeman avatar dd-adn avatar dependabot[bot] avatar devatoria avatar diyarab avatar emmanuel-ferdman avatar ethan-lowman-dd avatar expflower avatar gaetan-deputier avatar github-actions[bot] avatar griffin avatar guyboltonking avatar hochristinawuiyan avatar jvanbrunschot avatar kathy-huang avatar luphaz avatar nathantournant avatar nikos912000 avatar ptnapoleon avatar taihuynh167 avatar takakonishimura avatar wdhif avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaos-controller's Issues

User Request: HTTP Disruptions

Note: While chaos-controller is open to the public and we consider all suggestions for improvement, we prioritize feature development that is immediately applicable to chaos engineering initiatives within Datadog. We encourage users to contribute ideas to the repository directly in the form of pull requests!

Is your feature request related to a problem? Please describe.
@scottjr632 and I are interested in disrupting some web services at the application HTTP layer.

Describe the solution you'd like
If this is a desirable feature for Datadog, we are willing to design and implement an HTTP disruption capability. We would provide the design for review by early July to ensure that it lines up with Datadog plans for the controller and then proceed with implementation. We are currently looking into how possible this is to do at the interface/pod level, though it may require some kind of application capability due to the protocol's layer and the encryption of HTTPS.

Describe alternatives you've considered
A network disruption could approximate this functionality, but would be too coarse to target specific HTTP request types or URI endpoints without also disrupting other requests.

Inaccurate comment in node_failure?

Hi, quick question.

In node_failure.yaml the namespace metadata has a comment about a disruption resource needing to be in the same namespace as a targeted pod. I don't think this is accurate, because we're not targeting a pod, right? Or does chaos-controller require you to target pods with a label selector and then the nodes backing those pods are put into failure mode?

metadata:
  name: disruption-sample
  namespace: chaos-engineering # disruption resource must be in the same namespace as targeted pods
spec:
  selector: # label selector to target pods
    app: demo

For node level disruptions, can disruptions be run in any namespace?

Thanks, a little confused.

User Request: Support for percentage in CPU stress

Is your feature request related to a problem? Please describe.
A common use case for CPU stress is for testing CPU-based autoscaling. Right now the controller only supports 100% of CPU pressure.
It would be great to be able to specify percentages.

Describe the solution you'd like
Looking at the implementation, the injector Pod creates a goroutine per core. I'm wondering if the lack of support for percentage-based stress is due to that.

Describe alternatives you've considered
N/A

User Request: store failures in the Custom Resource's status

Is your feature request related to a problem? Please describe.
To improve developer experience we are integrating the controller with our Continuous Delivery platform. Since any failures are only logged in the controller showing that feedback in the UI is not straightforward.

Describe the solution you'd like
It would be helpful to store any errors related to a specific experiment in the Custom Resource's status. Any clients can then fetch these errors from the CR (and show these in the UI).

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
...
status:
  injectionStatus: NotInjected
  errors: "invalid disruption name"
  targets:
    - demo-nginx-68954564c-pfxc8
spec:
  containerFailure:
    forced: true
  count: 1
  duration: 1m30s
  selector:
    app: demo-nginx

Describe alternatives you've considered
We are open to ideas if there is a better way to show that information in a user interface. An alternative is to associate logs with experiments but this is still not straightforward.

User Issue: injector panics due to cgroupPaths containing an empty string

Describe the bug
While working on #405 I came across an issue in the main branch where the injector panics when retrieving the cgroupPaths.

To Reproduce
Steps to reproduce in minikube:

  1. make minikube-start
  2. make minikube-build; make sure the new images are not cached by either destroying minikube or using the old Makefile. In the main branch's Makefile images are always cached for me.
  3. make install
  4. make restart
  5. Deploy the following Disruption:
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: dns
  namespace: chaos-engineering
spec:
  level: node
  selector:
    kubernetes.io/hostname: minikube
  count: 100%
  dns:
    - hostname: demo.chaos-demo.svc.cluster.local
      record:
        type: A
        value: 10.0.0.154,10.0.0.13

The chaos-dns pod gets created but panics:

panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
main.initConfig()
/Users/nkatirtzis/github/chaos-controller/cli/injector/main.go:206 +0x16df

Expected behavior
The chaos-dns pods successfully injects the failure.

Environment:

  • Kubernetes version: 1.19.14
  • Controller version: latest (feaeb6e)
  • Cloud provider (or local): minikube
  • Base OS for Kubernetes: minikube with containerd

Additional context
From the logs I could see the cgroupPaths being a list with only an empty string.

User Issue: Unable to gracefully terminate pods container

Describe the bug
While executing a container-failure-graceful experiment I noticed that the injector pod starts, identifies the target pods but does not terminate the application container inside the pod.

To Reproduce
Steps to reproduce the behavior:

  1. I applied the following "graceful container termination" Disruption :

Screenshot 2023-02-21 at 16 43 21

  1. Delete the disruption once it's succeeded.

Expected behavior
I injected the disruption into a pod with 2 containers:

  1. Application container (name: demo-curl)
  2. Istio sidecar container (name: istio-proxy)

The expectation is for both the containers inside the pod to restart. The actual behavior observed is only the istio container restarts, the applications container in the pod does nothing.

Screenshots
we can see the injector pod spins up:
Screenshot 2023-02-21 at 16 31 45

It identifies the target pod:
Screenshot 2023-02-22 at 12 22 31

The time stamp show that nothing has happened :
Screenshot 2023-02-21 at 16 21 23

Injector logs

{"level":"info","ts":1677068244553.9192,"caller":"injector/main.go:219","message":"injector targeting container","disruptionName":"container-failure-all-graceful","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-75585bc5dd-j7zhd","targetNodeName":"ip-0-xxx-xx","containerID":"docker://e8c649289fe1a8ec8547cadc487cb5cea5dd369f743d1b3f802bd7e3bda906b1","container name":"/k8s_curl_demo-curl-75585bc5dd-j7zhd_chaos-engineering-framework_ab0c72c0-b295-4c31-93e9-1790e7ac9dde_0"}

Screenshot 2023-02-22 at 12 32 24

Application Logs
This was checked, but showed no updates.

Environment:
Kubernetes version: 1.21
Controller version: 7.10.0
Cloud provider (or local): EKS
Base OS for Kubernetes: Amazon Linux (5.4.209-116.363.amzn2.x86_64)

User Issue: Traffic surge once a packet drop network failure finishes

Describe the bug
We have been testing the packet drop failures internally. One thing we noticed is that once an experiment finishes there is significant traffic surge in the targeted applications which can harm them.

I may be missing implementation details but aren't packets dropped completely instead of piling up?

To Reproduce
In minikube:

  1. Deploy the following Disruption:
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: network-egress-drop
  namespace: chaos-demo
spec:
  level: pod
  selector:
    app: "demo-curl"
  count: 100%
  containers:
    - "curl"
  duration: 3m0s
  network:
    drop: 100
    hosts:
      - host: demo.chaos-demo.svc.cluster.local
  1. During the experiment, we can see the connection failures:
Failed to connect to demo.chaos-demo.svc.cluster.local port 8080 after 31757 ms: Operation timed out
  1. Once the experiment finishes all these requests/responses show up within a short time-window:
2022-03-21T16:31:34.033598753Z 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.107.44.112:8080...
2022-03-21T16:31:49.405143841Z 
...
  0     0    0     0    0     0      0      0 --:--:--  0:00:15 --:--:--     0* Connected to demo.chaos-demo.svc.cluster.local (10.107.44.112) port 8080 (#0)
2022-03-21T16:31:49.405213092Z > GET / HTTP/1.1
2022-03-21T16:31:49.405227382Z > Host: demo.chaos-demo.svc.cluster.local:8080
...
2022-03-21T16:31:55.472746326Z * Connected to demo.chaos-demo.svc.cluster.local (10.107.44.112) port 8080 (#0)
2022-03-21T16:31:55.472960280Z > GET / HTTP/1.1
2022-03-21T16:31:55.472984764Z > Host: demo.chaos-demo.svc.cluster.local:8080
...

Expected behavior
No traffic surge as packets/requests should be dropped rather than piling up.

Screenshots
This is from our internal test cluster:
Screenshot 2022-03-18 at 19 15 11

Environment:

  • Kubernetes version: 1.20.11
  • Controller version: 6.0.0
  • Cloud provider (or local): EKS
  • Base OS for Kubernetes: Amazon Linux 2
  • Container runtime: containerd

User Issue: CPU Pressure experiment could not inject the disruption successfully

Describe the bug
While executing a CPU Pressure disruption the injector fails to inject the disruption successfully.

To Reproduce
Steps to reproduce the behavior:

  1. Apply the following disruption on an EKS cluster:
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: cpu-pressure-pod
  namespace: chaos-engineering-framework
  annotations:
    chaos.datadoghq.com/environment: dev
spec:
  level: pod
  duration: 15m
  selector:
    app.kubernetes.io/name: "fault-injection-showcase"
  count: 100%
  cpuPressure:
    count: 100%
  staticTargeting: true

Expected behavior
The expectation is that the chaos injector finds the targets and successfully injects pressure to containers.

Screenshots
The chaos-contoller correctly identifies the targets:

Name:         cpu-pressure-pod
Namespace:    chaos-engineering-framework
Labels:       <none>
Annotations:  chaos.datadoghq.com/environment: dev
API Version:  chaos.datadoghq.com/v1beta1
Kind:         Disruption
Metadata:
  Creation Timestamp:  2023-06-28T10:42:56Z
  Finalizers:
    finalizer.chaos.datadoghq.com
  Generation:  1
  Managed Fields:
    ...
Spec:
  Count:  100%
  Cpu Pressure:
    Count:   100%
  Duration:  15m0s
  Level:     pod
  Selector:
    app.kubernetes.io/name:  fault-injection-showcase
  Static Targeting:          true
  Triggers:
    Create Pods:
      Not Before:  <nil>
    Inject:
      Not Before:  <nil>
Status:
  Desired Targets Count:   4
  Ignored Targets Count:   0
  Injected Targets Count:  0
  Injection Status:        PreviouslyPartiallyInjected
  Selected Targets Count:  4
  Target Injections:
    demo-curl-66dd49f77-cwrcd:
      Injection Status:  NotInjected
    demo-nginx-b764fb599-2ncrz:
      Injection Status:  NotInjected
    fault-injection-showcase-plugin-master-template-79cfc667bd6phhg:
      Injection Status:  NotInjected
    fault-injection-showcase-plugin-master-template-79cfc667bd77b55:
      Injection Status:  NotInjected
Events:                  <none>

Pod describe:

Name:             chaos-cpu-pressure-pod-hggrb
Namespace:        chaos-engineering-framework
Priority:         0
Service Account:  chaos-injector
Node:             ip-10-72-87-212.us-west-2.compute.internal/10.72.87.212
Start Time:       Wed, 28 Jun 2023 11:42:58 +0100
Labels:           chaos.datadoghq.com/disruption-kind=cpu-pressure
                  chaos.datadoghq.com/disruption-name=cpu-pressure-pod
                  chaos.datadoghq.com/disruption-namespace=chaos-engineering-framework
                  chaos.datadoghq.com/target=demo-curl-66dd49f77-cwrcd
Annotations:      kubernetes.io/psp: eks.privileged
                  sidecar.istio.io/inject: false
Status:           Succeeded
IP:               10.72.117.226
IPs:
  IP:  10.72.117.226
Containers:
  injector:
    Container ID:  docker://c09f06f1d55d74f521ec421946beab71b14352a65cab7ef8f8d2d697736b14db
    Image:         xxxxxxx/datadog/chaos-injector:7.18.0
    Image ID:      docker-pullable://xxxxxxx/datadog/chaos-injector@sha256:28148f9f2decad765d8ef70c2d5dd9b1e1ba808741e339721e012822b2ea8743
    Port:          <none>
    Host Port:     <none>
    Args:
      cpu-pressure
      --count
      100%
      --metrics-sink
      noop
      --level
      pod
      --target-containers
      curl;docker://a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201,dummy;docker://4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048,istio-proxy;docker://bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b
      --target-pod-ip
      10.72.83.133
      --chaos-namespace
      chaos-engineering-framework
      --log-context-disruption-name
      cpu-pressure-pod
      --log-context-disruption-namespace
      chaos-engineering-framework
      --log-context-target-name
      demo-curl-66dd49f77-cwrcd
      --log-context-target-node-name
      ip-10-72-87-212.us-west-2.compute.internal
      --not-injected-before
      2023-06-28T10:42:56Z
      --deadline
      2023-06-28T10:57:55Z
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 28 Jun 2023 11:42:59 +0100
      Finished:     Wed, 28 Jun 2023 11:57:55 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:      0
      memory:   0
    Readiness:  exec [cat /tmp/readiness_probe] delay=0s timeout=1s period=1s #success=1 #failure=5
    Environment:
      DD_ENTITY_ID:                         (v1:metadata.uid)
      DD_AGENT_HOST:                        (v1:status.hostIP)
      TARGET_POD_HOST_IP:                   (v1:status.hostIP)
      CHAOS_POD_IP:                         (v1:status.podIP)
      INJECTOR_POD_NAME:                   chaos-cpu-pressure-pod-hggrb (v1:metadata.name)
      CHAOS_INJECTOR_MOUNT_HOST:           /mnt/host/
      CHAOS_INJECTOR_MOUNT_PROC:           /mnt/host/proc/
      CHAOS_INJECTOR_MOUNT_SYSRQ:          /mnt/sysrq
      CHAOS_INJECTOR_MOUNT_SYSRQ_TRIGGER:  /mnt/sysrq-trigger
      CHAOS_INJECTOR_MOUNT_CGROUP:         /mnt/cgroup/
    Mounts:
      /mnt/cgroup from cgroup (rw)
      /mnt/host from host (ro)
      /mnt/sysrq from sysrq (rw)
      /mnt/sysrq-trigger from sysrq-trigger (rw)
      /run from run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l8lnd (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /run
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  sysrq:
    Type:          HostPath (bare host directory volume)
    Path:          /proc/sys/kernel/sysrq
    HostPathType:  File
  sysrq-trigger:
    Type:          HostPath (bare host directory volume)
    Path:          /proc/sysrq-trigger
    HostPathType:  File
  cgroup:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/cgroup
    HostPathType:  Directory
  host:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
  kube-api-access-l8lnd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

AWS node docker info (Full Docker inspect collected and save if needed)

sh-4.2$ sudo docker ps | grep k8s_dummy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0
4edab9f9b56c   xxxxxxxxx/ubuntu                                                                             "/bin/bash -c 'whileโ€ฆ"   17 hours ago    Up 17 hours              k8s_dummy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0

Environment:

  • Kubernetes version: v1.23.17-eks-c12679a
  • Controller version: 7.18.0
  • Cloud provider (or local): AWS EKS
  • Base OS for Kubernetes: Amazon Linux: 5.4.242-156.349.amzn2.x86_64

Additional context

Injector logs:

{"level":"info","ts":1687948979446.6792,"caller":"injector/main.go:244","message":"injector targeting container","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","containerID":"docker://4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048","container name":"/k8s_dummy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0"}
{"level":"info","ts":1687948979449.3525,"caller":"injector/main.go:244","message":"injector targeting container","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","containerID":"docker://bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b","container name":"/k8s_istio-proxy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0"}
{"level":"debug","ts":1687948979450.1873,"caller":"netns/netns.go:52","message":"Retrieved root namespace and target namespace","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","rootns":9,"targetns":10,"targetnsPath":"/mnt/host/proc/25830/ns/net"}
{"level":"debug","ts":1687948979450.2866,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"memory","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.3376,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"devices","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.347,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"net_cls","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.3552,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"net_prio","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.363,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpuset","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.3704,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"pids","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.391,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"blkio","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.3997,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"perf_event","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.4202,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"hugetlb","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.4287,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpu","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.4355,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpuacct","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.4429,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"freezer","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.45,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"name=systemd","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/a7fb85a4599d94d18fa91da244a789efae96ce5bf4034126e59f551aecd94201"}
{"level":"debug","ts":1687948979450.5627,"caller":"netns/netns.go:52","message":"Retrieved root namespace and target namespace","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","rootns":11,"targetns":12,"targetnsPath":"/mnt/host/proc/26062/ns/net"}
{"level":"debug","ts":1687948979450.6387,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpu","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.6533,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"freezer","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.661,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"net_cls","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.684,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"net_prio","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.692,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpuset","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.6987,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"pids","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.7134,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"memory","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.7344,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"blkio","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.742,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"perf_event","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.7493,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"devices","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.7686,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"hugetlb","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.7764,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpuacct","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.7834,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"name=systemd","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/4edab9f9b56cc00eb48e1ee298eabd6866e1b67b2bc3babccd3cbeb370e7b048"}
{"level":"debug","ts":1687948979450.8774,"caller":"netns/netns.go:52","message":"Retrieved root namespace and target namespace","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","rootns":13,"targetns":14,"targetnsPath":"/mnt/host/proc/25517/ns/net"}
{"level":"debug","ts":1687948979450.9482,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"blkio","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979450.9734,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"net_cls","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979450.9814,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"hugetlb","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979450.9883,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpuacct","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979450.9993,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"freezer","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.0063,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"pids","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.0137,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"memory","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.0205,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"devices","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.0273,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"net_prio","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.0347,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpu","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.0562,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"name=systemd","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.072,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"cpuset","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"debug","ts":1687948979451.0916,"caller":"cgroup/manager_linux.go:37","message":"adding cgroup subsystem path to manager","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","subsystem":"perf_event","path":"/kubepods/burstable/pod0894918f-93c6-45c3-b842-978574d0dbfe/bd8fd379fe3eadab07e6b9f9b37788df754b47fe96d121123ce0983b13f04e4b"}
{"level":"info","ts":1687948979451.3286,"caller":"injector/main.go:498","message":"waiting for synchronized start to begin","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","timeUntilNotInjectedBefore":"-3.451324844s"}
{"level":"info","ts":1687948979451.4324,"caller":"injector/main.go:523","message":"injecting the disruption","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","kind":"cpu-pressure"}
{"level":"info","ts":1687948979451.4646,"caller":"injector/cpu_pressure.go:55","message":"creating processes to stress target","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","targetLevel":"pod","target":"/k8s_curl_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","count":"100%"}
{"level":"info","ts":1687948979451.4963,"caller":"injector/cpu_pressure.go:74","message":"percentage calculated from percentage","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","targetLevel":"pod","target":"/k8s_curl_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","provided_value":"100%","percentage":100}
{"level":"debug","ts":1687948979451.5222,"caller":"noop/noop.go:41","message":"NOOP: MetricInjected false\n","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"error","ts":1687948979451.549,"caller":"injector/main.go:356","message":"disruption injection failed","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","error":"unable to create new process definition for injector: targeted container does not exists: /k8s_curl_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","stacktrace":"main.inject\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:356\nmain.injectAndWait\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:525\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:944\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:1068\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:992\nmain.main\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:134\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
{"level":"info","ts":1687948979451.5876,"caller":"injector/cpu_pressure.go:55","message":"creating processes to stress target","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","targetLevel":"pod","target":"/k8s_dummy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","count":"100%"}
{"level":"info","ts":1687948979451.595,"caller":"injector/cpu_pressure.go:74","message":"percentage calculated from percentage","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","targetLevel":"pod","target":"/k8s_dummy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","provided_value":"100%","percentage":100}
{"level":"debug","ts":1687948979451.6191,"caller":"noop/noop.go:41","message":"NOOP: MetricInjected false\n","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"error","ts":1687948979451.6248,"caller":"injector/main.go:356","message":"disruption injection failed","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","error":"unable to create new process definition for injector: targeted container does not exists: /k8s_dummy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","stacktrace":"main.inject\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:356\nmain.injectAndWait\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:525\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:944\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:1068\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:992\nmain.main\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:134\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
{"level":"info","ts":1687948979451.6455,"caller":"injector/cpu_pressure.go:55","message":"creating processes to stress target","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","targetLevel":"pod","target":"/k8s_istio-proxy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","count":"100%"}
{"level":"info","ts":1687948979451.6523,"caller":"injector/cpu_pressure.go:74","message":"percentage calculated from percentage","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","targetLevel":"pod","target":"/k8s_istio-proxy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","provided_value":"100%","percentage":100}
{"level":"debug","ts":1687948979451.6636,"caller":"noop/noop.go:41","message":"NOOP: MetricInjected false\n","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"error","ts":1687948979451.669,"caller":"injector/main.go:356","message":"disruption injection failed","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","error":"unable to create new process definition for injector: targeted container does not exists: /k8s_istio-proxy_demo-curl-66dd49f77-cwrcd_chaos-engineering-framework_0894918f-93c6-45c3-b842-978574d0dbfe_0","stacktrace":"main.inject\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:356\nmain.injectAndWait\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:525\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:944\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:1068\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:992\nmain.main\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:134\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
{"level":"error","ts":1687948979451.7002,"caller":"injector/main.go:375","message":"an injector could not inject the disruption successfully, please look at the logs above for more details","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","stacktrace":"main.inject\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:375\nmain.injectAndWait\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:525\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:944\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:1068\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/vendor/github.com/spf13/cobra/command.go:992\nmain.main\n\t/go/src/github.com/ujyL7oZF/0/DataDog/chaos-controller/cli/injector/main.go:134\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
{"level":"warn","ts":1687948979451.7363,"caller":"injector/main.go:644","message":"waiting for system signals...","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"info","ts":1687949875000.3848,"caller":"injector/main.go:661","message":"disruption duration has expired, exiting","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"info","ts":1687949875000.5095,"caller":"injector/main.go:443","message":"disruption cpu-pressure cleaned","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"debug","ts":1687949875000.5195,"caller":"noop/noop.go:55","message":"NOOP: MetricCleaned true\n","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"info","ts":1687949875000.5322,"caller":"injector/main.go:443","message":"disruption cpu-pressure cleaned","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"debug","ts":1687949875000.5383,"caller":"noop/noop.go:55","message":"NOOP: MetricCleaned true\n","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"info","ts":1687949875000.5452,"caller":"injector/main.go:443","message":"disruption cpu-pressure cleaned","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"debug","ts":1687949875000.5557,"caller":"noop/noop.go:55","message":"NOOP: MetricCleaned true\n","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"info","ts":1687949875056.8525,"caller":"injector/main.go:828","message":"disruption(s) cleaned, now exiting","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal"}
{"level":"info","ts":1687949875056.9182,"caller":"injector/main.go:126","message":"closing metrics sink client before exiting","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"demo-curl-66dd49f77-cwrcd","targetNodeName":"ip-10-72-87-212.us-west-2.compute.internal","sink":"noop"}

User Issue: OnInit Chaos handler loop

Describe the bug
When validating the oninit feature within our environment we see that the following scenario can happen:
The chaos handler container is successfully attached to the pod spec, the handler runs as expected however if a SIGUSER1 signal is not received within the timeout set the chaos-handler exits unsuccessfully (exit code 1)

Looking at the documentation: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#detailed-behavior, this means that if your pod has either a restart policy of always or onFailure the pod will never successfully start unless a disruption is injected and the handler receives the SIGUSER1 signal.

To Reproduce
Steps to reproduce the behavior:

  1. Pod restart policy is set to Always for a deployment
  2. The deployment pod contains the label chaos.datadoghq.com/disrupt-on-init
  3. onInit is enabled within the controller
  4. A disruption is not created prior to the timeout set.

Expected behavior
If the timeout value has exceeded and a disruption is not created then the pod should start up as normal.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment:

  • Kubernetes version: v1.25.16
  • Controller version: 8.4.1
  • Cloud provider (or local): aws
  • Base OS for Kubernetes: amazon linux

Additional context
Suggested fix would to change the exit code here:

case <-timer.C:
logger.Info("timed out, SIGUSR1 was never received, exiting")
os.Exit(1)

User Request: Add support for tolerations on injector pods

Is your feature request related to a problem? Please describe.
Currently the injector pods created do not support having tolerations added to them, making it hard/impossible to have injector pods spawned on certain node setups that have taints applied to them.

Describe the solution you'd like
Add the ability for tolerations to be added to the injector pods, defined within values.yaml, much like annotations and labels.

Describe alternatives you've considered
N/A

Additional context
Add any other context or screenshots about the feature request here.

Guidance running CPU pressure experiments

Describe the bug
Hi DataDog Chaos Team! I am just moving this Discussion post to an issue for some more visibility

I just wanted to reach out to see if you better understand how to troubleshoot the issue we are having. We aren't able to see the effects of CPU pressure experiments and would like some guidance. We are running the controller on version 7.13.1

Our application pod fault-injection-showcase, consists of two containers, the application container fault-injection-showcase and an istio-proxy container

To Reproduce
Running a cpu pressure experiment on multiple containers using the following manifest.
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
name: cpu-pressure-pod
namespace: chaos-engineering-framework
spec:
level: pod
duration: 5m
selector:
app.kubernetes.io/name: "fault-injection-showcase"
app: "fault-injection-showcase--plugin-master"
count: 100%
staticTargeting: true
cpuPressure:
count: 100%
I find no evidence to suggest that the fault-injection-showcase container has increased cpu-pressure, only the istio-proxy container is showing 100% cpu pressure
The injector pod also reaches a Failed state after the duration. This only happens with the cpu-pressure experiments (not with state or network experiments)

status:
phase: Failed
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2023-03-23T11:09:06Z'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2023-03-23T11:12:54Z'
reason: PodFailed
- type: ContainersReady
status: 'False'
lastProbeTime: null
lastTransitionTime: '2023-03-23T11:12:54Z'
reason: PodFailed
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2023-03-23T11:09:06Z'
message: Pod was active on the node longer than the specified deadline
reason: DeadlineExceeded
hostIP:
startTime: '2023-03-23T11:09:06Z'
containerStatuses:
- name: injector
state:
terminated:
exitCode: 137
reason: Error
startedAt: '2023-03-23T11:0

Im not sure if this failure is related/similar to this old issue that is now resolved #491
However, if I specify the container in the manifest below

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
name: cpu-pressure-pod
namespace: chaos-engineering-framework
spec:
level: pod
duration: 5m
selector:
app.kubernetes.io/name: "fault-injection-showcase"
app: "fault-injection-showcase--plugin-master"
containers:
- "fault-injection-showcase"
count: 100%
staticTargeting: true
cpuPressure:
count: 100%
It correctly injects into the fault-injection-showcase container 100% cpu-pressure. Though the pod still fails.
I also ran one experiment where I specify both containers in the manifest istio-proxy and fault-injection-showcase
And I find that it targets all the necessary pods, but in one pod it injects the fault-injection-showcase containers (and not istio-proxy) and in the other pod it injects the istio-proxy (and not the fault-injection-showcase)

Expected behavior
The metrics to show

Screenshots
istio proxy being injected (first picture) but not fault-injection-showcase (Second picture)
image

image

Environment:

  • Kubernetes version: 1.23
  • Controller version: 7.13.1

If you have any ideas on where to go about solving this, it would be greatly appreciated!

User Issue: CPU disruption not working

Describe the bug
Attempting to run a cpu stress for a targeted pod, but unfortunately there isn't any high cpu observed for the targeted pod when calling kubectl top pod <namespace> , and also when viewing the cpu cgroup files in the targeted pod.

To Reproduce
Below are the steps to reproduce this:

  1. Deploy the test app pod to a new namespace
  2. Create the cpu stress Custom Resource
  3. Controller spins up the injector pod
  4. injector pod states it has successfully injected the disruption in both the CR and the injector logs (see screenshot section below)

Expected behavior
High cpu usage is expected to be seen for the targeted pod via kubectl top pod -A

Screenshots/logs

Status for CR

Status:
  Injection Status:  Injected
  Targets:
    xma-78565cf58c-sn882
Events:
  Type    Reason   Age   From                   Message
  ----    ------   ----  ----                   -------
  Normal  Created  55s   disruption-controller  Created disruption injection pod for "cpu-pressure-xma-test"

Injector logs

{"level":"info","ts":1623835317808.8613,"caller":"injector/main.go:154","message":"injector targeting container","disruptionName":"cpu-pressure-xma-test","disruptionNamespace":"xma","targetName":"xma-78565cf58c-sn882","containerID":"docker://1c09ce702adbe374fd314ed609662d4a4ad19276f7f0cd89c8e95b1df1d8261f","container name":"/k8s_mystiqueapp_xma-78565cf58c-sn882_xma_44c505dd-154a-4ea9-b356-197378a262be_0"}
NOOP: MetricInjected true
{"level":"info","ts":1623835317809.961,"caller":"injector/main.go:228","message":"injecting the disruption","disruptionName":"cpu-pressure-xma-test","disruptionNamespace":"xma","targetName":"xma-78565cf58c-sn882"}
{"level":"info","ts":1623835317809.9905,"caller":"injector/cpu_pressure.go:62","message":"joining target CPU cgroup","disruptionName":"cpu-pressure-xma-test","disruptionNamespace":"xma","targetName":"xma-78565cf58c-sn882"}
{"level":"info","ts":1623835317810.0571,"caller":"injector/cpu_pressure.go:69","message":"highering current process priority","disruptionName":"cpu-pressure-xma-test","disruptionNamespace":"xma","targetName":"xma-78565cf58c-sn882"}
{"level":"info","ts":1623835317810.0723,"caller":"injector/cpu_pressure.go:77","message":"initializing load generator routines","disruptionName":"cpu-pressure-xma-test","disruptionNamespace":"xma","targetName":"xma-78565cf58c-sn882","routines":4}
{"level":"info","ts":1623835317810.1936,"caller":"injector/main.go:243","message":"disruption injected, now waiting for an exit signal","disruptionName":"cpu-pressure-xma-test","disruptionNamespace":"xma","targetName":"xma-78565cf58c-sn882"}

Target pod cpu cgroups files

user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpuacct.usage_user cpuacct.usage_sys
8006554114
0
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpuacct.usage_user
8014955389
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpuacct.usage_sys
0
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpuacct.usage_percpu
1833014320 2026983842 2057257605 2104251594
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpuacct.usage_all
cpu user system
0 1834258358 0
1 2027148888 0
2 2057677520 0
3 2105304304 0
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpuacct.usage
8028460937
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpuacct.stat
user 679
system 229
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpu.stat
nr_periods 13571
nr_throttled 2
throttled_time 40545836
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cpu.shares
512
user@xma-78565cf58c-sn882:/sys/fs/cgroup/cpu$ cat cgroup.procs
1
49
77

kubectl top pod

NAMESPACE              NAME                                         CPU(cores)   MEMORY(bytes)
--
chaos-engineering      chaos-cpu-pressure-xma-test-v6cnb            7m           19Mi
chaos-engineering      datadog-chaos-controller-8667c45b9b-hzv6s    1m           20Mi
xma                    xma-78565cf58c-sn882                         1m           47Mi

injector pod cgroup mnt

root@chaos-cpu-pressure-xma-test-v6cnb:/mnt# ls -ltrash
--
total 0
0 drwxr-xr-x 13 root root 340 Jun 11 15:30 cgroup
0 -rw-r--r--  1 root root   0 Jun 11 15:30 sysrq
0 --w-------  1 root root   0 Jun 11 15:30 sysrq-trigger
0 dr-xr-xr-x 19 root root 269 Jun 14 15:08 host
0 drwxr-xr-x  1 root root  28 Jun 16 13:54 ..
0 drwxr-xr-x  1 root root  66 Jun 16 13:54 .
root@chaos-cpu-pressure-xma-test-v6cnb:/mnt# pwd
/mnt
root@chaos-cpu-pressure-xma-test-v6cnb:/mnt# cd cgroup/
root@chaos-cpu-pressure-xma-test-v6cnb:/mnt/cgroup# ls -ltrash
total 0
0 dr-xr-xr-x  5 root root   0 Jun 11 15:30 systemd
0 dr-xr-xr-x  5 root root   0 Jun 11 15:30 pids
0 dr-xr-xr-x  3 root root   0 Jun 11 15:30 hugetlb
0 dr-xr-xr-x  3 root root   0 Jun 11 15:30 freezer
0 dr-xr-xr-x  5 root root   0 Jun 11 15:30 devices
0 dr-xr-xr-x  5 root root   0 Jun 11 15:30 blkio
0 dr-xr-xr-x  3 root root   0 Jun 11 15:30 perf_event
0 lrwxrwxrwx  1 root root  16 Jun 11 15:30 net_prio -&gt; net_cls,net_prio
0 dr-xr-xr-x  3 root root   0 Jun 11 15:30 net_cls,net_prio
0 lrwxrwxrwx  1 root root  16 Jun 11 15:30 net_cls -&gt; net_cls,net_prio
0 dr-xr-xr-x  5 root root   0 Jun 11 15:30 memory
0 dr-xr-xr-x  3 root root   0 Jun 11 15:30 cpuset
0 lrwxrwxrwx  1 root root  11 Jun 11 15:30 cpuacct -&gt; cpu,cpuacct
0 dr-xr-xr-x  5 root root   0 Jun 11 15:30 cpu,cpuacct
0 lrwxrwxrwx  1 root root  11 Jun 11 15:30 cpu -&gt; cpu,cpuacct
0 drwxr-xr-x 13 root root 340 Jun 11 15:30 .
0 drwxr-xr-x  1 root root  66 Jun 16 13:54 ..
root@chaos-cpu-pressure-xma-test-v6cnb:/mnt/cgroup# pwd
/mnt/cgroup

Injector pod process list
The java -server -Xms1G -Xmx1G below is the target pod app process

root@chaos-cpu-pressure-xma-test-v6cnb:/mnt/cgroup/cpu# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Jun11 ?        00:03:16 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
root         2     0  0 Jun11 ?        00:00:00 [kthreadd]
root         3     2  0 Jun11 ?        00:00:00 [rcu_gp]
root         4     2  0 Jun11 ?        00:00:00 [rcu_par_gp]
root         6     2  0 Jun11 ?        00:00:00 [kworker/0:0H-kblockd]
root         8     2  0 Jun11 ?        00:00:00 [mm_percpu_wq]
root         9     2  0 Jun11 ?        00:00:04 [ksoftirqd/0]
root        10     2  0 Jun11 ?        00:02:23 [rcu_sched]
root        11     2  0 Jun11 ?        00:00:01 [migration/0]
root        13     2  0 Jun11 ?        00:00:00 [cpuhp/0]
root        14     2  0 Jun11 ?        00:00:00 [cpuhp/1]
root        15     2  0 Jun11 ?        00:00:01 [migration/1]
root        16     2  0 Jun11 ?        00:00:03 [ksoftirqd/1]
root        18     2  0 Jun11 ?        00:00:00 [kworker/1:0H-kblockd]
root        19     2  0 Jun11 ?        00:00:00 [cpuhp/2]
root        20     2  0 Jun11 ?        00:00:01 [migration/2]
root        21     2  0 Jun11 ?        00:00:03 [ksoftirqd/2]
root        23     2  0 Jun11 ?        00:00:00 [kworker/2:0H-xfs-log/nvme0n1p1]
root        24     2  0 Jun11 ?        00:00:00 [cpuhp/3]
root        25     2  0 Jun11 ?        00:00:01 [migration/3]
root        26     2  0 Jun11 ?        00:00:03 [ksoftirqd/3]
root        28     2  0 Jun11 ?        00:00:00 [kworker/3:0H-xfs-log/nvme0n1p1]
root        30     2  0 Jun11 ?        00:00:00 [kdevtmpfs]
root        31     2  0 Jun11 ?        00:00:00 [netns]
root        35     2  0 Jun11 ?        00:00:00 [kauditd]
root       209     2  0 Jun11 ?        00:00:00 [khungtaskd]
root       224     2  0 Jun11 ?        00:00:00 [oom_reaper]
root       225     2  0 Jun11 ?        00:00:00 [writeback]
root       227     2  0 Jun11 ?        00:00:00 [kcompactd0]
root       228     2  0 Jun11 ?        00:00:00 [ksmd]
root       229     2  0 Jun11 ?        00:00:01 [khugepaged]
root       285     2  0 Jun11 ?        00:00:00 [kintegrityd]
root       286     2  0 Jun11 ?        00:00:00 [kblockd]
root       288     2  0 Jun11 ?        00:00:00 [blkcg_punt_bio]
root       398     2  0 Jun11 ?        00:00:00 [tpm_dev_wq]
root       403     2  0 Jun11 ?        00:00:00 [md]
root       407     2  0 Jun11 ?        00:00:00 [edac-poller]
root       412     2  0 Jun11 ?        00:00:00 [watchdogd]
root       549     2  0 Jun11 ?        00:00:00 [kswapd0]
root       667     2  0 Jun11 ?        00:00:00 [kthrotld]
root       729     2  0 Jun11 ?        00:00:00 [kstrp]
root       755     2  0 Jun11 ?        00:00:00 [ipv6_addrconf]
root      1205     2  0 Jun11 ?        00:00:00 [nvme-wq]
root      1206     2  0 Jun11 ?        00:00:00 [nvme-reset-wq]
root      1209     2  0 Jun11 ?        00:00:00 [nvme-delete-wq]
root      1257     2  0 Jun11 ?        00:00:00 [xfsalloc]
root      1258     2  0 Jun11 ?        00:00:00 [xfs_mru_cache]
root      1263     2  0 Jun11 ?        00:00:00 [xfs-buf/nvme0n1]
root      1264     2  0 Jun11 ?        00:00:00 [xfs-conv/nvme0n]
root      1265     2  0 Jun11 ?        00:00:00 [xfs-cil/nvme0n1]
root      1266     2  0 Jun11 ?        00:00:00 [xfs-reclaim/nvm]
root      1267     2  0 Jun11 ?        00:00:00 [xfs-eofblocks/n]
root      1268     2  0 Jun11 ?        00:00:00 [xfs-log/nvme0n1]
root      1269     2  0 Jun11 ?        00:01:19 [xfsaild/nvme0n1]
root      1270     2  0 Jun11 ?        00:00:01 [kworker/3:1H-kblockd]
root      1332     1  0 Jun11 ?        00:00:41 /usr/lib/systemd/systemd-journald
root      1537     1  0 Jun11 ?        00:00:00 /usr/sbin/lvmetad -f
root      1582     1  0 Jun11 ?        00:00:01 /usr/lib/systemd/systemd-udevd
root      2207     2  0 Jun11 ?        00:00:00 [kworker/2:1H-kblockd]
root      2208     2  0 Jun11 ?        00:00:00 [ena]
root      2397     2  0 Jun11 ?        00:00:00 [cryptd]
root      2627     2  0 Jun11 ?        00:00:00 [rpciod]
root      2628     2  0 Jun11 ?        00:00:00 [kworker/u9:0]
root      2629     2  0 Jun11 ?        00:00:00 [xprtiod]
root      2639     1  0 Jun11 ?        00:00:00 /sbin/auditd
root      2710     1  0 Jun11 ?        00:00:11 /usr/lib/systemd/systemd-logind
81        2786     1  0 Jun11 ?        00:00:23 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
32        2825     1  0 Jun11 ?        00:00:00 /sbin/rpcbind -w
root      2869     1  0 Jun11 ?        00:00:07 /usr/sbin/irqbalance --foreground
998       2890     1  0 Jun11 ?        00:00:11 /sbin/rngd -f --fill-watermark=0 --exclude=jitter
root      2958  3485  0 09:56 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/9427212baddee46b87443b939446198ae7bf412aa35ccd5d25cb61b44e133c43 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
999       2986     1  0 Jun11 ?        00:00:04 /usr/sbin/chronyd
root      3004  2958  0 09:56 ?        00:00:00 /pause
root      3008     1  0 Jun11 ?        00:00:00 /usr/sbin/gssproxy -D
root      3132  3485  0 09:56 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/9228fed179b3ab3f531dcd6471ef49025b3c5f8ad3074bbdd5a21cac4c5700c8 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
608       3165  3132  0 09:56 ?        00:00:07 java -server -Xms1G -Xmx1G -XX:InitialCodeCacheSize=50m -XX:ReservedCodeCacheSize=50m -XX:MetaspaceSize=20m -XX:MaxMetaspaceSize=50m -XX:MinMetaspaceFreeRatio=0 -XX:MaxMetaspaceFreeRatio=100 -XX:CompressedClassSpaceSize=20m -XX:+UsePar
root      3289     1  0 Jun11 ?        00:00:00 /sbin/dhclient -q -lf /var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid eth0
root      3381     1  0 Jun11 ?        00:00:00 /sbin/dhclient -6 -nw -lf /var/lib/dhclient/dhclient6--eth0.lease -pf /var/run/dhclient6-eth0.pid eth0
root      3485     1  0 Jun11 ?        00:20:49 /usr/bin/containerd
root      3563     2  0 Jun11 ?        00:00:00 [kworker/1:1H-kblockd]
root      3680     1  0 Jun11 ?        00:00:00 /usr/libexec/postfix/master -w
89        3695  3680  0 Jun11 ?        00:00:00 qmgr -l -t unix -u
root      3708     2  0 Jun11 ?        00:00:00 [kworker/0:1H-kblockd]
root      3774     1  0 Jun11 ?        00:00:33 /usr/sbin/rsyslogd -n
root      3834     1  0 Jun11 tty1     00:00:00 /sbin/agetty --noclear tty1 linux
root      3837     1  0 Jun11 ?        00:00:00 /usr/sbin/crond -n
root      3840     1  0 Jun11 ttyS0    00:00:00 /sbin/agetty --keep-baud 115200,38400,9600 ttyS0 vt220
root      4132     1  0 Jun11 ?        00:00:00 /usr/sbin/sshd -D
root      4231     1  1 Jun11 ?        01:17:26 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root      4492     2  0 Jun11 ?        00:00:01 bpfilter_umh
root      4604     1  2 Jun11 ?        03:29:12 /usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --network-plugin cni --node-ip=10.1.1.138 --pod-infra-container-image=602401
root      5068  3485  0 Jun11 ?        00:00:10 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/e60eb8ae173960808b9f4bdc0611f68684b3f517bebbbb1c2f3a820619d217a5 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root      5070  3485  0 Jun11 ?        00:00:10 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/a8aa28122f8e156267d686002fce5adedbb2b1f62438b781a2f93193c3f5e6b1 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root      5103  5068  0 Jun11 ?        00:00:00 /pause
root      5142  5070  0 Jun11 ?        00:00:00 /pause
root      5372  3485  0 Jun11 ?        00:00:09 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/1ed215f2ada061816af66aa7c094b343296b4c7ddf0773d0ada06b3484b0a62b -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root      5405  5372  0 Jun11 ?        00:02:08 kube-proxy --v=2 --config=/var/lib/kube-proxy-config/config
root      7146  3485  0 Jun11 ?        00:04:57 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/bbeec24043921227fb919bc7f1ff3cbd4ac122c20d093742dd55bdc07360813e -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root      7163  7146  0 Jun11 ?        00:00:00 bash /app/entrypoint.sh
root      7214  7163  0 Jun11 ?        00:04:51 ./aws-k8s-agent
root      7215  7163  0 Jun11 ?        00:00:00 tee -i aws-k8s-agent.log
root      7948  3485  0 Jun15 ?        00:00:01 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/554c8290a037a8ba680f2b18fcb89a18e9bd0ac2cd6f66ce51cbe1f565a89422 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root      7985  7948  0 Jun15 ?        00:00:00 /pause
root      8303  3485  0 Jun15 ?        00:00:01 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/a3e38431fd13d4a07f0f7898adfee7bc2f416b03cfde1160fe1e6b1d5dafd969 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1000      8337  8303  0 Jun15 ?        00:00:00 ruby details.rb 9080
root      8369  3485  0 Jun15 ?        00:00:01 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/e50c740e4980ff06bb99fc6c8f518684a932316d6122f4ebe943f00c92acc333 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1337      8386  8369  0 Jun15 ?        00:00:29 /usr/local/bin/pilot-agent proxy sidecar --domain default.svc.cluster.local --serviceCluster details.default --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337      8450  8386  0 Jun15 ?        00:01:47 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster details.default --service-node sidecar~10.1.13.182~details-v1-79f774bd
root     11412     2  0 14:05 ?        00:00:00 [kworker/u8:1-events_unbound]
root     15722     2  0 13:53 ?        00:00:00 [kworker/2:2-events]
root     15734     2  0 13:53 ?        00:00:00 [kworker/u8:0-events_unbound]
root     15736  3485  0 13:53 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/c69f4d3376fe9d2969bd517fb46c73b7a7bce9947c9cf2ee0047857e0649e4e6 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     15773 15736  0 13:53 ?        00:00:00 /pause
root     15794     2  0 13:53 ?        00:00:00 [kworker/u8:3-events_unbound]
root     16029  3485  0 13:54 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/42a6ef904ff7eb2e887a3192025a838170087de3e20b8a4f0971597fd8df71cd -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     16069 16029 99 13:54 ?        00:31:35 /usr/local/bin/injector cpu-pressure --metrics-sink noop --level pod --containers-id docker://1c09ce702adbe374fd314ed609662d4a4ad19276f7f0cd89c8e95b1df1d8261f --log-context-disruption-name cpu-pressure-xma-test --log-context-disruption
root     16488     2  0 14:07 ?        00:00:00 [kworker/3:0-events]
root     16520     2  0 12:34 ?        00:00:00 [kworker/0:9-rcu_gp]
root     16611     2  0 13:54 ?        00:00:00 [kworker/3:1-events]
root     16911     2  0 13:54 ?        00:00:00 [kworker/1:2-rcu_gp]
root     16915     2  0 13:54 ?        00:00:00 [kworker/1:6-events]
root     17589  3485  0 12:35 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/91000e1cdf9e53639bb0a6b5231bc7349affac1220ee7daf8296f7f5626e40e2 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     17629 17589  0 12:35 ?        00:00:00 /pause
root     17708  3485  0 12:35 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/0bd7f32ba1caa029fc2effb2730cb6774c963424dd368388a3c66e1ecf778e20 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     17726 17708  0 12:35 ?        00:00:00 ./kube-rbac-proxy --secure-listen-address=0.0.0.0:8443 --upstream=http://127.0.0.1:8080/ --logtostderr=true --v=10
root     17774  3485  0 12:35 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/f8a75fefbb836783bd4380577c62072f529eb6098bbe949e75d82ab6618f9f2d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
65532    17791 17774  0 12:35 ?        00:00:05 /usr/local/bin/manager --metrics-addr=127.0.0.1:8080 --enable-leader-election --metrics-sink=noop --injector-image=hub-docker-remote.aHost/datadog/chaos-injector:4.0.1 --image-pull-secrets=aHost --admission-
root     18103 16029  0 13:54 pts/0    00:00:00 /bin/bash
root     19895  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/626f017d72407408ae6ecf0a20665abda03a8f55839860c590fb863cc984c8d1 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     19975 19895  0 Jun14 ?        00:00:00 /pause
89       20009  3680  0 13:55 ?        00:00:00 pickup -l -t unix -u
root     20015  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/22f98a46806821da63aae44e16af3bb7cf208da97aae6ed4a8b3175df2ca4c60 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     20060 20015  0 Jun14 ?        00:00:00 /pause
root     20443  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/12fdc488bc76fde517eb34760b97024415f863a656d5a1876869e4a4a269cac7 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     20452  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/5105a46a1bde8419e7adfb6472e91dacbcb3662bed5508221cf833cf827d66ce -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     20489 20443  0 Jun14 ?        00:00:00 /bin/sh -c node server.js
root     20504 20452  0 Jun14 ?        00:00:00 /bin/sh -c node server.js
root     20559 20489  0 Jun14 ?        00:00:00 node server.js
root     20560 20504  0 Jun14 ?        00:00:00 node server.js
root     20572  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/24564c8389f0dc3fdd2169ac09d1a810384f04a1d854f3069ffcdc3f8b27f353 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     20586  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/9622c9c327315a3200564faa2ae46d85f08a7fb9244674b5db301f74566edaba -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1337     20627 20572  0 Jun14 ?        00:00:59 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver2.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     20646 20586  0 Jun14 ?        00:01:03 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver2.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     20717 20627  0 Jun14 ?        00:03:48 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver2.chaos --service-node sidecar~10.1.21.252~echoserver2-65444
1337     20718 20646  0 Jun14 ?        00:03:54 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver2.chaos --service-node sidecar~10.1.31.110~echoserver2-65444
root     21153  7146  0 14:09 ?        00:00:00 [runc] <defunct>
root     21162 16029  0 14:09 ?        00:00:00 [containerd] <defunct>
root     21165  7146  0 14:09 ?        00:00:00 /app/grpc-health-probe -addr=:50051
root     21175 18103  0 14:09 pts/0    00:00:00 ps -ef
root     24400  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/e2a7086f4b23bb9d440f45afd6ad21304740ab58c107f67d1174ea4ea61bbc5d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     24505 24400  0 Jun14 ?        00:00:00 /pause
root     24584  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/1db24e0e57dc846c71b334990d4ab4279e3d3c45ebe9d6b5062b5cb54811c6ec -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     24623 24584  0 Jun14 ?        00:00:00 /pause
root     24993  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/74121f0e5766d8028a9a85b292800e8c2eb299196135e556cd00bc65127b0926 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     25000  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/fd1da6a360defe0d9a95d782fb1aed7ac3f160de234d2686e0119c232e78395b -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     25035 25000  0 Jun14 ?        00:00:00 /bin/sh -c node server.js
root     25067 24993  0 Jun14 ?        00:00:00 /bin/sh -c node server.js
root     25110 25035  0 Jun14 ?        00:00:00 node server.js
root     25111 25067  0 Jun14 ?        00:00:00 node server.js
root     25124  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/410176dbb54a396cbded49967d764793cd10853241c68f24180cf8c7b097186a -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     25131  3485  0 Jun14 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/6db71eee65831fc3ed96816fd4edfda8f8ad3404046ebb657afed7d389b23e63 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1337     25157 25131  0 Jun14 ?        00:01:02 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver2.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     25158 25124  0 Jun14 ?        00:01:01 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver2.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     25267 25157  0 Jun14 ?        00:03:54 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver2.chaos --service-node sidecar~10.1.28.236~echoserver2-65444
1337     25273 25158  0 Jun14 ?        00:03:49 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver2.chaos --service-node sidecar~10.1.0.234~echoserver2-654447
root     26023  3485  0 09:19 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/3475e811e68e132d78afc70841b3322d0ae44369faa794a6593434bac9e9c6c0 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     26063 26023  0 09:19 ?        00:00:00 /pause
root     26144  3485  0 09:19 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/1c09ce702adbe374fd314ed609662d4a4ad19276f7f0cd89c8e95b1df1d8261f -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
608      26161 26144  0 09:19 ?        00:00:07 java -server -Xms1G -Xmx1G -XX:InitialCodeCacheSize=50m -XX:ReservedCodeCacheSize=50m -XX:MetaspaceSize=20m -XX:MaxMetaspaceSize=50m -XX:MinMetaspaceFreeRatio=0 -XX:MaxMetaspaceFreeRatio=100 -XX:CompressedClassSpaceSize=20m -XX:+UsePar
root     26317  3485  0 12:07 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/352e7e5d0f024c005c7858989401a2fda389e63d6814380363b9ef62affd0710 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
nobody   26353 26317  0 12:07 ?        00:00:00 /pause
root     26436  3485  0 12:07 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/fcf44b1b5813949b96277bbc57a093a02d12436cc3c7f2af5528058686ee4cb7 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
nobody   26453 26436  0 12:07 ?        00:00:00 /configmap-reload --volume-dir=/etc/config --webhook-url=http://127.0.0.1:9090/-/reload
root     26500  3485  0 12:07 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/0924234f72dadf748eb1f407bf9c9a87d3519b3929a42e092d67551588544b80 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
nobody   26518 26500  1 12:07 ?        00:02:21 /bin/prometheus --storage.tsdb.retention.time=15d --config.file=/etc/config/prometheus.yml --storage.tsdb.path=/data --web.console.libraries=/etc/prometheus/console_libraries --web.console.templates=/etc/prometheus/consoles --web.enabl
root     27012     2  0 13:58 ?        00:00:00 [kworker/2:3-events]
root     27161  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/a90ddb582b3ce80299a50162d6b2f9abb45500818205857529f005e6bf1fd829 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     27186  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/5fdfe25106d8fea36f609e27c1635597ca5f3271a6e98ed16b316a431cd00e07 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     27206 27186  0 13:58 ?        00:00:00 /pause
root     27315 27161  0 13:58 ?        00:00:00 /pause
root     27339  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/c46dedb5310ecd3535ea7199f2fdd4782db6c2d623ef77c284f0af752f4975d2 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     27390 27339  0 13:58 ?        00:00:00 /pause
root     27987  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/68cf1800d684063cf330e43481b7c4e7415f6e8db1e35c7808e57c02dc9e9290 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     28000  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/f8f580a2c94f47b7732b2e31378cf4e39c292309fa0096000f3723a849e63271 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     28001  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/e9bffcae49cfde6607595e9bcc36389cbf17e10437eac92162bad5a05436c61d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     28054 27987  0 13:58 ?        00:00:00 /bin/sh -c node server.js
root     28087 28001  0 13:58 ?        00:00:00 /bin/sh -c node server.js
root     28118 28000  0 13:58 ?        00:00:00 /bin/sh -c node server.js
root     28155 28087  0 13:58 ?        00:00:00 node server.js
root     28166 28118  0 13:58 ?        00:00:00 node server.js
root     28172 28054  0 13:58 ?        00:00:00 node server.js
root     28187  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/27c6e7783ee0154029dafb4f99f75d00e008c6fdaeaaac1f59046f95d30734e8 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     28192  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/041259c0999e2695691e68ad9e435b5e2f78bf6d92b6c2b16d92868bd501bdb1 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     28201  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/e9193bae033aaa676a6b402d06c6649968d41c1050cb02fd9e087dd211688e82 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1337     28268 28187  0 13:58 ?        00:00:00 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     28306 28192  0 13:58 ?        00:00:00 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     28330 28201  0 13:58 ?        00:00:00 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     28429 28306  0 13:58 ?        00:00:01 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver.chaos --service-node sidecar~10.1.23.46~echoserver-5bd6cdd4
1337     28432 28268  0 13:58 ?        00:00:01 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver.chaos --service-node sidecar~10.1.12.22~echoserver-5bd6cdd4
1337     28437 28330  0 13:58 ?        00:00:01 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver.chaos --service-node sidecar~10.1.10.218~echoserver-5bd6cdd
root     28827  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/fd01499e7923c8d3cecfe09b3a56971dfe3273e432099e310ef3ee8ecbb03342 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     28879 28827  0 13:58 ?        00:00:00 /pause
root     29048     2  0 13:58 ?        00:00:00 [kworker/0:3-events]
root     29138  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/069d95212a440fb6dda0166378706a5264954600c87b6cb51f7d39d839b85da8 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     29179 29138  0 13:58 ?        00:00:00 /pause
root     29372  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/7272cbc820d2aad245a4b49fd1968a008d3aef78b2ea4fa17b203b7d1f270301 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     29415 29372  0 13:58 ?        00:00:00 /pause
root     29575  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/5d8980c1a54ed7eac743eef36861550f46a25d79b5f3ea1d8a75a2cc6f2faf2c -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     29602 29575  0 13:58 ?        00:00:00 /bin/sh -c node server.js
root     29670 29602  0 13:58 ?        00:00:00 node server.js
root     29722  3485  0 13:58 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/fc9d941c1f60ba789ab369fd6a2a771da602423348e1959d9a4939da11399819 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1337     29783 29722  0 13:58 ?        00:00:00 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     29879 29783  0 13:58 ?        00:00:01 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver.chaos --service-node sidecar~10.1.8.117~echoserver-5bd6cdd4
root     29944  3485  0 13:59 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/fed8f00bda44549742242e49d50c56eec630c07092f73ae231e133060df8dbd5 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     29980 29944  0 13:59 ?        00:00:00 /bin/sh -c node server.js
root     29999  3485  0 13:59 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/0c8a5e9f16fc6eb21cec37549aed1aaad5b9f3bd8ce74aaf3da68e209480fb8d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
root     30021 29980  0 13:59 ?        00:00:00 node server.js
root     30047 29999  0 13:59 ?        00:00:00 /bin/sh -c node server.js
root     30063  3485  0 13:59 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/5cd9d3be33fa9761738d0f6a25aef8eb32ebbc3e4b129f768218b2610fcd5a6f -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1337     30109 30063  0 13:59 ?        00:00:00 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
root     30112 30047  0 13:59 ?        00:00:00 node server.js
root     30158  3485  0 13:59 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/a94b56d18a5a9775c06a1fce0661f16369d18c92302f75d66e45fe9479b18e85 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/cont
1337     30211 30158  0 13:59 ?        00:00:00 /usr/local/bin/pilot-agent proxy sidecar --domain chaos.svc.cluster.local --serviceCluster echoserver.chaos --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2
1337     30238 30109  0 13:59 ?        00:00:01 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver.chaos --service-node sidecar~10.1.18.83~echoserver-5bd6cdd4
1337     30264 30211  0 13:59 ?        00:00:01 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --service-cluster echoserver.chaos --service-node sidecar~10.1.17.186~echoserver-5bd6cdd
root     30330     2  0 13:59 ?        00:00:00 [kworker/3:7-xfs-buf/nvme0n1p1]

Injector pod cgroup.procs

cat cgroup.procs
--
1
2
3
4
6
8
9
10
11
13
14
15
16
18
19
20
21
23
24
25
26
28
30
31
35
209
224
225
227
228
229
285
286
288
398
403
407
412
549
667
729
755
1205
1206
1209
1257
1258
1263
1264
1265
1266
1267
1268
1269
1270
2207
2208
2397
2627
2628
2629
3563
3708
4492
11412
15722
15734
15794
16520
16611
16911
16915
27012
29048
30330

Environment:

  • Kubernetes version: EKS 1.19
  • Controller version: 4.0.1
  • Cloud provider (or local): AWS
  • Base OS for Kubernetes: docker://19.3.13 ?

Additional context

User Issue: Security vulnerabilities flagged in Docker images with Go 1.16

Describe the bug
Hi ๐Ÿ‘‹

We have container scanning in place and it is flagging certain security vulnerabilities that are exploited when Go 1.16 is used. The images we scan are the ones from DockerHub.

I was checking the codebase and my assumption was that Go 1.18 is used to build the binaries and in the Docker images (e.g. here and here).

I rebuilt the binaries and images with a GitHub Actions workflow that uses Go 1.18 and the scanner stopped complaining.

Do you know if there is anything in the release workflow that sets the Go version to 1.16?

To Reproduce
I am not sure how to reproduce that locally as it seems to have to do with the release workflow.

Expected behavior
Go 1.18 is used as the runtime.

Screenshots
Screenshot 2022-07-20 at 16 02 39

User Issue: :latest tag did not work for datadog/chaos-injector, but :4.2.1 did

Describe the bug
While trying to use the container_failure_all_forced.yaml disruption, the pod created to inject the disruption failed to pull the image from Docker Hub. Adding the tag :4.2.1 to image: datadog/chaos-injector in the install.yml file resolved this.

To Reproduce
Steps to reproduce the behavior:

  1. Do kubectl apply -f https://raw.githubusercontent.com/DataDog/chaos-controller/main/chart/install.yaml on the master node of a cluster.
  2. Download the disruption and apply with kubectl apply -f container_failure_all_forced.yaml
  3. Do kubectl get pods --all-namespaces or kubectl get pods -n chaos-engineering to view the disruption pod.
  4. See the error: Disruption pod failed to pull the image, with a status of ErrImagePull, then ImagePullBackOff

Expected behavior
It should have created a running pod with a working container to inject the disruption into the target

Screenshots
Events list from kubectl describe pod <DISRUPTOR-PODNAME> -n chaos-engineering
2021-08-16 (1)

Environment:

  • Kubernetes version: v1.21.3+k3s1
  • Controller version:
  • Cloud provider (or local): AWS
  • Base OS for Kubernetes: Amazon Linux 2

Additional context
Performed on a K3s cluster made up of Amazon EC2 instances, with one master and two worker nodes. The error was encountered while running commands on master.

User Request: Feature flag to disable deletion of Disruption

Is your feature request related to a problem? Please describe.
Our team is working on integrating the chaos controller with our CI/CD platform.

The setup we have is as follows:

  • Kubefed acts as the control plane and propagates any changes to worker clusters.
  • Worker clusters are the ones where workloads get deployed. These are the clusters where we run Disruptions.
  • The Spinnaker pipeline deploys Disruptions to the worker clusters through Kubefed.

Right now the deletion of Disruptions is done by the controller. This is happening once the Disruption expires and after the GC period.
This is great, but with a control plane we tend to have kubefed handling the lifecycle of any resources deployed to worker clusters. Ideally the responsibility to delete Disruptions would be on Spinnaker and Kubefed.

Describe the solution you'd like
Introduce a toggle in the controller's ConfigMap which disables the deletion.

Describe alternatives you've considered
The other option for us is to set the GC to a very large value so Spinnaker/Kubefed are always the first ones to do the deletion.

Additional context
Let us know if you have any other ideas. If this can be solved in a different way happy to consider it!

Injector pods can be oom killed

If the injector pod is scheduled on a node with high memory usage, it can be selected to be oom killed to free some space. It may be a real issue since it can stop the injection or the cleanup before the end, leading to some unexpected behaviors.

More than that, we might have the issue when we'll add something like the memory pressure feature.

Setting a limit to 0 is not a solution since it would make it even easier to get evicted (as explained here). And setting a request and limit can be an issue since it means the pod might not be scheduled (but maybe the best of the worst...).

Unable to dynamically target and experiments end on PreviouslyPartiallyInjected 7.22

Describe the bug
Hi team! We recently tried to upgrade from version 7.13.1 to 7.22 and we are facing issues trying to achieve the same behaviour.
Here is what we have observed

  • on dynamic targeting experiments, the controller correctly targets the pods
  • The injector pods inject the pods they target
  • the application will autoscale creating new pods
  • then the controller can โ€œseeโ€ the new pods, this is ADDED (DisruptionTargetHandler ADD) but not producing a new injector pod
  • the experiment duration expires and the end status is โ€œPreviouslyPartiallyInjectedโ€
  • The new injector pods THEN spin up after, but donโ€™t inject since the time has expired

We would like to understand any ideas you may have as to why this behaviour could be occuring.

To Reproduce
Steps to reproduce the behavior:
This could be an example of an experiment we ran

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: cpu-pressure-pod
  namespace: chaos-engineering-framework
  annotations:
    chaos.datadoghq.com/environment: dev
spec:
  level: pod
  duration: 3m
  selector:
    app.kubernetes.io/name: "fault-injection-showcase"
    app: "fault-injection-showcase--plugin-master"
  count: 100%
  staticTargeting: false
  cpuPressure: 
    count: 100%

This happens to all disruption types. The experiment does not reach PreviouslyInjected, even with staticTargeting on.

Expected behavior

  • The experiment should inject the fault into the targeted application, and when the application scales, it should also target the new pods that are spun up due to autoscaling
  • The experiment should end in PreviouslyInjected when the experiment is complete and all pods are injected.

Environment:
Kubernetes version: v1.23.17-eks-c12679a
Controller version: 7.22.0
Cloud provider (or local): AWS EKS
Base OS for Kubernetes: Amazon Linux: 5.4.242-156.349.amzn2.x86_64

Disruption

      manager: manager
      operation: Update
      subresource: status
      time: '2023-07-19T14:06:16Z'
  name: cpu-pressure-pod
  namespace: chaos-engineering-framework
  resourceVersion: '965396852'
  uid: d91615ec-52b7-4a44-a71b-fc07f713059d
  selfLink: >-
    /apis/chaos.datadoghq.com/v1beta1/namespaces/chaos-engineering-framework/disruptions/cpu-pressure-pod
status:
  desiredTargetsCount: 5
  ignoredTargetsCount: 0
  injectedTargetsCount: 0
  injectionStatus: PreviouslyPartiallyInjected
  selectedTargetsCount: 5
  targetInjections:
    fault-injection-showcase-plugin-master-template-57c65cb44966t9p:
      injectionStatus: Injected
      injectorPodName: chaos-cpu-pressure-pod-c76zj
      since: '2023-07-19T14:03:33Z'
    fault-injection-showcase-plugin-master-template-57c65cb449hkbhq:
      injectionStatus: NotInjected
    fault-injection-showcase-plugin-master-template-57c65cb449n6f2z:
      injectionStatus: NotInjected
    fault-injection-showcase-plugin-master-template-57c65cb449qw6t9:
      injectionStatus: NotInjected
    fault-injection-showcase-plugin-master-template-57c65cb449v5nwc:
      injectionStatus: Injected
      injectorPodName: chaos-cpu-pressure-pod-8fskc
      since: '2023-07-19T14:03:32Z'
spec:
  count: 100%
  cpuPressure:
    count: 100%
  duration: 3m0s
  level: pod
  selector:
    app: fault-injection-showcase--plugin-master
    app.kubernetes.io/name: fault-injection-showcase
  triggers:
    createPods:
      notBefore: null
    inject:
      notBefore: null

Injector Pod describe

Name:         chaos-cpu-pressure-pod-c76zj
Namespace:    chaos-engineering-framework
Priority:     0
Node:         ip-10-72-119-118.us-west-2.compute.internal/10.72.119.118
Start Time:   Wed, 19 Jul 2023 15:03:31 +0100
Labels:       chaos.datadoghq.com/disruption-kind=cpu-pressure
              chaos.datadoghq.com/disruption-name=cpu-pressure-pod
              chaos.datadoghq.com/disruption-namespace=chaos-engineering-framework
              chaos.datadoghq.com/target=fault-injection-showcase-plugin-master-template-57c65cb44966t9p
Annotations:  kubernetes.io/psp: eks.privileged
              sidecar.istio.io/inject: false
Status:       Succeeded
IP:           10.72.90.185
IPs:
  IP:  10.72.90.185
Containers:
  injector:
    Container ID:  docker://3854a32d0ae2ab2c3aaaeaebba57bcec2106e8bf00277fda9bfe50a0634d396c
    Image:         hub-docker-remote.artylab.expedia.biz/datadog/chaos-injector:7.22.0
    Image ID:      docker-pullable://hub-docker-remote.artylab.expedia.biz/datadog/chaos-injector@sha256:abb12e9b634c1d073f6ec8825547b2a3d81f0a74d239ea83efe15cd4d59609c0
    Port:          <none>
    Host Port:     <none>
    Args:
      cpu-pressure
      --count
      100%
      --metrics-sink
      noop
      --level
      pod
      --target-containers
      fault-injection-showcase;docker://d4ab23bfbe27d489c0587a1a8c9deccada81ad7f7be02d523e3feefe8a166d1b,istio-proxy;docker://b1a854586b8762592ac747c2c3a918aae1a6aa4d19639a4e3ced722738c25109
      --target-pod-ip
      10.72.64.26
      --chaos-namespace
      chaos-engineering-framework
      --log-context-disruption-name
      cpu-pressure-pod
      --log-context-disruption-namespace
      chaos-engineering-framework
      --log-context-target-name
      fault-injection-showcase-plugin-master-template-57c65cb44966t9p
      --log-context-target-node-name
      ip-10-72-119-118.us-west-2.compute.internal
      --not-injected-before
      2023-07-19T14:03:17Z
      --deadline
      2023-07-19T14:06:16Z
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 19 Jul 2023 15:03:32 +0100
      Finished:     Wed, 19 Jul 2023 15:06:16 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:      0
      memory:   0
    Readiness:  exec [cat /tmp/readiness_probe] delay=0s timeout=1s period=1s #success=1 #failure=5
    Environment:
      DD_ENTITY_ID:                         (v1:metadata.uid)
      DD_AGENT_HOST:                        (v1:status.hostIP)
      TARGET_POD_HOST_IP:                   (v1:status.hostIP)
      CHAOS_POD_IP:                         (v1:status.podIP)
      INJECTOR_POD_NAME:                   chaos-cpu-pressure-pod-c76zj (v1:metadata.name)
      CHAOS_INJECTOR_MOUNT_HOST:           /mnt/host/
      CHAOS_INJECTOR_MOUNT_PROC:           /mnt/host/proc/
      CHAOS_INJECTOR_MOUNT_SYSRQ:          /mnt/sysrq
      CHAOS_INJECTOR_MOUNT_SYSRQ_TRIGGER:  /mnt/sysrq-trigger
      CHAOS_INJECTOR_MOUNT_CGROUP:         /mnt/cgroup/
    Mounts:
      /mnt/cgroup from cgroup (rw)
      /mnt/host from host (ro)
      /mnt/sysrq from sysrq (rw)
      /mnt/sysrq-trigger from sysrq-trigger (rw)
      /run from run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rj8bm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /run
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  sysrq:
    Type:          HostPath (bare host directory volume)
    Path:          /proc/sys/kernel/sysrq
    HostPathType:  File
  sysrq-trigger:
    Type:          HostPath (bare host directory volume)
    Path:          /proc/sysrq-trigger
    HostPathType:  File
  cgroup:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/cgroup
    HostPathType:  Directory
  host:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  Directory
  kube-api-access-rj8bm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Pulled   28m   kubelet  Container image "hub-docker-remote.artylab.expedia.biz/datadog/chaos-injector:7.22.0" already present on machine
  Normal  Created  28m   kubelet  Created container injector
  Normal  Started  28m   kubelet  Started container injector
 

Logs
Example of injector being able to see the new pods that spin up (but this pod is not injected)
{"level":"debug","ts":1689775459195.1533,"caller":"watchers/target_pod.go:43","message":"DisruptionTargetHandler ADD","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"fault-injection-showcase-plugin-master-template-57c65cb449qw6t9","targetKind":"pod"}

{ "level": "warn", "ts": 1689781304975.5027, "caller": "watchers/target_pod.go:121", "message": "couldn't get the list of events from the target. Might not be able to notify on error changes: non-exact field matches are not supported by the cache" }

{"level":"warn","ts":1689775576793.182,"caller":"injector/main.go:850","message":"couldn't remove this pod's finalizer","disruptionName":"cpu-pressure-pod","disruptionNamespace":"chaos-engineering-framework","targetName":"fault-injection-showcase-plugin-master-template-57c65cb449n6f2z","targetNodeName":"ip-10-72-168-5.us-west-2.compute.internal","pod":"chaos-cpu-pressure-pod-f7rh6","error":"Operation cannot be fulfilled on pods \"chaos-cpu-pressure-pod-f7rh6\": the object has been modified; please apply your changes to the latest version and try again"}
(chaos-cpu-pressure-pod-f7rh6 refers to a new injector pod that did not inject anything)

I attached the controller logs here
fullLog.txt

We appreciate any support you can come back with!

User Request: allow targeting subset of destination service

Is your feature request related to a problem? Please describe.
A typical scenario I want to test is how service A response to its direct dependency service B being partially unavailable; I basically want to verify that A has proper timeouts and retries in place to be able to gracefully handle e.g. a single B pod being overloaded or in a bad state.

Describe the solution you'd like
I see that it's possible to scope a network disruption to just a list of specific IP addresses with the network.hosts field. However, I do not know the IP addresses of the B pods at the time of writing the Disruption. I would like to instead be able to provide a count of the destination service's pods that should be in scope for the disruption, with a percentage allowed. This would be dynamically translated to a list of IPs.

Describe alternatives you've considered
I can create a disruption on B instead of A, and set the count as I wish. However, that causes a disruption to all clients of B, whereas I want to limit the scope to A, which is the subject under test. We do not have dedicated environments for this, so limiting the impact of disruptions is key to staying popular with my colleagues :D

User Request: Support for duration

Is your feature request related to a problem? Please describe.
Usually chaos engineering experiments run over a predetermined period of time. This has many benefits:

  • Users don't need to terminate the experiments manually.
  • It acts as a safety net; it is fairly easy to forget to terminate experiments.
  • Many times users of chaos engineering frameworks don't even have access to tools like kubectl to run a kubectl delete. This is the case for most of our users internally.
  • CI/CD platforms like Spinnaker can be used to run experiments. The entire lifecycle of an experiment needs to be handled in that case. Examples of integrations with CI/CD platform include Chaos Monkey, Litmus, and more.

Describe the solution you'd like
The duration of the experiment (in seconds) will be defined in the CRD. The controller sleeps for duration seconds and sends an exit signal (SIGINT/SIGTERM) to the injector pods when this is exceeded.

Describe alternatives you've considered
The alternative is to handle the lifecycle of an experiment manually but this not always possible and desirable as mentioned earlier.

User Issue: Error pulling image with 7.19.0

Describe the bug

When deploying can't pull images. At first I though this was a mistake on my side trying to deploy chaos-controller on minikube.

  Warning  Failed     8s    kubelet            Error: ErrImagePull
  Normal   BackOff    7s    kubelet            Back-off pulling image "k8s.io/chaos-controller:latest"
  Warning  Failed     7s    kubelet            Error: ImagePullBackOff

To Reproduce
Steps to reproduce the behavior:

  1. Follow steps to install chaos-controller.
kubectl apply -f https://github.com/DataDog/chaos-controller/releases/download/7.19.0/install.yaml
  1. Pod errors with ErrImagePull.
  2. Tried with 7.18.0 and works. Noticed on 7.19.0 changed the image registry we use.
Pulling image "datadog/chaos-controller:7.18.0"

as at 7.19.0 installer

Back-off pulling image "k8s.io/chaos-controller:latest"

This can also be reproduced with docker pull.

Expected behavior

Not sure if a new setup is needed for pulling the images with 7.19.0 and I missed it.

Environment:

  • Kubernetes version: v1.26.1
  • Controller version: 7.19.0
  • Cloud provider (or local): minikube
  • Base OS for Kubernetes: -

7.26.0 Upgrade Issues - InjectionStatus PreviouslyPartiallyInjected

Describe the bug
Hi team, We recently tried to upgrade from version 7.13.1 to 7.26 and we are facing issues trying to achieve the same behaviour. We previously reached out to you about issues with dynamic targeting, we upgraded to your newest release and we are hoping to get some further guidance.
We wanted to understand the expected behaviour of the Injection status. For example, when running cpu pressure, the final status ends up being PreviouslyPartiallyInjected despite the experiment injecting all the desired pods.

To Reproduce
Steps to reproduce the behavior:

  1. This is an example of a cpu-pressure experiment
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: cpu-pressure-pod-rohab
  namespace: chaos-engineering-framework
  annotations:
    chaos.datadoghq.com/environment: dev
spec:
  level: pod
  duration: 6m
  selector:
    app.kubernetes.io/name: "fault-injection-showcase"
    app: "fault-injection-showcase--plugin-master"
  count: 100%
  staticTargeting: true
  cpuPressure: 
    count: 100%

The cpu pressure correctly identify targets and injects the failure. We believe the logs attached below for cpu-pressure show that the targeted pods are not seen as ready and later go into a terminated state
Flow:
1.Disruption Created
2. 2 Tagets identified
3. 2 chaos pods created.
4. No Status straight to "PartiallyInjected"
5. 0 pods ready, desired 2
6. 1 chaos pod ready, desired is 2. Status "PartiallyInjected"
7. 2 chaos pods ready, desired is 2. Status "Injected"
8. 1 Chaos pod goes in to a not Ready status
9. 1 chaos pod ready, desired is 2. Status "PartiallyInjected" from "Injected"
10, Error:

"caller": "controllers/disruption_controller.go:148",
  "message": "a retryable error occurred in reconcile loop",
  "disruptionName": "cpu-pressure-pod-rohab",
  "disruptionNamespace": "chaos-engineering-framework",
  "error": "error handling chaos pods termination: Operation cannot be fulfilled on disruptions.chaos.datadoghq.com \"cpu-pressure-pod-rohab\": the object has been modified; please apply your changes to the latest version and try again"
  1. 0 chaos pod ready, desired is 2, Termination 1 Status "PartiallyInjected" from "PausedPartiallyInjected"
  2. 0 chaos pod ready, desired is 2, Termination 2 Status "PausedPartiallyInjected" from "PreviouslyPartiallyInjected"

Expected behavior
We expect the experiment to finish on PreviouslyInjected if all the desired pods are injected.

Logs
chaos-controller-27-06.txt
cpu_pressure_partially_injection_debug_chaos_pod_log.txt

Describe on pods on cpu pressure(from a different experiment time but same experiment manifest)
injector_pod_describe.txt
target_pod_describe.txt

If you have any ideas. please let us know.

Environment:
Kubernetes version: v1.24.15-eks-a5565ad
Controller version: 7.26.0
Cloud provider (or local): AWS EKS
Base OS for Kubernetes: Amazon Linux: 5.4.242-156.349.amzn2.x86_64

User Request: Dead man's switch

Is your feature request related to a problem? Please describe.
At the moment the "big red button" for single and multiple experiments relies on the operator having access to the cluster.
For example, the operator can run a kubectl delete Disruption <name> for one or more disruptions.

It would be great if the controller had a dead man's switch. In case connection to the cluster is lost the controller would automatically stop all the running experiments.

Describe the solution you'd like

I think the implementation of a dead man's switch could use a heartbeat and a watchdog timer for remediation.
I'm still not sure how the heartbeat would look like. Can we check if the controller is still up and running and if connection to the cluster is lost?

Describe alternatives you've considered
Introducing support for duration with a default expiry period is a good first step for mitigating the risks. However, it is not enough.

User Request: Support for pod state disruptions

Is your feature request related to a problem? Please describe.
Pod state failures (e.g. graceful/non-graceful deletion) are common disruptions in the Chaos Engineering community.

The reasoning behind pod failures is that Kubernetes pods are ephemeral resources; they get destroyed, restarted, recreated.

This happens in many cases:

  • When deploying a new version of an application
  • In case the liveness probe of any container running inside the pod fails
  • As a consequence of draining a node
  • When the autoscaler updates the number of replicas of a deployment

Pod state disruptions can expose a number of reliability concerns including:

  • Long-living pods and all the issues that may arise from them
  • Cold start issues
  • Scalability issues (e.g. autoscaling misconfigurations)
  • Inconsistent/unknown startup times
  • Uneven traffic distribution across pods
  • Non-graceful shutdown
  • Issues related to Java's DNS cache TTL leading to terminated pods still receiving requests
  • Cascading failures
  • We also wrote a blogpost on issues we found when using Kube Monkey

Describe the solution you'd like
Pod deletions can be executed in many different ways. The easiest is through the Kubernetes client which supports graceful and non-graceful deletions through its gracePeriodSeconds parameter. This is how tools like Kube Monkey and our internal controller execute that disruption.

The other option would be to do this at container level which provides more granularity. This is how Pumba executes these disruptions.

A few more implementation details/ideas:

  • The level will always be pod for these disruptions.
  • In the CRD this might get a bit confusing but one option is to introduce a podFailure, similar to nodeFailure, with options (e.g. graceful/non-graceful deletion).
  • There is already a containers field which would allow targeting containers.

User Request: Compatibility with Kubernetes 1.22

The Kubernetes 1.22 release brought a set of backwards incompatible changes.
One that affects this project is the removal of deprecated beta APIs.

Is your feature request related to a problem? Please describe.
The provided charts are not compatible with Kubernetes 1.22.

Describe the solution you'd like
Update any deprecated APIs. There is a migration guide for this.
Depending on the Kubernetes versions we'd like to support in this project we need to check in which version the stable APIs were introduced.

User Issue: Unable to terminate node level network experiments

Describe the bug
While executing Availability Zone network Disruptions we noticed that deleting the Disruption resource does not terminate the experiment.
I suspect this is expected. However, I am curious to understand if there is anything we could do, other than manually killing the affected Nodes in case of an emergency.

To Reproduce
Steps to reproduce the behavior:

  1. Apply the following Disruption on an EKS cluster:
    apiVersion: chaos.datadoghq.com/v1beta1
    kind: Disruption
    metadata:
      name: network-filters
      namespace: chaos-engineering
    spec:
      duration: "5m0s"
      level: node
      selector:
        topology.kubernetes.io/zone: us-west-2a
      count: 100%
      network:
        drop: 100
    
  2. Delete the Disruption before it expires.

Expected behavior
The expectation is that the chaos Pods get cleaned up and that the targeted Nodes are no more impacted.
The actual behavior is that the chaos Pods are in Terminating state and the Node is still affected, until either the Disruption expires or the Node gets replaced (whichever happens first).

Screenshots
The status gets updated to PartiallyInjected:

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  ...
status:
  desiredTargetsCount: 3
  ignoredTargetsCount: 0
  injectedTargetsCount: 0
  injectionStatus: PartiallyInjected
  selectedTargetsCount: 3
  targets:
    - ip-xx.us-west-2.compute.internal
    - ip-xx.us-west-2.compute.internal
    - ip-xx.us-west-2.compute.internal
spec:
  count: 100%
  duration: 5m0s
  level: node
  network:
    drop: 100
  selector:
    topology.kubernetes.io/zone: us-west-2a
  staticTargeting: true

Experiment in progress
experiment_in_progress

Chaos Pods terminating
chaos_pods_terminating

Controller logs

{"level":"info","ts":1663261098328.3938,"caller":"controllers/disruption_controller.go:386","message":"starting targets injection","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","targets":["ip-xx.us-west-2.compute.internal","ip-xx.us-west-2.compute.internal","ip-xx.us-west-2.compute.internal","ip-xx.us-west-2.compute.internal"]}
{"level":"info","ts":1663261098328.5518,"caller":"controllers/disruption_controller.go:311","message":"checking if injection status needs to be updated","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","injectionStatus":"PartiallyInjected"}
{"level":"info","ts":1663261098328.6228,"caller":"controllers/disruption_controller.go:348","message":"chaos pod is not ready yet","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-sp5xs"}
{"level":"info","ts":1663261098328.6345,"caller":"controllers/disruption_controller.go:348","message":"chaos pod is not ready yet","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-l2rwh"}
{"level":"info","ts":1663261098328.64,"caller":"controllers/disruption_controller.go:348","message":"chaos pod is not ready yet","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-npnfq"}
{"level":"info","ts":1663261098328.6453,"caller":"controllers/disruption_controller.go:348","message":"chaos pod is not ready yet","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-tbdzh"}
{"level":"info","ts":1663261098352.3354,"caller":"controllers/disruption_controller.go:285","message":"disruption is not fully injected yet, requeuing","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","injectionStatus":"PartiallyInjected"}
{"level":"info","ts":1663261100329.6738,"caller":"controllers/disruption_controller.go:868","message":"terminating chaos pod to trigger cleanup","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-sp5xs"}
{"level":"info","ts":1663261100473.4058,"caller":"controllers/disruption_controller.go:868","message":"terminating chaos pod to trigger cleanup","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-l2rwh"}
{"level":"info","ts":1663261100641.4067,"caller":"controllers/disruption_controller.go:868","message":"terminating chaos pod to trigger cleanup","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-npnfq"}
{"level":"info","ts":1663261100784.24,"caller":"controllers/disruption_controller.go:868","message":"terminating chaos pod to trigger cleanup","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","chaosPod":"chaos-network-filters-tbdzh"}
{"level":"info","ts":1663261100947.5776,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261100961.5576,"caller":"controllers/disruption_controller.go:663","message":"target is not likely to be cleaned (either it does not exist anymore or it is not ready), the injector will TRY to clean it but will not take care about any failures","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal"}
{"level":"info","ts":1663261100961.5896,"caller":"controllers/disruption_controller.go:722","message":"chaos pod completed, removing finalizer","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal","chaosPod":"chaos-network-filters-sp5xs"}
{"level":"info","ts":1663261101118.2463,"caller":"controllers/disruption_controller.go:663","message":"target is not likely to be cleaned (either it does not exist anymore or it is not ready), the injector will TRY to clean it but will not take care about any failures","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal"}
{"level":"info","ts":1663261101118.2778,"caller":"controllers/disruption_controller.go:722","message":"chaos pod completed, removing finalizer","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal","chaosPod":"chaos-network-filters-l2rwh"}
{"level":"info","ts":1663261101261.8376,"caller":"controllers/disruption_controller.go:663","message":"target is not likely to be cleaned (either it does not exist anymore or it is not ready), the injector will TRY to clean it but will not take care about any failures","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal"}
{"level":"info","ts":1663261101261.8674,"caller":"controllers/disruption_controller.go:722","message":"chaos pod completed, removing finalizer","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal","chaosPod":"chaos-network-filters-npnfq"}
{"level":"info","ts":1663261101415.6448,"caller":"controllers/disruption_controller.go:663","message":"target is not likely to be cleaned (either it does not exist anymore or it is not ready), the injector will TRY to clean it but will not take care about any failures","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal"}
{"level":"info","ts":1663261101415.6882,"caller":"controllers/disruption_controller.go:722","message":"chaos pod completed, removing finalizer","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework","target":"ip-xx.us-west-2.compute.internal","chaosPod":"chaos-network-filters-tbdzh"}
{"level":"info","ts":1663261101609.2085,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261101635.3896,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261107362.9827,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261114383.3926,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261119848.4282,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261121400.5457,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 5s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261126417.5464,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261133438.4202,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261139457.891,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261146475.627,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261153494.1755,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261160511.7751,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261167531.0684,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261173549.154,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 8s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261181566.9873,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 9s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261190585.6133,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261196604.5437,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 9s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261205622.0989,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 5s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261210641.6821,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 9s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261219657.941,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 5s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261224674.3186,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261231694.8792,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 5s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261236716.0718,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 8s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261244733.3901,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 9s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261253752.3855,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 8s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261261770.0288,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261267786.8157,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 8s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261275804.2466,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261282825.0564,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261289841.469,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261296858.5693,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261299221.6409,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261299639.019,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 5s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261299926.3545,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 9s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261303875.483,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261306981.6267,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 5s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261307316.6602,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261307778.8533,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261307851.4666,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 8s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261308847.198,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 8s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261309361.5854,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 8s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261309407.5864,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 9s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261309564.1873,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261309860.5098,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 5s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261309904.886,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261310352.0588,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 7s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261311872.6145,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 9s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261312201.1887,"caller":"controllers/disruption_controller.go:156","message":"disruption has not been fully cleaned yet, re-queuing in 6s","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
{"level":"info","ts":1663261312375.1292,"caller":"controllers/disruption_controller.go:166","message":"all chaos pods are cleaned up; removing disruption finalizer","disruptionName":"network-filters","disruptionNamespace":"chaos-engineering-framework"}
NOOP: Notifier Info: Finished - Disruption finished for disruption network-filters

Environment:

  • Kubernetes version: 1.21
  • Controller version: 7.1.1
  • Cloud provider (or local): EKS
  • Base OS for Kubernetes: Amazon Linux (5.4.209-116.363.amzn2.x86_64)

User Request: Push 4.0.1 images to DockerHub

Is your feature request related to a problem? Please describe.
Now that the 4.0.1 release is out would it be possible to push the latest images to DockerHub?

Describe the solution you'd like
It'd be great if the latest Docker images get pushed to DockerHub on every release. We do something similar with GitHub Actions in our projects (example).

Describe alternatives you've considered
The alternative is to keep pushing the images manually but it adds overhead.

count: -1 does not work

We were trying to do a gameday with mostly network disruptions, but the Disruption resource would keep failing to create the disruption pods when we had the count set to -1, e.g.

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
 name: PLACEHOLDER
 namespace: PLACEHOLDER
spec:
 selector: # label selector to target pods
   app: PLACEHOLDER
 count: -1 
 network: 
   port: 9042 
   protocol: tcp
   flow: egress 
   drop: 100

Here is the disruption description for events:

Events:
  Type     Reason    Age   From                   Message
  ----     ------    ----  ----                   -------
  Warning  NoTarget  15s   disruption-controller  The given label selector did not target any pods

User Request: Documentation around Disruption Rollout Controller

Is your feature request related to a problem? Please describe.
When looking to change a setting file in the helm config yaml, we noticed the 'disruptionRolloutEnabled' setting and were curious what that actually does~ We noticed that there was no documentation around this and were only able to glean what this did from the PR that initially introduced it~ This is an AMAZING feature we had no idea about ๐Ÿ˜„ It would be great if this was documented and advertised in the repo a bit more as it is truly useful!

User Issue: Node level DNS disruptions impact the DNS records of all cluster's Kubernetes Services

Describe the bug
While testing the DNS disruptions at node level I noticed a couple of critical issues.

In summary, these impact the DNS records of all cluster's Kubernetes Services (*.svc.cluster.local).

To Reproduce
Steps to reproduce the behavior:

  1. Start the provided minikube setup
    • make minikube-start
    • make minikube-build
    • make install
  2. Do a nslookup before applying the Disruption:
    / # cat /etc/resolv.conf
    search chaos-demo.svc.cluster.local svc.cluster.local cluster.local
    nameserver 10.96.0.10
    options ndots:5
    
    Name:      demo.chaos-demo.svc.cluster.local
    Address 1: 10.108.230.236 demo.chaos-demo.svc.cluster.local
    
    Name:      dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local
    Address 1: 10.107.89.68 dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local
    
  3. Curling the 2 Services from the curl pod works as expected. The same applies to any external hostnames like google.com.
  4. Apply the following Custom Resource:
    apiVersion: chaos.datadoghq.com/v1beta1
    kind: Disruption
    metadata:
      name: dns
      namespace: chaos-engineering
    spec:
      level: node
      selector:
        kubernetes.io/hostname: minikube
      count: 100%
      dns:
        - hostname: demo.chaos-demo.svc.cluster.local
          record:
            type: A
            value: 10.0.0.154,10.0.0.13  
    
  5. Do a nslookup again:
    / # nslookup demo.chaos-demo.svc.cluster.local
    Name:      demo.chaos-demo.svc.cluster.local
    Address 1: 10.0.0.154
    
    / # nslookup dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local
    nslookup: can't resolve 'dashboard-metrics-scraper.kubernetes-dashboard.svc.cluster.local': Name does not resolve
    
  6. Curling the 2 Services from the curl pod is no more working. External hostnames are still accessible.

Expected behavior
The disruption should only impact the provided hostname (demo.chaos-demo.svc.cluster.local).

Environment:

  • Kubernetes version: 1.21.1 in the provided minikube setup
  • Controller version:
  • Cloud provider (or local): minikube
  • Base OS for Kubernetes: N/A

Additional context
In minikube there is only one node which means all outgoing calls to the cluster's Kubernetes Services are affected. In a multi-node setup this affects the nodes which are targeted using the label selector.

User Request: Release Dynamic Targeting behind a feature flag in controller

Note: While chaos-controller is open to the public and we consider all suggestions for improvement, we prioritize feature development that is immediately applicable to chaos engineering initiatives within Datadog. We encourage users to contribute ideas to the repository directly in the form of pull requests!

Is your feature request related to a problem? Please describe.

After deploying controller version 7.2.0 which has dynamic targeting enabled by default, we observed that the controller enters into a restart loop (Back-off restart) while running experiments. (Error attached at the bottom). We were in controller version 6.1.0 and upgraded to 7.2.0. However when turning off dynamic targeting by adding statingTargeting: true in the defintion, the controller works as expected.
We are in the process of evaluating the dynamic targeting feature however as this feature is enabled by default in new versions there is a risk of teams running this without knowing the complete impact.
It would be great if dynamic targeting can be released under a config in controller so that we can block this until teams are confident and/or iron out issues we see in our cluster due to dynamic targeting. This also enables us to use the newer versions of controller while we sort dynamic targeting.

Describe the solution you'd like
Release dynamic targeting behind a feature flag. A configuration to enable/disable dynamic targeting in the configmap.yaml so we can disable/enable this feature from the controller side.

Describe alternatives you've considered
Open to any other ideas that would enable/disable dynamic targeting from controller

** Errors seeing when running experiments with dynamic targeting**

{"level":"info","ts":1660818040268.9556,"caller":"chaos-controller/main.go:267","message":"loading configuration file","config":"/etc/chaos-controller/config.yaml"}
I0818 10:20:41.321192       1 request.go:665] Waited for 1.031103301s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/pkg.crossplane.io/v1?timeout=32s
{"level":"info","ts":1660818045975.2166,"caller":"eventbroadcaster/notifiersink.go:40","message":"notifier noop enabled"}
{"level":"info","ts":1660818045978.2427,"caller":"chaos-controller/main.go:424","message":"restarting chaos-controller"}
I0818 10:20:45.978442       1 leaderelection.go:248] attempting to acquire leader lease chaos-engineering-framework/75ec2fa4.datadoghq.com...
I0818 10:21:02.864515       1 leaderelection.go:258] successfully acquired lease chaos-engineering-framework/75ec2fa4.datadoghq.com
I0818 10:21:04.017267       1 request.go:665] Waited for 1.046628643s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/athena.aws.crossplane.io/v1alpha1?timeout=32s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1444dba]

goroutine 691 [running]:
github.com/DataDog/chaos-controller/controllers.(*DisruptionReconciler).manageInstanceSelectorCache(0xc000824000, 0xc0004e0240)
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/controllers/cache_handler.go:514 +0x63a
github.com/DataDog/chaos-controller/controllers.(*DisruptionReconciler).Reconcile(0xc000824000, {0x1b730d8?, 0xc0010c1a70?}, {{{0xc000a4de60?, 0x1b?}, {0xc000a4de20?, 0x20?}}})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/controllers/disruption_controller.go:124 +0x4c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00082d040, {0x1b730d8, 0xc0010c19b0}, {{{0xc000a4de60?, 0x17c79c0?}, {0xc000a4de20?, 0xc000716d40?}}})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x222
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00082d040, {0x1b73030, 0xc000698580}, {0x175aa00?, 0xc00064d940?})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x2e9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00082d040, {0x1b73030, 0xc000698580})
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x309

User Issue / Suggestion: Controller arguments do not supersede config file

Describe the bug
Small bug found: When using a configuration file along with command arguments, the arguments used are overridden by contents of the config file which caused a bit of confusion on our end during setup.

To Reproduce
Steps to reproduce the behavior:

  1. Set a value within the configmap/config file (ex: notifiers-common-clustername)
  2. Add an argument while launching the binary (ex: notifiers-common-clustername)

Expected behavior
When setting a command line argument, the value provided will take precedence over the same setting in the configuration file.

Environment:

  • Kubernetes version: N/A
  • Controller version: N/A
  • Cloud provider (or local): N/A
  • Base OS for Kubernetes: N/A

Additional context
Looks like a potential fix would be to call the codeblock containing viper.SetConfigFile(configPath) directly after only loading the configPath pvar and calling a pflag.Parse() and then continue on withe the rest of the flags.

User Issue: CPU pressure does not consume 100% of the pods allocated CPU

Describe the bug
"CPU pressure" attack does not consume 100% of the CPU allocated to a Kubernetes pod.

To Reproduce
Steps to reproduce the behavior:
I injected a "CPU pressure" fault into a pod with 2 containers:

  1. Application container (name: fault-injection-showcase, CPU requests: 16 cores, CPU limits: 16 cores)
  2. Istio sidecar container (name: istio-proxy, CPU requests: 100 millicores, CPU limits: 2 cores)

Expected behavior
The pod should consume 18 cores (100% of the CPU allocated to both the containers in the pod)

Actual behavior

  1. The pod CPU usage fluctuates anywhere between 7 to 18 cores
  2. Within the pod, the "istio-proxy" container was constantly at 100% CPU usage but the "fault-injection-showcase" container was only using ~60% all the time

Screenshots
CPU_Usage_Over_Time

CPU_Usage_Containers_In_The_Pod

CPU_Usage_Application_Container

CPU_Usage_Istio_Proxy_Container

Environment:

  • Kubernetes version: v1.21.9
  • Controller version: 6.0.0
  • Cloud provider (or local): AWS
  • Base OS for Kubernetes: Amazon Linux (5.4.181-99.354.amzn2.x86_64)

User Request: Debugging instructions

Is your feature request related to a problem? Please describe.
It would be nice to provide instructions on how to debug the controller vs an existing cluster.
Running this locally, in an IDE, requires some effort but is feasible.

Describe the solution you'd like
Any required files and documentation on how to run the controller -in an IDE- vs an existing cluster.

Describe alternatives you've considered
Deploying the controller to a cluster and relying on logs/metrics.

Additional context
Quick overview of running it locally:

  • Generate the webhook certificates; one can get these from the cluster as long as the controller has been installed there before. Or we can provide dummy ones?
  • Create the config.yaml file
    • This is similar to the one provided as a ConfigMap
    • Leader election and safeguards should be disabled
    • Provide the directory with the certificates
  • Set the --config argument, and the CONTROLLER_NODE_NAME environment variable when running the controller.

Would it help if we introduce a local folder with a sample config.yaml and dummy certificates?

User Request: Dashboard in Datadog

Does someone have practice to create dashboard according to chaos.* metrics with application metrics in datadog dashboard?
So that SRE can easily monitoring/compare the chaos injection with steady states?
Thanks in advacne.

User Issue: Injector Pod stuck when executing kernel panic disruption

Describe the bug
I have been executing kernel panic attacks and the injector Pod gets stuck in the Terminating state (exit code: 255).
The controller is not able to remove its finalizer and eventually to clean up the injector Pod and the Custom Resource. It logs the following:

{"level":"info","ts":1643818006815.9573,"caller":"controllers/disruption_controller.go:628","message":"instance seems stuck on removal for this target, please check manually","instance":"node-failure","namespace":"chaos-engineering-framework","target":"ip-10-72-9-223.us-west-2.compute.internal","chaosPod":"chaos-node-failure-2z6tw"}

NOOP: Notifier Warning: StuckOnRemoval - Instance is stuck on removal because of chaos pods not being able to terminate correctly, please check pods logs before manually removing their

I was wondering if this is expected.

In node terminations this is not an issue as the Pod will get evicted once the Node gets replaces.

Deployed Disruption

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: node-failure
  namespace: chaos-engineering-framework
spec:
  duration: 3m
  level: node
  selector:
    kubernetes.io/hostname: <hostname>
  count: 100%
  nodeFailure:
    shutdown: false

Disruption specs

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  ...
  creationTimestamp: '2022-02-04T17:01:20Z'
  deletionGracePeriodSeconds: 0
  deletionTimestamp: '2022-02-04T17:14:26Z'
  finalizers:
    - finalizer.chaos.datadoghq.com
  generation: 7
  ...
status:
  ignoredTargets:
    - <hostname>
  injectionStatus: PreviouslyInjected
  isStuckOnRemoval: true
  ...
spec:
  count: 100%
  duration: 3m0s
  level: node
  nodeFailure: {}
  selector:
    kubernetes.io/hostname: <hostname>

Injector Pod specs

apiVersion: v1
kind: Pod
metadata:
  ...
  creationTimestamp: '2022-02-04T17:01:20Z'
  deletionTimestamp: '2022-02-04T17:14:26Z'
  deletionGracePeriodSeconds: 0
  ...
  finalizers:
    - finalizer.chaos.datadoghq.com/chaos-pod
  ...
status:
  phase: Failed
  ...
  startTime: '2022-02-04T17:01:20Z'
  ...
  containerStatuses:
    - name: injector
      state:
        terminated:
          exitCode: 255
          reason: Error
          startedAt: '2022-02-04T17:01:23Z'
          finishedAt: '2022-02-04T17:01:54Z'
          containerID: >-
            docker://<id>
      lastState: {}
      ready: false
      restartCount: 0
      ...
      started: false
    ...
  qosClass: Burstable
spec:
  ...
  containers:
    - name: injector
      ...
      args:
        - node-failure
        - inject
        - '--metrics-sink'
        - datadog
        - '--level'
        - node
        - '--target-container-ids'
        - ''
        - '--target-pod-ip'
        - ''
        - '--chaos-namespace'
        - chaos-engineering-framework
        - '--log-context-disruption-name'
        - node-failure
        - '--log-context-disruption-namespace'
        - chaos-engineering-framework
        - '--log-context-target-name'
        - <hostname>
        - '--log-context-target-node-name'
        - <hostname>
        - '--deadline'
        - '2022-02-04T17:04:19Z'
      env:
        - name: TARGET_POD_HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: CHAOS_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: INJECTOR_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: CHAOS_INJECTOR_MOUNT_HOST
          value: /mnt/host/
        - name: CHAOS_INJECTOR_MOUNT_PROC
          value: /mnt/host/proc/
        - name: CHAOS_INJECTOR_MOUNT_SYSRQ
          value: /mnt/sysrq
        - name: CHAOS_INJECTOR_MOUNT_SYSRQ_TRIGGER
          value: /mnt/sysrq-trigger
        - name: CHAOS_INJECTOR_MOUNT_CGROUP
          value: /mnt/cgroup/
        ...
      ...
      readinessProbe:
        exec:
          command:
            - cat
            - /tmp/readiness_probe
        timeoutSeconds: 1
        periodSeconds: 1
        successThreshold: 1
        failureThreshold: 5
      ...
    ...
  restartPolicy: Never
  terminationGracePeriodSeconds: 60
  activeDeadlineSeconds: 189
  dnsPolicy: ClusterFirst
  serviceAccountName: chaos-injector
  serviceAccount: chaos-injector
  nodeName: <hostname>
  ...

Expected behavior
Injector Pods and Disruption get cleaned up once the Disruption expires.

Environment:

  • Kubernetes version: 1.20.11
  • Controller version: 5.2.1
  • Cloud provider (or local): EKS
  • Base OS for Kubernetes: Amazon Linux 2
  • Container runtime: docker

User Issue: Start in private cluster without public internet access

Describe the bug
A clear and concise description of what the bug is.

kubectl logs -f chaos-controller-5f897979d-fxv9z -n chaos-engineering

{"level":"fatal","ts":1710340783015.9324,"caller":"chaos-controller/main.go:154","message":"error initializing CloudProviderManager","error":"could not get the new ip ranges from provider AWS: Get "https://ip-ranges.amazonaws.com/ip-ranges.json\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\ncould not get the new ip ranges from provider GCP: Get "https://www.gstatic.com/ipranges/goog.json\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\ncould not get the new ip ranges from provider Datadog: Get "https://ip-ranges.datadoghq.com/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n","stacktrace":"main.main\n\t/go/src/github.com/BkfwVXeB/0/DataDog/chaos-controller/main.go:154\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

We have EKS cluster without Internet access, what would be the correct way to start? I have proxy to Internet, and is there a config to setup it?

User Issue: Pod level network disruptions not working

Describe the bug
I have been trying the pod level network disruptions over the last days but I can't see any impact.
I have Istio installed but I disabled it for the time being to rule that factor out.

To Reproduce
First, I deployed the demo resources from this repo. I only changed the namespace to chaos as I'm using that one.

Logs from the demo-curl pod before the disruption
Screenshot 2021-06-23 at 15 43 09

Disruption
If I understand that correctly the below disruption should impact the egress traffic, i.e. the call from demo curl to Nginx:

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: network-egress
  namespace: chaos
spec:
  level: pod
  selector:
    app: demo-curl
  count: 100%
  network:
    drop: 100
    hosts:
      - port: 80

Logs from the injector pod
The disruption gets injected:

{"level":"info","ts":1624459766542.497,"caller":"injector/main.go:154","message":"injector targeting container","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss","containerID":"docker://ce9300286ab80bd1c76d7deeb1b0a4883cefb33788fc005cd8d4a7992e888d80","container name":"/k8s_curl_demo-curl-f7f7d86c-w9zss_chaos_d935dd80-5b1c-4bfe-a913-c593b91aff29_0"}
{"level":"info","ts":1624459766544.775,"caller":"injector/main.go:154","message":"injector targeting container","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss","containerID":"docker://4b533ab35f2e2c0e897cdfcbbc3505d819850cd22848b5e1e4da036745eb8176","container name":"/k8s_dummy_demo-curl-f7f7d86c-w9zss_chaos_d935dd80-5b1c-4bfe-a913-c593b91aff29_0"}
{"level":"info","ts":1624459766545.6682,"caller":"injector/main.go:228","message":"injecting the disruption","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766545.6929,"caller":"injector/network_disruption.go:77","message":"adding network disruptions","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss","drop":100,"duplicate":0,"corrupt":0,"delay":0,"delayJitter":0,"bandwidthLimit":0}
{"level":"info","ts":1624459766546.0093,"caller":"injector/network_disruption.go:193","message":"detected default gateway IP 169.254.1.1 on interface eth0","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766546.0254,"caller":"injector/network_disruption.go:201","message":"target pod node IP is 10.1.77.165","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766547.3503,"caller":"injector/network_disruption.go:220","message":"setting tx qlen for interface eth0","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766547.503,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc qdisc add dev eth0 root handle 1: prio bands 4 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766548.8694,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc qdisc add dev lo root handle 1: prio bands 4 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766550.0566,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc qdisc add dev eth0 parent 1:4 handle 2: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766551.1255,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc qdisc add dev lo parent 1:4 handle 2: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766552.1704,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc filter add dev eth0 parent 2:0 handle 2: cgroup","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766553.1016,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc filter add dev lo parent 2:0 handle 2: cgroup","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766553.9832,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc qdisc add dev eth0 parent 2:2 handle 3: netem loss 100%","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766554.9204,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc qdisc add dev lo parent 2:2 handle 3: netem loss 100%","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766555.9028,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc filter add dev eth0 parent 1:0 u32 match ip dst 0.0.0.0/0 match ip dport 80 0xffff flowid 1:4","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766556.871,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc filter add dev lo parent 1:0 u32 match ip dst 0.0.0.0/0 match ip dport 80 0xffff flowid 1:4","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766557.8008,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc filter add dev eth0 parent 1:0 u32 match ip dst 169.254.1.1/32 flowid 1:1","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766558.7136,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc filter add dev eth0 parent 1:0 u32 match ip dst 10.1.77.165/32 flowid 1:1","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766559.6082,"caller":"network/tc.go:63","message":"running tc command: /sbin/tc filter add dev lo parent 1:0 u32 match ip dst 10.1.77.165/32 flowid 1:1","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766560.5198,"caller":"injector/network_disruption.go:229","message":"clearing tx qlen for interface eth0","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766560.6428,"caller":"injector/network_disruption.go:97","message":"operations applied successfully","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766560.682,"caller":"injector/network_disruption.go:100","message":"editing pod net_cls cgroup to apply a classid to target container packets","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
NOOP: MetricInjected true
{"level":"info","ts":1624459766560.8398,"caller":"injector/main.go:243","message":"disruption injected, now waiting for an exit signal","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
{"level":"info","ts":1624459766560.8535,"caller":"injector/network_disruption.go:77","message":"adding network disruptions","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss","drop":0,"duplicate":0,"corrupt":0,"delay":0,"delayJitter":0,"bandwidthLimit":0}
{"level":"info","ts":1624459766560.8652,"caller":"injector/network_disruption.go:100","message":"editing pod net_cls cgroup to apply a classid to target container packets","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}
NOOP: MetricInjected true
{"level":"info","ts":1624459766562.023,"caller":"injector/main.go:243","message":"disruption injected, now waiting for an exit signal","disruptionName":"network-ingress","disruptionNamespace":"chaos","targetName":"demo-curl-f7f7d86c-w9zss"}

I cannot see anything different though. The demo-curl pod can still call Nginx.

I tried many different pod level disruptions including the one above but for ingress traffic, or even disrupting the Service:

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: network-filters
  namespace: chaos-demo
spec:
  level: pod
  selector:
    app: demo-curl
  count: 100%
  network:
    drop: 100
    // with and without flow: ingress
    services:
      - name: demo
        namespace: chaos

I also attempted to verify the impact by port-forwarding (the service/deployment/pod of) the Nginx server and the curl-demo app and then hitting the url, or by using nc, but still nothing.

Expected behavior
Impact is visible.

Environment:

  • Kubernetes version: EKS 1.19
  • Controller version: 4.0.1
  • Cloud provider (or local): AWS
  • Base OS for Kubernetes: docker://19.3.13

User Issue: Injector Pod stuck in Failed phase with reason DeadlineExceeded

Describe the bug
We have been intermittently running into an issue where the injector Pod gets stuck in the Failed phase with the following context:

status:
  phase: Failed
  message: Pod was active on the node longer than the specified deadline
  reason: DeadlineExceeded

Deployed Disruption
Note: We didn't set the duration when applying the Disruption. This defaults to 15 mins which is reflected in the CR.

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: container-failure
  namespace: chaos-engineering
spec:
  selector:
    app: demo-nginx
  count: 1
  containerFailure:
    forced: true

Disruption specs

apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  creationTimestamp: '2022-02-01T15:28:39Z'
  deletionGracePeriodSeconds: 0
  deletionTimestamp: '2022-02-01T15:58:43Z'
  finalizers:
    - finalizer.chaos.datadoghq.com
  generation: 7
  ...
status:
  ignoredTargets:
    - XXX
  injectionStatus: PreviouslyInjected
spec:
  containerFailure:
    forced: true
  count: 1
  duration: 15m0s
  selector:
    app: XXX

Injector Pod specs
The finaliser gets successfully removed but the Pod is still there.

apiVersion: v1
kind: Pod
metadata:
  ...
  creationTimestamp: '2022-02-01T15:28:40Z'
  deletionTimestamp: '2022-02-01T15:44:39Z'
  deletionGracePeriodSeconds: 60
  ...
status:
  phase: Failed
  message: Pod was active on the node longer than the specified deadline
  reason: DeadlineExceeded
  startTime: '2022-02-01T15:28:40Z'
  containerStatuses:
    - name: injector
      state:
        running:
          startedAt: '2022-02-01T15:28:42Z'
      lastState: {}
      ready: false
      restartCount: 0
    - name: XXX
      state:
        running:
          startedAt: '2022-02-01T15:28:42Z'
      lastState: {}
      ready: false
      restartCount: 0
  qosClass: Burstable
spec:
  ...
  terminationGracePeriodSeconds: 60
  activeDeadlineSeconds: 899
  ...

Injector Pod logs
(reconstructed from our logging platform)

injector targeting container
injector targeting container
injecting the disruption
injecting a container failure
disruption container-failure injected
injecting a container failure
disruption container-failure injected
disruption(s) injected, now waiting for an exit signal
an exit signal has been received
disruption container-failure cleaned
disruption container-failure cleaned
disruption(s) cleaned, now exiting
closing metrics sink client before exiting

To Reproduce
Unfortunately I don't have a way to reproduce this consistently...

Expected behavior
The injector Pod and the Disruption get deleted.

Environment:

  • Kubernetes version: 1.20.11
  • Controller version: 5.2.1
  • Cloud provider (or local): EKS
  • Base OS for Kubernetes: Amazon Linux 2
  • Container runtime: docker

User Request: Release cloudProviders behind a feature flag in controller

Note: While chaos-controller is open to the public and we consider all suggestions for improvement, we prioritize feature development that is immediately applicable to chaos engineering initiatives within Datadog. We encourage users to contribute ideas to the repository directly in the form of pull requests!

Is your feature request related to a problem? Please describe.

  • #583 introduced the cloud managed service disruptions to address the limitations on the dynamic IPs between each DNS requests.
  • This has however introduced a technical limitation for us as our clusters are hosted behind a firewall and there's no rule in place for us to communicate with the external cloudproviders.
  • While we can potentially look at opening these outbound connections, that might not be approved and we would not be able to run any version of controller beyond 7.8.0.

Describe the solution you'd like

  • Release cloudProviders behind a feature flag. A configuration flag to enable / disable this feature in the configmap.yaml from the controller side.
  • Allow IPRangesURL to be configurable to access the the internal cloudproviders behind a proxy.
  • We would like both solutions to be implemented. Happy to raise another feature request if required.

Describe alternatives you've considered

  • Open to ideas to toggle this feature using controller.

User Issue: issue with cgroups in EKS

Hi folks ๐Ÿ‘‹

Thanks for open-sourcing this project. Great work.

Describe the bug

I am trying to run Disruptions in a EKS cluster and I'm facing issues with the cgroup path.

The error I am seeing when deploying a Disruption is the following:

{"level":"fatal","ts":1621863086297.7078,"caller":"injector/main.go:151","message":"can't create container object","disruptionName":"containers-targeting","disruptionNamespace":"chaos","targetName":"echoserver-7bfc76f9bd-pxtl6","error":"error getting cgroup path: unexpected cgroup format: /kubepods/besteffort/podcbd5ece0-ec07-4de2-a37f-54c4fc753f7e","stacktrace":"main.initConfig\n\t/go/src/github.com/DataDog/chaos-controller/cli/injector/main.go:151\ngithub.com/spf13/cobra.(*Command).preRun\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:856\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:792\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:864\nmain.main\n\t/go/src/github.com/DataDog/chaos-controller/cli/injector/main.go:92\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204"}
/kubepods/besteffort/podcbd5ece0-ec07-4de2-a37f-54c4fc753f7e

To Reproduce
These are the steps I followed on my setup.

  1. I installed any resources through the install.yaml file.
  2. I also deployed a echoserver app under a different chaos namespace.
  3. Finally, I deployed the following Disruption:
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: containers-targeting
  namespace: chaos
spec:
  level: pod
  selector:
    app: echoserver
  count: 100%
  network:
    drop: 100

Expected behavior
The injector pod runs the Disruption without errors.

Screenshots
N/A

Environment:

  • Kubernetes version: EKS 1.18
  • Controller version:
  • Cloud provider (or local): AWS
  • Base OS for Kubernetes: Amazon Linux 2
  • Container runtime: docker://19.3.6

Additional context
Based on the error message, the CGroupParent which is parsed here does not meet the format the controller is expecting.

I then tried something like that (I am not sure if that makes sense?) but when the Disruption was injected I got the following error message:

{"level":"error","ts":1622047742022.986,"caller":"injector/main.go:235","message":"disruption injection failed","disruptionName":"containers-targeting","disruptionNamespace":"chaos","targetName":"echoserver-7bfc76f9bd-qt4bs","error":"error writing classid to pod net_cls cgroup: error opening cgroup file /mnt/cgroup/net_cls/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd8ab9bfd_581f_4191_9b4d_90dab3356e0d.slice/net_cls.classid: open /mnt/cgroup/net_cls/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd8ab9bfd_581f_4191_9b4d_90dab3356e0d.slice/net_cls.classid: no such file or directory","stacktrace":"main.injectAndWait\n\t/Users/nkatirtzis/github/chaos-controller/cli/injector/main.go:235\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/nkatirtzis/go/pkg/mod/github.com/spf13/[email protected]/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/nkatirtzis/go/pkg/mod/github.com/spf13/[email protected]/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/nkatirtzis/go/pkg/mod/github.com/spf13/[email protected]/command.go:864\nmain.main\n\t/Users/nkatirtzis/github/chaos-controller/cli/injector/main.go:92\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}

User Issue: Status not being a subresource causes issues with control planes

Describe the bug
Right now, the status is not defined as a subresource in the CRD. This means that the entire object gets updated, including its generation.

This causes issues, especially when integrating with control planes, like Kubefed; the control plane sees the object's generation getting updated and reconciles, causing an infinite loop.

I was curious if there was a particular reason behind the implementation. Is it possible to change this?

To Reproduce
The issue is more obvious when having a control plane.

Expected behavior
The status is defined as a subresource in the CRD. This would allow updating just the status rather than the entire object.

Additional information
Subresources in Kubebuilder: https://book.kubebuilder.io/reference/generating-crd.html?highlight=subresourc#subresources

7.26.0 Upgrade Issues - CrashBackLoop

Describe the bug
Hi team, We recently tried to upgrade from version 7.13.1 to 7.26 and we are facing issues trying to achieve the same behaviour. We previously reached out to you about issues with dynamic targeting, we upgraded to your newest release and we are hoping to get some further guidance.
We found that running a container failure (graceful and forceful, dynamic and static targeting) causes the targeted pods to go into a CrashBackLoop for the duration of the experiment

To Reproduce
Steps to reproduce the behavior:

  1. This is an example of a forceful container failure experiment
apiVersion: chaos.datadoghq.com/v1beta1
kind: Disruption
metadata:
  name: container-failure-all-forced
  # Namespace should be your application namespace.
  namespace: chaos-engineering-framework
  annotations:
    chaos.datadoghq.com/environment: "dev"
spec:
  level: pod
  duration: 6m0s
  selector:
    app.kubernetes.io/name: "fault-injection-showcase"
    app: "fault-injection-showcase--plugin-master"
  count: 100%
  #setting staticTargeting as true is optional and is required only if you want to execute experiments with static targeting
  staticTargeting: true
  containerFailure:
    forced: true

For the container forced experiment we found
(Injector logs)

{"level":"warn","ts":1690464275041.3286,"caller":"injector/main.go:620","message":"couldn't watch targeted pod, retrying error an error occurred during reinjection: unable to reinitialize netns manager and cgroup manager for containerID 0809c40b517ade8875f9b3b55b9793474914b462e4a1ebe82c1c9832bd18c296 (PID: 15431): error creating network namespace manager: error getting given PID (15431) network namespace from path /mnt/host/proc/15431/ns/net: no such file or directory retrying 589.973741ms","disruptionName":"container-failure-all-forced","disruptionNamespace":"chaos-engineering-framework","targetName":"fault-injection-showcase-plugin-master-template-b785fd95b-lzbzh","targetNodeName":"ip-10-72-65-218.us-west-2.compute.internal"}

This could be related to our shift to using containerd over dockershim, due to our kubernetes upgrade to 1.24
Expected behavior
We expect the targeted pods not to go into a CrashBackLoop for container failure.
Let us know if our expectations are incorrect.

Logs
chaos-controller-27-06.txt

injector_container-failure-forced.txt

Environment:
Kubernetes version: v1.24.15-eks-a5565ad
Controller version: 7.26.0
Cloud provider (or local): AWS EKS
Base OS for Kubernetes: Amazon Linux: 5.4.242-156.349.amzn2.x86_64

User Request: Self-signed certificates generated through Helm

Is your feature request related to a problem? Please describe.
Internally we don't have the cert-manager installed in certain clusters which required us to configure the admission hook.

There are many ways to generate certificates but it'd be nice to provide a solution that works out-of-the-box for both cases in the charts; when the cert-manager is installed and when it is not.

Describe the solution you'd like
I added some logic to our internal charts to support that by setting a property certManager.enabled in the values.yaml. When the cert-manager is not installed we use Helm template functions to generate the certificates. This is very similar to the kubefed webhook implementation.

Would you be open to a PR for that? The only problem I can think of is that this is Helm-specific.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.