stop/abort network delay experiment is not removing the added delay time on target pod

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

stop/abort network delay experiment is not restoring the settings back to its original state on target pod interface. about litmus-go HOT 10 CLOSED

litmuschaos commented on July 24, 2024

stop/abort network delay experiment is not restoring the settings back to its original state on target pod interface.

from litmus-go.

Comments (10)

ksatchit commented on July 24, 2024

@rbalagopalakrishna thanks for trying litmus. Can you share the following details:

K8s & LitmusChaos versions with container runtime being used in your cluster
ChaosEngine spec
Logs of the <chaosengine+name>-runner pod, experiment pod (pod-network-latency-%hash%-%hash%) pod & the helper pod (pod-network-latency-%longer hash%)

To retain these pods w/o being deleted at the end of the experiment you can use the spec.jobCleanupPolicy: retain in the ChaosEngine definition instead of the default delete.

from litmus-go.

rbalagopalakrishna commented on July 24, 2024

Hi @ksatchit

kn8s version: Client Version: v1.20.0, Server Version: v1.20.0
Litmus operator version: chaos-operator: v1.10.0
Container runtime: containerd://1.4.3

ChaosEngine Spec:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: busybox-network-chaos
  namespace: default
spec:
  jobCleanUpPolicy: 'retain'
  annotationCheck: 'false'
  engineState: 'active'
  auxiliaryAppInfo: ''
  monitoring: false
  appinfo:
    appns: 'default'
    applabel: 'name=networkchaos'
    appkind: 'deployment'
  chaosServiceAccount: pod-network-latency-sa
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: 'eth0'
            - name: LIB_IMAGE
              value: 'litmuschaos/go-runner:1.10.0'
            - name: NETWORK_LATENCY
              value: '500'
            - name: TOTAL_CHAOS_DURATION
              value: '300'
            - name: PODS_AFFECTED_PERC
              value: '100'
            - name: CONTAINER_RUNTIME
              value: 'containerd'
            - name: SOCKET_PATH
              value: '/run/containerd/containerd.sock'

Logs of runner pod

root@litmus:~# kubectl logs -f busybox-network-chaos-runner
W1215 03:42:21.039823       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:42:21Z" level=info msg="Experiments details are as follows" Service Account Name=pod-network-latency-sa Engine Namespace=default Experiments List="[pod-network-latency]" Engine Name=busybox-network-chaos appLabels="name=networkchaos" appKind=deployment
time="2020-12-15T03:42:51Z" level=error msg="Unable to send GA event, error: Post \"https://www.google-analytics.com/collect\": dial tcp 172.217.1.238:443: i/o timeout"
time="2020-12-15T03:42:51Z" level=info msg="Getting the ENV Variables"
time="2020-12-15T03:42:51Z" level=info msg="Preparing to run Chaos Experiment: pod-network-latency"
time="2020-12-15T03:42:51Z" level=info msg="Validating configmaps specified in the ChaosExperiment & ChaosEngine"
time="2020-12-15T03:42:52Z" level=info msg="Validating secrets specified in the ChaosExperiment & ChaosEngine"
time="2020-12-15T03:42:52Z" level=info msg="Validating HostFileVolumes details specified in the ChaosExperiment"
time="2020-12-15T03:42:52Z" level=error msg="Unable to create event, err: events \"ExperimentDependencyCheckpod-network-latencycc46e45e-6402-4699-9456-0ca98ece4a54\" already exists"
time="2020-12-15T03:42:52Z" level=error msg="Unable to create event, err: events \"ExperimentJobCreatepod-network-latencycc46e45e-6402-4699-9456-0ca98ece4a54\" already exists"
time="2020-12-15T03:42:52Z" level=info msg="Started Chaos Experiment Name: pod-network-latency, with Job Name: pod-network-latency-bdut9i"

Logs of helper pod:

# kubectl logs -f pod-network-latency-p68ci0-zx72z
W1215 03:39:37.462647       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:39:37Z" level=info msg="Experiment Name: pod-network-latency"
time="2020-12-15T03:39:37Z" level=info msg="[PreReq]: Getting the ENV for the  experiment"
time="2020-12-15T03:39:37Z" level=info msg="[PreReq]: Updating the chaos result of pod-network-latency experiment (SOT)"
time="2020-12-15T03:39:37Z" level=info msg="The application information is as follows\n" Namespace=default Label="name=networkchaos" Ramp Time=0
time="2020-12-15T03:39:37Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2020-12-15T03:39:37Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:39:37Z" level=info msg="[Status]: The Container status are as follows" container=busybox Pod=busybox-deployment-2-78c56d468c-p8mnv Readiness=true
time="2020-12-15T03:39:39Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:39:39Z" level=info msg="[Status]: The running status of Pods are as follows" Status=Running Pod=busybox-deployment-2-78c56d468c-p8mnv
time="2020-12-15T03:39:41Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2020-12-15T03:39:41Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2020-12-15T03:39:41Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:39:43Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:40:04Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-qnembj Status=Running
time="2020-12-15T03:40:06Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2020-12-15T03:40:06Z" level=info msg="helper pod status: Failed"
time="2020-12-15T03:40:06Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-qnembj Status=Failed
time="2020-12-15T03:40:07Z" level=error msg="Chaos injection failed, err: helper pod failed due to, err: <nil>"
root@litmus:~# kubectl logs -f pod-network-latency-bdut9i-mtksv
W1215 03:42:53.144587       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:42:53Z" level=info msg="Experiment Name: pod-network-latency"
time="2020-12-15T03:42:53Z" level=info msg="[PreReq]: Getting the ENV for the  experiment"
time="2020-12-15T03:42:53Z" level=info msg="[PreReq]: Updating the chaos result of pod-network-latency experiment (SOT)"
time="2020-12-15T03:42:53Z" level=info msg="The application information is as follows\n" Ramp Time=0 Namespace=default Label="name=networkchaos"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: The Container status are as follows" container=busybox Pod=busybox-deployment-2-78c56d468c-q4784 Readiness=true
time="2020-12-15T03:42:55Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:42:55Z" level=info msg="[Status]: The running status of Pods are as follows" Status=Running Pod=busybox-deployment-2-78c56d468c-q4784
time="2020-12-15T03:42:57Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2020-12-15T03:42:57Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2020-12-15T03:42:57Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:42:59Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:43:26Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-rfybkz Status=Running
time="2020-12-15T03:43:28Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2020-12-15T03:43:28Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:30Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:31Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:32Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:33Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:34Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:35Z" level=info msg="helper pod status: Running"
.
.
.
.
.
time="2020-12-15T03:45:11Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:12Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:13Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:14Z" level=info msg="helper pod status: Running"

Logs of Experiment pod:

# kubectl logs -f pod-network-latency-rfybkz
W1215 03:43:25.991105    1962 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:43:26Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2020-12-15T03:43:26Z" level=info msg="containerid: dbc1b8d2ea82cba1730ece4c9c5b6c3179aaee8e24e130bcf4de61e6217db89e"
time="2020-12-15T03:43:26Z" level=info msg="[cri]: Container ID=dbc1b8d2ea82cba1730ece4c9c5b6c3179aaee8e24e130bcf4de61e6217db89e has process PID=721"
time="2020-12-15T03:43:26Z" level=info msg="/bin/bash -c nsenter -t 721 -n tc qdisc add dev eth0 root netem delay 500ms"
time="2020-12-15T03:43:26Z" level=info msg="[Chaos]: Waiting for 300s"
time="2020-12-15T03:45:15Z" level=info msg="[Chaos]: Killing process started because of terminated signal received"

in experiment pod logs i see time="2020-12-15T03:45:15Z" level=info msg="[Chaos]: Killing process started because of terminated signal received" this is because of i applied patch to stop the chaos using cmd:
# kubectl patch chaosengine busybox-network-chaos -n default --type merge --patch '{"spec":{"engineState":"stop"}}'

Ping results after applying the patch. Applied 500ms delay is still there.

PING 108.250.140.21 (108.250.140.21): 56 data bytes
64 bytes from 108.250.140.21: seq=0 ttl=63 time=502.495 ms
64 bytes from 108.250.140.21: seq=1 ttl=63 time=500.516 ms
64 bytes from 108.250.140.21: seq=2 ttl=63 time=500.510 ms
64 bytes from 108.250.140.21: seq=3 ttl=63 time=501.827 ms
64 bytes from 108.250.140.21: seq=4 ttl=63 time=503.103 ms
64 bytes from 108.250.140.21: seq=5 ttl=63 time=500.505 ms

If I try running the experiment again on same target pod interface.... the experiment pod is going to error state with below logs

# kubectl logs -f pod-network-latency-qnembj
W1215 03:40:04.493066   26229 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:40:04Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2020-12-15T03:40:04Z" level=info msg="containerid: 46b5e0c66222f3481acb376a2b76f294265a7b7368709827b7faa56b366619f9"
time="2020-12-15T03:40:04Z" level=info msg="[cri]: Container ID=46b5e0c66222f3481acb376a2b76f294265a7b7368709827b7faa56b366619f9 has process PID=12741"
time="2020-12-15T03:40:04Z" level=info msg="/bin/bash -c nsenter -t 12741 -n tc qdisc add dev eth0 root netem delay 500ms"
time="2020-12-15T03:40:04Z" level=error msg="RTNETLINK answers: File exists\n"
time="2020-12-15T03:40:04Z" level=fatal msg="helper pod failed, err: exit status 2"

from litmus-go.

rbalagopalakrishna commented on July 24, 2024

Any fix coming for this?

from litmus-go.

ksatchit commented on July 24, 2024

Hi @rbalagopalakrishna . Sorry for the delayed response. Yes, this is being looked into by the team. There have been some developments around a feature to enable retries for "chaos revert" operations to guarantee chaos rollback - in cases where the initial attempts fail.

To elaborate, this is being addressed in a few ways:

(A) Under normal execution (w/o intentional abort ops), retries of chaos revert are carried out before bailing out and ending the experiment:
- To aid this, the helper pods are signaling the state of chaos (injected v/s reverted) via annotations to the chaosresult
- The chaosresult status schema has been enhanced with fields to signal/indicate the eventual success/failure of the revert operation that will help users learn of what has transpired - before re-executing the experiment (that may result in the error you saw subsequently about existing tc rules) (RTNETLINK answers: File exists)
(B) For intentional abort (that causes an immediate force removal of all chaos pods) - Ability to define terminationSeconds for the helper pods via the chaosengine schema that provides more time/chance for revert to successfully execute.

(A) is being trialed at this point (via these PRs: litmuschaos/chaos-operator#328 & #275 ). The changes are available in the image litmuschaos/go-runner:1.12.0-revert. You can give it a go to see the updated flow. Having said that this is a TP (tech-preview artifact).

(B) is scheduled to be part of a 1.13.x patch release towards feb-end.

The workaround for the problem of continuing latency/loss is to restart (delete) the pod (as long as the pod is being managed by a deployment/sts/daemonset etc.,) -- the newly created/scheduled pod results in a new network namespace - effectively removing the older processes/rules. If you are running a pure unmanaged pod, you may need to kill the tc process on the sandbox/pause container associated with this pod.

Having said all this, interested to know if you face this issue upon every abort or is this inconsistent?

from litmus-go.

rbalagopalakrishna commented on July 24, 2024

Hi @ksatchit. Thank you for the update . I will try the litmuschaos/go-runner:1.12.0-revert image. i faced the issue upon every abort.

from litmus-go.

rbalagopalakrishna commented on July 24, 2024

Hi @ksatchit, I created a new litmus setup with litmuschaos/go-runner:1.12.0-revert image. now i see... the runner pod is not creating any other helper pods.

logs:

Runner pod spec:
$ kubectl describe pod nginx-network-chaos-runner -n litmus
Containers:
chaos-runner:
Container ID: containerd://b09d67fb5343ec2e2a524b45aa5ad2c04b3a6a8b447d59782f003347884bccfb
Image: litmuschaos/go-runner:1.12.0-revert
Image ID: docker.io/litmuschaos/go-runner@sha256:e390bcfdd521a76b8fc932a8169035533e9d78e221d80f19a7abe85a430b0476
Port:

The runner pod is going to complete pod without any log info.

from litmus-go.

ispeakc0de commented on July 24, 2024

Hi @rbalagopalakrishna,

Can you please try with the following manifests?

from litmus-go.

ispeakc0de commented on July 24, 2024

Hi @rbalagopalakrishna ,

We have added the chaos revert logic in the network chaos for the cases where the experiment failed or aborted.
You can use the following images to use it:

chaos-operator: litmuschaos/chaos-operator:1.13.1-revert
chaos-runner: litmuschaos/chaos-runner:1.13.0
go-runner: litmuschaos/go-runner:1.13.1-revert

PR reference for this: PR1, PR2

from litmus-go.

rbalagopalakrishna commented on July 24, 2024

Hi @ispeakc0de,

*-revert images are able to revert the n/w chaos successfully after applying the abort patch.

attaching ping logs (duration 60seconds and patch was applied 10s after chaos started.)

from litmus-go.

ksatchit commented on July 24, 2024

With the referred revert & reruns are successful in both 1.13.1 & 1.13.1-revert (the latter having a few more indicators in the result for successful revert).

Closing this issue, please feel free to re-open or raise a new one upon any problems.

from litmus-go.

stop/abort network delay experiment is not restoring the settings back to its original state on target pod interface. about litmus-go HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent