Giter Club home page Giter Club logo

Comments (10)

ksatchit avatar ksatchit commented on July 24, 2024

@rbalagopalakrishna thanks for trying litmus. Can you share the following details:

  • K8s & LitmusChaos versions with container runtime being used in your cluster
  • ChaosEngine spec
  • Logs of the <chaosengine+name>-runner pod, experiment pod (pod-network-latency-%hash%-%hash%) pod & the helper pod (pod-network-latency-%longer hash%)

To retain these pods w/o being deleted at the end of the experiment you can use the spec.jobCleanupPolicy: retain in the ChaosEngine definition instead of the default delete.

from litmus-go.

rbalagopalakrishna avatar rbalagopalakrishna commented on July 24, 2024

Hi @ksatchit

kn8s version: Client Version: v1.20.0, Server Version: v1.20.0
Litmus operator version: chaos-operator: v1.10.0
Container runtime: containerd://1.4.3

ChaosEngine Spec:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: busybox-network-chaos
  namespace: default
spec:
  jobCleanUpPolicy: 'retain'
  annotationCheck: 'false'
  engineState: 'active'
  auxiliaryAppInfo: ''
  monitoring: false
  appinfo:
    appns: 'default'
    applabel: 'name=networkchaos'
    appkind: 'deployment'
  chaosServiceAccount: pod-network-latency-sa
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: 'eth0'
            - name: LIB_IMAGE
              value: 'litmuschaos/go-runner:1.10.0'
            - name: NETWORK_LATENCY
              value: '500'
            - name: TOTAL_CHAOS_DURATION
              value: '300'
            - name: PODS_AFFECTED_PERC
              value: '100'
            - name: CONTAINER_RUNTIME
              value: 'containerd'
            - name: SOCKET_PATH
              value: '/run/containerd/containerd.sock'

Logs of runner pod

root@litmus:~# kubectl logs -f busybox-network-chaos-runner
W1215 03:42:21.039823       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:42:21Z" level=info msg="Experiments details are as follows" Service Account Name=pod-network-latency-sa Engine Namespace=default Experiments List="[pod-network-latency]" Engine Name=busybox-network-chaos appLabels="name=networkchaos" appKind=deployment
time="2020-12-15T03:42:51Z" level=error msg="Unable to send GA event, error: Post \"https://www.google-analytics.com/collect\": dial tcp 172.217.1.238:443: i/o timeout"
time="2020-12-15T03:42:51Z" level=info msg="Getting the ENV Variables"
time="2020-12-15T03:42:51Z" level=info msg="Preparing to run Chaos Experiment: pod-network-latency"
time="2020-12-15T03:42:51Z" level=info msg="Validating configmaps specified in the ChaosExperiment & ChaosEngine"
time="2020-12-15T03:42:52Z" level=info msg="Validating secrets specified in the ChaosExperiment & ChaosEngine"
time="2020-12-15T03:42:52Z" level=info msg="Validating HostFileVolumes details specified in the ChaosExperiment"
time="2020-12-15T03:42:52Z" level=error msg="Unable to create event, err: events \"ExperimentDependencyCheckpod-network-latencycc46e45e-6402-4699-9456-0ca98ece4a54\" already exists"
time="2020-12-15T03:42:52Z" level=error msg="Unable to create event, err: events \"ExperimentJobCreatepod-network-latencycc46e45e-6402-4699-9456-0ca98ece4a54\" already exists"
time="2020-12-15T03:42:52Z" level=info msg="Started Chaos Experiment Name: pod-network-latency, with Job Name: pod-network-latency-bdut9i"

Logs of helper pod:

# kubectl logs -f pod-network-latency-p68ci0-zx72z
W1215 03:39:37.462647       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:39:37Z" level=info msg="Experiment Name: pod-network-latency"
time="2020-12-15T03:39:37Z" level=info msg="[PreReq]: Getting the ENV for the  experiment"
time="2020-12-15T03:39:37Z" level=info msg="[PreReq]: Updating the chaos result of pod-network-latency experiment (SOT)"
time="2020-12-15T03:39:37Z" level=info msg="The application information is as follows\n" Namespace=default Label="name=networkchaos" Ramp Time=0
time="2020-12-15T03:39:37Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2020-12-15T03:39:37Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:39:37Z" level=info msg="[Status]: The Container status are as follows" container=busybox Pod=busybox-deployment-2-78c56d468c-p8mnv Readiness=true
time="2020-12-15T03:39:39Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:39:39Z" level=info msg="[Status]: The running status of Pods are as follows" Status=Running Pod=busybox-deployment-2-78c56d468c-p8mnv
time="2020-12-15T03:39:41Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2020-12-15T03:39:41Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2020-12-15T03:39:41Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:39:43Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:40:04Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-qnembj Status=Running
time="2020-12-15T03:40:06Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2020-12-15T03:40:06Z" level=info msg="helper pod status: Failed"
time="2020-12-15T03:40:06Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-qnembj Status=Failed
time="2020-12-15T03:40:07Z" level=error msg="Chaos injection failed, err: helper pod failed due to, err: <nil>"
root@litmus:~# kubectl logs -f pod-network-latency-bdut9i-mtksv
W1215 03:42:53.144587       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:42:53Z" level=info msg="Experiment Name: pod-network-latency"
time="2020-12-15T03:42:53Z" level=info msg="[PreReq]: Getting the ENV for the  experiment"
time="2020-12-15T03:42:53Z" level=info msg="[PreReq]: Updating the chaos result of pod-network-latency experiment (SOT)"
time="2020-12-15T03:42:53Z" level=info msg="The application information is as follows\n" Ramp Time=0 Namespace=default Label="name=networkchaos"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: The Container status are as follows" container=busybox Pod=busybox-deployment-2-78c56d468c-q4784 Readiness=true
time="2020-12-15T03:42:55Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:42:55Z" level=info msg="[Status]: The running status of Pods are as follows" Status=Running Pod=busybox-deployment-2-78c56d468c-q4784
time="2020-12-15T03:42:57Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2020-12-15T03:42:57Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2020-12-15T03:42:57Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:42:59Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:43:26Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-rfybkz Status=Running
time="2020-12-15T03:43:28Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2020-12-15T03:43:28Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:30Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:31Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:32Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:33Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:34Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:35Z" level=info msg="helper pod status: Running"
.
.
.
.
.
time="2020-12-15T03:45:11Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:12Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:13Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:14Z" level=info msg="helper pod status: Running"

Logs of Experiment pod:

# kubectl logs -f pod-network-latency-rfybkz
W1215 03:43:25.991105    1962 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:43:26Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2020-12-15T03:43:26Z" level=info msg="containerid: dbc1b8d2ea82cba1730ece4c9c5b6c3179aaee8e24e130bcf4de61e6217db89e"
time="2020-12-15T03:43:26Z" level=info msg="[cri]: Container ID=dbc1b8d2ea82cba1730ece4c9c5b6c3179aaee8e24e130bcf4de61e6217db89e has process PID=721"
time="2020-12-15T03:43:26Z" level=info msg="/bin/bash -c nsenter -t 721 -n tc qdisc add dev eth0 root netem delay 500ms"
time="2020-12-15T03:43:26Z" level=info msg="[Chaos]: Waiting for 300s"
time="2020-12-15T03:45:15Z" level=info msg="[Chaos]: Killing process started because of terminated signal received"

in experiment pod logs i see time="2020-12-15T03:45:15Z" level=info msg="[Chaos]: Killing process started because of terminated signal received" this is because of i applied patch to stop the chaos using cmd:
# kubectl patch chaosengine busybox-network-chaos -n default --type merge --patch '{"spec":{"engineState":"stop"}}'

Ping results after applying the patch. Applied 500ms delay is still there.

PING 108.250.140.21 (108.250.140.21): 56 data bytes
64 bytes from 108.250.140.21: seq=0 ttl=63 time=502.495 ms
64 bytes from 108.250.140.21: seq=1 ttl=63 time=500.516 ms
64 bytes from 108.250.140.21: seq=2 ttl=63 time=500.510 ms
64 bytes from 108.250.140.21: seq=3 ttl=63 time=501.827 ms
64 bytes from 108.250.140.21: seq=4 ttl=63 time=503.103 ms
64 bytes from 108.250.140.21: seq=5 ttl=63 time=500.505 ms

If I try running the experiment again on same target pod interface.... the experiment pod is going to error state with below logs

# kubectl logs -f pod-network-latency-qnembj
W1215 03:40:04.493066   26229 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-12-15T03:40:04Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2020-12-15T03:40:04Z" level=info msg="containerid: 46b5e0c66222f3481acb376a2b76f294265a7b7368709827b7faa56b366619f9"
time="2020-12-15T03:40:04Z" level=info msg="[cri]: Container ID=46b5e0c66222f3481acb376a2b76f294265a7b7368709827b7faa56b366619f9 has process PID=12741"
time="2020-12-15T03:40:04Z" level=info msg="/bin/bash -c nsenter -t 12741 -n tc qdisc add dev eth0 root netem delay 500ms"
time="2020-12-15T03:40:04Z" level=error msg="RTNETLINK answers: File exists\n"
time="2020-12-15T03:40:04Z" level=fatal msg="helper pod failed, err: exit status 2"

from litmus-go.

rbalagopalakrishna avatar rbalagopalakrishna commented on July 24, 2024

Any fix coming for this?

from litmus-go.

ksatchit avatar ksatchit commented on July 24, 2024

Hi @rbalagopalakrishna . Sorry for the delayed response. Yes, this is being looked into by the team. There have been some developments around a feature to enable retries for "chaos revert" operations to guarantee chaos rollback - in cases where the initial attempts fail.

To elaborate, this is being addressed in a few ways:

  • (A) Under normal execution (w/o intentional abort ops), retries of chaos revert are carried out before bailing out and ending the experiment:

    • To aid this, the helper pods are signaling the state of chaos (injected v/s reverted) via annotations to the chaosresult
    • The chaosresult status schema has been enhanced with fields to signal/indicate the eventual success/failure of the revert operation that will help users learn of what has transpired - before re-executing the experiment (that may result in the error you saw subsequently about existing tc rules) (RTNETLINK answers: File exists)
  • (B) For intentional abort (that causes an immediate force removal of all chaos pods) - Ability to define terminationSeconds for the helper pods via the chaosengine schema that provides more time/chance for revert to successfully execute.

(A) is being trialed at this point (via these PRs: litmuschaos/chaos-operator#328 & #275 ). The changes are available in the image litmuschaos/go-runner:1.12.0-revert. You can give it a go to see the updated flow. Having said that this is a TP (tech-preview artifact).

(B) is scheduled to be part of a 1.13.x patch release towards feb-end.

The workaround for the problem of continuing latency/loss is to restart (delete) the pod (as long as the pod is being managed by a deployment/sts/daemonset etc.,) -- the newly created/scheduled pod results in a new network namespace - effectively removing the older processes/rules. If you are running a pure unmanaged pod, you may need to kill the tc process on the sandbox/pause container associated with this pod.

Having said all this, interested to know if you face this issue upon every abort or is this inconsistent?

from litmus-go.

rbalagopalakrishna avatar rbalagopalakrishna commented on July 24, 2024

Hi @ksatchit. Thank you for the update . I will try the litmuschaos/go-runner:1.12.0-revert image. i faced the issue upon every abort.

from litmus-go.

rbalagopalakrishna avatar rbalagopalakrishna commented on July 24, 2024

Hi @ksatchit, I created a new litmus setup with litmuschaos/go-runner:1.12.0-revert image. now i see... the runner pod is not creating any other helper pods.

logs:

Runner pod spec:
$ kubectl describe pod nginx-network-chaos-runner -n litmus
Containers:
chaos-runner:
Container ID: containerd://b09d67fb5343ec2e2a524b45aa5ad2c04b3a6a8b447d59782f003347884bccfb
Image: litmuschaos/go-runner:1.12.0-revert
Image ID: docker.io/litmuschaos/go-runner@sha256:e390bcfdd521a76b8fc932a8169035533e9d78e221d80f19a7abe85a430b0476
Port:

The runner pod is going to complete pod without any log info.
Screenshot 2021-02-10 at 7 37 03 PM

from litmus-go.

ispeakc0de avatar ispeakc0de commented on July 24, 2024

Hi @rbalagopalakrishna,

Can you please try with the following manifests?

  1. chaos-operator
  2. chaosengine
  3. chaosexperiment

from litmus-go.

ispeakc0de avatar ispeakc0de commented on July 24, 2024

Hi @rbalagopalakrishna ,

We have added the chaos revert logic in the network chaos for the cases where the experiment failed or aborted.
You can use the following images to use it:

  1. chaos-operator: litmuschaos/chaos-operator:1.13.1-revert
  2. chaos-runner: litmuschaos/chaos-runner:1.13.0
  3. go-runner: litmuschaos/go-runner:1.13.1-revert

PR reference for this: PR1, PR2

from litmus-go.

rbalagopalakrishna avatar rbalagopalakrishna commented on July 24, 2024

Hi @ispeakc0de,

+1

*-revert images are able to revert the n/w chaos successfully after applying the abort patch.
apply-patch
ping-results

attaching ping logs (duration 60seconds and patch was applied 10s after chaos started.)

from litmus-go.

ksatchit avatar ksatchit commented on July 24, 2024

With the referred revert & reruns are successful in both 1.13.1 & 1.13.1-revert (the latter having a few more indicators in the result for successful revert).

Closing this issue, please feel free to re-open or raise a new one upon any problems.

from litmus-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.