Comments (10)
@rbalagopalakrishna thanks for trying litmus. Can you share the following details:
- K8s & LitmusChaos versions with container runtime being used in your cluster
- ChaosEngine spec
- Logs of the <chaosengine+name>-runner pod, experiment pod (pod-network-latency-%hash%-%hash%) pod & the helper pod (pod-network-latency-%longer hash%)
To retain these pods w/o being deleted at the end of the experiment you can use the spec.jobCleanupPolicy: retain
in the ChaosEngine definition instead of the default delete.
from litmus-go.
Hi @ksatchit
kn8s version: Client Version: v1.20.0, Server Version: v1.20.0
Litmus operator version: chaos-operator: v1.10.0
Container runtime: containerd://1.4.3
ChaosEngine Spec:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: busybox-network-chaos
namespace: default
spec:
jobCleanUpPolicy: 'retain'
annotationCheck: 'false'
engineState: 'active'
auxiliaryAppInfo: ''
monitoring: false
appinfo:
appns: 'default'
applabel: 'name=networkchaos'
appkind: 'deployment'
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_INTERFACE
value: 'eth0'
- name: LIB_IMAGE
value: 'litmuschaos/go-runner:1.10.0'
- name: NETWORK_LATENCY
value: '500'
- name: TOTAL_CHAOS_DURATION
value: '300'
- name: PODS_AFFECTED_PERC
value: '100'
- name: CONTAINER_RUNTIME
value: 'containerd'
- name: SOCKET_PATH
value: '/run/containerd/containerd.sock'
Logs of runner pod
root@litmus:~# kubectl logs -f busybox-network-chaos-runner
W1215 03:42:21.039823 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2020-12-15T03:42:21Z" level=info msg="Experiments details are as follows" Service Account Name=pod-network-latency-sa Engine Namespace=default Experiments List="[pod-network-latency]" Engine Name=busybox-network-chaos appLabels="name=networkchaos" appKind=deployment
time="2020-12-15T03:42:51Z" level=error msg="Unable to send GA event, error: Post \"https://www.google-analytics.com/collect\": dial tcp 172.217.1.238:443: i/o timeout"
time="2020-12-15T03:42:51Z" level=info msg="Getting the ENV Variables"
time="2020-12-15T03:42:51Z" level=info msg="Preparing to run Chaos Experiment: pod-network-latency"
time="2020-12-15T03:42:51Z" level=info msg="Validating configmaps specified in the ChaosExperiment & ChaosEngine"
time="2020-12-15T03:42:52Z" level=info msg="Validating secrets specified in the ChaosExperiment & ChaosEngine"
time="2020-12-15T03:42:52Z" level=info msg="Validating HostFileVolumes details specified in the ChaosExperiment"
time="2020-12-15T03:42:52Z" level=error msg="Unable to create event, err: events \"ExperimentDependencyCheckpod-network-latencycc46e45e-6402-4699-9456-0ca98ece4a54\" already exists"
time="2020-12-15T03:42:52Z" level=error msg="Unable to create event, err: events \"ExperimentJobCreatepod-network-latencycc46e45e-6402-4699-9456-0ca98ece4a54\" already exists"
time="2020-12-15T03:42:52Z" level=info msg="Started Chaos Experiment Name: pod-network-latency, with Job Name: pod-network-latency-bdut9i"
Logs of helper pod:
# kubectl logs -f pod-network-latency-p68ci0-zx72z
W1215 03:39:37.462647 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2020-12-15T03:39:37Z" level=info msg="Experiment Name: pod-network-latency"
time="2020-12-15T03:39:37Z" level=info msg="[PreReq]: Getting the ENV for the experiment"
time="2020-12-15T03:39:37Z" level=info msg="[PreReq]: Updating the chaos result of pod-network-latency experiment (SOT)"
time="2020-12-15T03:39:37Z" level=info msg="The application information is as follows\n" Namespace=default Label="name=networkchaos" Ramp Time=0
time="2020-12-15T03:39:37Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2020-12-15T03:39:37Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:39:37Z" level=info msg="[Status]: The Container status are as follows" container=busybox Pod=busybox-deployment-2-78c56d468c-p8mnv Readiness=true
time="2020-12-15T03:39:39Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:39:39Z" level=info msg="[Status]: The running status of Pods are as follows" Status=Running Pod=busybox-deployment-2-78c56d468c-p8mnv
time="2020-12-15T03:39:41Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2020-12-15T03:39:41Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2020-12-15T03:39:41Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:39:43Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:40:04Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-qnembj Status=Running
time="2020-12-15T03:40:06Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2020-12-15T03:40:06Z" level=info msg="helper pod status: Failed"
time="2020-12-15T03:40:06Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-qnembj Status=Failed
time="2020-12-15T03:40:07Z" level=error msg="Chaos injection failed, err: helper pod failed due to, err: <nil>"
root@litmus:~# kubectl logs -f pod-network-latency-bdut9i-mtksv
W1215 03:42:53.144587 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2020-12-15T03:42:53Z" level=info msg="Experiment Name: pod-network-latency"
time="2020-12-15T03:42:53Z" level=info msg="[PreReq]: Getting the ENV for the experiment"
time="2020-12-15T03:42:53Z" level=info msg="[PreReq]: Updating the chaos result of pod-network-latency experiment (SOT)"
time="2020-12-15T03:42:53Z" level=info msg="The application information is as follows\n" Ramp Time=0 Namespace=default Label="name=networkchaos"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:42:53Z" level=info msg="[Status]: The Container status are as follows" container=busybox Pod=busybox-deployment-2-78c56d468c-q4784 Readiness=true
time="2020-12-15T03:42:55Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:42:55Z" level=info msg="[Status]: The running status of Pods are as follows" Status=Running Pod=busybox-deployment-2-78c56d468c-q4784
time="2020-12-15T03:42:57Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2020-12-15T03:42:57Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2020-12-15T03:42:57Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2020-12-15T03:42:59Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2020-12-15T03:43:26Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-network-latency-rfybkz Status=Running
time="2020-12-15T03:43:28Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2020-12-15T03:43:28Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:30Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:31Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:32Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:33Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:34Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:43:35Z" level=info msg="helper pod status: Running"
.
.
.
.
.
time="2020-12-15T03:45:11Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:12Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:13Z" level=info msg="helper pod status: Running"
time="2020-12-15T03:45:14Z" level=info msg="helper pod status: Running"
Logs of Experiment pod:
# kubectl logs -f pod-network-latency-rfybkz
W1215 03:43:25.991105 1962 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2020-12-15T03:43:26Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2020-12-15T03:43:26Z" level=info msg="containerid: dbc1b8d2ea82cba1730ece4c9c5b6c3179aaee8e24e130bcf4de61e6217db89e"
time="2020-12-15T03:43:26Z" level=info msg="[cri]: Container ID=dbc1b8d2ea82cba1730ece4c9c5b6c3179aaee8e24e130bcf4de61e6217db89e has process PID=721"
time="2020-12-15T03:43:26Z" level=info msg="/bin/bash -c nsenter -t 721 -n tc qdisc add dev eth0 root netem delay 500ms"
time="2020-12-15T03:43:26Z" level=info msg="[Chaos]: Waiting for 300s"
time="2020-12-15T03:45:15Z" level=info msg="[Chaos]: Killing process started because of terminated signal received"
in experiment pod logs i see time="2020-12-15T03:45:15Z" level=info msg="[Chaos]: Killing process started because of terminated signal received" this is because of i applied patch to stop the chaos using cmd:
# kubectl patch chaosengine busybox-network-chaos -n default --type merge --patch '{"spec":{"engineState":"stop"}}'
Ping results after applying the patch. Applied 500ms delay is still there.
PING 108.250.140.21 (108.250.140.21): 56 data bytes
64 bytes from 108.250.140.21: seq=0 ttl=63 time=502.495 ms
64 bytes from 108.250.140.21: seq=1 ttl=63 time=500.516 ms
64 bytes from 108.250.140.21: seq=2 ttl=63 time=500.510 ms
64 bytes from 108.250.140.21: seq=3 ttl=63 time=501.827 ms
64 bytes from 108.250.140.21: seq=4 ttl=63 time=503.103 ms
64 bytes from 108.250.140.21: seq=5 ttl=63 time=500.505 ms
If I try running the experiment again on same target pod interface.... the experiment pod is going to error state with below logs
# kubectl logs -f pod-network-latency-qnembj
W1215 03:40:04.493066 26229 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2020-12-15T03:40:04Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2020-12-15T03:40:04Z" level=info msg="containerid: 46b5e0c66222f3481acb376a2b76f294265a7b7368709827b7faa56b366619f9"
time="2020-12-15T03:40:04Z" level=info msg="[cri]: Container ID=46b5e0c66222f3481acb376a2b76f294265a7b7368709827b7faa56b366619f9 has process PID=12741"
time="2020-12-15T03:40:04Z" level=info msg="/bin/bash -c nsenter -t 12741 -n tc qdisc add dev eth0 root netem delay 500ms"
time="2020-12-15T03:40:04Z" level=error msg="RTNETLINK answers: File exists\n"
time="2020-12-15T03:40:04Z" level=fatal msg="helper pod failed, err: exit status 2"
from litmus-go.
Any fix coming for this?
from litmus-go.
Hi @rbalagopalakrishna . Sorry for the delayed response. Yes, this is being looked into by the team. There have been some developments around a feature to enable retries for "chaos revert" operations to guarantee chaos rollback - in cases where the initial attempts fail.
To elaborate, this is being addressed in a few ways:
-
(A) Under normal execution (w/o intentional abort ops), retries of chaos revert are carried out before bailing out and ending the experiment:
- To aid this, the helper pods are signaling the state of chaos (injected v/s reverted) via annotations to the chaosresult
- The chaosresult status schema has been enhanced with fields to signal/indicate the eventual success/failure of the revert operation that will help users learn of what has transpired - before re-executing the experiment (that may result in the error you saw subsequently about existing tc rules) (RTNETLINK answers: File exists)
-
(B) For intentional abort (that causes an immediate force removal of all chaos pods) - Ability to define terminationSeconds for the helper pods via the chaosengine schema that provides more time/chance for revert to successfully execute.
(A) is being trialed at this point (via these PRs: litmuschaos/chaos-operator#328 & #275 ). The changes are available in the image litmuschaos/go-runner:1.12.0-revert. You can give it a go to see the updated flow. Having said that this is a TP (tech-preview artifact).
(B) is scheduled to be part of a 1.13.x patch release towards feb-end.
The workaround for the problem of continuing latency/loss is to restart (delete) the pod (as long as the pod is being managed by a deployment/sts/daemonset etc.,) -- the newly created/scheduled pod results in a new network namespace - effectively removing the older processes/rules. If you are running a pure unmanaged pod, you may need to kill the tc process on the sandbox/pause container associated with this pod.
Having said all this, interested to know if you face this issue upon every abort or is this inconsistent?
from litmus-go.
Hi @ksatchit. Thank you for the update . I will try the litmuschaos/go-runner:1.12.0-revert image. i faced the issue upon every abort.
from litmus-go.
Hi @ksatchit, I created a new litmus setup with litmuschaos/go-runner:1.12.0-revert image. now i see... the runner pod is not creating any other helper pods.
logs:
Runner pod spec:
$ kubectl describe pod nginx-network-chaos-runner -n litmus
Containers:
chaos-runner:
Container ID: containerd://b09d67fb5343ec2e2a524b45aa5ad2c04b3a6a8b447d59782f003347884bccfb
Image: litmuschaos/go-runner:1.12.0-revert
Image ID: docker.io/litmuschaos/go-runner@sha256:e390bcfdd521a76b8fc932a8169035533e9d78e221d80f19a7abe85a430b0476
Port:
The runner pod is going to complete pod without any log info.
from litmus-go.
Can you please try with the following manifests?
from litmus-go.
Hi @rbalagopalakrishna ,
We have added the chaos revert logic in the network chaos for the cases where the experiment failed or aborted.
You can use the following images to use it:
- chaos-operator: litmuschaos/chaos-operator:1.13.1-revert
- chaos-runner: litmuschaos/chaos-runner:1.13.0
- go-runner: litmuschaos/go-runner:1.13.1-revert
PR reference for this: PR1, PR2
from litmus-go.
Hi @ispeakc0de,
+1
*-revert images are able to revert the n/w chaos successfully after applying the abort patch.
attaching ping logs (duration 60seconds and patch was applied 10s after chaos started.)
from litmus-go.
With the referred revert & reruns are successful in both 1.13.1 & 1.13.1-revert (the latter having a few more indicators in the result for successful revert).
Closing this issue, please feel free to re-open or raise a new one upon any problems.
from litmus-go.
Related Issues (20)
- Add missing Jitter ENV input for Pod Network Chaos experiments HOT 2
- Add Status Logs to GCP experiments
- kubelet-service-kill experiment fails HOT 1
- pod-network-loss: cleanup fails because target pod has been restarted HOT 5
- pod cpu hog experiment fails due to litmus helper cgroup error. HOT 1
- Security vulnerabilities found on golang:1.17
- FEATURE - Add resource name filters inside the k8s probe
- Pod HTTP Status Code not working with Istio HOT 3
- Some questions for file experiment.go HOT 1
- PRISMA-2022-0164: Update github.com/aws/aws-sdk-go dependency to at least v1.40.27
- Capturing pod-fio-stress experiment metrics HOT 1
- DEFAULT_HEATTH_CHECK tunable does not disable all default verifications of AUTs pods. HOT 1
- Change the experiments "pod delete" and "pod Autoscaler" to support to Namespace dynamically HOT 2
- Node drain disruption not reverted after "kubectl drain" timeout
- Log all iterations of the probe for onchaos and continuous modes configured via verbosity flag HOT 1
- Litmus does not consider appKind when filtering the target pod for an experiment
- Add support for ALL for TARGET_CONTAINER for parallel executions
- CPU Hog Exec - is not terminating appropriately for certain target containers as intended HOT 1
- Fixes Vulnerabilities on promql and crictl binaries used inside the experiment image.
- pod-delete - TARGET_SELECTION_ERROR with custom parent Kind
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litmus-go.