The tool injects failures but doesn't check if the targeted components recovered from

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Check if the targeted components recovered from failures about krkn HOT 7 CLOSED

krkn-chaos commented on May 26, 2024

Check if the targeted components recovered from failures

from krkn.

Comments (7)

chaitanyaenr commented on May 26, 2024 1

It would be nice to have the post_action option native to Powerfulseal/PowerfulSeal config, we created an issue around it: powerfulseal/powerfulseal#203. Looks like the latest enhancement can help us with couple of checks: powerfulseal/powerfulseal#242. For the stuff which is not possible, we can go with postAction way like @paigerube14 mentioned IMHO.

from krkn.

mffiedler commented on May 26, 2024

Tools like Cerberus can do intelligent monitoring of different dimensions of a cluster and provide a go/no-go signal. When the tools observe changes, though, they do not have any knowledge of the cause of the changes. It is up to tools like Kraken to interpret signal changes from Cerberus and correlate them with the scenario being executed or to do deeper component checks if go/no-go is too simplistic.

A tricky thing here is that a scenario might not simply cause the cluster to transition from good to bad. Some of the scenarios, especially on large clusters, might cause the cluster to oscillate/bounce from go to no-go repeatedly in Cerberus or require more detailed component checks.

Each scenario should interpret the signal from Cerberus or perform additional checks to determine pass/fail.

As a starting point I'm thinking of a common class or other mechanism for reporting pass/fail which has a default implementation of a simple Cerberus go/no-go check as the most basic form of pass/fail. Scenarios can then override the simple check with something more applicable to the scenario as needed.

@chaitanyaenr @paigerube14 @yashashreesuresh Thoughts? Better approach?

from krkn.

paigerube14 commented on May 26, 2024

I think I was thinking about doing something where at the end of each iteration (after all scenarios listed in the config are done) it would look at the status of cerberus but also see if the killed pods/nodes have come back and are running/ready. I would keep track of the pods and/or nodes that were crashed during each scenario, the time and the namespace to keep track of how long the certain pods/nodes are down and if they recovered.

For pods I have seen that the pod names come back with the same name, is that always the case?

from krkn.

chaitanyaenr commented on May 26, 2024

@mffiedler @paigerube14 Agree that consuming just the Cerberus signal for pass/fail is tricky. So the plan is to have the logic to check if the targeted component recovered fine explicitly and then consume the go/no-go signal from Cerberus after the scenario specific health check to make sure the other components didn't get affected because of the failure. So we will be tracking the killed pods/nodes during the scenario and check on them with a time out in place like @paigerube14 mentioned and in case the scenario succeeds, we then check Cerberus at the end of the iteration to see if the cluster is sane as a whole. Thoughts?

@paigerube14 The pods might not come back with the same name, so think it's better to check the replica count. For example, if we are killing apiserver, we can check using the label to see if there are 3 replicas of the pods up and running within a certain period of time. Thoughts?

from krkn.

chaitanyaenr commented on May 26, 2024

Think we should add support in the config to mention a post action ( can be a simple oc command or a script ) which gets run at the end of the each scenario to compare the current output vs expected output to report True or False which we can use to determine pass/fail. This way, we can mention the check and expected pod count ( 3 in case of apiserver and etcd for example ) in the config instead of baking in the logic into the code since the scenario can change based on the change in the config ( can be an app or system component, can be 3 replicas or 1 replica e.t.c. ). Thoughts?

from krkn.

paigerube14 commented on May 26, 2024

I think having a separate post action might be helpful in this situation compared to just finding the pod/node that was killed. I think there would be a lot of extra searching going through all the current running pods/nodes to see if it came back.

I'm not exactly sure how we would do this though. I think we would need to reconfigure the config so that you would know which post action script to run after which scenario.

Maybe it could look like the following

scenarios:                                             # List of policies/chaos scenarios to load
- - scenarios/etcd.yml
  - postAction/etcd.sh
- - scenarios/openshift-kube-apiserver.yml
  - postAction/openshift-kube-apiserver.sh
- - scenarios/openshift-apiserver.yml
  - postAction/openshift-apiserver.sh

Thoughts?

from krkn.

chaitanyaenr commented on May 26, 2024

Closing this since we now have support in Kraken to check for recovery of the targeted components as well as the cluster.

from krkn.

Check if the targeted components recovered from failures about krkn HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent