Giter Club home page Giter Club logo

Comments (7)

chaitanyaenr avatar chaitanyaenr commented on May 26, 2024 1

It would be nice to have the post_action option native to Powerfulseal/PowerfulSeal config, we created an issue around it: powerfulseal/powerfulseal#203. Looks like the latest enhancement can help us with couple of checks: powerfulseal/powerfulseal#242. For the stuff which is not possible, we can go with postAction way like @paigerube14 mentioned IMHO.

from krkn.

mffiedler avatar mffiedler commented on May 26, 2024

Tools like Cerberus can do intelligent monitoring of different dimensions of a cluster and provide a go/no-go signal. When the tools observe changes, though, they do not have any knowledge of the cause of the changes. It is up to tools like Kraken to interpret signal changes from Cerberus and correlate them with the scenario being executed or to do deeper component checks if go/no-go is too simplistic.

A tricky thing here is that a scenario might not simply cause the cluster to transition from good to bad. Some of the scenarios, especially on large clusters, might cause the cluster to oscillate/bounce from go to no-go repeatedly in Cerberus or require more detailed component checks.

Each scenario should interpret the signal from Cerberus or perform additional checks to determine pass/fail.

As a starting point I'm thinking of a common class or other mechanism for reporting pass/fail which has a default implementation of a simple Cerberus go/no-go check as the most basic form of pass/fail. Scenarios can then override the simple check with something more applicable to the scenario as needed.

@chaitanyaenr @paigerube14 @yashashreesuresh Thoughts? Better approach?

from krkn.

paigerube14 avatar paigerube14 commented on May 26, 2024

I think I was thinking about doing something where at the end of each iteration (after all scenarios listed in the config are done) it would look at the status of cerberus but also see if the killed pods/nodes have come back and are running/ready. I would keep track of the pods and/or nodes that were crashed during each scenario, the time and the namespace to keep track of how long the certain pods/nodes are down and if they recovered.

For pods I have seen that the pod names come back with the same name, is that always the case?

from krkn.

chaitanyaenr avatar chaitanyaenr commented on May 26, 2024

@mffiedler @paigerube14 Agree that consuming just the Cerberus signal for pass/fail is tricky. So the plan is to have the logic to check if the targeted component recovered fine explicitly and then consume the go/no-go signal from Cerberus after the scenario specific health check to make sure the other components didn't get affected because of the failure. So we will be tracking the killed pods/nodes during the scenario and check on them with a time out in place like @paigerube14 mentioned and in case the scenario succeeds, we then check Cerberus at the end of the iteration to see if the cluster is sane as a whole. Thoughts?

@paigerube14 The pods might not come back with the same name, so think it's better to check the replica count. For example, if we are killing apiserver, we can check using the label to see if there are 3 replicas of the pods up and running within a certain period of time. Thoughts?

from krkn.

chaitanyaenr avatar chaitanyaenr commented on May 26, 2024

Think we should add support in the config to mention a post action ( can be a simple oc command or a script ) which gets run at the end of the each scenario to compare the current output vs expected output to report True or False which we can use to determine pass/fail. This way, we can mention the check and expected pod count ( 3 in case of apiserver and etcd for example ) in the config instead of baking in the logic into the code since the scenario can change based on the change in the config ( can be an app or system component, can be 3 replicas or 1 replica e.t.c. ). Thoughts?

from krkn.

paigerube14 avatar paigerube14 commented on May 26, 2024

I think having a separate post action might be helpful in this situation compared to just finding the pod/node that was killed. I think there would be a lot of extra searching going through all the current running pods/nodes to see if it came back.

I'm not exactly sure how we would do this though. I think we would need to reconfigure the config so that you would know which post action script to run after which scenario.

Maybe it could look like the following

scenarios:                                             # List of policies/chaos scenarios to load
- - scenarios/etcd.yml
  - postAction/etcd.sh
- - scenarios/openshift-kube-apiserver.yml
  - postAction/openshift-kube-apiserver.sh
- - scenarios/openshift-apiserver.yml
  - postAction/openshift-apiserver.sh

Thoughts?

from krkn.

chaitanyaenr avatar chaitanyaenr commented on May 26, 2024

Closing this since we now have support in Kraken to check for recovery of the targeted components as well as the cluster.

from krkn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.