Comments (7)
It would be nice to have the post_action option native to Powerfulseal/PowerfulSeal config, we created an issue around it: powerfulseal/powerfulseal#203. Looks like the latest enhancement can help us with couple of checks: powerfulseal/powerfulseal#242. For the stuff which is not possible, we can go with postAction way like @paigerube14 mentioned IMHO.
from krkn.
Tools like Cerberus can do intelligent monitoring of different dimensions of a cluster and provide a go/no-go signal. When the tools observe changes, though, they do not have any knowledge of the cause of the changes. It is up to tools like Kraken to interpret signal changes from Cerberus and correlate them with the scenario being executed or to do deeper component checks if go/no-go is too simplistic.
A tricky thing here is that a scenario might not simply cause the cluster to transition from good to bad. Some of the scenarios, especially on large clusters, might cause the cluster to oscillate/bounce from go to no-go repeatedly in Cerberus or require more detailed component checks.
Each scenario should interpret the signal from Cerberus or perform additional checks to determine pass/fail.
As a starting point I'm thinking of a common class or other mechanism for reporting pass/fail which has a default implementation of a simple Cerberus go/no-go check as the most basic form of pass/fail. Scenarios can then override the simple check with something more applicable to the scenario as needed.
@chaitanyaenr @paigerube14 @yashashreesuresh Thoughts? Better approach?
from krkn.
I think I was thinking about doing something where at the end of each iteration (after all scenarios listed in the config are done) it would look at the status of cerberus but also see if the killed pods/nodes have come back and are running/ready. I would keep track of the pods and/or nodes that were crashed during each scenario, the time and the namespace to keep track of how long the certain pods/nodes are down and if they recovered.
For pods I have seen that the pod names come back with the same name, is that always the case?
from krkn.
@mffiedler @paigerube14 Agree that consuming just the Cerberus signal for pass/fail is tricky. So the plan is to have the logic to check if the targeted component recovered fine explicitly and then consume the go/no-go signal from Cerberus after the scenario specific health check to make sure the other components didn't get affected because of the failure. So we will be tracking the killed pods/nodes during the scenario and check on them with a time out in place like @paigerube14 mentioned and in case the scenario succeeds, we then check Cerberus at the end of the iteration to see if the cluster is sane as a whole. Thoughts?
@paigerube14 The pods might not come back with the same name, so think it's better to check the replica count. For example, if we are killing apiserver, we can check using the label to see if there are 3 replicas of the pods up and running within a certain period of time. Thoughts?
from krkn.
Think we should add support in the config to mention a post action ( can be a simple oc command or a script ) which gets run at the end of the each scenario to compare the current output vs expected output to report True or False which we can use to determine pass/fail. This way, we can mention the check and expected pod count ( 3 in case of apiserver and etcd for example ) in the config instead of baking in the logic into the code since the scenario can change based on the change in the config ( can be an app or system component, can be 3 replicas or 1 replica e.t.c. ). Thoughts?
from krkn.
I think having a separate post action might be helpful in this situation compared to just finding the pod/node that was killed. I think there would be a lot of extra searching going through all the current running pods/nodes to see if it came back.
I'm not exactly sure how we would do this though. I think we would need to reconfigure the config so that you would know which post action script to run after which scenario.
Maybe it could look like the following
scenarios: # List of policies/chaos scenarios to load
- - scenarios/etcd.yml
- postAction/etcd.sh
- - scenarios/openshift-kube-apiserver.yml
- postAction/openshift-kube-apiserver.sh
- - scenarios/openshift-apiserver.yml
- postAction/openshift-apiserver.sh
Thoughts?
from krkn.
Closing this since we now have support in Kraken to check for recovery of the targeted components as well as the cluster.
from krkn.
Related Issues (20)
- Container Scenarios - update example/docs to list all usable keys
- Centralized storage for chaos experiments artifacts HOT 3
- Support adding load + multiple chaos scenarios injection in parallel
- Attempt to run a container scenario for api while count is bigger than 1 results in crash HOT 1
- Attempt to kill a single instance of etcd on SNO results in crash since the cluster is non-responsive while the pod reconciles. HOT 1
- Ability to provide SLO profile via http
- Test failing but not failing action
- Hog scenarios should accept node-selector without a value
- Cannot run Kraken Dockerized version
- Capture Alerts triggered as part of telemetry HOT 3
- SLO validation via prometheus metrics profile HOT 1
- Kraken regex openshift kill pod fails for statefulset pods HOT 2
- [Docs] Switch references to Kubernetes
- K8S/OCP functionality split
- [Chaos-recommender] Expose profiling thresholds HOT 1
- [Chaos-recommender] Output recommendations in json HOT 1
- [Chaos-recommender] Accept list of namespaces/labels to profile HOT 1
- [chaos-recommender] Consider node level resource usage for app with no limits HOT 1
- [chaos-recommender] Support pandas > 2.0.0
- Chaos Recommender util README points to a Python Version which might have installation issues
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from krkn.