krkn-chaos / krkn Goto Github PK

Chaos and resiliency testing tool for Kubernetes with a focus on improving performance under failure conditions. A CNCF sandbox project.

License: Apache License 2.0

Python 95.93% Dockerfile 0.23% Shell 2.55% Jinja 1.29%

chaos-engineering resiliency testing kubernetes performance scalability containers reliability

krkn's People

Contributors

Stargazers

Watchers

Forkers

chaitanyaenr yashashreesuresh paigerube14 aossama obaireys mffiedler hkamel deepkumarchaudhary payal-setiya cybernetics yasuyuki4biz siva2krish metrocodeteam pravin-dsilva arcprabh ampc juliemathew jaredoconnell innovation-sre mehdi-wsc schituku amruta-bandhu-chaudhury tblakex01 bmos299 fsbaraglia qiliredhat amitsagtani97 shashanksr6694 learnitall koflerm deepak16686 snirmusayof ratsuf chazzrobbz fredericwu nmacc peterk143 sunilkum84 agullon nanyte25 sau1506mya bcgov obrienrobert wise-elm adolfo-ab tommyd450 robin6321 mcianci2024 startxfr hyp3rion31gr3y harshil-codes thesimpledata antonyjohnson-js sachin199112 shreyasar2202 atultodi samdoran syxunion afalhambra startx-lab aaronsteneo tsebastiani thyx113 psehgaft kdvalin naftalysh shrutianekar tpratyushareddy rh-cpranava psuriset yogananth-subramanian shahsahil264 naveensubramanian14 alanfx jtydlack smalleni kvk359 adhiadarsh sudheervictory dustinblack mfleader sagarkvv mudverma vinothbalakrish navyseal8 rhdh-hive-demo kami619 markmc hexiaofeng kmlefebv liangquanli930 fossabot sandeephans ekuric

krkn's Issues

Grafana Dashboard - Bad Gateway

I've enabled the Grafana dashboard on the config.yaml file and I get an Bad Gateway message on all of the dashboards. I've tested it on an IBM ROKS cluster as well as an AWS cluster and encountered the same problem.

Documentation update required to specify the mandatory cloud_type parameter value as aws

Scenario - Enable node scenarios for an OCP cluster on a non aws cloud type

Updated the cloud_type as openstack in the scenarios/node_scenarios_example.yml file

node_scenarios:
  - actions:                                                        # node chaos scenarios to be injected
 #   - node_stop_start_scenario
    - stop_start_kubelet_scenario
 #   - node_crash_scenario
    node_name: worker-0                                                      # node on which scenario has to be injected
    label_selector: node-role.kubernetes.io/worker                  # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
    instance_kill_count: 1                                          # number of times to inject each scenario under actions
    timeout: 120                                                    # duration to wait for completion of node scenario injection
    cloud_type: openstack                                                 # cloud type on which Kubernetes/OpenShift runs

Issue seen: Script fails when any value other than aws is specified in cloud_type.

[root@rails1 kraken]# python3 run_kraken.py --config config/config.yaml
 _              _              
| | ___ __ __ _| | _____ _ __  
| |/ / '__/ _` | |/ / _ \ '_ \ 
|   <| | | (_| |   <  __/ | | |
|_|\_\_|  \__,_|_|\_\___|_| |_|
2020-11-09 06:23:01,238 [INFO] Starting kraken
2020-11-09 06:23:01,244 [INFO] Initializing client to talk to the Kubernetes cluster
2020-11-09 06:23:03,448 [INFO] Fetching cluster info
2020-11-09 06:23:04,360 [INFO] 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.1     True        False         11d     Cluster version is 4.6.1
Kubernetes master is running at https://api.chaos-arc46.redhat.com:6443
2020-11-09 06:23:04,361 [INFO] Daemon mode not enabled, will run through 1 iterations
2020-11-09 06:23:04,361 [INFO] Executing scenarios for iteration 0
Traceback (most recent call last):
  File "run_kraken.py", line 298, in <module>
    main(options.cfg)
  File "run_kraken.py", line 244, in main
    node_scenario_object)
  File "run_kraken.py", line 48, in inject_node_scenario
    node_scenario_object.stop_start_kubelet_scenario(instance_kill_count, node, timeout)
AttributeError: 'NoneType' object has no attribute 'stop_start_kubelet_scenario'

Expected result : Documentation should specify that all the node scenarios would run only on aws cloud type
As per current documentation, the kubelet start stop node scenario should run on any cloud platform.

Node scenarios base class

I came across these two files abstract_node_scenarios.py and general_cloud_node_scenarios.py which I think are serving the same purpose. And I think they can be combined into one single base/general class.
Thoughts? Correct me if I haven't gotten the use case of having two different base classes :)

cc: @paigerube14

Enabling Grafana on an IBM ROKS Cluster - `jq` not found

I enabled Grafana on the config.yaml file and received the following message in the OCP Console saying that jq was not found. envsubst can also not be found. I have jq installed locally on the machine I'm using to run the oc apply -f kraken.yml command to initialize the deployment.

Using k8s command: oc 
Using namespace: dittybopper 
Using default grafana password: admin 

[32mGetting environment vars...[0m 
./deploy.sh: line 111: jq: command not found 
./deploy.sh: line 112: jq: command not found 
Prometheus URL is: 
Prometheus password collected. 

[32mCreating namespace...
[0m namespace/dittybopper created 
[32mDeploying Grafana...[0m 
./deploy.sh: line 122: envsubst: command not found 
error: no objects passed to apply

Support for Azure cloud

First thanks for making this tool available to the general public, it looks really awsome!

Is there already a roadmap for supporting different clouds, in our case specifically Azure?
If so can you provide any rough indication on when this would be added?

We use OpenShift on Azure and this tool look really awsome so that's why we would love to use this in our setup!

Add "continuous mode" to kraken

It is useful to continuously run chaos scenarios while running other workloads. Add a config option to loop through the configured scenarios a set number of times or infinitely. Also, provide a configuration for a "sleep time" between loops.

Add a scenario to randomly kill pods based on a regexp

Add a scenario to kill a random pod selected from pods matching a regexp. We have a simple, but effective shell script in OpenShift testing that randomly takes out infra pods. Combined with Cerberus (https://github.com/openshift-scale/cerberus) we can detect non-transient issues caused by such a scenario.

Example (in shell scripting):

#!/bin/bash
NUM=$1
SLEEP=$2
while true; do 
   echo "============"
   oc get pods --all-namespaces | grep openshift | grep Running| awk {'print $1" "$2'} | shuf -n $NUM | while read i; do oc delete pod -n $i --wait=false; done; sleep $SLEEP; 
done

Enable support for Openstack cloud to perform node level chaos tests

Issue :

Currently the Kraken tool node level chaos tests are unsupported on an Openstack based OCP cluster.

Solution:

We have enabled support by writing code to connect to and run the node level chaos tests on an Openstack based cluster.
Also, we have updated the node scenarios documentation with instructions for the Openstack cloud type.

Dockerfile failing to build image

I trying to build an image to run in DEBUG mode but failing run I attempt to build in kraken/containers using the Dockerfile. I am receiving a python error:

Collecting azure-keyvault-certificates~=4.1 (from azure-keyvault->-r requirements.txt (line 9))
  Downloading https://files.pythonhosted.org/packages/23/2b/e3219ae48391263b4657817b61eb5104ac1190d360e288e4546a18e61ca5/azure_keyvault_certificates-4.2.1-py2.py3-none-any.whl (235kB)
Collecting azure-keyvault-keys~=4.1 (from azure-keyvault->-r requirements.txt (line 9))
  Downloading https://files.pythonhosted.org/packages/a5/7f/a626704af6f0fb36bdc16397b49ab313a7e7a801ded5b43c6a74391f536a/azure_keyvault_keys-4.3.1-py2.py3-none-any.whl (248kB)
Collecting azure-keyvault-secrets~=4.1 (from azure-keyvault->-r requirements.txt (line 9))
  Downloading https://files.pythonhosted.org/packages/43/ae/4ad0c67e54f820d0ead249ce48e8cf498cfee42034dcd837d8f218ba9e76/azure_keyvault_secrets-4.2.0-py2.py3-none-any.whl (206kB)
Collecting azure-core<2.0.0,>=1.0.0 (from azure-identity->-r requirements.txt (line 10))
  Downloading https://files.pythonhosted.org/packages/19/18/21cfd7faf7ab24c35689c9f199179081cee8fec44668c7f090e1db61226d/azure_core-1.13.0-py2.py3-none-any.whl (133kB)
Collecting msal<2.0.0,>=1.6.0 (from azure-identity->-r requirements.txt (line 10))
  Downloading https://files.pythonhosted.org/packages/68/a9/b534f1158ffce8c551dea86d90981e9bd892f310c4c27d079d6b4b88849a/msal-1.11.0-py2.py3-none-any.whl (63kB)
Collecting msal-extensions~=0.3.0 (from azure-identity->-r requirements.txt (line 10))
  Downloading https://files.pythonhosted.org/packages/49/cb/c833ffa0f97c3098b146ac19bb2266c2d84b2119ffff83fdf001bb59d3ae/msal_extensions-0.3.0-py2.py3-none-any.whl
Collecting cryptography>=2.1.4 (from azure-identity->-r requirements.txt (line 10))
  Downloading https://files.pythonhosted.org/packages/9b/77/461087a514d2e8ece1c975d8216bc03f7048e6090c5166bc34115afdaa53/cryptography-3.4.7.tar.gz (546kB)
    Complete output from command python setup.py egg_info:
    
            =============================DEBUG ASSISTANCE==========================
            If you are seeing an error here please try the following to
            successfully install cryptography:
    
            Upgrade to the latest pip and try again. This will fix errors for most
            users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
            =============================DEBUG ASSISTANCE==========================
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-4q3jv9uv/cryptography/setup.py", line 14, in <module>
        from setuptools_rust import RustExtension
    ModuleNotFoundError: No module named 'setuptools_rust'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-4q3jv9uv/cryptography/
The command '/bin/sh -c yum install -y git python36 python3-pip && git clone https://github.com/openshift-scale/kraken /root/kraken && mkdir -p /root/.kube && cd /root/kraken && pip3 install -r requirements.txt' returned a non-zero code: 1
Barrys-MBP:containers bmosus.ibm.com$

reduce calling exec

There are some calls to exec to hit kubectl instead, we should be leveraging the k8s python client.

Add pod scenarios for custom application

Task:
This task is for providing a sample scenario for executing pod level chaos tests for a custom application (say, acme-air app) deployed on the Openshift Container Platform. The sample scenario will be added to the existing scenarios directory .

Modularize calls to scenarios in Kraken main()

From PR #38 : This might be a good time to also separate out each different type of scenario into its own function. This might make it easier to read and add the many more types of scenarios that are being worked on as well.

This should be done after #37 merges and #38 is finalized. @yashashreesuresh @paigerube14 @chaitanyaenr

trying to deploy the image and it is not working on openshift

trying to run kraken using python3 having following issue

[root@ip-00-000-00-00 kraken]# ls -lrt
total 15988
-rw-r-----. 1 root root 14421964 Feb 25 2015 Python-3.4.3.tar.xz
drwxr-x---. 3 root root 22 Nov 10 15:36 CI
-rw-r-----. 1 root root 11357 Nov 10 15:36 LICENSE
-rw-r-----. 1 root root 2352 Nov 10 15:36 README.md
drwxr-x---. 4 root root 84 Nov 10 15:36 ansible
drwxr-x---. 2 root root 88 Nov 10 15:36 containers
drwxr-x---. 2 root root 4096 Nov 10 15:36 docs
drwxr-x---. 6 root root 74 Nov 10 15:36 kraken
drwxr-x---. 2 root root 32 Nov 10 15:36 media
-rw-r-----. 1 root root 450 Nov 10 15:36 tox.ini
-rw-r-----. 1 root root 18 Nov 10 15:36 test-requirements.txt
-rw-r-----. 1 root root 297 Nov 10 15:36 setup.py
-rw-r-----. 1 root root 1099 Nov 10 15:36 setup.cfg
drwxr-x---. 2 root root 4096 Nov 10 15:36 scenarios
drwxr-x---. 2 root root 6 Nov 12 12:57 docker
drwxr-x---. 2 root root 46 Nov 12 14:27 config
-rw-r-----. 1 root root 13859 Nov 12 14:36 run_kraken.py
drwxr-x---. 5 root root 69 Nov 12 17:35 py36-venv
-rw-r-----. 1 root root 1886796 Nov 12 17:44 get-pip.py
-rw-r-----. 1 root root 93 Nov 12 17:52 requirements.txt

[root@ip-00-000-00-00 kraken]# python3 -V
Python 3.4.3

[root@ip-00-000-00-00 kraken]#
[root@ip-00-000-00-00 kraken]# pip3 -V
pip 20.2.4 from /usr/local/lib/python3.6/site-packages/pip (python 3.6)
[root@ip-00-000-00-00 kraken]# python3 -V
Python 3.4.3
[root@ip-00-000-00-00 kraken]# python -V
Python 3.6.8
[root@ip-00-000-00-00 kraken]#
[root@ip-00-000-00-00 kraken]#
[root@ip-00-000-00-00 kraken]# pip3 -V
pip 20.2.4 from /usr/local/lib/python3.6/site-packages/pip (python 3.6)

[root@ip-00-000-00-00 kraken]# pip3 install -r requirements.txt
Collecting git+https://github.com/powerfulseal/powerfulseal.git (from -r requirements.txt (line 4))
Cloning https://github.com/powerfulseal/powerfulseal.git to /tmp/pip-req-build-ipizx782
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Requirement already satisfied (use --upgrade to upgrade): powerfulseal==3.2.0 from git+https://github.com/powerfulseal/powerfulseal.git in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 4))
Requirement already satisfied: datetime in /usr/local/lib64/python3.6/site-packages (from -r requirements.txt (line 1)) (4.3)
Requirement already satisfied: pyfiglet in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (0.8.post1)
Requirement already satisfied: PyYAML in /usr/local/lib64/python3.6/site-packages (from -r requirements.txt (line 3)) (5.3.1)
Requirement already satisfied: requests in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 5)) (2.25.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 6)) (1.16.16)
Requirement already satisfied: datadog<1.0.0,>=0.29.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.39.0)
Requirement already satisfied: python-dateutil<2.7.0,>=2.5.3 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.6.1)
Requirement already satisfied: coloredlogs<11.0.0,>=10.0.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (10.0)
Requirement already satisfied: azure-mgmt-resource<3.0.0,>=2.2.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.2.0)
Requirement already satisfied: google-api-python-client>=1.7.8 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.12.5)
Requirement already satisfied: google-auth>=1.6.2 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.23.0)
Requirement already satisfied: ConfigArgParse<1,>=0.11.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.15.2)
Requirement already satisfied: kubernetes==12.0.0a1 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (12.0.0a1)
Requirement already satisfied: future<1,>=0.16.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.18.2)
Requirement already satisfied: openstacksdk<1,>=0.10.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.51.0)
Requirement already satisfied: oauth2client>=4.1.3 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.1.3)
Requirement already satisfied: jsonschema<4,>=3.0.2 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.0)
Requirement already satisfied: Flask<2,>=1.0.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.2)
Requirement already satisfied: prometheus-client<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.3.1)
Requirement already satisfied: paramiko<3,>=2.5.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.7.2)
Requirement already satisfied: termcolor<2,>=1.1.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.0)
Requirement already satisfied: azure-common<2.0.0,>=1.1.23 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.26)
Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.0.4)
Requirement already satisfied: flask-cors<4,>=3.0.6 in /usr/local/lib64/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.0.9)
Requirement already satisfied: azure-mgmt-network<3.0.0,>=2.7.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.7.0)
Requirement already satisfied: spur<1,>=0.3.20 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.3.22)
Requirement already satisfied: azure-mgmt-compute<5.0.0,>=4.6.2 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.6.2)
Requirement already satisfied: zope.interface in /usr/local/lib64/python3.6/site-packages (from datetime->-r requirements.txt (line 1)) (5.2.0)
Requirement already satisfied: pytz in /usr/local/lib/python3.6/site-packages (from datetime->-r requirements.txt (line 1)) (2020.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (1.26.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (2020.11.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (2.10)
Requirement already satisfied: botocore<1.20.0,>=1.19.16 in /usr/local/lib/python3.6/site-packages (from boto3->-r requirements.txt (line 6)) (1.19.16)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from boto3->-r requirements.txt (line 6)) (0.3.3)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3->-r requirements.txt (line 6)) (0.10.0)
Requirement already satisfied: decorator>=3.3.2 in /usr/local/lib/python3.6/site-packages (from datadog<1.0.0,>=0.29.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.4.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/site-packages (from python-dateutil<2.7.0,>=2.5.3->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.15.0)
Requirement already satisfied: humanfriendly>=4.7 in /usr/local/lib/python3.6/site-packages (from coloredlogs<11.0.0,>=10.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (8.2)
Requirement already satisfied: msrest>=0.5.0 in /usr/local/lib/python3.6/site-packages (from azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.6.19)
Requirement already satisfied: msrestazure<2.0.0,>=0.4.32 in /usr/local/lib/python3.6/site-packages (from azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.6.4)
Requirement already satisfied: google-api-core<2dev,>=1.21.0 in /usr/local/lib/python3.6/site-packages (from google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.23.0)
Requirement already satisfied: httplib2<1dev,>=0.15.0 in /usr/local/lib/python3.6/site-packages (from google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.18.1)
Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /usr/local/lib/python3.6/site-packages (from google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.0.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.2.8)
Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (50.3.2)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.6)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.1.1)
Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes==12.0.0a1->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.3.0)
Requirement already satisfied: websocket-client!=0.40.0,!=0.41.,!=0.42.,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes==12.0.0a1->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.57.0)
Requirement already satisfied: keystoneauth1>=3.18.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.2.1)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (5.5.1)
Requirement already satisfied: munch>=2.1.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.5.0)
Requirement already satisfied: importlib-metadata>=1.7.0; python_version < "3.8" in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.0.0)
Requirement already satisfied: dogpile.cache>=0.6.5 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.0.2)
Requirement already satisfied: netifaces>=0.10.4 in /usr/local/lib64/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.10.9)
Requirement already satisfied: jsonpatch!=1.20,>=1.16 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.26)
Requirement already satisfied: appdirs>=1.3.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.4.4)
Requirement already satisfied: iso8601>=0.1.11 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.1.13)
Requirement already satisfied: requestsexceptions>=1.2.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.4.0)
Requirement already satisfied: cryptography>=2.7 in /usr/local/lib64/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.1)
Requirement already satisfied: os-service-types>=1.7.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.7.0)
Requirement already satisfied: pyasn1>=0.1.7 in /usr/local/lib/python3.6/site-packages (from oauth2client>=4.1.3->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.4.8)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.6/site-packages (from jsonschema<4,>=3.0.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (20.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib64/python3.6/site-packages (from jsonschema<4,>=3.0.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.17.3)
Requirement already satisfied: Jinja2>=2.10.1 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.11.2)
Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.0.1)
Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.0)
Requirement already satisfied: click>=5.1 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (7.1.2)
Requirement already satisfied: pynacl>=1.0.1 in /usr/local/lib64/python3.6/site-packages (from paramiko<3,>=2.5.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.4.0)
Requirement already satisfied: bcrypt>=3.1.3 in /usr/local/lib64/python3.6/site-packages (from paramiko<3,>=2.5.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.0)
Requirement already satisfied: isodate>=0.6.0 in /usr/local/lib/python3.6/site-packages (from msrest>=0.5.0->azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.6.0)
Requirement already satisfied: adal<2.0.0,>=0.6.0 in /usr/local/lib/python3.6/site-packages (from msrestazure<2.0.0,>=0.4.32->azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.2.5)
Requirement already satisfied: protobuf>=3.12.0 in /usr/local/lib64/python3.6/site-packages (from google-api-core<2dev,>=1.21.0->google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.13.0)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2dev,>=1.21.0->google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.52.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/site-packages (from requests-oauthlib->kubernetes==12.0.0a1->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.1.0)
Requirement already satisfied: stevedore>=1.20.0 in /usr/local/lib/python3.6/site-packages (from keystoneauth1>=3.18.0->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/site-packages (from importlib-metadata>=1.7.0; python_version < "3.8"->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.4.0)
Requirement already satisfied: jsonpointer>=1.9 in /usr/local/lib/python3.6/site-packages (from jsonpatch!=1.20,>=1.16->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib64/python3.6/site-packages (from cryptography>=2.7->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.14.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib64/python3.6/site-packages (from Jinja2>=2.10.1->Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.1)
Requirement already satisfied: PyJWT>=1.0.0 in /usr/local/lib/python3.6/site-packages (from adal<2.0.0,>=0.6.0->msrestazure<2.0.0,>=0.4.32->azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.7.1)
Requirement already satisfied: pycparser in /usr/local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.7->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.20)
Building wheels for collected packages: powerfulseal
Building wheel for powerfulseal (PEP 517) ... done
Created wheel for powerfulseal: filename=powerfulseal-3.2.0-py3-none-any.whl size=156211 sha256=adf9ad79cc8f30cb888ec0fc63bb69bdbbd9bc964980469319342c94c303c0bc
Stored in directory: /tmp/pip-ephem-wheel-cache-rk3agc6k/wheels/2d/cf/51/8de7d61c30ff637fca2222e4ba67d69cc9f1f8c6f419a0ee0d
Successfully built powerfulseal

[root@ip-00-000-00-00 ~]# vi myfile
[root@ip-00-000-00-00 ~]# python
Python 3.6.8 (default, Aug 13 2020, 07:46:32)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import yaml
f = open('myfile')
y = **yaml.**load(f)
main:1: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
print(y)
this is satish

[root@ip-00-000-00-00 kraken]# python3 run_kraken.py --config=config/config.yaml
Traceback (most recent call last):
File "run_kraken.py", line 7, in
import yaml
ImportError: No module named 'yaml'

Refactor code base to be modular

https://github.com/cloud-bulldozer/kraken/blob/master/run_kraken.py has functions that can be moved to modules to make it modular and reusable in addition to making it easy for folks to add changes needed in the main function file to orchestrate new modules/code additions.

[RFE ] Bake in Performance and Scalability knowledge for Kraken to pass/fail

Kraken currently has the ability to check if the targeted component recovered from the failures injected in addition to checking the health of the cluster as a whole to make sure it didn't get impacted because of the chaos. Kraken needs to take into account the performance and scalability of the component under test as well to expose issues where the component/cluster doesn't perform well after the recovery.

It supports performance monitoring which helps with tracking the performance and scale metrics: https://github.com/cloud-bulldozer/kraken#performance-monitoring but we need a way for it to analyze things without having to manually check on the cluster.

The proposal is to have the ability in Kraken to accept a profile which consists of metrics to scrape from the prometheus and evaluates them with a gold set of values found from the OpenShift/Kubernetes performance and scale runs and pass/fails the run based on it. This is inspired from Kube-burner which we use heavily for performance/scale testing OpenShift: https://github.com/cloud-bulldozer/kube-burner/blob/master/docs/alerting.md.

More on the importance of doing this can be found in the blog we published recently: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests.

Investigate Litmus chaos framework

CNV team looked into it and found out that Litmus framework ( https://github.com/litmuschaos/litmus ) has interesting chaos scenarios that Kraken is missing today. We need to investigate to see if it makes sense to consume it under the hood to improve the coverage.

Running Containerized Version Link goes to nothing

On the containers page, the link under the Run containerized version just goes to the kraken home page.

It is supposed to go to this page correct?

[RFE] Stress cluster when injecting faults

Right now, we inject failures and check for cluster health and recovery on a idle cluster. However, in production that is hardly the case, as the cluster will have some workloads actively running. To simulate production use cases as closely as possible, it might make sense to run load on the cluster before triggering faults. We could use something like kube-burner (https://github.com/cloud-bulldozer/kube-burner) to load the cluster. The idea would be to first start the run of kube-burner and then inject faults during the run.

Chaos scenario to skew system/node/pod tIme

Kraken needs to support injecting a scenario which skews the system time on the nodes/pods and checks it gets corrected by NTP and make sure there’s no effect on the running services.

Run containerized kraken as a kubernetes/openshift deployment

Further improve the ease of use of containerized kraken by providing an image on quay.io and creating a kubernetes or openshift deployment to run it. Config can be provided via configmap.

kubernetes equivalent of oc commands

I see that you have the following commands to stop kubelets in oc environment:

oc debug node/" + node + " -- chroot /host systemctl stop kubelet

What will be the kubectl equivalent of that ?

Add linters and CI for running the linters on PR's

Linters needs to be added as it improves the code quality. Travis CI needs to be enabled to run linters on every pull request to make sure it follows the best practices and doesn't break the tool.

ROKS node support

#80 referencing this issue, so that node options become available in IBMCloud, really interesting tool, thanks for all your efforts!

Initialization error with the latest version

When trying to run the latest Kraken version (both using Docker and deploying on OpenShift) I get the same error:

2020-12-01 09:46:25,027 [INFO] Starting kraken
2020-12-01 09:46:25,032 [INFO] Initializing client to talk to the Kubernetes cluster
 _              _              
| | ___ __ __ _| | _____ _ __  
| |/ / '__/ _` | |/ / _ \ '_ \ 
|   <| | | (_| |   <  __/ | | |
|_|\_\_|  \__,_|_|\_\___|_| |_|
                               

Traceback (most recent call last):
  File "run_kraken.py", line 316, in <module>
    main(options.cfg)
  File "run_kraken.py", line 238, in main
    kubecli.find_kraken_node()
  File "/root/kraken/kraken/kubernetes/client.py", line 180, in find_kraken_node
    pod_json = json.loads(pod_json_str)
  File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Here are some details on my environment:

$ docker version                                                                                                                                                      
Client: Docker Engine - Community
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 17:02:55 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:01:25 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

$ oc version 
Client Version: 4.4.6
Server Version: 4.5.19
Kubernetes Version: v1.18.3+10e5708

## My kubeconfig appears to be fine:
$ oc get pods         
NAME                                READY   STATUS             RESTARTS   AGE
kraken-deployment-d857c4d4f-btxx8   0/1     CrashLoopBackOff   15         61m

Here's how I tried to run it:

$ docker run --name=kraken --net=host -v $HOME/.kube/config:/root/.kube/config -v $PWD/chaos/config.yaml:/root/kraken/config/config.yaml quay.io/openshift-scale/kraken:latest

Trying to run openshift-apiserver scenario without success

I've set up a standalone cluster for Kraken and followed the steps provided in the Readme to get started. I made some changes in the Kraken config file to only include one scenario which would be the openshift-apiserver to isolate the problem I'm having. This is what I have in the kraken deployment pod log after starting the deployment:

2021-04-12 14:16:22,450 [INFO] Starting kraken
2021-04-12 14:16:22,455 [INFO] Initializing client to talk to the Kubernetes cluster
/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'api.kraken-test-04.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
2021-04-12 14:16:23,455 [INFO] Fetching cluster info
2021-04-12 14:16:47,718 [INFO]
I0412 14:16:24.777729      13 request.go:668] Waited for 1.168597466s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/policy/v1beta1?timeout=32s
I0412 14:16:34.778149      13 request.go:668] Waited for 11.168923785s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/coordination.k8s.io/v1beta1?timeout=32s
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.16    True        False         21h     Cluster version is 4.6.16
I0412 14:16:36.903365      31 request.go:668] Waited for 1.155303512s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/networking.k8s.io/v1?timeout=32s
I0412 14:16:47.103691      31 request.go:668] Waited for 11.353934532s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
Kubernetes control plane is running at https://api.kraken-test-04.cp.fyre.ibm.com:6443

2021-04-12 14:16:47,718 [INFO] Daemon mode enabled, kraken will cause chaos forever

2021-04-12 14:16:47,718 [INFO] Ignoring the iterations set
2021-04-12 14:16:47,718 [INFO] Executing scenarios for iteration 0
2021-04-12 14:16:48,211 [INFO] Scenario: scenarios/openshift-apiserver.yml has been successfully injected!
2021-04-12 14:16:48,212 [INFO] Waiting for the specified duration: 60

My expectation based on the message is that I would be able to see in real time a pod in the openshift-apiserver namespace be taken down, and a new one initialized. However, I don't see anything happening to the pods. This was with the default openshift-apiserver.yml scenario.
I've also tried changing the selector in the scenario to app=openshift-apiserver-a to match our targeted pod label in its yaml, as well as getting rid of the selector line altogether.
Are there additional steps that need to be done to get this scenario to work?

can not run in the openshift 4.7 with latest images

I run it in my local openshift cluster, can not connect to internet, the error is
[root@helper ~]# podman logs -f kraken
2021-03-31 06:33:30,085 [INFO] Starting kraken
2021-03-31 06:33:30,091 [INFO] Initializing client to talk to the Kubernetes cluster
2021-03-31 06:33:31,060 [INFO] Fetching cluster info
2021-03-31 06:33:31,977 [INFO]
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0 True False 6d17h Cluster version is 4.7.0
Kubernetes control plane is running at https://api.ocp4.example.com:6443

2021-03-31 06:33:31,978 [INFO] Daemon mode not enabled, will run through 1 iterations

2021-03-31 06:33:31,978 [INFO] Executing scenarios for iteration 0
2021-03-31 06:33:32,481 [INFO] Scenario: scenarios/etcd.yml has been successfully injected!
2021-03-31 06:33:32,481 [INFO] Waiting for the specified duration: 60
2021-03-31 06:34:32,965 [INFO] Scenario: scenarios/regex_openshift_pod_kill.yml has been successfully injected!
2021-03-31 06:34:32,965 [INFO] Waiting for the specified duration: 60
_ _
| | ___ __ __ | | _____ _ __
| |/ / '__/ ` | |/ / _ \ ' \
| <| | | (| | < / | | |
||__| _,||__|| ||

Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 570, in _build_master
ws.require(requires)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 888, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 779, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (setuptools 39.2.0 (/usr/lib/python3.6/site-packages), Requirement.parse('setuptools>=40.3.0'), {'google-auth'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/powerfulseal", line 6, in
from pkg_resources import load_entry_point
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3095, in
@_call_aside
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3079, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3108, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 572, in _build_master
return cls._build_from_requirements(requires)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 585, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 774, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'setuptools>=40.3.0' distribution was not found and is required by google-auth

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_kraken.py", line 403, in
main(options.cfg)
File "run_kraken.py", line 351, in main
node_scenarios(scenarios_list, config)
File "run_kraken.py", line 212, in node_scenarios
node_scenario_object = get_node_scenario_object(node_scenario)
File "run_kraken.py", line 32, in get_node_scenario_object
return aws_node_scenarios()
File "/root/kraken/kraken/node_actions/aws_node_scenarios.py", line 66, in init
self.aws = AWS()
File "/root/kraken/kraken/node_actions/aws_node_scenarios.py", line 12, in init
self.boto_client = boto3.client('ec2')
File "/usr/local/lib/python3.6/site-packages/boto3/init.py", line 93, in client
return _get_default_session().client(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/boto3/session.py", line 263, in client
aws_session_token=aws_session_token, config=config)
File "/usr/local/lib/python3.6/site-packages/botocore/session.py", line 838, in create_client
client_config=config, api_version=api_version)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 87, in create_client
verify, credentials, scoped_config, client_config, endpoint_bridge)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 328, in _get_client_args
verify, credentials, scoped_config, client_config, endpoint_bridge)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 73, in get_client_args
endpoint_url, is_secure, scoped_config)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 154, in compute_client_args
s3_config=s3_config,
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 220, in _compute_endpoint_config
return self._resolve_endpoint(**resolve_endpoint_kwargs)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 303, in _resolve_endpoint
service_name, region_name, endpoint_url, is_secure)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 402, in resolve
service_name, region_name)
File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 134, in construct_endpoint
partition, service_name, region_name)
File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 148, in _endpoint_for_partition
raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.
^C

my config is like this

[root@helper ~]# cat config.yaml
kraken:
kubeconfig_path: /data/install/auth/kubeconfig # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
chaos_scenarios: # List of policies/chaos scenarios to load
- pod_scenarios: # List of chaos pod scenarios to load
- - scenarios/etcd.yml
- - scenarios/regex_openshift_pod_kill.yml
- scenarios/post_action_regex.py
- node_scenarios: # List of chaos node scenarios to load
- scenarios/node_scenarios_example.yml
- pod_scenarios:
- - scenarios/openshift-apiserver.yml
- - scenarios/openshift-kube-apiserver.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/time_scenarios_example.yml

cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal

performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"

tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever

Do you have some suggestion?

Cerberus go/no-go signal being consumed prior to sleeping for wait duration in litmus scenarios

Hey all,

We've been doing some chaos testing using kraken and we've noticed some signalling issues from cerberus when running litmus scenarios. In our case we are running a pod network loss experiment that is reporting a successful result, but cerberus is returning a no-go signal immediately after due to liveness check failure restarting the pod during the experiment.

After looking at the code it seems the ordering for checking the signal is wrong in the litmus_scenarios function:

cerberus_integration(config)
logging.info("Waiting for the specified duration: %s" % wait_duration)
time.sleep(wait_duration)

In the other functions cerberus is checked after the wait duration:

logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
cerberus_integration(config)

import yaml issue

when i try to run command on regular one i see the issue
[root@ip-10-201-36-93 cerberus]# python3 start_cerberus.py --config=config/config.yaml
Traceback (most recent call last):
File "start_cerberus.py", line 5, in
import yaml
ImportError: No module named 'yaml'
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#

Support to run Kraken on ppc64le architecture

The scenario is to run Kraken tests on a host having Power (ppc64le) architecture. This would be beneficial to test clusters in environments having only Power and no vm's with other architectures.

I've run into many pip and yum package dependency issues while installing Kraken directly on a RHEL host running on Power . The Dockerfile which is currently available, only runs on intel architecture.

To solve this, I have created a separate Dockerfile for ppc64le so that the tool can be run without any dependency problems.

Node chaos scenarios

It's important to check how Kubernetes/OpenShift is dealing with the node failures, the failures can be a node termination or stop/shut down. The nodes can be terminated by hitting the cloud platform API on top of which OpenShift has been deployed ( AWS, GCP, Azure e.t.c ).

Kraken can accept a config from the user with the details including role/selector of the node to kill ( Master/node-role.kubernetes.io/master, Worker/node-role.kubernetes.io/worker, Infra/node-role.kubernetes.io/infra), number of nodes to kill and wait between each kill and act accordingly.

kill pods by kill signal instead of deleting them

afaiu kraken currently kills pod by deleting them. However this is rather a somewhat graceful termination for the process running in the pod.

It would reflect much more real life scenarios if the processes in the pod's containers would be killed by kill -9.

This would show applications arbitrarly terminating and then see how the system reacts to it.

Document the roadmap

We do have issues filed around the features and enhancements which are open/being worked on but it might be useful to have a roadmap in readme with pointers to the major ones like Litmus integration ( #31 ) for example to help users understand where we need help as well as where the project is going.

Check if the targeted components recovered from failures

The tool injects failures but doesn't check if the targeted components recovered from failures, it's important to add them to determine the pass/fail criteria for the run.

Analyzing performance degradation

How do we make sure that the targeted components recovered from the chaos scenarios without taking a performance hit? The on-cluster-latest.json and etcd-on-cluster-dashboard.json dashboards can be loaded in Dittybopper to analyze most of the metrics needed to evaluate the performance hit of the components. We could create a new dashboard for the additional metrics that are not collected in the above dashboards. We can recommend the user to install Dittybopper as this would enable him to look into all the metrics and analyze the performance of the targeted components.

Investigate ocs-monkey tool and see if we can leverage it for storage testing

Investigate https://github.com/JohnStrunk/ocs-monkey and see if we can leverage it for storage testing. Had a discussion with Badre Tejado-Imam about this tool, and was told from developers that it can be extended beyond storage testing

Cluster shut down scenarios in other cloud types

PR #25 adds cluster shut down scenario for AWS cloud type alone. Since we now have functions to start and stop the node instances in other cloud types - GCP, Azure (#40), etc., we can add checks for other could types here to enable cluster shut down scenario on other cloud types.

cc: @paigerube14 @chaitanyaenr

Questions on running.....

I have some general questions.

I want to run Kraken in an openshift cluster and point it to another cluster for the chaos testing.

How do I get the .kubeconfig when using OpenShift?
In the config.yaml. Why is there two different pod_scenarios sections? Could this be all under one?

        -   pod_scenarios:                                 # List of chaos pod scenarios to load
            - -    scenarios/etcd.yml
            - -    scenarios/regex_openshift_pod_kill.yml
              -    scenarios/post_action_regex.py
        -   node_scenarios:                                # List of chaos node scenarios to load
            -   scenarios/node_scenarios_example.yml
        -   pod_scenarios:
            - -    scenarios/openshift-apiserver.yml
            - -    scenarios/openshift-kube-apiserver.yml

Thanks.

Implement ipmitool for Baremetal Openshift Node Scenarios

Currently it looks like we support AWS and GCP for node scenarios like node power off. power on and reboot.

If we have an ipmitool based library for the node scenarios, we can operator against baremetal OpenShift.

Some raw calls for power on/off

ipmitool -I lanplus -H <host>  -U <user> -P  <password> chassis power on```


Keep in mind, if a node is powered off already, the power off command will return an error, so you need to check the power status first.

https://pypi.org/project/python-ipmi/ is a library on which you can based your module.

Also, the mapping between node name and baremetal host FQDN is not straightforward on baremetal installs.
```[kni@e16-h18-b03-fc640 ~]$ oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.test155.myocp4.com:6443".
[kni@e16-h18-b03-fc640 ~]$ oc get bmh
NAME              STATUS   PROVISIONING STATUS      CONSUMER                       BMC                                                          HARDWARE PROFILE   ONLINE   ERROR
master-0          OK       externally provisioned   test155-77wr8-master-0         ipmi://mgmt-e16-h18-b04-fc640.rdu2.scalelab.redhat.com:623                      true     
master-1          OK       externally provisioned   test155-77wr8-master-1         ipmi://mgmt-e16-h20-b02-fc640.rdu2.scalelab.redhat.com:623                      true     
master-2          OK       externally provisioned   test155-77wr8-master-2         ipmi://mgmt-e16-h20-b04-fc640.rdu2.scalelab.redhat.com:623                      true     
storage001-r640   OK       provisioned              test155-77wr8-worker-0-p62h6   ipmi://mgmt-f04-h10-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage002-r640   OK       provisioned              test155-77wr8-worker-0-4wwqz   ipmi://mgmt-f04-h11-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage003-r640   OK       provisioned              test155-77wr8-worker-0-rb6dt   ipmi://mgmt-f04-h12-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage004-r640   OK       provisioned              test155-77wr8-worker-0-7frns   ipmi://mgmt-f04-h13-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage005-r640   OK       provisioned              test155-77wr8-worker-0-bnnls   ipmi://mgmt-f04-h33-000-r640.rdu2.scalelab.redhat.com        unknown            true     
worker-0          OK       provisioned              test155-77wr8-worker-0-jjrq6   ipmi://mgmt-e16-h24-b01-fc640.rdu2.scalelab.redhat.com:623   unknown            true     
worker-1          OK       provisioned              test155-77wr8-worker-0-jbnsq   ipmi://mgmt-e16-h24-b02-fc640.rdu2.scalelab.redhat.com:623   unknown            true

is how you get the mapping between node name and management IP FQDN.

At the minimum we should support
node_start_scenario
node_stop_scenario
node_stop_start_scenario
node_reboot_scenario
in https://github.com/openshift-scale/kraken/blob/master/docs/node_scenarios.md

Scrape and store metrics long term

Kraken currently supports deploying mutable Grafana with pre-loaded dashboards to help with monitoring the cluster state as well as various performance metrics but the user needs to know the start and end time to narrow down the metrics for a specific run which makes it difficult especially in case of CI/automation runs where there won't be anyone monitoring the runs and the cluster will be terminated: https://github.com/cloud-bulldozer/kraken#performance-monitoring.

To solve the problem, the proposal is to add ability in Kraken to accept a profile which consists of metrics to scrape from the prometheus for the test run duration and store them in Elasticsearch, this way we can use Grafana to visualize them. Each test run can have a uuid associated with it to differentiate them.

Kube-burner indexing feature can be leveraged to achieve this: https://kube-burner.readthedocs.io/en/latest/indexers/.

Implement ipmitool for Baremetal Openshift Node Scenarios

Currently it looks like we support AWS and GCP for node scenarios like node power off. power on and reboot.

If we have an ipmitool based library for the node scenarios, we can operator against baremetal OpenShift.

Some raw calls for power on/off

ipmitool -I lanplus -H <host>  -U <user> -P  <password> chassis power off
ipmitool -I lanplus -H <host>  -U <user> -P  <password> chassis power on

Keep in mind, if a node is powered off already, the power off command will return an error, so you need to check the power status first.

https://pypi.org/project/python-ipmi/ is a library on which you can base your module.

Also, the mapping between node name and baremetal host FQDN is not straightforward on baremetal installs.

Now using project "openshift-machine-api" on server "https://api.test155.myocp4.com:6443".
[kni@e16-h18-b03-fc640 ~]$ oc get bmh
NAME              STATUS   PROVISIONING STATUS      CONSUMER                       BMC                                                          HARDWARE PROFILE   ONLINE   ERROR
master-0          OK       externally provisioned   test155-77wr8-master-0         ipmi://mgmt-e16-h18-b04-fc640.rdu2.scalelab.redhat.com:623                      true     
master-1          OK       externally provisioned   test155-77wr8-master-1         ipmi://mgmt-e16-h20-b02-fc640.rdu2.scalelab.redhat.com:623                      true     
master-2          OK       externally provisioned   test155-77wr8-master-2         ipmi://mgmt-e16-h20-b04-fc640.rdu2.scalelab.redhat.com:623                      true     
storage001-r640   OK       provisioned              test155-77wr8-worker-0-p62h6   ipmi://mgmt-f04-h10-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage002-r640   OK       provisioned              test155-77wr8-worker-0-4wwqz   ipmi://mgmt-f04-h11-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage003-r640   OK       provisioned              test155-77wr8-worker-0-rb6dt   ipmi://mgmt-f04-h12-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage004-r640   OK       provisioned              test155-77wr8-worker-0-7frns   ipmi://mgmt-f04-h13-000-r640.rdu2.scalelab.redhat.com        unknown            true     
storage005-r640   OK       provisioned              test155-77wr8-worker-0-bnnls   ipmi://mgmt-f04-h33-000-r640.rdu2.scalelab.redhat.com        unknown            true     
worker-0          OK       provisioned              test155-77wr8-worker-0-jjrq6   ipmi://mgmt-e16-h24-b01-fc640.rdu2.scalelab.redhat.com:623   unknown            true     
worker-1          OK       provisioned              test155-77wr8-worker-0-jbnsq   ipmi://mgmt-e16-h24-b02-fc640.rdu2.scalelab.redhat.com:623   unknown            true

is how you get the mapping between node name and management IP FQDN.

At the minimum we should support
node_start_scenario
node_stop_scenario
node_stop_start_scenario
node_reboot_scenario
in https://github.com/openshift-scale/kraken/blob/master/docs/node_scenarios.md

Chaos scenario to introduce network latency

Kraken needs to support scenarios which introduces communication delays/network latency to simulate degradation or outages in a network and verify if the system especially the critical components like etcd handles it well in terms of leader election/receiving hearbeats from the members.

We need to also support injecting failures such as dropping packets, blocking DNS e.t.c.

Add node level scenarios for Helper node

The bastion or helper node is useful for deploying Openshift clusters and hosts a number of important services such as dhcpd, haproxy etc. Running node level tests on this node would be useful in checking its stability and the services running on it.

AWS Node Scenario - You must specify a region

I attempted to run the base AWS Node scenario provided, and I received an error message on the OCP Console pod log for the Kraken deployment. It says that I need to specify a region, but there isn't a place or option to specify a region in config.yaml or node_scenarios_example.yml file.

Here is the output from the OCP Console:

Traceback (most recent call last): 
File “run_kraken.py”, line 428, in <module> main(options.cfg) 
File “run_kraken.py”, line 376, in main node_scenarios(scenarios_list, config) 
File “run_kraken.py”, line 231, in node_scenarios node_scenario_object = get_node_scenario_object(node_scenario) 
File “run_kraken.py”, line 33, in get_node_scenario_object return aws_node_scenarios() 
File “/root/kraken/kraken/node_actions/aws_node_scenarios.py”, line 66, in __init__ self.aws = AWS() 
File “/root/kraken/kraken/node_actions/aws_node_scenarios.py”, line 12, in __init__ self.boto_client = boto3.client(‘ec2’) 
File “/usr/local/lib/python3.6/site-packages/boto3/__init__.py”, line 93, in client return _get_default_session().client(*args, **kwargs) 
File “/usr/local/lib/python3.6/site-packages/boto3/session.py”, line 263, in client aws_session_token=aws_session_token, config=config) 
File “/usr/local/lib/python3.6/site-packages/botocore/session.py”, line 851, in create_client client_config=config, api_version=api_version) 
File “/usr/local/lib/python3.6/site-packages/botocore/client.py”, line 87, in create_client verify, credentials, scoped_config, client_config, endpoint_bridge) 
File “/usr/local/lib/python3.6/site-packages/botocore/client.py”, line 328, in _get_client_args verify, credentials, scoped_config, client_config, endpoint_bridge) 
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 73, in get_client_args endpoint_url, is_secure, scoped_config) 
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 154, in compute_client_args s3_config=s3_config, 
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 220, in _compute_endpoint_config return self._resolve_endpoint(**resolve_endpoint_kwargs) 
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 303, in _resolve_endpoint service_name, region_name, endpoint_url, is_secure) 
File “/usr/local/lib/python3.6/site-packages/botocore/client.py”, line 402, in resolve service_name, region_name) 
File “/usr/local/lib/python3.6/site-packages/botocore/regions.py”, line 134, in construct_endpoint partition, service_name, region_name) 
File “/usr/local/lib/python3.6/site-packages/botocore/regions.py”, line 148, in _endpoint_for_partition raise NoRegionError() botocore.exceptions.NoRegionError: You must specify a region.

Here is the kraken-config configmap:

Name:         kraken-config
Namespace:    kraken-aws
Labels:       <none>
Annotations:  <none>

Data
====
config.yaml:
----
kraken:
    kubeconfig_path: /root/.kube/config                    # Path to kubeconfig
    exit_on_failure: False                                 # Exit when a post action scenario fails
    litmus_version: v1.10.0                                # Litmus version to install
    litmus_uninstall: False                                # If you want to uninstall litmus if failure
    chaos_scenarios:                                       # List of policies/chaos scenarios to load
        -   node_scenarios:                                # List of chaos node scenarios to load
            -   scenarios/node_scenarios_example.yml

cerberus:
    cerberus_enabled: False                                # Enable it when cerberus is previously installed
    cerberus_url:                                          # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal

performance_monitoring:
    deploy_dashboards: False                              # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
    repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"

tunings:
    wait_duration: 60                                      # Duration to wait between each chaos scenario
    iterations: 1                                          # Number of times to execute the scenarios
    daemon_mode: False                                     # Iterations are set to infinity which means that the kraken will cause chaos forever

Events:  <none>

I am using the default node_scenarios_example.yml file

Integrate Cerberus to check the cluster health

Cerberus ( https://github.com/openshift-scale/cerberus ) needs to be run after each scenario specified in the config to monitor the cluster under test and the aggregated go/no-go signal generated by it needs to be consumed by Kraken to determine pass/fail i.e make sure the Kubernetes/OpenShift cluster is healthy after the failure injection, not just the targeted component.

Need a badly behaved container that messes with logging

The container might want to:

Write multiple megabytes of data in one system call to avoid CPU rate limits from impacting its ability to write
Write to the stdout/stderr using more than one process and more than one thread in a process
Write data payloads that contain single bytes separated by new lines

Containerize for better portability

User needs have an option to run kraken as a container, this eliminates the need to deal with installing dependencies and packages conflicts.

Scenarios will need updating with new powerfulseal code

While I was looking at trying to add code to powerfulseal for the regex namespace issue I opened (powerfulseal/powerfulseal#274) I noticed that the latest python pip package is very outdated compared to the code in github.

There are a couple of RC releases that you are able to install if given the version numbers. Seen here

Trying to test my code I had to completely rewrite the the pod kill scenario. As you can see in the documentation here it no longer has the podScenario and nodeScenario sections, it is just one common scenarios section.

These changes are not too huge but will have to be done at some point in the future.

Examples:
2.9.0 scenario
3.0.* scenario

We will need to make updates to our scenarios when the pip package updates from the current 2.9.0 version.

Node scenario fails asking for region name even if aws cli is installed.

Scenario: Node scenario using kraken tool
Set cloud_type as aws on non aws cloud provider.

File: scenarios/node_scenarios_example.yml

node_scenarios:
  - actions:                                                        # node chaos scenarios to be injected
 #   - node_stop_start_scenario
    - stop_start_kubelet_scenario
 #   - node_crash_scenario
    node_name: worker-0                                                      # node on which scenario has to be injected
    label_selector: node-role.kubernetes.io/worker                  # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
    instance_kill_count: 1                                          # number of times to inject each scenario under actions
    timeout: 120                                                    # duration to wait for completion of node scenario injection
    cloud_type: aws                                                 # cloud type on which Kubernetes/OpenShift runs

Issue seen:

[root@rails1 kraken]# python3 run_kraken.py --config config/config.yaml
 _              _              
| | ___ __ __ _| | _____ _ __  
| |/ / '__/ _` | |/ / _ \ '_ \ 
|   <| | | (_| |   <  __/ | | |
|_|\_\_|  \__,_|_|\_\___|_| |_|
                               

2020-11-09 06:50:42,571 [INFO] Starting kraken
2020-11-09 06:50:42,577 [INFO] Initializing client to talk to the Kubernetes cluster
2020-11-09 06:50:44,678 [INFO] Fetching cluster info
2020-11-09 06:50:46,239 [INFO] 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.1     True        False         11d     Cluster version is 4.6.1
Kubernetes master is running at https://api.chaos-arc46.redhat.com:6443

2020-11-09 06:50:46,240 [INFO] Daemon mode not enabled, will run through 1 iterations

2020-11-09 06:50:46,240 [INFO] Executing scenarios for iteration 0
Traceback (most recent call last):
  File "run_kraken.py", line 298, in <module>
    main(options.cfg)
  File "run_kraken.py", line 240, in main
    node_scenario_object = get_node_scenario_object(node_scenario)
  File "run_kraken.py", line 21, in get_node_scenario_object
    return aws_node_scenarios()
  File "/root/chaos/kraken/kraken/node_actions/aws_node_scenarios.py", line 66, in __init__
    self.aws = AWS()
  File "/root/chaos/kraken/kraken/node_actions/aws_node_scenarios.py", line 12, in __init__
    self.boto_client = boto3.client('ec2')
  File "/usr/local/lib/python3.6/site-packages/boto3/__init__.py", line 91, in client
    return _get_default_session().client(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/boto3/session.py", line 263, in client
    aws_session_token=aws_session_token, config=config)
  File "/usr/local/lib/python3.6/site-packages/botocore/session.py", line 838, in create_client
    client_config=config, api_version=api_version)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 87, in create_client
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 328, in _get_client_args
    verify, credentials, scoped_config, client_config, endpoint_bridge)
  File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 73, in get_client_args
    endpoint_url, is_secure, scoped_config)
  File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 153, in compute_client_args
    s3_config=s3_config,
  File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 218, in _compute_endpoint_config
    return self._resolve_endpoint(**resolve_endpoint_kwargs)
  File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 301, in _resolve_endpoint
    service_name, region_name, endpoint_url, is_secure)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 402, in resolve
    service_name, region_name)
  File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 134, in construct_endpoint
    partition, service_name, region_name)
  File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 148, in _endpoint_for_partition
    raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.

Functional tests

Addition of functional tests to ensure that the PR's don't break the existing functionality of Kraken similar to functional tests in Cerberus. This eases the code review and ensures the commits are non-breaking.

( PS: It would also enable contributors without cluster to open PR's for minor enhancements confidently :D )