krkn-chaos / krkn Goto Github PK
View Code? Open in Web Editor NEWChaos and resiliency testing tool for Kubernetes with a focus on improving performance under failure conditions. A CNCF sandbox project.
License: Apache License 2.0
Chaos and resiliency testing tool for Kubernetes with a focus on improving performance under failure conditions. A CNCF sandbox project.
License: Apache License 2.0
Scenario - Enable node scenarios for an OCP cluster on a non aws cloud type
Updated the cloud_type as openstack in the scenarios/node_scenarios_example.yml file
node_scenarios:
- actions: # node chaos scenarios to be injected
# - node_stop_start_scenario
- stop_start_kubelet_scenario
# - node_crash_scenario
node_name: worker-0 # node on which scenario has to be injected
label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
instance_kill_count: 1 # number of times to inject each scenario under actions
timeout: 120 # duration to wait for completion of node scenario injection
cloud_type: openstack # cloud type on which Kubernetes/OpenShift runs
Issue seen: Script fails when any value other than aws is specified in cloud_type.
[root@rails1 kraken]# python3 run_kraken.py --config config/config.yaml
_ _
| | ___ __ __ _| | _____ _ __
| |/ / '__/ _` | |/ / _ \ '_ \
| <| | | (_| | < __/ | | |
|_|\_\_| \__,_|_|\_\___|_| |_|
2020-11-09 06:23:01,238 [INFO] Starting kraken
2020-11-09 06:23:01,244 [INFO] Initializing client to talk to the Kubernetes cluster
2020-11-09 06:23:03,448 [INFO] Fetching cluster info
2020-11-09 06:23:04,360 [INFO]
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.1 True False 11d Cluster version is 4.6.1
Kubernetes master is running at https://api.chaos-arc46.redhat.com:6443
2020-11-09 06:23:04,361 [INFO] Daemon mode not enabled, will run through 1 iterations
2020-11-09 06:23:04,361 [INFO] Executing scenarios for iteration 0
Traceback (most recent call last):
File "run_kraken.py", line 298, in <module>
main(options.cfg)
File "run_kraken.py", line 244, in main
node_scenario_object)
File "run_kraken.py", line 48, in inject_node_scenario
node_scenario_object.stop_start_kubelet_scenario(instance_kill_count, node, timeout)
AttributeError: 'NoneType' object has no attribute 'stop_start_kubelet_scenario'
Expected result : Documentation should specify that all the node scenarios would run only on aws cloud type
As per current documentation, the kubelet start stop node scenario should run on any cloud platform.
I came across these two files abstract_node_scenarios.py and general_cloud_node_scenarios.py which I think are serving the same purpose. And I think they can be combined into one single base/general class.
Thoughts? Correct me if I haven't gotten the use case of having two different base classes :)
cc: @paigerube14
I enabled Grafana on the config.yaml
file and received the following message in the OCP Console saying that jq
was not found. envsubst
can also not be found. I have jq
installed locally on the machine I'm using to run the oc apply -f kraken.yml
command to initialize the deployment.
Using k8s command: oc
Using namespace: dittybopper
Using default grafana password: admin
[32mGetting environment vars...[0m
./deploy.sh: line 111: jq: command not found
./deploy.sh: line 112: jq: command not found
Prometheus URL is:
Prometheus password collected.
[32mCreating namespace...
[0m namespace/dittybopper created
[32mDeploying Grafana...[0m
./deploy.sh: line 122: envsubst: command not found
error: no objects passed to apply
First thanks for making this tool available to the general public, it looks really awsome!
Is there already a roadmap for supporting different clouds, in our case specifically Azure?
If so can you provide any rough indication on when this would be added?
We use OpenShift on Azure and this tool look really awsome so that's why we would love to use this in our setup!
It is useful to continuously run chaos scenarios while running other workloads. Add a config option to loop through the configured scenarios a set number of times or infinitely. Also, provide a configuration for a "sleep time" between loops.
Add a scenario to kill a random pod selected from pods matching a regexp. We have a simple, but effective shell script in OpenShift testing that randomly takes out infra pods. Combined with Cerberus (https://github.com/openshift-scale/cerberus) we can detect non-transient issues caused by such a scenario.
Example (in shell scripting):
#!/bin/bash
NUM=$1
SLEEP=$2
while true; do
echo "============"
oc get pods --all-namespaces | grep openshift | grep Running| awk {'print $1" "$2'} | shuf -n $NUM | while read i; do oc delete pod -n $i --wait=false; done; sleep $SLEEP;
done
Issue :
Currently the Kraken tool node level chaos tests are unsupported on an Openstack based OCP cluster.
Solution:
We have enabled support by writing code to connect to and run the node level chaos tests on an Openstack based cluster.
Also, we have updated the node scenarios documentation with instructions for the Openstack cloud type.
I trying to build an image to run in DEBUG mode but failing run I attempt to build in kraken/containers using the Dockerfile. I am receiving a python error:
Collecting azure-keyvault-certificates~=4.1 (from azure-keyvault->-r requirements.txt (line 9))
Downloading https://files.pythonhosted.org/packages/23/2b/e3219ae48391263b4657817b61eb5104ac1190d360e288e4546a18e61ca5/azure_keyvault_certificates-4.2.1-py2.py3-none-any.whl (235kB)
Collecting azure-keyvault-keys~=4.1 (from azure-keyvault->-r requirements.txt (line 9))
Downloading https://files.pythonhosted.org/packages/a5/7f/a626704af6f0fb36bdc16397b49ab313a7e7a801ded5b43c6a74391f536a/azure_keyvault_keys-4.3.1-py2.py3-none-any.whl (248kB)
Collecting azure-keyvault-secrets~=4.1 (from azure-keyvault->-r requirements.txt (line 9))
Downloading https://files.pythonhosted.org/packages/43/ae/4ad0c67e54f820d0ead249ce48e8cf498cfee42034dcd837d8f218ba9e76/azure_keyvault_secrets-4.2.0-py2.py3-none-any.whl (206kB)
Collecting azure-core<2.0.0,>=1.0.0 (from azure-identity->-r requirements.txt (line 10))
Downloading https://files.pythonhosted.org/packages/19/18/21cfd7faf7ab24c35689c9f199179081cee8fec44668c7f090e1db61226d/azure_core-1.13.0-py2.py3-none-any.whl (133kB)
Collecting msal<2.0.0,>=1.6.0 (from azure-identity->-r requirements.txt (line 10))
Downloading https://files.pythonhosted.org/packages/68/a9/b534f1158ffce8c551dea86d90981e9bd892f310c4c27d079d6b4b88849a/msal-1.11.0-py2.py3-none-any.whl (63kB)
Collecting msal-extensions~=0.3.0 (from azure-identity->-r requirements.txt (line 10))
Downloading https://files.pythonhosted.org/packages/49/cb/c833ffa0f97c3098b146ac19bb2266c2d84b2119ffff83fdf001bb59d3ae/msal_extensions-0.3.0-py2.py3-none-any.whl
Collecting cryptography>=2.1.4 (from azure-identity->-r requirements.txt (line 10))
Downloading https://files.pythonhosted.org/packages/9b/77/461087a514d2e8ece1c975d8216bc03f7048e6090c5166bc34115afdaa53/cryptography-3.4.7.tar.gz (546kB)
Complete output from command python setup.py egg_info:
=============================DEBUG ASSISTANCE==========================
If you are seeing an error here please try the following to
successfully install cryptography:
Upgrade to the latest pip and try again. This will fix errors for most
users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
=============================DEBUG ASSISTANCE==========================
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-build-4q3jv9uv/cryptography/setup.py", line 14, in <module>
from setuptools_rust import RustExtension
ModuleNotFoundError: No module named 'setuptools_rust'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-4q3jv9uv/cryptography/
The command '/bin/sh -c yum install -y git python36 python3-pip && git clone https://github.com/openshift-scale/kraken /root/kraken && mkdir -p /root/.kube && cd /root/kraken && pip3 install -r requirements.txt' returned a non-zero code: 1
Barrys-MBP:containers bmosus.ibm.com$
There are some calls to exec to hit kubectl
instead, we should be leveraging the k8s python client.
Task:
This task is for providing a sample scenario for executing pod level chaos tests for a custom application (say, acme-air app) deployed on the Openshift Container Platform. The sample scenario will be added to the existing scenarios directory .
From PR #38 : This might be a good time to also separate out each different type of scenario into its own function. This might make it easier to read and add the many more types of scenarios that are being worked on as well.
This should be done after #37 merges and #38 is finalized. @yashashreesuresh @paigerube14 @chaitanyaenr
trying to deploy the image and it is not working on openshift
[root@ip-00-000-00-00 kraken]# ls -lrt
total 15988
-rw-r-----. 1 root root 14421964 Feb 25 2015 Python-3.4.3.tar.xz
drwxr-x---. 3 root root 22 Nov 10 15:36 CI
-rw-r-----. 1 root root 11357 Nov 10 15:36 LICENSE
-rw-r-----. 1 root root 2352 Nov 10 15:36 README.md
drwxr-x---. 4 root root 84 Nov 10 15:36 ansible
drwxr-x---. 2 root root 88 Nov 10 15:36 containers
drwxr-x---. 2 root root 4096 Nov 10 15:36 docs
drwxr-x---. 6 root root 74 Nov 10 15:36 kraken
drwxr-x---. 2 root root 32 Nov 10 15:36 media
-rw-r-----. 1 root root 450 Nov 10 15:36 tox.ini
-rw-r-----. 1 root root 18 Nov 10 15:36 test-requirements.txt
-rw-r-----. 1 root root 297 Nov 10 15:36 setup.py
-rw-r-----. 1 root root 1099 Nov 10 15:36 setup.cfg
drwxr-x---. 2 root root 4096 Nov 10 15:36 scenarios
drwxr-x---. 2 root root 6 Nov 12 12:57 docker
drwxr-x---. 2 root root 46 Nov 12 14:27 config
-rw-r-----. 1 root root 13859 Nov 12 14:36 run_kraken.py
drwxr-x---. 5 root root 69 Nov 12 17:35 py36-venv
-rw-r-----. 1 root root 1886796 Nov 12 17:44 get-pip.py
-rw-r-----. 1 root root 93 Nov 12 17:52 requirements.txt
[root@ip-00-000-00-00 kraken]# python3 -V
Python 3.4.3
[root@ip-00-000-00-00 kraken]#
[root@ip-00-000-00-00 kraken]# pip3 -V
pip 20.2.4 from /usr/local/lib/python3.6/site-packages/pip (python 3.6)
[root@ip-00-000-00-00 kraken]# python3 -V
Python 3.4.3
[root@ip-00-000-00-00 kraken]# python -V
Python 3.6.8
[root@ip-00-000-00-00 kraken]#
[root@ip-00-000-00-00 kraken]#
[root@ip-00-000-00-00 kraken]# pip3 -V
pip 20.2.4 from /usr/local/lib/python3.6/site-packages/pip (python 3.6)
[root@ip-00-000-00-00 kraken]# pip3 install -r requirements.txt
Collecting git+https://github.com/powerfulseal/powerfulseal.git (from -r requirements.txt (line 4))
Cloning https://github.com/powerfulseal/powerfulseal.git to /tmp/pip-req-build-ipizx782
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Requirement already satisfied (use --upgrade to upgrade): powerfulseal==3.2.0 from git+https://github.com/powerfulseal/powerfulseal.git in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 4))
Requirement already satisfied: datetime in /usr/local/lib64/python3.6/site-packages (from -r requirements.txt (line 1)) (4.3)
Requirement already satisfied: pyfiglet in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (0.8.post1)
Requirement already satisfied: PyYAML in /usr/local/lib64/python3.6/site-packages (from -r requirements.txt (line 3)) (5.3.1)
Requirement already satisfied: requests in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 5)) (2.25.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/site-packages (from -r requirements.txt (line 6)) (1.16.16)
Requirement already satisfied: datadog<1.0.0,>=0.29.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.39.0)
Requirement already satisfied: python-dateutil<2.7.0,>=2.5.3 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.6.1)
Requirement already satisfied: coloredlogs<11.0.0,>=10.0.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (10.0)
Requirement already satisfied: azure-mgmt-resource<3.0.0,>=2.2.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.2.0)
Requirement already satisfied: google-api-python-client>=1.7.8 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.12.5)
Requirement already satisfied: google-auth>=1.6.2 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.23.0)
Requirement already satisfied: ConfigArgParse<1,>=0.11.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.15.2)
Requirement already satisfied: kubernetes==12.0.0a1 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (12.0.0a1)
Requirement already satisfied: future<1,>=0.16.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.18.2)
Requirement already satisfied: openstacksdk<1,>=0.10.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.51.0)
Requirement already satisfied: oauth2client>=4.1.3 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.1.3)
Requirement already satisfied: jsonschema<4,>=3.0.2 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.0)
Requirement already satisfied: Flask<2,>=1.0.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.2)
Requirement already satisfied: prometheus-client<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.3.1)
Requirement already satisfied: paramiko<3,>=2.5.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.7.2)
Requirement already satisfied: termcolor<2,>=1.1.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.0)
Requirement already satisfied: azure-common<2.0.0,>=1.1.23 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.26)
Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.0.4)
Requirement already satisfied: flask-cors<4,>=3.0.6 in /usr/local/lib64/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.0.9)
Requirement already satisfied: azure-mgmt-network<3.0.0,>=2.7.0 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.7.0)
Requirement already satisfied: spur<1,>=0.3.20 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.3.22)
Requirement already satisfied: azure-mgmt-compute<5.0.0,>=4.6.2 in /usr/local/lib/python3.6/site-packages (from powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.6.2)
Requirement already satisfied: zope.interface in /usr/local/lib64/python3.6/site-packages (from datetime->-r requirements.txt (line 1)) (5.2.0)
Requirement already satisfied: pytz in /usr/local/lib/python3.6/site-packages (from datetime->-r requirements.txt (line 1)) (2020.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (1.26.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (2020.11.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/site-packages (from requests->-r requirements.txt (line 5)) (2.10)
Requirement already satisfied: botocore<1.20.0,>=1.19.16 in /usr/local/lib/python3.6/site-packages (from boto3->-r requirements.txt (line 6)) (1.19.16)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/site-packages (from boto3->-r requirements.txt (line 6)) (0.3.3)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/site-packages (from boto3->-r requirements.txt (line 6)) (0.10.0)
Requirement already satisfied: decorator>=3.3.2 in /usr/local/lib/python3.6/site-packages (from datadog<1.0.0,>=0.29.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.4.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/site-packages (from python-dateutil<2.7.0,>=2.5.3->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.15.0)
Requirement already satisfied: humanfriendly>=4.7 in /usr/local/lib/python3.6/site-packages (from coloredlogs<11.0.0,>=10.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (8.2)
Requirement already satisfied: msrest>=0.5.0 in /usr/local/lib/python3.6/site-packages (from azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.6.19)
Requirement already satisfied: msrestazure<2.0.0,>=0.4.32 in /usr/local/lib/python3.6/site-packages (from azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.6.4)
Requirement already satisfied: google-api-core<2dev,>=1.21.0 in /usr/local/lib/python3.6/site-packages (from google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.23.0)
Requirement already satisfied: httplib2<1dev,>=0.15.0 in /usr/local/lib/python3.6/site-packages (from google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.18.1)
Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /usr/local/lib/python3.6/site-packages (from google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.0.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.2.8)
Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (50.3.2)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.6)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from google-auth>=1.6.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.1.1)
Requirement already satisfied: requests-oauthlib in /usr/local/lib/python3.6/site-packages (from kubernetes==12.0.0a1->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.3.0)
Requirement already satisfied: websocket-client!=0.40.0,!=0.41.,!=0.42.,>=0.32.0 in /usr/local/lib/python3.6/site-packages (from kubernetes==12.0.0a1->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.57.0)
Requirement already satisfied: keystoneauth1>=3.18.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (4.2.1)
Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (5.5.1)
Requirement already satisfied: munch>=2.1.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.5.0)
Requirement already satisfied: importlib-metadata>=1.7.0; python_version < "3.8" in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.0.0)
Requirement already satisfied: dogpile.cache>=0.6.5 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.0.2)
Requirement already satisfied: netifaces>=0.10.4 in /usr/local/lib64/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.10.9)
Requirement already satisfied: jsonpatch!=1.20,>=1.16 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.26)
Requirement already satisfied: appdirs>=1.3.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.4.4)
Requirement already satisfied: iso8601>=0.1.11 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.1.13)
Requirement already satisfied: requestsexceptions>=1.2.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.4.0)
Requirement already satisfied: cryptography>=2.7 in /usr/local/lib64/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.1)
Requirement already satisfied: os-service-types>=1.7.0 in /usr/local/lib/python3.6/site-packages (from openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.7.0)
Requirement already satisfied: pyasn1>=0.1.7 in /usr/local/lib/python3.6/site-packages (from oauth2client>=4.1.3->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.4.8)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.6/site-packages (from jsonschema<4,>=3.0.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (20.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /usr/local/lib64/python3.6/site-packages (from jsonschema<4,>=3.0.2->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.17.3)
Requirement already satisfied: Jinja2>=2.10.1 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.11.2)
Requirement already satisfied: Werkzeug>=0.15 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.0.1)
Requirement already satisfied: itsdangerous>=0.24 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.0)
Requirement already satisfied: click>=5.1 in /usr/local/lib/python3.6/site-packages (from Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (7.1.2)
Requirement already satisfied: pynacl>=1.0.1 in /usr/local/lib64/python3.6/site-packages (from paramiko<3,>=2.5.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.4.0)
Requirement already satisfied: bcrypt>=3.1.3 in /usr/local/lib64/python3.6/site-packages (from paramiko<3,>=2.5.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.0)
Requirement already satisfied: isodate>=0.6.0 in /usr/local/lib/python3.6/site-packages (from msrest>=0.5.0->azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (0.6.0)
Requirement already satisfied: adal<2.0.0,>=0.6.0 in /usr/local/lib/python3.6/site-packages (from msrestazure<2.0.0,>=0.4.32->azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.2.5)
Requirement already satisfied: protobuf>=3.12.0 in /usr/local/lib64/python3.6/site-packages (from google-api-core<2dev,>=1.21.0->google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.13.0)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.6/site-packages (from google-api-core<2dev,>=1.21.0->google-api-python-client>=1.7.8->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.52.0)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/site-packages (from requests-oauthlib->kubernetes==12.0.0a1->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.1.0)
Requirement already satisfied: stevedore>=1.20.0 in /usr/local/lib/python3.6/site-packages (from keystoneauth1>=3.18.0->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.2.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/site-packages (from importlib-metadata>=1.7.0; python_version < "3.8"->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (3.4.0)
Requirement already satisfied: jsonpointer>=1.9 in /usr/local/lib/python3.6/site-packages (from jsonpatch!=1.20,>=1.16->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib64/python3.6/site-packages (from cryptography>=2.7->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.14.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib64/python3.6/site-packages (from Jinja2>=2.10.1->Flask<2,>=1.0.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.1.1)
Requirement already satisfied: PyJWT>=1.0.0 in /usr/local/lib/python3.6/site-packages (from adal<2.0.0,>=0.6.0->msrestazure<2.0.0,>=0.4.32->azure-mgmt-resource<3.0.0,>=2.2.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (1.7.1)
Requirement already satisfied: pycparser in /usr/local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.7->openstacksdk<1,>=0.10.0->powerfulseal==3.2.0->-r requirements.txt (line 4)) (2.20)
Building wheels for collected packages: powerfulseal
Building wheel for powerfulseal (PEP 517) ... done
Created wheel for powerfulseal: filename=powerfulseal-3.2.0-py3-none-any.whl size=156211 sha256=adf9ad79cc8f30cb888ec0fc63bb69bdbbd9bc964980469319342c94c303c0bc
Stored in directory: /tmp/pip-ephem-wheel-cache-rk3agc6k/wheels/2d/cf/51/8de7d61c30ff637fca2222e4ba67d69cc9f1f8c6f419a0ee0d
Successfully built powerfulseal
[root@ip-00-000-00-00 ~]# vi myfile
[root@ip-00-000-00-00 ~]# python
Python 3.6.8 (default, Aug 13 2020, 07:46:32)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
import yaml
f = open('myfile')
y = **yaml.**load(f)
main:1: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
print(y)
this is satish
[root@ip-00-000-00-00 kraken]# python3 run_kraken.py --config=config/config.yaml
Traceback (most recent call last):
File "run_kraken.py", line 7, in
import yaml
ImportError: No module named 'yaml'
https://github.com/cloud-bulldozer/kraken/blob/master/run_kraken.py has functions that can be moved to modules to make it modular and reusable in addition to making it easy for folks to add changes needed in the main function file to orchestrate new modules/code additions.
Kraken currently has the ability to check if the targeted component recovered from the failures injected in addition to checking the health of the cluster as a whole to make sure it didn't get impacted because of the chaos. Kraken needs to take into account the performance and scalability of the component under test as well to expose issues where the component/cluster doesn't perform well after the recovery.
It supports performance monitoring which helps with tracking the performance and scale metrics: https://github.com/cloud-bulldozer/kraken#performance-monitoring but we need a way for it to analyze things without having to manually check on the cluster.
The proposal is to have the ability in Kraken to accept a profile which consists of metrics to scrape from the prometheus and evaluates them with a gold set of values found from the OpenShift/Kubernetes performance and scale runs and pass/fails the run based on it. This is inspired from Kube-burner which we use heavily for performance/scale testing OpenShift: https://github.com/cloud-bulldozer/kube-burner/blob/master/docs/alerting.md.
More on the importance of doing this can be found in the blog we published recently: https://www.openshift.com/blog/making-chaos-part-of-kubernetes/openshift-performance-and-scalability-tests.
CNV team looked into it and found out that Litmus framework ( https://github.com/litmuschaos/litmus ) has interesting chaos scenarios that Kraken is missing today. We need to investigate to see if it makes sense to consume it under the hood to improve the coverage.
On the containers page, the link under the Run containerized version just goes to the kraken home page.
It is supposed to go to this page correct?
Right now, we inject failures and check for cluster health and recovery on a idle cluster. However, in production that is hardly the case, as the cluster will have some workloads actively running. To simulate production use cases as closely as possible, it might make sense to run load on the cluster before triggering faults. We could use something like kube-burner (https://github.com/cloud-bulldozer/kube-burner) to load the cluster. The idea would be to first start the run of kube-burner and then inject faults during the run.
Kraken needs to support injecting a scenario which skews the system time on the nodes/pods and checks it gets corrected by NTP and make sure there’s no effect on the running services.
Further improve the ease of use of containerized kraken by providing an image on quay.io and creating a kubernetes or openshift deployment to run it. Config can be provided via configmap.
I see that you have the following commands to stop kubelets in oc environment:
oc debug node/" + node + " -- chroot /host systemctl stop kubelet
What will be the kubectl equivalent of that ?
Linters needs to be added as it improves the code quality. Travis CI needs to be enabled to run linters on every pull request to make sure it follows the best practices and doesn't break the tool.
#80 referencing this issue, so that node options become available in IBMCloud, really interesting tool, thanks for all your efforts!
When trying to run the latest Kraken version (both using Docker and deploying on OpenShift) I get the same error:
2020-12-01 09:46:25,027 [INFO] Starting kraken
2020-12-01 09:46:25,032 [INFO] Initializing client to talk to the Kubernetes cluster
_ _
| | ___ __ __ _| | _____ _ __
| |/ / '__/ _` | |/ / _ \ '_ \
| <| | | (_| | < __/ | | |
|_|\_\_| \__,_|_|\_\___|_| |_|
Traceback (most recent call last):
File "run_kraken.py", line 316, in <module>
main(options.cfg)
File "run_kraken.py", line 238, in main
kubecli.find_kraken_node()
File "/root/kraken/kraken/kubernetes/client.py", line 180, in find_kraken_node
pod_json = json.loads(pod_json_str)
File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here are some details on my environment:
$ docker version
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:55 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:25 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.7
GitCommit: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
$ oc version
Client Version: 4.4.6
Server Version: 4.5.19
Kubernetes Version: v1.18.3+10e5708
## My kubeconfig appears to be fine:
$ oc get pods
NAME READY STATUS RESTARTS AGE
kraken-deployment-d857c4d4f-btxx8 0/1 CrashLoopBackOff 15 61m
Here's how I tried to run it:
$ docker run --name=kraken --net=host -v $HOME/.kube/config:/root/.kube/config -v $PWD/chaos/config.yaml:/root/kraken/config/config.yaml quay.io/openshift-scale/kraken:latest
I've set up a standalone cluster for Kraken and followed the steps provided in the Readme to get started. I made some changes in the Kraken config file to only include one scenario which would be the openshift-apiserver to isolate the problem I'm having. This is what I have in the kraken deployment pod log after starting the deployment:
2021-04-12 14:16:22,450 [INFO] Starting kraken
2021-04-12 14:16:22,455 [INFO] Initializing client to talk to the Kubernetes cluster
/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'api.kraken-test-04.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning,
2021-04-12 14:16:23,455 [INFO] Fetching cluster info
2021-04-12 14:16:47,718 [INFO]
I0412 14:16:24.777729 13 request.go:668] Waited for 1.168597466s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/policy/v1beta1?timeout=32s
I0412 14:16:34.778149 13 request.go:668] Waited for 11.168923785s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/coordination.k8s.io/v1beta1?timeout=32s
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.16 True False 21h Cluster version is 4.6.16
I0412 14:16:36.903365 31 request.go:668] Waited for 1.155303512s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/networking.k8s.io/v1?timeout=32s
I0412 14:16:47.103691 31 request.go:668] Waited for 11.353934532s due to client-side throttling, not priority and fairness, request: GET:https://api.kraken-test-04.cp.fyre.ibm.com:6443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
Kubernetes control plane is running at https://api.kraken-test-04.cp.fyre.ibm.com:6443
2021-04-12 14:16:47,718 [INFO] Daemon mode enabled, kraken will cause chaos forever
2021-04-12 14:16:47,718 [INFO] Ignoring the iterations set
2021-04-12 14:16:47,718 [INFO] Executing scenarios for iteration 0
2021-04-12 14:16:48,211 [INFO] Scenario: scenarios/openshift-apiserver.yml has been successfully injected!
2021-04-12 14:16:48,212 [INFO] Waiting for the specified duration: 60
My expectation based on the message is that I would be able to see in real time a pod in the openshift-apiserver namespace be taken down, and a new one initialized. However, I don't see anything happening to the pods. This was with the default openshift-apiserver.yml scenario.
I've also tried changing the selector in the scenario to app=openshift-apiserver-a
to match our targeted pod label in its yaml, as well as getting rid of the selector line altogether.
Are there additional steps that need to be done to get this scenario to work?
I run it in my local openshift cluster, can not connect to internet, the error is
[root@helper ~]# podman logs -f kraken
2021-03-31 06:33:30,085 [INFO] Starting kraken
2021-03-31 06:33:30,091 [INFO] Initializing client to talk to the Kubernetes cluster
2021-03-31 06:33:31,060 [INFO] Fetching cluster info
2021-03-31 06:33:31,977 [INFO]
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0 True False 6d17h Cluster version is 4.7.0
Kubernetes control plane is running at https://api.ocp4.example.com:6443
2021-03-31 06:33:31,978 [INFO] Daemon mode not enabled, will run through 1 iterations
2021-03-31 06:33:31,978 [INFO] Executing scenarios for iteration 0
2021-03-31 06:33:32,481 [INFO] Scenario: scenarios/etcd.yml has been successfully injected!
2021-03-31 06:33:32,481 [INFO] Waiting for the specified duration: 60
2021-03-31 06:34:32,965 [INFO] Scenario: scenarios/regex_openshift_pod_kill.yml has been successfully injected!
2021-03-31 06:34:32,965 [INFO] Waiting for the specified duration: 60
_ _
| | ___ __ __ | | _____ _ __
| |/ / '__/ ` | |/ / _ \ ' \
| <| | | (| | < / | | |
||__| _,||__|| ||
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 570, in _build_master
ws.require(requires)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 888, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 779, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (setuptools 39.2.0 (/usr/lib/python3.6/site-packages), Requirement.parse('setuptools>=40.3.0'), {'google-auth'})
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/powerfulseal", line 6, in
from pkg_resources import load_entry_point
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3095, in
@_call_aside
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3079, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3108, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 572, in _build_master
return cls._build_from_requirements(requires)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 585, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 774, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'setuptools>=40.3.0' distribution was not found and is required by google-auth
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 570, in _build_master
ws.require(requires)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 888, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 779, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (setuptools 39.2.0 (/usr/lib/python3.6/site-packages), Requirement.parse('setuptools>=40.3.0'), {'google-auth'})
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/powerfulseal", line 6, in
from pkg_resources import load_entry_point
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3095, in
@_call_aside
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3079, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 3108, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 572, in _build_master
return cls._build_from_requirements(requires)
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 585, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/lib/python3.6/site-packages/pkg_resources/init.py", line 774, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'setuptools>=40.3.0' distribution was not found and is required by google-auth
Traceback (most recent call last):
File "run_kraken.py", line 403, in
main(options.cfg)
File "run_kraken.py", line 351, in main
node_scenarios(scenarios_list, config)
File "run_kraken.py", line 212, in node_scenarios
node_scenario_object = get_node_scenario_object(node_scenario)
File "run_kraken.py", line 32, in get_node_scenario_object
return aws_node_scenarios()
File "/root/kraken/kraken/node_actions/aws_node_scenarios.py", line 66, in init
self.aws = AWS()
File "/root/kraken/kraken/node_actions/aws_node_scenarios.py", line 12, in init
self.boto_client = boto3.client('ec2')
File "/usr/local/lib/python3.6/site-packages/boto3/init.py", line 93, in client
return _get_default_session().client(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/boto3/session.py", line 263, in client
aws_session_token=aws_session_token, config=config)
File "/usr/local/lib/python3.6/site-packages/botocore/session.py", line 838, in create_client
client_config=config, api_version=api_version)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 87, in create_client
verify, credentials, scoped_config, client_config, endpoint_bridge)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 328, in _get_client_args
verify, credentials, scoped_config, client_config, endpoint_bridge)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 73, in get_client_args
endpoint_url, is_secure, scoped_config)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 154, in compute_client_args
s3_config=s3_config,
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 220, in _compute_endpoint_config
return self._resolve_endpoint(**resolve_endpoint_kwargs)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 303, in _resolve_endpoint
service_name, region_name, endpoint_url, is_secure)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 402, in resolve
service_name, region_name)
File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 134, in construct_endpoint
partition, service_name, region_name)
File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 148, in _endpoint_for_partition
raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.
^C
[root@helper ~]# cat config.yaml
kraken:
kubeconfig_path: /data/install/auth/kubeconfig # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
chaos_scenarios: # List of policies/chaos scenarios to load
- pod_scenarios: # List of chaos pod scenarios to load
- - scenarios/etcd.yml
- - scenarios/regex_openshift_pod_kill.yml
- scenarios/post_action_regex.py
- node_scenarios: # List of chaos node scenarios to load
- scenarios/node_scenarios_example.yml
- pod_scenarios:
- - scenarios/openshift-apiserver.yml
- - scenarios/openshift-kube-apiserver.yml
- time_scenarios: # List of chaos time scenarios to load
- scenarios/time_scenarios_example.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever
Do you have some suggestion?
Hey all,
We've been doing some chaos testing using kraken and we've noticed some signalling issues from cerberus when running litmus scenarios. In our case we are running a pod network loss experiment that is reporting a successful result, but cerberus is returning a no-go signal immediately after due to liveness check failure restarting the pod during the experiment.
After looking at the code it seems the ordering for checking the signal is wrong in the litmus_scenarios function:
cerberus_integration(config)
logging.info("Waiting for the specified duration: %s" % wait_duration)
time.sleep(wait_duration)
In the other functions cerberus is checked after the wait duration:
logging.info("Waiting for the specified duration: %s" % (wait_duration))
time.sleep(wait_duration)
cerberus_integration(config)
when i try to run command on regular one i see the issue
[root@ip-10-201-36-93 cerberus]# python3 start_cerberus.py --config=config/config.yaml
Traceback (most recent call last):
File "start_cerberus.py", line 5, in
import yaml
ImportError: No module named 'yaml'
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#
[root@ip-10-201-36-93 cerberus]#
The scenario is to run Kraken tests on a host having Power (ppc64le) architecture. This would be beneficial to test clusters in environments having only Power and no vm's with other architectures.
I've run into many pip and yum package dependency issues while installing Kraken directly on a RHEL host running on Power . The Dockerfile which is currently available, only runs on intel architecture.
To solve this, I have created a separate Dockerfile for ppc64le so that the tool can be run without any dependency problems.
It's important to check how Kubernetes/OpenShift is dealing with the node failures, the failures can be a node termination or stop/shut down. The nodes can be terminated by hitting the cloud platform API on top of which OpenShift has been deployed ( AWS, GCP, Azure e.t.c ).
Kraken can accept a config from the user with the details including role/selector of the node to kill ( Master/node-role.kubernetes.io/master, Worker/node-role.kubernetes.io/worker, Infra/node-role.kubernetes.io/infra), number of nodes to kill and wait between each kill and act accordingly.
afaiu kraken currently kills pod by deleting them. However this is rather a somewhat graceful termination for the process running in the pod.
It would reflect much more real life scenarios if the processes in the pod's containers would be killed by kill -9
.
This would show applications arbitrarly terminating and then see how the system reacts to it.
We do have issues filed around the features and enhancements which are open/being worked on but it might be useful to have a roadmap in readme with pointers to the major ones like Litmus integration ( #31 ) for example to help users understand where we need help as well as where the project is going.
The tool injects failures but doesn't check if the targeted components recovered from failures, it's important to add them to determine the pass/fail criteria for the run.
How do we make sure that the targeted components recovered from the chaos scenarios without taking a performance hit? The on-cluster-latest.json and etcd-on-cluster-dashboard.json dashboards can be loaded in Dittybopper to analyze most of the metrics needed to evaluate the performance hit of the components. We could create a new dashboard for the additional metrics that are not collected in the above dashboards. We can recommend the user to install Dittybopper as this would enable him to look into all the metrics and analyze the performance of the targeted components.
Investigate https://github.com/JohnStrunk/ocs-monkey and see if we can leverage it for storage testing. Had a discussion with Badre Tejado-Imam about this tool, and was told from developers that it can be extended beyond storage testing
I have some general questions.
I want to run Kraken in an openshift cluster and point it to another cluster for the chaos testing.
- pod_scenarios: # List of chaos pod scenarios to load
- - scenarios/etcd.yml
- - scenarios/regex_openshift_pod_kill.yml
- scenarios/post_action_regex.py
- node_scenarios: # List of chaos node scenarios to load
- scenarios/node_scenarios_example.yml
- pod_scenarios:
- - scenarios/openshift-apiserver.yml
- - scenarios/openshift-kube-apiserver.yml
Thanks.
Currently it looks like we support AWS and GCP for node scenarios like node power off. power on and reboot.
If we have an ipmitool based library for the node scenarios, we can operator against baremetal OpenShift.
Some raw calls for power on/off
ipmitool -I lanplus -H <host> -U <user> -P <password> chassis power on```
Keep in mind, if a node is powered off already, the power off command will return an error, so you need to check the power status first.
https://pypi.org/project/python-ipmi/ is a library on which you can based your module.
Also, the mapping between node name and baremetal host FQDN is not straightforward on baremetal installs.
```[kni@e16-h18-b03-fc640 ~]$ oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.test155.myocp4.com:6443".
[kni@e16-h18-b03-fc640 ~]$ oc get bmh
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
master-0 OK externally provisioned test155-77wr8-master-0 ipmi://mgmt-e16-h18-b04-fc640.rdu2.scalelab.redhat.com:623 true
master-1 OK externally provisioned test155-77wr8-master-1 ipmi://mgmt-e16-h20-b02-fc640.rdu2.scalelab.redhat.com:623 true
master-2 OK externally provisioned test155-77wr8-master-2 ipmi://mgmt-e16-h20-b04-fc640.rdu2.scalelab.redhat.com:623 true
storage001-r640 OK provisioned test155-77wr8-worker-0-p62h6 ipmi://mgmt-f04-h10-000-r640.rdu2.scalelab.redhat.com unknown true
storage002-r640 OK provisioned test155-77wr8-worker-0-4wwqz ipmi://mgmt-f04-h11-000-r640.rdu2.scalelab.redhat.com unknown true
storage003-r640 OK provisioned test155-77wr8-worker-0-rb6dt ipmi://mgmt-f04-h12-000-r640.rdu2.scalelab.redhat.com unknown true
storage004-r640 OK provisioned test155-77wr8-worker-0-7frns ipmi://mgmt-f04-h13-000-r640.rdu2.scalelab.redhat.com unknown true
storage005-r640 OK provisioned test155-77wr8-worker-0-bnnls ipmi://mgmt-f04-h33-000-r640.rdu2.scalelab.redhat.com unknown true
worker-0 OK provisioned test155-77wr8-worker-0-jjrq6 ipmi://mgmt-e16-h24-b01-fc640.rdu2.scalelab.redhat.com:623 unknown true
worker-1 OK provisioned test155-77wr8-worker-0-jbnsq ipmi://mgmt-e16-h24-b02-fc640.rdu2.scalelab.redhat.com:623 unknown true
is how you get the mapping between node name and management IP FQDN.
At the minimum we should support
node_start_scenario
node_stop_scenario
node_stop_start_scenario
node_reboot_scenario
in https://github.com/openshift-scale/kraken/blob/master/docs/node_scenarios.md
Kraken currently supports deploying mutable Grafana with pre-loaded dashboards to help with monitoring the cluster state as well as various performance metrics but the user needs to know the start and end time to narrow down the metrics for a specific run which makes it difficult especially in case of CI/automation runs where there won't be anyone monitoring the runs and the cluster will be terminated: https://github.com/cloud-bulldozer/kraken#performance-monitoring.
To solve the problem, the proposal is to add ability in Kraken to accept a profile which consists of metrics to scrape from the prometheus for the test run duration and store them in Elasticsearch, this way we can use Grafana to visualize them. Each test run can have a uuid associated with it to differentiate them.
Kube-burner indexing feature can be leveraged to achieve this: https://kube-burner.readthedocs.io/en/latest/indexers/.
Currently it looks like we support AWS and GCP for node scenarios like node power off. power on and reboot.
If we have an ipmitool based library for the node scenarios, we can operator against baremetal OpenShift.
Some raw calls for power on/off
ipmitool -I lanplus -H <host> -U <user> -P <password> chassis power off
ipmitool -I lanplus -H <host> -U <user> -P <password> chassis power on
Keep in mind, if a node is powered off already, the power off command will return an error, so you need to check the power status first.
https://pypi.org/project/python-ipmi/ is a library on which you can base your module.
Also, the mapping between node name and baremetal host FQDN is not straightforward on baremetal installs.
Now using project "openshift-machine-api" on server "https://api.test155.myocp4.com:6443".
[kni@e16-h18-b03-fc640 ~]$ oc get bmh
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
master-0 OK externally provisioned test155-77wr8-master-0 ipmi://mgmt-e16-h18-b04-fc640.rdu2.scalelab.redhat.com:623 true
master-1 OK externally provisioned test155-77wr8-master-1 ipmi://mgmt-e16-h20-b02-fc640.rdu2.scalelab.redhat.com:623 true
master-2 OK externally provisioned test155-77wr8-master-2 ipmi://mgmt-e16-h20-b04-fc640.rdu2.scalelab.redhat.com:623 true
storage001-r640 OK provisioned test155-77wr8-worker-0-p62h6 ipmi://mgmt-f04-h10-000-r640.rdu2.scalelab.redhat.com unknown true
storage002-r640 OK provisioned test155-77wr8-worker-0-4wwqz ipmi://mgmt-f04-h11-000-r640.rdu2.scalelab.redhat.com unknown true
storage003-r640 OK provisioned test155-77wr8-worker-0-rb6dt ipmi://mgmt-f04-h12-000-r640.rdu2.scalelab.redhat.com unknown true
storage004-r640 OK provisioned test155-77wr8-worker-0-7frns ipmi://mgmt-f04-h13-000-r640.rdu2.scalelab.redhat.com unknown true
storage005-r640 OK provisioned test155-77wr8-worker-0-bnnls ipmi://mgmt-f04-h33-000-r640.rdu2.scalelab.redhat.com unknown true
worker-0 OK provisioned test155-77wr8-worker-0-jjrq6 ipmi://mgmt-e16-h24-b01-fc640.rdu2.scalelab.redhat.com:623 unknown true
worker-1 OK provisioned test155-77wr8-worker-0-jbnsq ipmi://mgmt-e16-h24-b02-fc640.rdu2.scalelab.redhat.com:623 unknown true
is how you get the mapping between node name and management IP FQDN.
At the minimum we should support
node_start_scenario
node_stop_scenario
node_stop_start_scenario
node_reboot_scenario
in https://github.com/openshift-scale/kraken/blob/master/docs/node_scenarios.md
Kraken needs to support scenarios which introduces communication delays/network latency to simulate degradation or outages in a network and verify if the system especially the critical components like etcd handles it well in terms of leader election/receiving hearbeats from the members.
We need to also support injecting failures such as dropping packets, blocking DNS e.t.c.
The bastion or helper node is useful for deploying Openshift clusters and hosts a number of important services such as dhcpd, haproxy etc. Running node level tests on this node would be useful in checking its stability and the services running on it.
I attempted to run the base AWS Node scenario provided, and I received an error message on the OCP Console pod log for the Kraken deployment. It says that I need to specify a region, but there isn't a place or option to specify a region in config.yaml
or node_scenarios_example.yml
file.
Here is the output from the OCP Console:
Traceback (most recent call last):
File “run_kraken.py”, line 428, in <module> main(options.cfg)
File “run_kraken.py”, line 376, in main node_scenarios(scenarios_list, config)
File “run_kraken.py”, line 231, in node_scenarios node_scenario_object = get_node_scenario_object(node_scenario)
File “run_kraken.py”, line 33, in get_node_scenario_object return aws_node_scenarios()
File “/root/kraken/kraken/node_actions/aws_node_scenarios.py”, line 66, in __init__ self.aws = AWS()
File “/root/kraken/kraken/node_actions/aws_node_scenarios.py”, line 12, in __init__ self.boto_client = boto3.client(‘ec2’)
File “/usr/local/lib/python3.6/site-packages/boto3/__init__.py”, line 93, in client return _get_default_session().client(*args, **kwargs)
File “/usr/local/lib/python3.6/site-packages/boto3/session.py”, line 263, in client aws_session_token=aws_session_token, config=config)
File “/usr/local/lib/python3.6/site-packages/botocore/session.py”, line 851, in create_client client_config=config, api_version=api_version)
File “/usr/local/lib/python3.6/site-packages/botocore/client.py”, line 87, in create_client verify, credentials, scoped_config, client_config, endpoint_bridge)
File “/usr/local/lib/python3.6/site-packages/botocore/client.py”, line 328, in _get_client_args verify, credentials, scoped_config, client_config, endpoint_bridge)
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 73, in get_client_args endpoint_url, is_secure, scoped_config)
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 154, in compute_client_args s3_config=s3_config,
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 220, in _compute_endpoint_config return self._resolve_endpoint(**resolve_endpoint_kwargs)
File “/usr/local/lib/python3.6/site-packages/botocore/args.py”, line 303, in _resolve_endpoint service_name, region_name, endpoint_url, is_secure)
File “/usr/local/lib/python3.6/site-packages/botocore/client.py”, line 402, in resolve service_name, region_name)
File “/usr/local/lib/python3.6/site-packages/botocore/regions.py”, line 134, in construct_endpoint partition, service_name, region_name)
File “/usr/local/lib/python3.6/site-packages/botocore/regions.py”, line 148, in _endpoint_for_partition raise NoRegionError() botocore.exceptions.NoRegionError: You must specify a region.
Here is the kraken-config
configmap:
Name: kraken-config
Namespace: kraken-aws
Labels: <none>
Annotations: <none>
Data
====
config.yaml:
----
kraken:
kubeconfig_path: /root/.kube/config # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
litmus_version: v1.10.0 # Litmus version to install
litmus_uninstall: False # If you want to uninstall litmus if failure
chaos_scenarios: # List of policies/chaos scenarios to load
- node_scenarios: # List of chaos node scenarios to load
- scenarios/node_scenarios_example.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
performance_monitoring:
deploy_dashboards: False # Install a mutable grafana and load the performance dashboards. Enable this only when running on OpenShift
repo: "https://github.com/cloud-bulldozer/performance-dashboards.git"
tunings:
wait_duration: 60 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever
Events: <none>
I am using the default node_scenarios_example.yml
file
Cerberus ( https://github.com/openshift-scale/cerberus ) needs to be run after each scenario specified in the config to monitor the cluster under test and the aggregated go/no-go signal generated by it needs to be consumed by Kraken to determine pass/fail i.e make sure the Kubernetes/OpenShift cluster is healthy after the failure injection, not just the targeted component.
The container might want to:
stdout/stderr
using more than one process and more than one thread in a processUser needs have an option to run kraken as a container, this eliminates the need to deal with installing dependencies and packages conflicts.
While I was looking at trying to add code to powerfulseal for the regex namespace issue I opened (powerfulseal/powerfulseal#274) I noticed that the latest python pip package is very outdated compared to the code in github.
There are a couple of RC releases that you are able to install if given the version numbers. Seen here
Trying to test my code I had to completely rewrite the the pod kill scenario. As you can see in the documentation here it no longer has the podScenario and nodeScenario sections, it is just one common scenarios section.
These changes are not too huge but will have to be done at some point in the future.
Examples:
2.9.0 scenario
3.0.* scenario
We will need to make updates to our scenarios when the pip package updates from the current 2.9.0 version.
Scenario: Node scenario using kraken tool
Set cloud_type as aws on non aws cloud provider.
File: scenarios/node_scenarios_example.yml
node_scenarios:
- actions: # node chaos scenarios to be injected
# - node_stop_start_scenario
- stop_start_kubelet_scenario
# - node_crash_scenario
node_name: worker-0 # node on which scenario has to be injected
label_selector: node-role.kubernetes.io/worker # when node_name is not specified, a node with matching label_selector is selected for node chaos scenario injection
instance_kill_count: 1 # number of times to inject each scenario under actions
timeout: 120 # duration to wait for completion of node scenario injection
cloud_type: aws # cloud type on which Kubernetes/OpenShift runs
Issue seen:
[root@rails1 kraken]# python3 run_kraken.py --config config/config.yaml
_ _
| | ___ __ __ _| | _____ _ __
| |/ / '__/ _` | |/ / _ \ '_ \
| <| | | (_| | < __/ | | |
|_|\_\_| \__,_|_|\_\___|_| |_|
2020-11-09 06:50:42,571 [INFO] Starting kraken
2020-11-09 06:50:42,577 [INFO] Initializing client to talk to the Kubernetes cluster
2020-11-09 06:50:44,678 [INFO] Fetching cluster info
2020-11-09 06:50:46,239 [INFO]
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.1 True False 11d Cluster version is 4.6.1
Kubernetes master is running at https://api.chaos-arc46.redhat.com:6443
2020-11-09 06:50:46,240 [INFO] Daemon mode not enabled, will run through 1 iterations
2020-11-09 06:50:46,240 [INFO] Executing scenarios for iteration 0
Traceback (most recent call last):
File "run_kraken.py", line 298, in <module>
main(options.cfg)
File "run_kraken.py", line 240, in main
node_scenario_object = get_node_scenario_object(node_scenario)
File "run_kraken.py", line 21, in get_node_scenario_object
return aws_node_scenarios()
File "/root/chaos/kraken/kraken/node_actions/aws_node_scenarios.py", line 66, in __init__
self.aws = AWS()
File "/root/chaos/kraken/kraken/node_actions/aws_node_scenarios.py", line 12, in __init__
self.boto_client = boto3.client('ec2')
File "/usr/local/lib/python3.6/site-packages/boto3/__init__.py", line 91, in client
return _get_default_session().client(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/boto3/session.py", line 263, in client
aws_session_token=aws_session_token, config=config)
File "/usr/local/lib/python3.6/site-packages/botocore/session.py", line 838, in create_client
client_config=config, api_version=api_version)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 87, in create_client
verify, credentials, scoped_config, client_config, endpoint_bridge)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 328, in _get_client_args
verify, credentials, scoped_config, client_config, endpoint_bridge)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 73, in get_client_args
endpoint_url, is_secure, scoped_config)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 153, in compute_client_args
s3_config=s3_config,
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 218, in _compute_endpoint_config
return self._resolve_endpoint(**resolve_endpoint_kwargs)
File "/usr/local/lib/python3.6/site-packages/botocore/args.py", line 301, in _resolve_endpoint
service_name, region_name, endpoint_url, is_secure)
File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 402, in resolve
service_name, region_name)
File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 134, in construct_endpoint
partition, service_name, region_name)
File "/usr/local/lib/python3.6/site-packages/botocore/regions.py", line 148, in _endpoint_for_partition
raise NoRegionError()
botocore.exceptions.NoRegionError: You must specify a region.
Addition of functional tests to ensure that the PR's don't break the existing functionality of Kraken similar to functional tests in Cerberus. This eases the code review and ensures the commits are non-breaking.
( PS: It would also enable contributors without cluster to open PR's for minor enhancements confidently :D )
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.