chaostoolkit-incubator / chaostoolkit-azure Goto Github PK
View Code? Open in Web Editor NEWChaos Toolkit Extension for Azure
Home Page: https://chaostoolkit.org/
License: Apache License 2.0
Chaos Toolkit Extension for Azure
Home Page: https://chaostoolkit.org/
License: Apache License 2.0
The title speaks for itself. Extension is on 1.1.2 whereas ct-lib is 1.9.0
Delete actions are invasive actions. They potentially can harm your infrastructure. We should reconsider whether the Azure extension should offer those type of actions. At the moment the machine and aks modules offer such delete actions. Maybe we should remove them. Powering off instead of deleting should be sufficient here.
The Azure CLI provides full support to manage the Azure infrastructure. Only some advantages:
Switch from azure-mgmt-* modules to the Azure CLI.
The Chaostoolkit Azure makes use of the Azure CLI. The Azure CLI eases tasks like collection of Azure resources, doing actions on them and filtering for resources.
Azure CLI requires a bunch of sub-dependencies. Those sub-dependencies require a specific constellation of versions to make the Azure CLI run. In some cases the constellation (-> requirements.txt) does not. The Chaostoolkit Azure can not be build.
We need to find another solution for this.
Instead of providing resource groups in the configuration file, the user
is now able to provide filter queries. The filter queries are passed
locally to the action. With this solution the user is able to use a
fine-granular filter.
Requirement: The resource-graph extension is a pre-requirement. The
application will install the resource-graph extension if is not
installed.
Introduce a new feature stressing the I/O operations per second of the hard drive for Linux. Windows is not supported yet.
The ct-azure extension hosts actions for i) the Azure Service Fabric and ii) the Azure infrastructure.
The Azure Service Fabric is an orchestration framework, comparable to K8s. Microsoft states that Service Fabric is portable, and can be deployed on for example AWS.
Some main points for the discussion of separating those two parts:
From the ct clients' point of view, it could also be beneficial to see, that ct supports Azure and Azure Service Fabric.
What do you think?
Currently, we expect this:
subscription_id = configuration['azure']['subscription_id']
in https://github.com/chaostoolkit-incubator/chaostoolkit-azure/blob/master/chaosazure/__init__.py#L122
with the config as follows:
"configuration": {
"azure": {
"subscription_id": "xxx"
}
},
But that means, the key cannot be interpolated from the environment variable. It should have been like this:
subscription_id = configuration['azure_subscription_id']
with the config as follows:
"configuration": {
"azure_subscription_id": {
"type" : "env",
"key": "SUBSCRIPTION_ID"
}
},
so that the value is read from the environment variable.
Configurations are flat by default not nested as the readme suggests.
Until now, we do not have any actions for the managed Azure Kubernetes Services (AKS). Since the AKS is quite popular it would be nice to have some features for it. For example deleting, powering off, or restarting a node from the AKS cluster.
Chaos Toolkit Azure does not have any feature for virtual machine scale sets (vmss). The actions poweroff, restart, delete and deallocate enable a way to interact with the VMSS.
Implement a new feature on Linux devices allowing to increase the I/O operations per second of the hard drive.
Provide actions to stop, restart and start Azure Web Apps.
We have two AUTH functions, one for the management of the service fabric and another one for the management of the Azure infrastructure. To avoid repetitions we move the two AUTH functions one module up.
For now the activities in the Azure extension are asynchronous and are not waiting for the activity result. According to the Chaostoolkit specification they should be synchronous.
If a user anyway wants to have the activities async he still can manage this by setting the background = true flage in the chaos experiment itself.
Implement a new feature on Linux devices allowing the user to modify its network latency. Windows is not supported yet.
Instead of providing resource groups in the configuration file, the user shall be able to provide filter queries in probes.
The MS Azure REST API introduced a new version 2019-04-01. This version does not work with the chaostoolkit-azure.
FillDisk simulates full disk behaviour on Virtual Machines. Real case scenarios could be undeleted log files or archives. This function is inspired by the Simian Army feature also called FillDisk.
is not working, I am getting some error
�[36m[2019-11-07 13:36:26 DEBUG] [python:34]�[39m Activity 'restart-webb-app' loaded from 'c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py'
�[36m[2019-11-07 13:36:26 DEBUG] [actions:53]�[39m Start restart_webapp: configuration='{'azure': {'subscription_id': 'xxx'}}', filter='where resourceGroup == 'GF-RG-Base-d' and name = 'gf-web-dynamicscfs-d-azwe''
�[36m[2019-11-07 13:36:26 DEBUG] [activity:230]�[39m Activity failed
Traceback (most recent call last):
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\provider\python.py", line 55, in run_python_activity
return func(**arguments)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 57, in restart_webapp
choice = __fetch_webapp_at_random(filter, configuration, secrets)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 133, in __fetch_webapp_at_random
webapps = fetch_webapps(filter, configuration, secrets)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 118, in fetch_webapps
webapps = fetch_resources(filter, RES_TYPE_WEBAPP, secrets, configuration)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\rgraph\resource_graph.py", line 13, in fetch_resources
resources = client.resources(query)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\azure\mgmt\resourcegraph\operations_resource_graph_client_operations.py", line 64, in resources
raise models.ErrorResponseException(self._deserialize, response)
azure.mgmt.resourcegraph.models._models_py3.ErrorResponseException: (BadRequest) Please provide below info when asking for support: timestamp = 2019-11-07T12:36:26.9590396Z, correlationId = 456f97ab-5b44-42e9-9b96-ad5b4ce504b1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\activity.py", line 223, in run_activity
result = run_python_activity(activity, configuration, secrets)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\provider\python.py", line 57, in run_python_activity
raise ActivityFailed(
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\provider\python.py", line 55, in run_python_activity
return func(**arguments)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 57, in restart_webapp
choice = __fetch_webapp_at_random(filter, configuration, secrets)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 133, in __fetch_webapp_at_random
webapps = fetch_webapps(filter, configuration, secrets)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 118, in fetch_webapps
webapps = fetch_resources(filter, RES_TYPE_WEBAPP, secrets, configuration)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\rgraph\resource_graph.py", line 13, in fetch_resources
resources = client.resources(query)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\azure\mgmt\resourcegraph\operations\_resource_graph_client_operations.py", line 64, in resources
raise models.ErrorResponseException(self._deserialize, response)
chaoslib.exceptions.ActivityFailed: azure.mgmt.resourcegraph.models._models_py3.ErrorResponseException: (BadRequest) Please provide below info when asking for support: timestamp =
Within the Chaos Toolkit Azure extension we want to make use of the Azure CLI. The usage of the Azure CLI needs several common steps, e.g. login with the provided service principal first and execute the Azure CLI commands. Commands may be executed as a single command, as batch consisting of multiple commands, or as asynchronous step. After the commands are executed a logout is recommended.
Currently we import the complete Azure Python SDK stack to manage resources. However, we only use a small percentage of the SDK. The pip install process lasts very long and the packages are quite large. I think there is a way to improve the requirements that we just download the packages that we really need.
Consider this feature https://docs.microsoft.com/en-us/azure/kusto/query/sampleoperator and let it take place for every resource interaction.
Currently, the setup.py module doesn't expose all the necessary subpackages exported by the chaosazure package.
Until now the start_machine action offers a way to start previously stopped machines. The action provides a filter mechanism to start specific machines which may not be fine granular enough or or to complicated to use.
Instead of filtering the user could use a tagging mechanism. When the user stops a machine she can tag this machine(s). The start_machine action shall make use of those tags and start only machines that comply to a specific tag.
Actions at the moment are executed asynchronously.
I wonder if there are reasons that a consumer wants to execute actions in synchronous mode. I guess that waiting for the execution of an execution could make sense in some cases. For example the stopping of a virtual machine takes a serious moment of time (1-2 minutes). When executing chaos experiments the consumer maybe wants to make sure that the machine is really, really stopped.
The VMSS and the Machine actions are both utilizing its own run_command feature.
It would be benefitial to have a centralized spot to run command - because maintaing will be easier.
The list of scripts that do run on **nix-based machines but not on Windows-based machines needs to be updated.
This issue rised from chaostoolkit/chaostoolkit-lib#121. The README needs to be updated.
This commit enables us to rollback all stopped machines during a chaos experiment and start them again.
For now the experiment description is not 100% compliant with the documentation in https://docs.chaostoolkit.org/reference/api/experiment/#secrets. The environment variables are checked in the extension though it is already checked by the chaostoolkit-lib.
The sub-dependency of the azure-mgmt-compute module results in the following error
error: msrestazure 0.5.0 is installed but msrestazure~=0.4.7 is required by {'azure-mgmt-compute'}
So we update the used module and make the dependency explicit in the requirements.txt
@Lawouach I would like to do a refactoring on the module structure.
Modules such as vm, vmss, and webapp have more or less similar (but not the same) actions. I would like to structure
from
.
├── vm
│ ├── actions.py
├── vmss
│ ├── actions.py
├── webapp
│ ├── actions.py
to
.
│ compute
│ ├── vm
│ │ ├── actions.py
│ ├── vmss
│ │ ├── actions.py
│ ├── webapp
│ │ ├── actions.py
Would such a structure hurt somewhere e.g. the documentation generation?
We have a new handling for the authentication with the Azure infrastructure. Documenting this feature is essential.
Are there any documentation for how to use chaos toolkit azure for postgres db instances?
The chaostoolkit-azure driver has scripts that can be executed on virtual machines and instances. Those script generates some noise on CPU, IO, disk and so on.
As this feature and capability is nice - it would be even nicer to allow a user to execute custom/ other scripts that he prepared. For example the scripts could be provided via his/her private GitHub repository, a public available script repository or on his private local file system.
That's the path taken by the AWS extension. No point in going forward when credentials are not provided to the experiment.
We do not yet have any kind of pytests for the machine, aks, and vmss module.
When calling the command
chaos discover chaostoolkit-azure
we get the error message
package 'chaosazure' does not export a `discover` function
So we need to implement the discovery function for the Azure extension.
Azure API allows the use of a Credentials File in addition to the explicit ENV vars approach.
This is an easier way to deal with Azure credentials, especially in the case where you are given access to a previously running AKS cluster.
Here is the example from Azure python API doc: https://docs.microsoft.com/en-us/python/azure/python-sdk-azure-authenticate?view=azure-python#mgmt-auth-file
[ ] #35
[ ] Apply this for AKS
[ ] Apply this for VMSS
[ ] Apply this for Web Apps
This Document shows that restart_node and stop_node commands stop a node at random. However, all nodes in my cluster stopped when I tried to do command "stop_node".
And also tried to add "sample" query on argument like
"arguments":{
"filter": "'where resourceGroup=='Staging' and name='dmc-staging-1' | sample 1'"
},
but it doesn't work.
My sample json is below. Do you have an any idea for this issue?
{
"version": "1.0.0",
"title": "...",
"description": "...",
"tags": [
"azure",
"kubernetes",
"aks",
"node"
],
"configuration": {
"azure": {
"subscription_id": "xx"
}
},
"secrets": {
"azure": {
"client_id": "xx",
"client_secret": "xx",
"tenant_id": "xx"
}
},
"steady-state-hypothesis": {
"title": "Services are all available and healthy",
"probes": [
{
"name": "all-microservices-healthy",
"type": "probe",
"tolerance": true,
"provider": {
"func": "all_microservices_healthy",
"type": "python",
"module": "chaosk8s.probes"
}
}
]
},
"method": [
{
"name": "stop-node",
"type": "action",
"provider": {
"func": "stop_node",
"argument":{
"filter": "'where resourceGroup=='Staging' and name='dmc-staging-1' | sample 1'"
},
"type": "python",
"module": "chaosazure.aks.actions",
"secrets": [
"azure"
],
"config": [
"azure"
]
}
}
],
"rollbacks": [
]
}
We work in Azure China, so the current login with cred settings will always fail as the base_url
of each client
is None
and then default to msrestazure.azure_cloud.AZURE_PUBLIC_CLOUD
. Therefore our Azure China credential will failed. You can see the login url is direct to the global one.
[2019-09-18 17:58:13 ERROR] => failed: msrest.exceptions.AuthenticationError: , AdalError: Get Token request returned http error: 400 and server response: {"error":"invalid_request","error_description":"AADSTS90002: Tenant 'adsfadfa' not found. This may happen if there are no active subscriptions for the tenant. Check with your subscription administrator.\r\nTrace ID: adsfadsf\r\nCorrelation ID: dadfadfa\r\nTimestamp: 2019-09-18 09:58:13Z","error_codes":[90002],"timestamp":"2019-09-18 09:58:13Z","trace_id":"dddddd","correlation_id":"xxxxxxxxx","error_uri":"https://login.microsoftonline.com/error?code=90002"}
By checking Azure python SDK, we've successfully point the China Azure URL as below, but it is not a good solution that I have to modify each py where a client is declared and concurrent.
E.g.
from msrestazure.azure_cloud import AZURE_CHINA_CLOUD
...
def __compute_mgmt_client(secrets, configuration):
with auth(secrets) as cred:
subscription_id = configuration['azure']['subscription_id']
client = ComputeManagementClient(
credentials=cred, subscription_id=subscription_id,
base_url=AZURE_CHINA_CLOUD.endpoints.resource_manager)
return client
...
Furthermore, shall we manage to set the configuration from configuration
in order to enable more choice of the Azure Cloud?
chaosazure.vmss.actions/deallocate-vmss with injection the secrets explicitly works, but passing environment variables does not.
The following experiment(injection the secrets explicitly) works fine.
{
<<<<snip>>>>
"configuration": {
"azure": {
"subscription_id": "xxxx"
},
"service_url": {
"type": "env",
"key": "APPLICATION_ENTRYPOINT_URL"
}
},
"secrets": {
"azure": {
"client_id": "xxxx",
"client_secret": "xxxx",
"tenant_id": "xxxx"
}
},
<<<<snip>>>>
"method": [
{
"type": "action",
"name": "deallocate-vmss",
"provider": {
"module": "chaosazure.vmss.actions",
"type": "python",
"func": "deallocate_vmss",
"secrets": [
"azure"
],
"config": [
"azure"
],
"arguments": {
"filter": ""
}
},
"pauses": {
"after": 2
}
}
]
}
[2019-07-26 23:40:45 INFO] Validating the experiment's syntax
[2019-07-26 23:40:45 INFO] Experiment looks valid
[2019-07-26 23:40:45 INFO] Running experiment: My application is resilient to pod death
[2019-07-26 23:40:45 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:40:45 INFO] Probe: application-must-respond-normally
[2019-07-26 23:40:45 INFO] Steady state hypothesis is met!
[2019-07-26 23:40:45 INFO] Action: deallocate-vmss
additional_properties is not a known attribute of class <class 'azure.mgmt.resourcegraph.models._models_py3.QueryRequest'> and will be ignored
[2019-07-26 23:40:48 INFO] Pausing after activity for 2s...
[2019-07-26 23:40:50 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:40:50 INFO] Probe: application-must-respond-normally
[2019-07-26 23:40:50 INFO] Steady state hypothesis is met!
[2019-07-26 23:40:50 INFO] Let's rollback...
[2019-07-26 23:40:50 INFO] No declared rollbacks, let's move on.
[2019-07-26 23:40:50 INFO] Experiment ended with status: completed
But the following one (environment variables) fails.
{
<<<<snip>>>>
"configuration": {
"azure": {
"subscription_id": {
"type": "env",
"key": "AZURE_SUBSCRIPTION_ID"
}
},
"service_url": {
"type": "env",
"key": "APPLICATION_ENTRYPOINT_URL"
}
},
"secrets": {
"azure": {
"client_id": {
"type": "env",
"key": "AZURE_CLIENT_ID"
},
"client_secret": {
"type": "env",
"key": "AZURE_CLIENT_SECRET"
},
"tenant_id": {
"type": "env",
"key": "AZURE_TENANT_ID"
}
}
},
<<<<snip>>>>
"method": [
{
"type": "action",
"name": "deallocate-vmss",
"provider": {
"module": "chaosazure.vmss.actions",
"type": "python",
"func": "deallocate_vmss",
"secrets": [
"azure"
],
"config": [
"azure"
],
"arguments": {
"filter": ""
}
},
"pauses": {
"after": 2
}
}
]
}
[2019-07-26 23:57:27 INFO] Validating the experiment's syntax
[2019-07-26 23:57:27 INFO] Experiment looks valid
[2019-07-26 23:57:27 INFO] Running experiment: My application is resilient to node death
[2019-07-26 23:57:27 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:57:27 INFO] Probe: application-must-respond-normally
[2019-07-26 23:57:27 INFO] Steady state hypothesis is met!
[2019-07-26 23:57:27 INFO] Action: deallocate-vmss
additional_properties is not a known attribute of class <class 'azure.mgmt.resourcegraph.models._models_py3.QueryRequest'> and will be ignored
[2019-07-26 23:57:28 ERROR] => failed: azure.mgmt.resourcegraph.models._models_py3.ErrorResponseException: (BadRequest) Please provide below info when asking for support: timestamp = 2019-07-26T14:57:28.0913916Z, correlationId = 974dabf8-0163-4637-bac5-f63d65f71318.
[2019-07-26 23:57:28 INFO] Pausing after activity for 2s...
[2019-07-26 23:57:30 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:57:30 INFO] Probe: application-must-respond-normally
[2019-07-26 23:57:30 INFO] Steady state hypothesis is met!
[2019-07-26 23:57:30 INFO] Let's rollback...
[2019-07-26 23:57:30 INFO] No declared rollbacks, let's move on.
[2019-07-26 23:57:30 INFO] Experiment ended with status: completed
I would appreciate it if you would give me some advice.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.