Giter Club home page Giter Club logo

chaostoolkit-azure's People

Contributors

amiosci avatar aravindarc avatar botobako avatar buderre avatar bugra-derre avatar gavinhc avatar hemantahk avatar jbblache avatar lawouach avatar lucasholzmann avatar lunik avatar matsch55 avatar maximmold avatar mkaszub avatar russmiles avatar seriousscorpion avatar supdavid avatar torumakabe avatar xpdable avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

chaostoolkit-azure's Issues

Reconsider deleting actions

Delete actions are invasive actions. They potentially can harm your infrastructure. We should reconsider whether the Azure extension should offer those type of actions. At the moment the machine and aks modules offer such delete actions. Maybe we should remove them. Powering off instead of deleting should be sufficient here.

Switch requirements for Azure Management

The Azure CLI provides full support to manage the Azure infrastructure. Only some advantages:

  • with the Azure CLI you can list all available accounts for the Azure CosmosDB
  • filter Azure resources with the Azure Graph query language

Switch from azure-mgmt-* modules to the Azure CLI.

Is the Azure CLI the right choice?

The Chaostoolkit Azure makes use of the Azure CLI. The Azure CLI eases tasks like collection of Azure resources, doing actions on them and filtering for resources.

Azure CLI requires a bunch of sub-dependencies. Those sub-dependencies require a specific constellation of versions to make the Azure CLI run. In some cases the constellation (-> requirements.txt) does not. The Chaostoolkit Azure can not be build.

We need to find another solution for this.

Enable filtering for virtual machines

Instead of providing resource groups in the configuration file, the user
is now able to provide filter queries. The filter queries are passed
locally to the action. With this solution the user is able to use a
fine-granular filter.

Requirement: The resource-graph extension is a pre-requirement. The
application will install the resource-graph extension if is not
installed.

Introduce new feature: burn I/O

Introduce a new feature stressing the I/O operations per second of the hard drive for Linux. Windows is not supported yet.

Discussion: Separate the chaostoolkit-azure extension in two extensions

The ct-azure extension hosts actions for i) the Azure Service Fabric and ii) the Azure infrastructure.

The Azure Service Fabric is an orchestration framework, comparable to K8s. Microsoft states that Service Fabric is portable, and can be deployed on for example AWS.

Some main points for the discussion of separating those two parts:

  • The Azure Service Fabric differs in its business scope from the Azure infrastructure
  • The Azure Service Fabric differs in its authentication logic from the Azure infrastructure authentication
  • The Azure Service Fabric as an orchestration framework shall maybe treated like the K8s extension

From the ct clients' point of view, it could also be beneficial to see, that ct supports Azure and Azure Service Fabric.

What do you think?

The subscription_id section in the configuration is wrong

Currently, we expect this:

subscription_id = configuration['azure']['subscription_id']

in https://github.com/chaostoolkit-incubator/chaostoolkit-azure/blob/master/chaosazure/__init__.py#L122

with the config as follows:

"configuration": {
    "azure": {
      "subscription_id": "xxx"
	}
  },

But that means, the key cannot be interpolated from the environment variable. It should have been like this:

subscription_id = configuration['azure_subscription_id']

with the config as follows:

"configuration": {
    "azure_subscription_id": {
           "type" : "env",
           "key": "SUBSCRIPTION_ID"
	}
  },

so that the value is read from the environment variable.

Configurations are flat by default not nested as the readme suggests.

Introduce actions for the managed Azure Kubernetes Services

Until now, we do not have any actions for the managed Azure Kubernetes Services (AKS). Since the AKS is quite popular it would be nice to have some features for it. For example deleting, powering off, or restarting a node from the AKS cluster.

Refactor authentication functions

We have two AUTH functions, one for the management of the service fabric and another one for the management of the Azure infrastructure. To avoid repetitions we move the two AUTH functions one module up.

Activites are asynchronous - they should be blocking

For now the activities in the Azure extension are asynchronous and are not waiting for the activity result. According to the Chaostoolkit specification they should be synchronous.

If a user anyway wants to have the activities async he still can manage this by setting the background = true flage in the chaos experiment itself.

Databases actions

We'd like to have more experiment on database level, like restart its instance.
Now we are using below service in red block. We've encountered the real-world restarts, especially MySQL many times and this cause we have to work on weekeeeeend :-(
image

Introduce a new feature: FillDisk

FillDisk simulates full disk behaviour on Virtual Machines. Real case scenarios could be undeleted log files or archives. This function is inspired by the Simian Army feature also called FillDisk.

restart_webapp does not work

image
is not working, I am getting some error

�[36m[2019-11-07 13:36:26 DEBUG] [python:34]�[39m Activity 'restart-webb-app' loaded from 'c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py'
�[36m[2019-11-07 13:36:26 DEBUG] [actions:53]�[39m Start restart_webapp: configuration='{'azure': {'subscription_id': 'xxx'}}', filter='where resourceGroup == 'GF-RG-Base-d' and name = 'gf-web-dynamicscfs-d-azwe''
�[36m[2019-11-07 13:36:26 DEBUG] [activity:230]�[39m Activity failed
Traceback (most recent call last):
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\provider\python.py", line 55, in run_python_activity
return func(**arguments)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 57, in restart_webapp
choice = __fetch_webapp_at_random(filter, configuration, secrets)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 133, in __fetch_webapp_at_random
webapps = fetch_webapps(filter, configuration, secrets)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 118, in fetch_webapps
webapps = fetch_resources(filter, RES_TYPE_WEBAPP, secrets, configuration)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\rgraph\resource_graph.py", line 13, in fetch_resources
resources = client.resources(query)
File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\azure\mgmt\resourcegraph\operations_resource_graph_client_operations.py", line 64, in resources
raise models.ErrorResponseException(self._deserialize, response)
azure.mgmt.resourcegraph.models._models_py3.ErrorResponseException: (BadRequest) Please provide below info when asking for support: timestamp = 2019-11-07T12:36:26.9590396Z, correlationId = 456f97ab-5b44-42e9-9b96-ad5b4ce504b1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\activity.py", line 223, in run_activity
    result = run_python_activity(activity, configuration, secrets)
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\provider\python.py", line 57, in run_python_activity
    raise ActivityFailed(
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaoslib\provider\python.py", line 55, in run_python_activity
    return func(**arguments)
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 57, in restart_webapp
    choice = __fetch_webapp_at_random(filter, configuration, secrets)
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 133, in __fetch_webapp_at_random
    webapps = fetch_webapps(filter, configuration, secrets)
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\webapp\actions.py", line 118, in fetch_webapps
    webapps = fetch_resources(filter, RES_TYPE_WEBAPP, secrets, configuration)
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\chaosazure\rgraph\resource_graph.py", line 13, in fetch_resources
    resources = client.resources(query)
  File "c:\users\910329\appdata\local\programs\python\python38\lib\site-packages\azure\mgmt\resourcegraph\operations\_resource_graph_client_operations.py", line 64, in resources
    raise models.ErrorResponseException(self._deserialize, response)
chaoslib.exceptions.ActivityFailed: azure.mgmt.resourcegraph.models._models_py3.ErrorResponseException: (BadRequest) Please provide below info when asking for support: timestamp = 

Enable the utilization of the Azure CLI within the Chaos Toolkit Azure extension

Within the Chaos Toolkit Azure extension we want to make use of the Azure CLI. The usage of the Azure CLI needs several common steps, e.g. login with the provided service principal first and execute the Azure CLI commands. Commands may be executed as a single command, as batch consisting of multiple commands, or as asynchronous step. After the commands are executed a logout is recommended.

Purge the requirements

Currently we import the complete Azure Python SDK stack to manage resources. However, we only use a small percentage of the SDK. The pip install process lasts very long and the packages are quite large. I think there is a way to improve the requirements that we just download the packages that we really need.

Introduce tagging

Until now the start_machine action offers a way to start previously stopped machines. The action provides a filter mechanism to start specific machines which may not be fine granular enough or or to complicated to use.

Instead of filtering the user could use a tagging mechanism. When the user stops a machine she can tag this machine(s). The start_machine action shall make use of those tags and start only machines that comply to a specific tag.

Introduce parameter option to execute actions in synchronous or asynchronous mode?

Actions at the moment are executed asynchronously.

I wonder if there are reasons that a consumer wants to execute actions in synchronous mode. I guess that waiting for the execution of an execution could make sense in some cases. For example the stopping of a virtual machine takes a serious moment of time (1-2 minutes). When executing chaos experiments the consumer maybe wants to make sure that the machine is really, really stopped.

Violating the DRY principle

The VMSS and the Machine actions are both utilizing its own run_command feature.

It would be benefitial to have a centralized spot to run command - because maintaing will be easier.

Straighten dependencies in requirements.txt

The sub-dependency of the azure-mgmt-compute module results in the following error

error: msrestazure 0.5.0 is installed but msrestazure~=0.4.7 is required by {'azure-mgmt-compute'}

So we update the used module and make the dependency explicit in the requirements.txt

Refactoring the module structure

@Lawouach I would like to do a refactoring on the module structure.

Modules such as vm, vmss, and webapp have more or less similar (but not the same) actions. I would like to structure

from

.
├── vm
│  ├── actions.py
├── vmss
│  ├── actions.py
├── webapp
│  ├── actions.py

to

.
│  compute
│  ├── vm
│  │  ├── actions.py
│  ├── vmss
│  │  ├── actions.py
│  ├── webapp
│  │  ├── actions.py

Would such a structure hurt somewhere e.g. the documentation generation?

Update documentation

We have a new handling for the authentication with the Azure infrastructure. Documenting this feature is essential.

User may have custom scripts -from URL or local file- that he wants to execute

The chaostoolkit-azure driver has scripts that can be executed on virtual machines and instances. Those script generates some noise on CPU, IO, disk and so on.

As this feature and capability is nice - it would be even nicer to allow a user to execute custom/ other scripts that he prepared. For example the scripts could be provided via his/her private GitHub repository, a public available script repository or on his private local file system.

@Lawouach @mkaszub Do you have oppinions on this?

Introduce tests

We do not yet have any kind of pytests for the machine, aks, and vmss module.

How to define the number of restart / stop_node on AKS Cluster?

Stop a node at random from a managed Azure Kubernetes Service.

This Document shows that restart_node and stop_node commands stop a node at random. However, all nodes in my cluster stopped when I tried to do command "stop_node".
And also tried to add "sample" query on argument like

        "arguments":{
          "filter": "'where resourceGroup=='Staging' and name='dmc-staging-1' | sample 1'"
        },

but it doesn't work.
My sample json is below. Do you have an any idea for this issue?

{
  "version": "1.0.0",
  "title": "...",
  "description": "...",
  "tags": [
    "azure",
    "kubernetes",
    "aks",
    "node"
  ],
  "configuration": {
    "azure": {
        "subscription_id": "xx"
    }
  },
  "secrets": {
    "azure": {
        "client_id": "xx",
        "client_secret": "xx",
        "tenant_id": "xx"
    }
  },
  "steady-state-hypothesis": {
    "title": "Services are all available and healthy",
    "probes": [
        {
          "name": "all-microservices-healthy",
          "type": "probe",
          "tolerance": true,
          "provider": {
            "func": "all_microservices_healthy",
            "type": "python",
            "module": "chaosk8s.probes"
          }
        }
    ]
  },
  "method": [
    {
      "name": "stop-node",
      "type": "action",
      "provider": {
        "func": "stop_node",
        "argument":{
          "filter": "'where resourceGroup=='Staging' and name='dmc-staging-1' | sample 1'"
        },
        "type": "python",
        "module": "chaosazure.aks.actions",
        "secrets": [
          "azure"
        ],
        "config": [
          "azure"
        ]
      }
    }
  ],
  "rollbacks": [
  ]
}

Support Azure China or multiple region.

We work in Azure China, so the current login with cred settings will always fail as the base_url of each client is None and then default to msrestazure.azure_cloud.AZURE_PUBLIC_CLOUD. Therefore our Azure China credential will failed. You can see the login url is direct to the global one.

[2019-09-18 17:58:13 ERROR] => failed: msrest.exceptions.AuthenticationError: , AdalError: Get Token request returned http error: 400 and server response: {"error":"invalid_request","error_description":"AADSTS90002: Tenant 'adsfadfa' not found. This may happen if there are no active subscriptions for the tenant. Check with your subscription administrator.\r\nTrace ID: adsfadsf\r\nCorrelation ID: dadfadfa\r\nTimestamp: 2019-09-18 09:58:13Z","error_codes":[90002],"timestamp":"2019-09-18 09:58:13Z","trace_id":"dddddd","correlation_id":"xxxxxxxxx","error_uri":"https://login.microsoftonline.com/error?code=90002"}

By checking Azure python SDK, we've successfully point the China Azure URL as below, but it is not a good solution that I have to modify each py where a client is declared and concurrent.
E.g.

from msrestazure.azure_cloud import AZURE_CHINA_CLOUD
...
def __compute_mgmt_client(secrets, configuration):
    with auth(secrets) as cred:
        subscription_id = configuration['azure']['subscription_id']
        client = ComputeManagementClient(
            credentials=cred, subscription_id=subscription_id, 
            base_url=AZURE_CHINA_CLOUD.endpoints.resource_manager)

        return client
...

Furthermore, shall we manage to set the configuration from configuration in order to enable more choice of the Azure Cloud?
image

Does chaosazure.vmss.actions/deallocate-vmss with environment variables work?

chaosazure.vmss.actions/deallocate-vmss with injection the secrets explicitly works, but passing environment variables does not.

The following experiment(injection the secrets explicitly) works fine.

{
<<<<snip>>>>
    "configuration": {
        "azure": {
            "subscription_id": "xxxx"
        },
        "service_url": {
            "type": "env",
            "key": "APPLICATION_ENTRYPOINT_URL"
        }
    },
    "secrets": {
        "azure": {
            "client_id": "xxxx",
            "client_secret": "xxxx",
            "tenant_id": "xxxx"
        }
    },
<<<<snip>>>>
    "method": [
        {
            "type": "action",
            "name": "deallocate-vmss",
            "provider": {
                "module": "chaosazure.vmss.actions",
                "type": "python",
                "func": "deallocate_vmss",
                "secrets": [
                    "azure"
                ],
                "config": [
                    "azure"
                ],
                "arguments": {
                    "filter": ""
                }
            },
            "pauses": {
                "after": 2
            }
        }
    ]
}
[2019-07-26 23:40:45 INFO] Validating the experiment's syntax
[2019-07-26 23:40:45 INFO] Experiment looks valid
[2019-07-26 23:40:45 INFO] Running experiment: My application is resilient to pod death
[2019-07-26 23:40:45 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:40:45 INFO] Probe: application-must-respond-normally
[2019-07-26 23:40:45 INFO] Steady state hypothesis is met!
[2019-07-26 23:40:45 INFO] Action: deallocate-vmss
additional_properties is not a known attribute of class <class 'azure.mgmt.resourcegraph.models._models_py3.QueryRequest'> and will be ignored
[2019-07-26 23:40:48 INFO] Pausing after activity for 2s...
[2019-07-26 23:40:50 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:40:50 INFO] Probe: application-must-respond-normally
[2019-07-26 23:40:50 INFO] Steady state hypothesis is met!
[2019-07-26 23:40:50 INFO] Let's rollback...
[2019-07-26 23:40:50 INFO] No declared rollbacks, let's move on.
[2019-07-26 23:40:50 INFO] Experiment ended with status: completed

But the following one (environment variables) fails.

{
<<<<snip>>>>
    "configuration": {
        "azure": {
            "subscription_id": {
                "type": "env",
                "key": "AZURE_SUBSCRIPTION_ID"
            }
        },
        "service_url": {
            "type": "env",
            "key": "APPLICATION_ENTRYPOINT_URL"
        }
    },
    "secrets": {
        "azure": {
            "client_id": {
                "type": "env",
                "key": "AZURE_CLIENT_ID"
            },
            "client_secret": {
                "type": "env",
                "key": "AZURE_CLIENT_SECRET"
            },
            "tenant_id": {
                "type": "env",
                "key": "AZURE_TENANT_ID"
            }
        }
    },
<<<<snip>>>>
    "method": [
        {
            "type": "action",
            "name": "deallocate-vmss",
            "provider": {
                "module": "chaosazure.vmss.actions",
                "type": "python",
                "func": "deallocate_vmss",
                "secrets": [
                    "azure"
                ],
                "config": [
                    "azure"
                ],
                "arguments": {
                    "filter": ""
                }
            },
            "pauses": {
                "after": 2
            }
        }
    ]
}
[2019-07-26 23:57:27 INFO] Validating the experiment's syntax
[2019-07-26 23:57:27 INFO] Experiment looks valid
[2019-07-26 23:57:27 INFO] Running experiment: My application is resilient to node death
[2019-07-26 23:57:27 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:57:27 INFO] Probe: application-must-respond-normally
[2019-07-26 23:57:27 INFO] Steady state hypothesis is met!
[2019-07-26 23:57:27 INFO] Action: deallocate-vmss
additional_properties is not a known attribute of class <class 'azure.mgmt.resourcegraph.models._models_py3.QueryRequest'> and will be ignored
[2019-07-26 23:57:28 ERROR]   => failed: azure.mgmt.resourcegraph.models._models_py3.ErrorResponseException: (BadRequest) Please provide below info when asking for support: timestamp = 2019-07-26T14:57:28.0913916Z, correlationId = 974dabf8-0163-4637-bac5-f63d65f71318.
[2019-07-26 23:57:28 INFO] Pausing after activity for 2s...
[2019-07-26 23:57:30 INFO] Steady state hypothesis: Application is normal
[2019-07-26 23:57:30 INFO] Probe: application-must-respond-normally
[2019-07-26 23:57:30 INFO] Steady state hypothesis is met!
[2019-07-26 23:57:30 INFO] Let's rollback...
[2019-07-26 23:57:30 INFO] No declared rollbacks, let's move on.
[2019-07-26 23:57:30 INFO] Experiment ended with status: completed

I would appreciate it if you would give me some advice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.