CWHD is a custom Azure monitoring solution leveraging Grafana to monitor the following aspects:
Color code signals in Grafana dashboards showing Green, Amber and Red tiles depending on:
- overall resource heath from Azure Resource Health signals
- all App health using App Insights Standard Test (HTTP ping) web app availability signals * for VM only - configurable threshold of CPU, Memory and Disk usage to display Amber color when threshold is met. (only works for VM)
- dashboard visualization tiles uses Green, Amber and Red color code to determine the overall availability of an application aggregated by one or more Azure resource's Resource Health
The dashboards are organized in Level 0 and Level 1 depicting the "depth" of monitoring.
- Level 0 - shows availability status if all Apps.
- Level 1 - drills into Resource Health of each Azure resource used by the app
-
Required Telemetry / Logs
- for App Service and Web App health signals - all Workspace-based Application Insights Standard Test results send to a single Log Analytics Workspace
- for Virtual Machines health signals - enable VM Insights
- All PaaS resources under monitoring, to have Diagnostic Setting configured to send Logs to 1 central Log Analytics Workspace. For e.g: API Management send resource logs to workspace
-
Azure Resources Required
- a "central" Log Analytics Workspace
- Azure Managed Grafana
- enable Managed Identity
- add Azure role assignment (RBAC) for Grafana Managed Identity with Monitor Reader to:
- Subscriptions containing resources under monitoring
- Log Analytics Workspace (if workspace in different subscription from above)
- Azure Function - App Service Plan S1
- enable Managed Identity
- add Azure role assignment (RBAC) for Function Managed Identity with Monitor Reader to:
- Subscriptions containing resources under monitoring
- Log Analytics Workspace (if workspace in different subscription from above)
- All Application Insights must be linked to the same central Log Analytics Workspace
- Create App Insights Standard Tests to perform availability tests for all App Services and Web Apps. (Standard Tests logs are stored in AppAvailabilityResults table)
-
Assumption
- has an existing Log Analytics Workspace where "all" Application Insights are linked to
CWHD uses a variety of Azure resources including a core Azure Function named Resource Health Retriever, acting as health status aggregator to retrieve and aggregate metrics and health statuses from different data sources depending on the resource types under monitoring.
In the health status aspect of CWHD, Resource Health Retriever function supports the following:
-
"General" resource types (all non App Service types): get their health status from Azure Resource Health via Resource Health Rest API.
-
App Service: function performs log query from Log Analytics AppAvailabilityResults table to get the latest Standard Test result. Reason for not getting health status from Resource Health API is that when an App Service is stopped, Resource Health still shows "Available", this behaviour is by design. Requirement is to show "Unavailable" when an App Service is stopped.
-
VM: health status is determine by 2 factors
- Resource Health availability status determines if VM is available or not depicting the Green or Red status.
- If resource health status is Available/Green, additional 3 metrics CPU, Memory and Disk usage percentage will be monitored according to a set of configurable thresholds. In Grafana, VM Stat visualization will show Amber status if one or more of the 3 metrics reaches the threshold.
The overall available status (green) depends on the dependent Azure resources that each app here is using. If there is any one of the Azure resource used by Cloud Crafty or Pocket Geeks apps that has Resource Health status as "Unavailable", the overall health status at Level 0 will be Unavailable. For example Cloud Crafty uses 3 Azure resources: App Service, Key Vault and APIM. The overall availability status will only be Green when all 3 resourcecs' Resource Health + App Insight Standard Test availability status is available.
Proposed Distributed Tracing with OpenTelemetry Collector to collect OpenTelemetry traces from apps, collector sends traces to Jaeger backed by Azure Managed Cassandra. Grafana gets traces from Jaeger as datasource to display traces within Grafana centrally, in addition to viewing traces in Jaeger UI.