Giter Club home page Giter Club logo

alerts-validator's Introduction

alerts-validator

This project aims to define if currently used alerting rules in Victoria Metrics are valid.

By valid, we mean that alerting rules should use metrics that are still available, and fire if needed.

A metric could disappear or be renamed after upgrading an infracstructre's component. Alerts-validator help us to have up-to-date alerting rules.

This project has a continuous loop which get alerting rules from VM Alert endpoint, then analyze them with VM Select endpoint, and expose them as Prometheus Metric.

Example

Let's say you are using this alerting rule (from Awesome Prometheus Alerts) :

- alert: HostOutOfMemory
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

Let's say node_memory_MemAvailable_bytes as been renamed because of an upgrade of Prometheus Node Exporter last week.

Alerts-validator will analyze every metric's previous rules by querying VM Select endpoint :

present_over_time(node_memory_MemAvailable_bytes[167h]) offset 1h   --> No datapoints
present_over_time(node_memory_MemTotal_bytes[167h]) offset 1h       --> Some datapoints

If one of those metrics doesn't have any datapoints, the alerting rule HostOutOfMemory will be define as invalid for time range [1h-168h].

You can now be alerted that HostOutOfMemory alerting rule need to be updated.

Build

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o alerts-validator-linux-amd64

Compatibility

Works with:

  • Victoria Metrics from v1.76.0 to last
  • Prometheus from v2.33.0 to last

Other products and versions have not been tested yet.

Configuration

Command line

Usage of alerts-validator:
  -conf string
        Config path (default "config.yaml")

YAML

listenAddress: 0.0.0.0      # Listen address for metrics endpoint
listenPort: 9345            # Listen port for metrics endpoint
computeInterval: 168h       # Interval between 2 computes (Valid time units are "s", "m", "h". )
validityCheckIntervals:     # Check if there are datapoints betwen 2 intervals (Valid time units are "s", "m", "h". Truncated to minute. )
- 1h
- 168h
labelKeys:
- tenant
servers:
  - labelValues:                                                  # Added in metric label
    - my_tenant
    ruleUrl: https://vmalert.cluster.local.                       # Endpoint to get rules           /api/v1/rules
    queryUrl: https://vm.cluster.local./select/000/prometheus     # Endpoint to validate metrics    /api/v1/query

Metrics

With config :

validityCheckIntervals:
- 1h
- 168h
- 672h
alertsvalidator_validity_range{range_from="1h",range_to="168h",alertname="my_alert",status="valid"}     0
alertsvalidator_validity_range{range_from="1h",range_to="168h",alertname="my_alert",status="invalid"}   1 # Invalid


alertsvalidator_validity_range{range_from="168h",range_to="672h",alertname="my_alert",status="valid"}   1 # Valid
alertsvalidator_validity_range{range_from="168h",range_to="672h",alertname="my_alert",status="invalid"} 0

alertsvalidator_external_api_error{type="rule", server="https://vmalert.cluster.local."} 0
alertsvalidator_external_api_error{type="query", server="https://vm.cluster.local./select/000/prometheus"} 0

Logs

Some logs are available, they display more informations on alert-validators previous metrics.

{
    "level": "info",
    "alertname": "myInvalidAlert",
    "is_vector_present": {
        "1h-168h": {
            "up{instance=\"my_server\"}": true,
        },
        "168h-672h": {
            "up{instance=\"my_server\"}": false,
        },
    },
    "is_valid": {
        "1h-168h": true,
        "168h-672h": false,
    },
    "time": ---
}

Alerting & Monitoring

You can find a few panels for grafana in grafana_panels/.

Example of alerting with alerts-validator metrics:

Groups:
    Interval:  300s
    Name:      bedrock-alerts-validator.rules
    Rules:
      Alert:  newlyInvalidAlert
      Annotations:
        Description:  {{ $labels.alertname }}-{{ $labels.alertid }} alerting rule has been recently detected as invalid by alerts-validator
        Summary:      Newly Invalid Alert detected
      Expr:           sum(alertsvalidator_validity_range{status="invalid"}) by (alertname, alertid) + sum(alertsvalidator_validity_range{status="valid"}) by (alertname, alertid) offset 2h == 2
      For:            1h
      Labels:
        Severity:  notice

Contributors

alerts-validator's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.