magenta-aps / check_prometheus_metric Goto Github PK

View Code? Open in Web Editor NEW

This project forked from prometheus/nagios_plugins

44.0 6.0 10.0 98 KB

Nagios plugin for alerting on Prometheus query results.

License: Apache License 2.0

Shell 100.00%

check_prometheus_metric's Introduction

`check_prometheus_metric`

Nagios plugin for alerting on Prometheus query results.

Usage

  check_prometheus_metric.sh - Nagios plugin for checking Prometheus metrics.
  
  Usage:
    check_prometheus_metric.sh -H HOST -q QUERY -w FLOAT[:FLOAT] -c FLOAT[:FLOAT]
                               -n NAME [-m METHOD] [-O] [-i] [-p]

  Options:
    -H HOST          URL of Prometheus host to query.
    -q QUERY         Prometheus query, in single quotes, that returns a float.
    -w FLOAT[:FLOAT] Warning level value (must be a float or nagios-interval).
    -c FLOAT[:FLOAT] Critical level value (must be a float or nagios-interval).
    -n NAME          A name for the metric being checked.
    -m METHOD        Comparison method, one of gt, ge, lt, le, eq, ne.
                     (Defaults to ge unless otherwise specified).
    -C CURL_OPTS     Additional flags to curl. Can be passed multiple times. 
                     Options and option values must be passed separately.
                     e.g. -C --connect-timeout -C 10 -C --cacert -C /path/to/ca.crt
    -O               Accept NaN as an "OK" result.
    -E               Accept an empty vector (null) as an "OK" result.
    -i               Print the extra metric information into the Nagios message.
    -p               Add perfdata to check output.

  Examples:
    check_prometheus_metric -q 'up{job=\"job_name\"}' -w :1 -c :1
    # Check that job is up. If not, critical.
    
    check_prometheus_metric -q 'node_load1' -w :0.05 -c :0.1
    # Check load is below 0.05 (warning) and 0.1 (critical).

    check_prometheus_metric -q 'go_threads' -w 15:25 -c :
    # Check thread count is between 15-25, warning if outside this interval.

  Dependencies:
    Requires bash, curl, cut, echo, grep, jq and sed to be in $PATH.

Note: nagios-interval refers to this syntax

Icinga configuration

To utilize the script with Icinga 2, four steps must be taken;

The CheckCommand definition must be added to /etc/icinga2/conf.d/check_prometheus_metric.conf:

object CheckCommand "check_prometheus_metric" {
  import "plugin-check-command"
  command = [ "/usr/lib/nagios/plugins/check_prometheus_metric" ]

  arguments = {
        "-H" = {
                value = "$check_prometheus_metric_url$"
                description = "URL of Prometheus host to query."
        }
        "-q" = {
                value = "$check_prometheus_metric_query$"
                description = "Prometheus query, that returns a float."
        }
        "-w" = {
                value = "$check_prometheus_metric_warning$"
                description = "Warning level value (float or nagios-interval)."
        }
        "-c" = {
                value = "$check_prometheus_metric_critical$"
                description = "Critical level value (float or nagios-interval)."
        }
        "-n" = {
                value = "$check_prometheus_metric_name$"
                description = "A name for the metric being checked."
        }
    }
}

The checking script, must be added to /usr/lib/nagios/plugins/check_prometheus_metric:

GIT_REPO=https://github.com/magenta-aps/check_prometheus_metric
VERSION=$(curl -sL -H "Accept: application/json" ${GIT_REPO}/releases/latest | jq -r .tag_name)
curl -sL -O ${GIT_REPO}/releases/download/${VERSION}/check_prometheus_metric.sh
chmod +x check_prometheus_metric.sh
sudo mv check_prometheus_metric.sh /usr/lib/nagios/plugins/check_prometheus_metric

Install the required software to run the script:

sudo apt-get update
sudo apt-get install -y bash coreutils curl grep jq sed

Add Service definitions for whatever is to be monitored:

/etc/icinga2/conf.d/check_pi_service.conf:

apply Service "pi" {
  import "generic-service"

  check_command = "check_prometheus_metric"

  vars.check_prometheus_metric_url = "prometheus:9090"
  vars.check_prometheus_metric_query = "pi"
  vars.check_prometheus_metric_warning = "3:4"
  vars.check_prometheus_metric_critical = "1:6"
  vars.check_prometheus_metric_name = "pi"
  
  command_endpoint = host.vars.client_endpoint
  assign where host.name == NodeName
}

Nagios configuration

You need to add the following commands to your Nagios configuration to use it:

define command {
    command_name check_prometheus
    command_line $USER1$/check_prometheus_metric.sh -H '$ARG1$' -q '$ARG2$' -w '$ARG3$' -c '$ARG4$' -n '$ARG5$' -m '$ARG6$'
}

check_prometheus_metric's People

Contributors

Stargazers

Watchers

Forkers

adfinis-forks bdurrow nbenor alias-dev jo-jo0 elias5000 posteingang jostmart pdolinic maxwbot

check_prometheus_metric's Issues

Comparison with negative values in result

I got a metric with positive values for warning and critical threshold, but when I receive a negative value as result it display a CRITICAL ALARM and negative value is minor than any positive one, so it should be an OK result.

I must fix this adding the condition "> 0" to promQL query and return only positive results, but then I got NULL vector, so I must add "on() vector(0)" to return the zero vector, but i would like to display the negative value because it gives some information, even if it's not an alarm.

Note: Waiting for next release with -E option to remove "on() vector(0)"

Allow extra get / post data to be passed through

We're trying to grab prometheus data from gitlab - this requires a token adding to the URL:

https://gitlab.your.company/-/metrics?token=TOKENTOKENTOKEN

When using this script, we've found it doesn't appear to work specifying the above as a url, possibly because the url given has a path appended to it in the check_prometheus_metric script.

We tried passing the key through as extra data:

./check_prometheus_metric -H 'https://gitlab.your.company/-/metrics' -q 'http_requests_total{method="get"}' -w '0:40000' -c '0:42000' -n 'total http requests' -p -C "-d '{ \"token\": \"TOKENTOKENTOKEN\" }'"

But still got an error - presumably you can't have two data entries passed to curl.

Feature request/discussion: Show additional labels/results

Scenario

Sometimes we want to check a list of metrics to make sure that they are all below/above a certain threshold. For this case this plugin works really well. However, when one of the metric passes the threshold it requires a bit of work to find out which one.

Example: Monitor disk space for a group of pods/nodes. When the threshold is breached we would like to know quickly which pod/node is over threshold. Currently, we have to switch to prometheus UI or Grafana to find out which is additional work.

Proposal

Simplest solution: Print out the raw/prettified result from prometheus in the status message. That would be an huge improvement already.

Extended solution: Provide a secondary prometheus query to reduce the amount of information displayed. For instance this could extract only certain labels from the resulting metric (i.e. disk free and node name).

Alternative extended solution: Filter the result from the original query for certain labels.

Add a Docker container

One interesting use of this script would be to use it as a Kubernetes liveness check exec function.

This would allow Kubernetes to pull Prometheus metrics and act as a metrics-based pod health check.

Shouldn't CURL_OPTS also be applied to test run?

I have a case where I need to pass --key and --cert to curl. Without the following change, I'm getting a "UNKNOWN - no response from prometheus" response.

diff --git a/src/check_prometheus_metric.sh b/src/check_prometheus_metric.sh
index a35deae..ea38bcd 100644
--- a/src/check_prometheus_metric.sh
+++ b/src/check_prometheus_metric.sh
@@ -183,7 +183,7 @@ function process_command_line {
 
 function check_prometheus_server {
     # Check that the server URL responds
-    curl --silent --head ${PROMETHEUS_SERVER}/api/v1/query > /dev/null
+    curl -s "${CURL_OPTS[@]}" --head ${PROMETHEUS_SERVER}/api/v1/query > /dev/null
     SERVER_RESPONDED=$?
     if [ "${SERVER_RESPONDED}" -ne 0 ]; then
         NAGIOS_STATUS=UNKNOWN

UNKNOWN - unable to parse prometheus response | query_result=null

I'm trying to perform some check to prometheus "node_exporter"
I'm trying query that works fine - in performed from GRAFANA, but If I run within the script - do not work.
For example:

FROM GRAFANA:
rate(node_network_receive_bytes_total{device=eth0,instance="damian-sr,job="prometheus"}[1m]

give me results.
THIS command from ICINGA server:
./check_prometheus_metric.sh '-H' '192.168.201.200:9090' '-w' '1:1024' '-c' '1:1024' '-n' 'network_receive_bytes' '-q' 'rate(node_network_receive_bytes_total{device="eth0",instance="damian-sr:9100"}[1m])'
give me this results:
network_receive_bytes is null
UNKNOWN - unable to parse prometheus response

Where I'm wrong?

preamble.bash / parse.bash / usage.bash : No such file or directory

I have a problem running the command on an UBUNTU 18.04
I'm using with ICINGA - But the error is even at OS level
I have copied these files:
preamble.bash
parse.bash
usage.bash
In the same folder with "check_prometheus_metric.sh".
The folder looks like:

/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 29: preamble.bash: No such file or directory
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 30: parse.bash: No such file or directory
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 31: usage.bash: No such file or directory
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 68: is_float_or_interval: command not found
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 70: usage: command not found

Tricky queries UNKNOWN - unable to parse prometheus response

Hi, I have an issue when trying to run some tricky queries. The thing is that I am trying to use Icinga to check metrics on my services, where all of them are containers. If I configure simple queries as an "sum(up)" I got the info and I can create critical and warning notifications, but the problem comes when I configured a more complex query, working with some labels and GROUP BYs. I think that maybe it could be some character I need to escape, for not sure how ( I've tried a lot of options)

If I execute the query from the command itself I am getting response, but I do not if I set the command as a service.

The query:

scalar((sum(rate(container_cpu_usage_seconds_total{name=~".*grafana.*-grafana-.*"}[1m])) BY (instance, name, env) * 100))

Output on icinga web:

UNKNOWN - unable to parse prometheus response
or check didn't appear.

Output on CLI:

/usr/lib/nagios/plugins/check_prometheus_metric.sh -H http://myprometheusinstance:port -c 5: -w 10: -n check_back_cpu -q 'scalar((sum(rate(container_cpu_usage_seconds_total{name=~".*grafana.*-grafana-.*"}[1m])) BY (instance, name, env) * 100))'

CRITICAL - check_grafana_cpu is 0.2051226219923568

object Service "check_grafana_cpu" {
    import "generic-service"
    host_name = "icinga2-magnet-pro"
        check_command = "check_prometheus_metric"
    command_endpoint = "icinga2-pro"
    vars.check_prometheus_metric_url = "myprometheusinstance:port"
    vars.check_prometheus_metric_query = "scalar((sum(rate(container_cpu_usage_seconds_total{name=~\".*grafana.*-grafana-.*\"}[1m]))BY(instance, name, env)*100))"
    vars.check_prometheus_metric_name = "grafana cpu"
    vars.check_prometheus_metric_warning = "60"
    vars.check_prometheus_metric_critical = "80"
}

Any help with this? Thanks a lot!