Giter Club home page Giter Club logo

check_prometheus_metric's Introduction

check_prometheus_metric

CI

Nagios plugin for alerting on Prometheus query results.

Usage

  check_prometheus_metric.sh - Nagios plugin for checking Prometheus metrics.
  
  Usage:
    check_prometheus_metric.sh -H HOST -q QUERY -w FLOAT[:FLOAT] -c FLOAT[:FLOAT]
                               -n NAME [-m METHOD] [-O] [-i] [-p]

  Options:
    -H HOST          URL of Prometheus host to query.
    -q QUERY         Prometheus query, in single quotes, that returns a float.
    -w FLOAT[:FLOAT] Warning level value (must be a float or nagios-interval).
    -c FLOAT[:FLOAT] Critical level value (must be a float or nagios-interval).
    -n NAME          A name for the metric being checked.
    -m METHOD        Comparison method, one of gt, ge, lt, le, eq, ne.
                     (Defaults to ge unless otherwise specified).
    -C CURL_OPTS     Additional flags to curl. Can be passed multiple times. 
                     Options and option values must be passed separately.
                     e.g. -C --connect-timeout -C 10 -C --cacert -C /path/to/ca.crt
    -O               Accept NaN as an "OK" result.
    -E               Accept an empty vector (null) as an "OK" result.
    -i               Print the extra metric information into the Nagios message.
    -p               Add perfdata to check output.

  Examples:
    check_prometheus_metric -q 'up{job=\"job_name\"}' -w :1 -c :1
    # Check that job is up. If not, critical.
    
    check_prometheus_metric -q 'node_load1' -w :0.05 -c :0.1
    # Check load is below 0.05 (warning) and 0.1 (critical).

    check_prometheus_metric -q 'go_threads' -w 15:25 -c :
    # Check thread count is between 15-25, warning if outside this interval.

  Dependencies:
    Requires bash, curl, cut, echo, grep, jq and sed to be in $PATH.

Note: nagios-interval refers to this syntax

Icinga configuration

To utilize the script with Icinga 2, four steps must be taken;

  1. The CheckCommand definition must be added to /etc/icinga2/conf.d/check_prometheus_metric.conf:
object CheckCommand "check_prometheus_metric" {
  import "plugin-check-command"
  command = [ "/usr/lib/nagios/plugins/check_prometheus_metric" ]

  arguments = {
        "-H" = {
                value = "$check_prometheus_metric_url$"
                description = "URL of Prometheus host to query."
        }
        "-q" = {
                value = "$check_prometheus_metric_query$"
                description = "Prometheus query, that returns a float."
        }
        "-w" = {
                value = "$check_prometheus_metric_warning$"
                description = "Warning level value (float or nagios-interval)."
        }
        "-c" = {
                value = "$check_prometheus_metric_critical$"
                description = "Critical level value (float or nagios-interval)."
        }
        "-n" = {
                value = "$check_prometheus_metric_name$"
                description = "A name for the metric being checked."
        }
    }
}
  1. The checking script, must be added to /usr/lib/nagios/plugins/check_prometheus_metric:
GIT_REPO=https://github.com/magenta-aps/check_prometheus_metric
VERSION=$(curl -sL -H "Accept: application/json" ${GIT_REPO}/releases/latest | jq -r .tag_name)
curl -sL -O ${GIT_REPO}/releases/download/${VERSION}/check_prometheus_metric.sh
chmod +x check_prometheus_metric.sh
sudo mv check_prometheus_metric.sh /usr/lib/nagios/plugins/check_prometheus_metric
  1. Install the required software to run the script:
sudo apt-get update
sudo apt-get install -y bash coreutils curl grep jq sed
  1. Add Service definitions for whatever is to be monitored:

/etc/icinga2/conf.d/check_pi_service.conf:

apply Service "pi" {
  import "generic-service"

  check_command = "check_prometheus_metric"

  vars.check_prometheus_metric_url = "prometheus:9090"
  vars.check_prometheus_metric_query = "pi"
  vars.check_prometheus_metric_warning = "3:4"
  vars.check_prometheus_metric_critical = "1:6"
  vars.check_prometheus_metric_name = "pi"
  
  command_endpoint = host.vars.client_endpoint
  assign where host.name == NodeName
}

Nagios configuration

You need to add the following commands to your Nagios configuration to use it:

define command {
    command_name check_prometheus
    command_line $USER1$/check_prometheus_metric.sh -H '$ARG1$' -q '$ARG2$' -w '$ARG3$' -c '$ARG4$' -n '$ARG5$' -m '$ARG6$'
}

check_prometheus_metric's People

Contributors

agimenez avatar bdurrow avatar beorn7 avatar corvus-ch avatar fabxc avatar grobie avatar juliusv avatar marcindulak avatar oshu avatar peterbourgon avatar posteingang avatar sanchpet avatar skeen avatar superq avatar yaron-idan avatar yoan-adfinis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

check_prometheus_metric's Issues

Comparison with negative values in result

I got a metric with positive values for warning and critical threshold, but when I receive a negative value as result it display a CRITICAL ALARM and negative value is minor than any positive one, so it should be an OK result.

I must fix this adding the condition "> 0" to promQL query and return only positive results, but then I got NULL vector, so I must add "on() vector(0)" to return the zero vector, but i would like to display the negative value because it gives some information, even if it's not an alarm.

Note: Waiting for next release with -E option to remove "on() vector(0)"

Allow extra get / post data to be passed through

We're trying to grab prometheus data from gitlab - this requires a token adding to the URL:

https://gitlab.your.company/-/metrics?token=TOKENTOKENTOKEN

When using this script, we've found it doesn't appear to work specifying the above as a url, possibly because the url given has a path appended to it in the check_prometheus_metric script.

We tried passing the key through as extra data:

./check_prometheus_metric -H 'https://gitlab.your.company/-/metrics' -q 'http_requests_total{method="get"}' -w '0:40000' -c '0:42000' -n 'total http requests' -p -C "-d '{ \"token\": \"TOKENTOKENTOKEN\" }'"

But still got an error - presumably you can't have two data entries passed to curl.

Feature request/discussion: Show additional labels/results

Scenario

Sometimes we want to check a list of metrics to make sure that they are all below/above a certain threshold. For this case this plugin works really well. However, when one of the metric passes the threshold it requires a bit of work to find out which one.

Example: Monitor disk space for a group of pods/nodes. When the threshold is breached we would like to know quickly which pod/node is over threshold. Currently, we have to switch to prometheus UI or Grafana to find out which is additional work.

Proposal

Simplest solution: Print out the raw/prettified result from prometheus in the status message. That would be an huge improvement already.

Extended solution: Provide a secondary prometheus query to reduce the amount of information displayed. For instance this could extract only certain labels from the resulting metric (i.e. disk free and node name).

Alternative extended solution: Filter the result from the original query for certain labels.

Shouldn't CURL_OPTS also be applied to test run?

I have a case where I need to pass --key and --cert to curl. Without the following change, I'm getting a "UNKNOWN - no response from prometheus" response.

diff --git a/src/check_prometheus_metric.sh b/src/check_prometheus_metric.sh
index a35deae..ea38bcd 100644
--- a/src/check_prometheus_metric.sh
+++ b/src/check_prometheus_metric.sh
@@ -183,7 +183,7 @@ function process_command_line {
 
 function check_prometheus_server {
     # Check that the server URL responds
-    curl --silent --head ${PROMETHEUS_SERVER}/api/v1/query > /dev/null
+    curl -s "${CURL_OPTS[@]}" --head ${PROMETHEUS_SERVER}/api/v1/query > /dev/null
     SERVER_RESPONDED=$?
     if [ "${SERVER_RESPONDED}" -ne 0 ]; then
         NAGIOS_STATUS=UNKNOWN

UNKNOWN - unable to parse prometheus response | query_result=null

I'm trying to perform some check to prometheus "node_exporter"
I'm trying query that works fine - in performed from GRAFANA, but If I run within the script - do not work.
For example:

FROM GRAFANA:
rate(node_network_receive_bytes_total{device=eth0,instance="damian-sr,job="prometheus"}[1m]

give me results.
THIS command from ICINGA server:
./check_prometheus_metric.sh '-H' '192.168.201.200:9090' '-w' '1:1024' '-c' '1:1024' '-n' 'network_receive_bytes' '-q' 'rate(node_network_receive_bytes_total{device="eth0",instance="damian-sr:9100"}[1m])'
give me this results:
network_receive_bytes is null
UNKNOWN - unable to parse prometheus response

Where I'm wrong?

preamble.bash / parse.bash / usage.bash : No such file or directory

I have a problem running the command on an UBUNTU 18.04
I'm using with ICINGA - But the error is even at OS level
I have copied these files:
preamble.bash
parse.bash
usage.bash
In the same folder with "check_prometheus_metric.sh".
The folder looks like:
image

/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 29: preamble.bash: No such file or directory
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 30: parse.bash: No such file or directory
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 31: usage.bash: No such file or directory
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 68: is_float_or_interval: command not found
/usr/lib/nagios/plugins/check_prometheus_metric/check_prometheus_metric.sh: line 70: usage: command not found

Tricky queries UNKNOWN - unable to parse prometheus response

Hi, I have an issue when trying to run some tricky queries. The thing is that I am trying to use Icinga to check metrics on my services, where all of them are containers. If I configure simple queries as an "sum(up)" I got the info and I can create critical and warning notifications, but the problem comes when I configured a more complex query, working with some labels and GROUP BYs. I think that maybe it could be some character I need to escape, for not sure how ( I've tried a lot of options)

If I execute the query from the command itself I am getting response, but I do not if I set the command as a service.

The query:

scalar((sum(rate(container_cpu_usage_seconds_total{name=~".*grafana.*-grafana-.*"}[1m])) BY (instance, name, env) * 100))

Output on icinga web:

UNKNOWN - unable to parse prometheus response
or check didn't appear.

Output on CLI:

/usr/lib/nagios/plugins/check_prometheus_metric.sh -H http://myprometheusinstance:port -c 5: -w 10: -n check_back_cpu -q 'scalar((sum(rate(container_cpu_usage_seconds_total{name=~".*grafana.*-grafana-.*"}[1m])) BY (instance, name, env) * 100))'

CRITICAL - check_grafana_cpu is 0.2051226219923568
object Service "check_grafana_cpu" {
    import "generic-service"
    host_name = "icinga2-magnet-pro"
        check_command = "check_prometheus_metric"
    command_endpoint = "icinga2-pro"
    vars.check_prometheus_metric_url = "myprometheusinstance:port"
    vars.check_prometheus_metric_query = "scalar((sum(rate(container_cpu_usage_seconds_total{name=~\".*grafana.*-grafana-.*\"}[1m]))BY(instance, name, env)*100))"
    vars.check_prometheus_metric_name = "grafana cpu"
    vars.check_prometheus_metric_warning = "60"
    vars.check_prometheus_metric_critical = "80"
}

Any help with this? Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.