Giter Club home page Giter Club logo

check_es_system's Introduction

check_es_system

This is a monitoring plugin to check the status of an ElasticSearch cluster node. Besides the classical status check (green, yellow, red) this plugin also allows to monitor disk or memory usage of Elasticsearch. This is especially helpful when running Elasticsearch in the cloud (e.g. Elasticsearch as a service) because, as ES does not run on your own server, you cannot monitor the disk or memory usage. This is where this plugin comes in. Just tell the plugin how much resources (diskspace, memory capacity) you have available (-d) and it will alarm you when you reach a threshold. Besides that, the plugin offers additional (advanced) checks of a Elasticsearch node/cluster (Java Threads, Thread Pool Statistics, Master Verification, Read-Only Indexes, ...).

Please refer to https://www.claudiokuenzler.com/monitoring-plugins/check_es_system.php for full documentation and usage examples.

Requirements

  • The following commands must be available: curl, expr
  • One of the following json parsers must be available: jshon or jq (defaults to jq)

Usage

./check_es_system.sh -H ESNode [-P port] [-S] [-u user] [-p pass] [-d available] -t check [-o unit] [-i indexes] [-w warn] [-c crit] [-m max_time] [-e node] [-X jq|jshon]

check_es_system's People

Contributors

deric avatar fbomj avatar napsty avatar tatref avatar thenetworkisdown avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

check_es_system's Issues

plugin doesn't like alternative name in ssl certificate

$ ./check_es_system.sh -H elasticsearch.example.com -P 9243 -S  -u user -p pass -d 128 -t disk
json read error: line 2 column 0: '[' or '{' expected near end of file
expr: syntax error
expr: syntax error
./check_es_system.sh: line 159: [: -ge: unary operator expected
./check_es_system.sh: line 162: [: -ge: unary operator expected
ES SYSTEM OK - Disk usage is at % ( G from 128 G)|es_disk=B;109951162777;130567005798;;

Reason for this in the background is a missing "-k" parameter.
When the plugin detects required authentication, a second curl is fired but this curl does not have the -k parameter as the first curl. This results in the following curl error and curl refuses to continue:

$ curl -s --basic -u user:password https://elasticsearch.example.com:9243/_cluster/stats -v
* Hostname was NOT found in DNS cache
*   Trying 34.246.11.54...
* Connected to elasticsearch.example.com (34.246.11.54) port 9243 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-RSA-AES128-GCM-SHA256
* Server certificate:
* 	 subject: CN=*.eu-west-1.aws.found.io
* 	 start date: 2018-06-04 00:00:00 GMT
* 	 expire date: 2019-07-04 12:00:00 GMT
* 	 subjectAltName does not match elasticsearch.example.com
* SSL: no alternative certificate subject name matches target host name 'elasticsearch.example.com'
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):

Set several hostnames for cluster checks (like check_icmp can)

Hi Claudio,
using this awesome plugin to check several elastic clusters in icinga2. I have both local and "cluster" services, where the local checks are configured in the host objects themself and for the cluster checks, I define a "cluster" Host object. As check_icmp can check several Hosts and alert when hosts_alive < x, this fits nice for a general host check_command. But it would be nice, if I could also set several hostnames in es_system to get the cluster state from any of my configured nodes, without the need of using an HAproxy etc. What do you think about this?

Have a nice weekend.
Cheers,
Marcus

Catch "Connection refused" error

When ES is not running under the given port, the plugin currently simply outputs this:

# /usr/lib/nagios/plugins/check_es_system.sh -H "myesnode" -t status
ES SYSTEM UNKNOWN - Authentication required but missing username and password

But using curl itself clearly shows that the connection to port 9200 cannot be established:

# curl http://myesnode:9200/_cluster/stats
curl: (7) Failed to connect to myesnode port 9200: Connection refused

In the plugin we can't see that because we use a silent curl output, however we can use curl's exit status:

# curl -k -s --max-time 30 http://myesnode:9200/_cluster/stats
# echo $?
7

Plugin returns OK when Elasticsearch cluster is "not there"

The plugin simply returns OK with an empty output when the Cluster is not there, for example when a "wrong" URL is specified using the -H parameter:

$ ./check_es_system.sh -H wrong-cluster-url.example.com -P 9243 -S -u user -p pass -t status
$ echo $?
0

Wrong URL in this case means that something responds but no stats can be read from it. In this particular case an outdated (archived?) Elasticsearch URL from Elastic.co was used.

Debug (manually executing curl from the background) shows:

$ curl -k -s --max-time 30 --basic -u user:pass https://wrong-cluster-url.example.com:9243/_cluster/stats
{"ok":false,"message":"Unknown resource."}

image

image

Allow more parsers

We are using SLES which provides for example jq, but no luck for jshon.
I will be providing a patch that allows to specify the parser.

503 errors on correctly terminated requests

The latest commit brings periodical unexpected 503 errors on correctly terminated requests.

bash -xv check_es_system_new.sh -S -H host -t status -u user -p password
(skipped to output)
+ case $checktype in
+ getstatus
+ esurl=https://host:9200/_cluster/stats
+ eshealthurl=https://host:9200/_cluster/health
+ [[ -z user ]]
+ [[ -n user ]]
+ authlogic
+ [[ -z user ]]
+ [[ -n user ]]
+ [[ -z password ]]
+ [[ -n password ]]
+ [[ -z user ]]
++ curl -k -s --max-time 30 --basic -u user:password https://host:9200/_cluster/stats
+ esstatus='{"_nodes":{"total":8,"successful":8,"failed":0},"cluster_name":"esams","cluster_uuid":"cluster_uuid","timestamp":1586415748436,"status":"green","indices":{"count":469,"shards":{"total":938,"primaries":469,"replication":1.0,"index":{"shards":{"min":2,"max":2,"avg":2.0},"primaries":{"min":1,"max":1,"avg":1.0},"replication":{"min":1.0,"max":1.0,"avg":1.0}}},"docs":{"count":27839887667,"deleted":4108286},"store":{"size_in_bytes":35468749382128},"fielddata":{"memory_size_in_bytes":2684088,"evictions":0},"query_cache":{"memory_size_in_bytes":801738829,"total_count":229132,"hit_count":54569,"miss_count":174563,"cache_size":2070,"cache_count":3881,"evictions":1811},"completion":{"size_in_bytes":0},"segments":{"count":25555,"memory_in_bytes":32587211695,"terms_memory_in_bytes":12727150395,"stored_fields_memory_in_bytes":16864340512,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":4480,"points_memory_in_bytes":2904223928,"doc_values_memory_in_bytes":91492380,"index_writer_memory_in_bytes":427070700,"version_map_memory_in_bytes":20169864,"fixed_bit_set_memory_in_bytes":5151176,"max_unsafe_auto_id_timestamp":1586390402189,"file_sizes":{}}},"nodes":{"count":{"total":8,"coordinating_only":0,"data":4,"ingest":1,"master":3,"ml":8,"voting_only":0},"versions":["7.5.0"],"os":{"available_processors":304,"allocated_processors":304,"names":[{"name":"Linux","count":8}],"pretty_names":[{"pretty_name":"CentOS Linux 7 (Core)","count":8}],"mem":{"total_in_bytes":538575011840,"free_in_bytes":38617698304,"used_in_bytes":499957313536,"free_percent":7,"used_percent":93}},"process":{"cpu":{"percent":33},"open_file_descriptors":{"min":836,"max":7527,"avg":3762}},"jvm":{"max_uptime_in_millis":10323568464,"versions":[{"version":"13.0.1","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"13.0.1+9","vm_vendor":"AdoptOpenJDK","bundled_jdk":true,"using_bundled_jdk":true,"count":8}],"mem":{"heap_used_in_bytes":88419242368,"heap_max_in_bytes":272994402304},"threads":2249},"fs":{"total_in_bytes":48628980883456,"free_in_bytes":13135674953728,"available_in_bytes":13135674953728},"plugins":[],"network_types":{"transport_types":{"security4":8},"http_types":{"security4":8}},"discovery_types":{"zen":8},"packaging_types":[{"flavor":"default","type":"rpm","count":8}]}}'
+ esstatusrc=0
+ [[ 0 -eq 7 ]]
+ [[ 0 -eq 28 ]]
+ [[ {"_nodes":{"total":8,"successful":8,"failed":0},"cluster_name":"cluster_name","cluster_uuid":"cluster_uuid","timestamp":1586415748436,"status":"green","indices":{"count":469,"shards":{"total":938,"primaries":469,"replication":1.0,"index":{"shards":{"min":2,"max":2,"avg":2.0},"primaries":{"min":1,"max":1,"avg":1.0},"replication":{"min":1.0,"max":1.0,"avg":1.0}}},"docs":{"count":27839887667,"deleted":4108286},"store":{"size_in_bytes":35468749382128},"fielddata":{"memory_size_in_bytes":2684088,"evictions":0},"query_cache":{"memory_size_in_bytes":801738829,"total_count":229132,"hit_count":54569,"miss_count":174563,"cache_size":2070,"cache_count":3881,"evictions":1811},"completion":{"size_in_bytes":0},"segments":{"count":25555,"memory_in_bytes":32587211695,"terms_memory_in_bytes":12727150395,"stored_fields_memory_in_bytes":16864340512,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":4480,"points_memory_in_bytes":2904223928,"doc_values_memory_in_bytes":91492380,"index_writer_memory_in_bytes":427070700,"version_map_memory_in_bytes":20169864,"fixed_bit_set_memory_in_bytes":5151176,"max_unsafe_auto_id_timestamp":1586390402189,"file_sizes":{}}},"nodes":{"count":{"total":8,"coordinating_only":0,"data":4,"ingest":1,"master":3,"ml":8,"voting_only":0},"versions":["7.5.0"],"os":{"available_processors":304,"allocated_processors":304,"names":[{"name":"Linux","count":8}],"pretty_names":[{"pretty_name":"CentOS Linux 7 (Core)","count":8}],"mem":{"total_in_bytes":538575011840,"free_in_bytes":38617698304,"used_in_bytes":499957313536,"free_percent":7,"used_percent":93}},"process":{"cpu":{"percent":33},"open_file_descriptors":{"min":836,"max":7527,"avg":3762}},"jvm":{"max_uptime_in_millis":10323568464,"versions":[{"version":"13.0.1","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"13.0.1+9","vm_vendor":"AdoptOpenJDK","bundled_jdk":true,"using_bundled_jdk":true,"count":8}],"mem":{"heap_used_in_bytes":88419242368,"heap_max_in_bytes":272994402304},"threads":2249},"fs":{"total_in_bytes":48628980883456,"free_in_bytes":13135674953728,"available_in_bytes":13135674953728},"plugins":[],"network_types":{"transport_types":{"security4":8},"http_types":{"security4":8}},"discovery_types":{"zen":8},"packaging_types":[{"flavor":"default","type":"rpm","count":8}]}} =~ 503 ]]
+ echo 'ES SYSTEM CRITICAL - Elasticsearch not available: host:9200 return error 503'

line 155: [: =: unary operator expected when check type not defined

When the plugin is only executed with the -H parameter, the following error shows up:

# /usr/lib/nagios/plugins/check_es_system.sh -H "ElasticNodeIP"
/usr/lib/nagios/plugins/check_es_system.sh: line 155: [: =: unary operator expected
ES SYSTEM UNKNOWN - Authentication required but missing username and password

Make sure, -t is mandatory, as shown in help and documentation.

Proposal: Monitoring which indexes are in readonly

Abstract

When you want to monitor all readonly indexes you can't see which index(es) is/are in readonly and you have to search it manually.
i.e.

# ./check_es_system.sh -H ${host} -P ${port} -S -u ${user} -p ${pass} -t readonly
ES SYSTEM CRITICAL -  1 index(es) found read-only -

Proposal

If you parse the json you can extract the index(es) that is/are in readonly.
For example with this code snippet you could achieve that with jq. I tried it with the json_parse wrapper but it doesn't work. I guess the wrapper have to be adapted to work.

      if [[ $rocount -gt 0 ]]; then
        if [[ "$index" = "_all" ]]; then
          roindexes=$(jq -r '.[].settings.index |select(.blocks.read_only == "true").provided_name' <<< ${settings})
          output[${icount}]=" $rocount index(es) found read-only. ${roindexes} -"

Output of tps too long

Hello,

Probably only in larger clusters, but when I run check_es_system.sh with tps type, I get a huge perfdata output. Not sure what would be a good workaround, maybe only return the sum of all queues? (with an option)

Grtz

Willem

Handle local 503 error (curl exit code 0)

When the destination URL returns a 503 error, curl (might) exit with a status code 0:

++ curl -k -s --max-time 30 --basic -u user:secret http://localhost:9200/_cluster/stats
+ esstatus='<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>'
+ esstatusrc=0

Manual verification:

root@es:/# curl -s -o /dev/null -w "%{http_code}" -k -s --max-time 30 --basic -u user:secret http://localhost:9200/_cluster/stats
503root@es:/# echo $?
0

This situation will cause the plugin to fail with the following error:

root@es:/# /opt/check_es_system.sh -H localhost -P 9200 -t status -u user -p secret
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'
json read error: line 1 column 1: '[' or '{' expected near '<'

Handle incorrect authentication

Incorrect username/password is not correctly handled by the plugin.

# /usr/lib/nagios/plugins/check_es_system.sh -H myes.example.com -u myesuser -p wrongpass -t disk -d 128
parse error: type 'object' has no elements to extract (arg 2)
expr: syntax error
expr: syntax error
/usr/lib/nagios/plugins/check_es_system.sh: line 147: [: -ge: unary operator expected
/usr/lib/nagios/plugins/check_es_system.sh: line 150: [: -ge: unary operator expected
ES SYSTEM OK - Disk usage is at % ( G from 128 G)|es_disk=B;109951162777;130567005798;;

Debug:

++ grep -i authentication
++ echo '{"error":{"root_cause":[{"type":"security_exception","reason":"action' '[cluster:monitor/stats]' requires 'authentication","header":{"WWW-Authenticate":"Basic' 'realm=\"shield\"' 'charset=\"UTF-8\""}}],"type":"security_exception","reason":"action' '[cluster:monitor/stats]' requires 'authentication","header":{"WWW-Authenticate":"Basic' 'realm=\"shield\"' 'charset=\"UTF-8\""}},"status":401}'
+ [[ -n {"error":{"root_cause":[{"type":"security_exception","reason":"action [cluster:monitor/stats] requires authentication","header":{"WWW-Authenticate":"Basic realm=\"shield\" charset=\"UTF-8\""}}],"type":"security_exception","reason":"action [cluster:monitor/stats] requires authentication","header":{"WWW-Authenticate":"Basic realm=\"shield\" charset=\"UTF-8\""}},"status":401} ]]
+ authlogic
+ [[ -z myesuser ]]
+ [[ -n myesuser ]]
+ [[ -z wrongpass ]]
+ [[ -n wrongpass ]]
+ [[ -z myesuser ]]
curl -s --basic -u ${user}:${pass} $esurl
++ curl -s --basic -u myesuser:wrongpass http://myes.example.com:9200/_cluster/stats
+ esstatus='{"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [myesuser] for REST request [/_cluster/stats]","header":{"WWW-Authenticate":"Basic realm=\"shield\" charset=\"UTF-8\""}}],"type":"security_exception","reason":"unable to authenticate user [myesuser] for REST request [/_cluster/stats]","header":{"WWW-Authenticate":"Basic realm=\"shield\" charset=\"UTF-8\""}},"status":401}'
echo $esstatus | grep -i authentication
++ echo '{"error":{"root_cause":[{"type":"security_exception","reason":"unable' to authenticate user '[myesuser]' for REST request '[/_cluster/stats]","header":{"WWW-Authenticate":"Basic' 'realm=\"shield\"' 'charset=\"UTF-8\""}}],"type":"security_exception","reason":"unable' to authenticate user '[myesuser]' for REST request '[/_cluster/stats]","header":{"WWW-Authenticate":"Basic' 'realm=\"shield\"' 'charset=\"UTF-8\""}},"status":401}'
++ grep -i authentication
+ [[ -n '' ]]
+ case $checktype in
echo $esstatus | jshon -e indices -e store -e "size_in_bytes"
++ jshon -e indices -e store -e size_in_bytes
++ echo '{"error":{"root_cause":[{"type":"security_exception","reason":"unable' to authenticate user '[myesuser]' for REST request '[/_cluster/stats]","header":{"WWW-Authenticate":"Basic' 'realm=\"shield\"' 'charset=\"UTF-8\""}}],"type":"security_exception","reason":"unable' to authenticate user '[myesuser]' for REST request '[/_cluster/stats]","header":{"WWW-Authenticate":"Basic' 'realm=\"shield\"' 'charset=\"UTF-8\""}},"status":401}'
parse error: type 'object' has no elements to extract (arg 2)

Wrong Data Nodes Output with elastic 7.11.1

Hi Claudio,
the plugin reports 0 data nodes instead of X at elastic 7.11.1:

./check_es_system [...]
ES SYSTEM OK - Elasticsearch Cluster "el-cl-100" is green (14 nodes, 0 data nodes, 124 shards, 448066911 docs)|total_nodes=14;;;; data_nodes=0;;;; total_shards=124;;;; relocating_shards=0;;;; initializing_shards=0;;;; unassigned_shards=0;;;; docs=448066911;;;;

curl -k -s --basic -u remote_monitoring_user:xxx https://xxx:9200/_cluster/health
{"cluster_name":"el-cl-100","status":"green","timed_out":false,"number_of_nodes":14,"number_of_data_nodes":13,"active_primary_shards":62,"active_shards":124,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}

By the way:
A very good improvement would be, to check how many hot, warm, ingest nodes etc. one expect in the cluster.
Since the "new" role concept of elastic, roles can be parsed very good with /_cat/nodes:
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-nodes.html

Cheers,
Marcus

Minimal role to get the command working

Hello,

It's more a suggestion that an issue.

Maybe you could add to the documentation the permissions that the user need to to be able to use this script ?

Regards,

Johan

Catch 403 error

Currently, when a user does not have the privileges to access the stats, the plugin just returns senseless output:

# ./check_es_system.sh -H myescluster.eu-central-1.aws.cloud.es.io -P 9243 -S -t status -u auser -p apassword
parse error: type 'object' has no elements to extract (arg 2)
parse error: type 'object' has no elements to extract (arg 2)
parse error: type 'object' has no elements to extract (arg 2)
parse error: type 'object' has no elements to extract (arg 2)
parse error: type 'object' has no elements to extract (arg 2)
parse error: type 'object' has no elements to extract (arg 2)
parse error: type 'object' has no elements to extract (arg 2)

What happens in the background is the following (seen with bash -xv):

++ echo '{"error":{"root_cause":[{"type":"security_exception","reason":"action' '[cluster:monitor/health]' is unauthorized for user '[auser]"}],"type":"security_exception","reason":"action' '[cluster:monitor/health]' is unauthorized for user '[auser]"},"status":403}'
parse error: type 'object' has no elements to extract (arg 2)

This needs to be handled, too.

Plugin silents fails but returns OK when cluster failed

When a ElasticSearch cluster fails and certain system indexes were not yet re-assigned, the plugin runs into an isuse:

# /usr/lib/nagios/plugins/check_es_system.sh -H myESnode -u monitoring -p secret -t status
parse error: type 'object' has no elements to extract (arg 2)
json read error: line 2 column 0: '[' or '{' expected near end of file
json read error: line 2 column 0: '[' or '{' expected near end of file
json read error: line 2 column 0: '[' or '{' expected near end of file

Yet the exit code is 0, indicating everything's OK:

# echo $?
0

In the graphical monitoring (here Icingaweb2) view this will be shown:

image

Expected behaviour: Plugin should alert that something's not right with the cluster.

readonly check does not work as expected

Expected Behavior

The readonly check must produce ES SYSTEM CRITICAL output when an index have the setting index.blocks.read_only setted to true.

Current Behavior

The readonly check produces ES SYSTEM OK output when an index have the setting index.blocks.read_only setted to true.

Steps to Reproduce

  1. curl -XPUT -k -u ${user}:${pass} ${proto}://${host}:${port}/${index}/_settings -H 'Content-Type: application/json' -d '{ "index": { "blocks.read_only": true } }'
  2. chmod +x check_es_system.sh
  3. ./check_es_system.sh -H ${host} -P ${port} -S -u ${user} -p ${pass} -t readonly -i ${index}

shards & ILM feature for monitoring.

the issue we keep coming up against is that we exceed the max shard count and elasticsearch stops ingesting.So we'll need to make sure that checking against total_shards will provide the level of alerting we're looking for.

we need to define thresholds for other metrics, so we can hopefully extend that to cover shards & ILM ?

Can shard count be added to the script for alerting purpose ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.