clusterlabs / ha_cluster_exporter Goto Github PK

View Code? Open in Web Editor NEW

78.0 78.0 34.0 2.93 MB

Prometheus exporter for Pacemaker based Linux HA clusters

License: Apache License 2.0

Go 68.02% Makefile 2.86% Shell 27.40% Python 1.73%

cluster golang high-availability linux metrics monitoring pacemaker prometheus prometheus-exporter

ha_cluster_exporter's Issues

[PKG]: create an obs service file to checkout github source when changes

implement metric about resource configuration monitoring

desc:

Implement a metric which will give the date when a configuration is updated.

A pseudo/metric implementation would more or less like :

ha_cluster_pacemaker_resource_configuration_changes= Mon Oct 28 11:03:29 2019
the value should be timestamp

How to implement this:

we have already the crm_mon command which expose this information with

<crm_mon version="2.0.0">
    <summary>
        <stack type="corosync" />
        <current_dc present="true" version="1.1.18+20180430.b12c320f5-3.15.1-b12c320f5" name="damadog-hana01" id="1084780051" with_quorum="true" />
        <last_update time="Mon Oct 28 11:03:34 2019" />
        <last_change time="Mon Oct 28 11:03:29 2019" user="root" client="crm_attribute" origin="damadog-hana02" />```

the `last_change` is the field we are looking for.

todo:

search to convert the Mon Oct 28 11:03:29 2019 to a prometheus compatible value perhaps a timestamp

make log compatible to systemd daemon

Right now we print in terminal.
We should just print the golang output to systemd for beeing a service.

easy-fix / testing: expand current test for pacemaker

description:

the following file has some unit-tests. https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/pacemaker_metrics_test.go#L76

It would be nice to extend this tests following same pattern Just testing other attirbutes.

update readme description

Desc

the readme contain an outdated text:

This prometheus exporter is used to serve metrics for pacemaker https://github.com/ClusterLabs/pacemaker. It should run inside a node of the cluster or both.

We now export more metrics like SBD, drbd and other components.

We should write a more generic description for this.

An example would be: (it might need improvement and research)

Like the exporter is used to serve metrics for ha clusters and their components.
As second usage will be that, the exporter is intented to work also as only exporter for a single component, in this case it will just serve the metrics for this compenent

implement a way to count the resorcourse

desc

I have hardcoded the number because I need to implement the way we lookup the number.

https://github.com/MalloZup/ha_cluster_exporter/blob/master/main.go#L412

[challange]: build exporter package for archlinux or debian or centos7 other distro with OBS

description:

we need to build the exporter for other distros

A good one would be the centos7 since we already have the repo.

In case archlinux or debian, we could add repos there.

we build the pkg with:
https://build.opensuse.org/package/show/server:monitoring/prometheus-ha_cluster_exporter

Tutorial:

https://duncan.codes/tutorials/rpm-packaging/

what you will learn:

you will learn to packaging

warning:

this issue is for hackers level. Are you ready to be one? If yes , take this challenge. ( we will help along the way for any info required)

If want to take this issue

add label to metric

currenlty

# TYPE cluster_nodes gauge
cluster_nodes{type="DC"} 1
cluster_nodes{type="configured"} 2
cluster_nodes{type="expected_up"} 2
cluster_nodes{type="maintenance"} 0
cluster_nodes{type="member"} 2
cluster_nodes{type="online"} 2
cluster_nodes{type="pending"} 0
cluster_nodes{type="ping"} 0
cluster_nodes{type="remote"} 0
cluster_nodes{type="shutdown"} 0
cluster_nodes{type="standby"} 0
cluster_nodes{type="standby_onfail"} 0
cluster_nodes{type="unclean"} 0
cluster_nodes{type="unknown"} 0

we should add
cluster_nodes{node="foobar1" type="unknown"} 0

differentiate node type and status

Desc

From metric ha_cluster_nodes
differentiate node "type" and "status":
type: member/remote
status: combinations of "online/standby/standby_onfail/maintenance/pending/unclean/shutdown/expected_up/is_dc".

follow-up from
#58 (comment)

Bug: DRBD Error'ing (when not present)

Similar to #28 - an error is seen when DRBD is not set up on the system,

2019/10/16 14:34:28 fork/exec /sbin/drbdsetup: no such file or directory

Rename ha_cluster_node_resoucres to ha_cluster_resources

tasks:

update code and tests
update specification on metrics
update dashboard and alerts when needed

The cluster binaries locations varies depending on Releases or Distributions

Depending on distro or on the service pack, the cluster binaries locations can vary, causing the exporter to not find the executable. For example, between SLE-12-SP3 and SLE-15, the binary drbdsetup has changed location and the exporter shows the following error:

Oct 31 13:32:14 dev-1 ha_cluster_exporter[73317]: time="2019-10-31T13:32:14Z" level=warning msg="Could not initialize DRBD collector: '/usr/sbin/drbdsetup' not found: stat /usr/sbin/drbdsetup: n...r directory\n"

And here we see the real location

dev-1:~ # which drbdsetup
/sbin/drbdsetup

The exporter should allow different locations and ideally should try to find it by itself.

Add cluster prefix to metrics

We should check if we need a prefix of cluster or ha_cluster to each metrics

implement metric about constraints with ban

desc:

Metric should look like
ha_cluster_pacemaker_constraint{type="ban|prefer" id="foobar"}

How to implement it:

use crm resource ban rsc_ip_PRD_HDB00
the crm_mon already show this info in field: <bans>

so it will show following:

    <bans>
        <ban id="cli-ban-rsc_ip_PRD_HDB00-on-damadog-hana02" resource="rsc_ip_PRD_HDB00" node="damadog-hana02" weight="-1000000" master_only="false" />
    </bans>

And we can expose the ID of resource and add type

Add detailed value documentation to `*_total` metrics

good idea. We should also do this for all the other ones, in the future, because actually non of the *_total metrics document their value.

Originally posted by @stefanotorresi in #92

Create SBD metric about number of device configured

the metric should give 0 if there is not sbd device is configured and give a number int of the number

Create Index in Readme.md

we should have an index in Markdown for each central point.

use Register method instead of MustRegister

Use Register method (https://godoc.org/github.com/openshift/origin/vendor/github.com/prometheus/client_golang/prometheus#Registry.Register)

This allow us to catch the error and print a error msg .

Mustregister just panic without giving the opportunity of catching errors

PKGing: release a new version of rpm exporter

release a new rpm of the release

move managed metric as label

CI: add basic things of travis ci

add basic ci for golang here.

research on how to remove/reset metrics

Currenlty we have a function for resetting some metric with labels.

Historically I didn't find any API for doing this job with the prometheus api. This issue is more a placeholder for research if we can use some API to remove the metrics/reset the state them with prometheus api.

Feature Request: Implement listen IPv4 option

Currently, it is not possible to alter which address the exporter will listen on; I propose the addition of a -ip4 flag.

E.g. ./ha_cluster_exporter --ip4 127.0.0.1

Pull request to follow :)

create a makefile target for running tests with code coverage and visualize them

the golang language has the feature to create and display code coverage of unit- tests

Task:

research about this interesting feature
create a makefile target like make coverage this will execute tests with code coverage
By default we don't need it , so we will have make coverage and make test will run without coverage. The make coverage will run go test with some flags
this task should create an html golang report and then open a fenster in web-browser to visualize it.

rationale:

We don't want to enable codecoverage or similar in travis or CI because codecovarage can be a really wrong metric for running in PRs or automation.

We just want to use code coverage from time to time to see what we could perhaps cover but there should be a human analyzing it.

As use-case could be:

As dev I run make coverage and this will open automatically the browser fenster to the code-coverage of golang in html

Then I can analyze from time to time what perhaps isn't covered by tests. ( we can't cover everything by tests since we don't have binary).

The code-coverage thing should be manual human process .

Design a creative and emotional logo for ha exporter

desc:

It would be nice to have a logo for this project.

The logo should be a mix or remind to some elements of Prometheus, Grafana and also something with Cluster and HA . This is a starting point. Any imagination is welcome!

Check if empty map

We should check if the maps we are accessing are empty to avoid some possible errors if any


if len(map) == 0 {
    ....
}

refactor pacemaker metrics like DRBD one

THe pr of drbd introduce a refactor of pacemaker (#26) .

This metric with types is similar to DRBD metrics.

We should follow the same pattern. Also add:

move all function that belong to pacemaker to that new file
add unit-tests with a XML data example

Travis configuration seems off

We currently have 4 jobs, 2 for each stage (gofmt and build), but something must be off: we should only have 1 job per stage.

use labels instead of multiples metrics

we have to much metrics.

We can easy use labels with that

For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.

https://prometheus.io/docs/practices/instrumentation/

Feature Request: Implement Log Levelling

It would be very nice to see the implementation of log levelling to enable specification of log levels that are output which is a useful feature to have in use within a production environment.

In the current operation, the exporter continually output's info level logs.

Example:

2019/10/11 11:10:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:40 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:45 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:45 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:50 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:50 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:55 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:55 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:00 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:00 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:05 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:05 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:40 [INFO]: Reading quorum status with corosync-quorumtool...

make from 2 only 1 metric with label

from this 2 metrics create 1

cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master"} 1

cluster_resources_status{status="master"} 1

cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master" status="orpanhed"}} 1

depracate ha configured total metric?

I have the feeling that this 2 metrics are providing same info that an user could grab by reading other metrics of cluster, doing a total etc.

In sake of minimalism we could perhaps just remove them

ha_cluster_nodes_configured_total
ha_cluster_resources_configured_total

I think historically they were added as migration from hawk-apiserver original metrics but I was also doubting on their need to exist because imho one could do a promql for this 2 metrics.

Imho I would vote to remove them if we can in sake of simplicity and minimalism

add version cmdline

add CLI option to print version of the exporter

DRBD Metric-02: Implement remote connection

desc

follow-up from #25

impmplement something like
ha_cluster_drbd_resource_remote_connection{resource_name="1-single-0", role="primary", volume="0" peer-node-id="2" peer-role="Secondary"} 1

Bug / Feature Request: Cluster SBD Error'ing (when not present)

I have noticed that when operating the exporter on a machine that does not have sbd setup the following set of errors are seen repeatedly:

2019/10/11 15:14:12 [INFO]: Reading cluster SBD configuration..
2019/10/11 15:14:12 [ERROR] Could not open sbd config file open /etc/sysconfig/sbd: no such file or directory

There should maybe be a check such as

if _, err := os.Stat("/etc/sysconfig/sbd"); os.IsNotExist(err) {
	log.Println("[INFO]: SBD is not present... skipping.")
} else {

}

This would give a single output of 0000/00/00 00:00:00 [INFO]: SBD is not present... skipping. instead of continually repeating an error in the absence of sbd. :)

Feature Request: Implement a landing page

Currently, if you visit the exporter on http://localhost:9002/ you get a 404 error, which is not an issue per se, however, would be nice to have a simple landing page which just explains what the exporter is- similar to various others such as node exporter.

This could be done using something such as:

	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		w.Write([]byte(`<html>
			<head>
				<title>HACluster Exporter</title>
			</head>
			<body>
				<h1>HACluster Exporter</h1>
				<p><a href="/metrics">Metrics</a></p>
                                <br />
				<h2>More information:</h2>
				<p><a href="https://github.com/ClusterLabs/ha_cluster_exporter">github.com/ClusterLabs/ha_cluster_exporter</a></p>
			</body>
			</html>`))
	})

Error/warning need more context

We have such erorr/warning:

log.Warnln(err). Imho we should use log.Warnf("context bro erorr by doing x %s," err) so we have a more context info

Configured metric

In past we used the metric configured . We should expose this metric separately.

implement Refresh parameter for cmdline exporter

desc

we should just add a flag for refreshing XX seconds so the binary can be changes

Add changelog file

discuss which format take fro changelog and add one if needed

DOC: find a way to generate documentation automatically and write other things manually

Description:

This issue is after discussion with me, @diegoakechi and @stefanotorresi.

We agreed that we should find a compromise between generating docs automatically and maintain them manually.

Tasks:

research and investigate what we can automatically create (Name of metric, HELP etc, list)

This task might depend also if we want to refactor the exporter with the usage of collector, which will change a bit the API so also the automatic generation can be impacted. We should then wait a little before doing this

EASY-FIX: Add go fmt check in travis

we should add a check in travis to check that golang code is properly formatted.

check/test if in different sles/openSUSE systems version

Check if crm_mon return same data when running on different systems

actions:

run this on older systems

improve english in readme

The readme might contains some english phrases which need rewords or better statement for improving UX

add clustername to metrics

Description

In case of multiple cluster we should add a label which will highlight that the metric belong to that clustername so we can see which cluster belong to

This might need to research and we should do on the dashboard level rather then on exporter

Research on error handling in context of metrics

Desc:

I think we should in case of error set an empty metrics but not error/panic.
We should log an error but not interrupt.

It is something to be researched when the exporter need to panic

easy-fix: change travis import_path to correct one

https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/.travis.yml#L10 point to wrong github project.

Change the path to the actual repository Clusterlab/ha_cluster_exporter

Implement DRBD monitoring

metric:

drbd Resource:

ha_cluster_drbd_resource{resource_name="1-single-0", role="primary", volume="0", disk-state="uptodate", replication-state="etablished", peer-disk-state"uptodate"}
ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="0", disk-state="uptodate" replication-state="etablished", peer-disk-state"uptodate"}
ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="1", disk-state="outdated" replication-state="etablished", peer-disk-state"uptodate"}

drbdsetup status --json

[
{
  "name": "1-single-0",
  "node-id": 1,
  "role": "Primary",
  "suspended": false,
  "write-ordering": "flush",
  "devices": [
    {
      "volume": 0,
      "minor": 2,
      "disk-state": "UpToDate",
      "client": false,
      "quorum": true,
      "size": 409600,
      "read": 8457,
      "written": 536513,
      "al-writes": 4,
      "bm-writes": 0,
      "upper-pending": 0,
      "lower-pending": 0
    } ],
  "connections": [
    {
      "peer-node-id": 2,
      "name": "SLE15-sp1-gm-drbd1145296-node2",
      "connection-state": "Connected", 
      "congested": false,
      "peer-role": "Secondary",
      "ap-in-flight": 0,
      "rs-in-flight": 0,
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Established",
          "peer-disk-state": "UpToDate",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 8202,
          "sent": 534390,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    } ]
}
,```

Add label RING_ID to corosync metric

Create a metric
corosync_ring_errors("ring_id=1") 1 corosync_ring_errors("ring_id=42") 0

here so we can point what ring is failing. A zero or 1 is a failure.

Install logrous in golang modules

Desc:

we should use logrous (https://github.com/sirupsen/logrus) for a more fine grained control over logs.

Tasks:

install logrous in go module file

Not mandatory

1. bonus convert some logs info to debug mode

clusterlabs / ha_cluster_exporter Goto Github PK

ha_cluster_exporter's Issues

desc:

How to implement this:

todo:

description:

Desc

desc

description:

Tutorial:

what you will learn:

warning:

Desc

tasks:

desc:

How to implement it:

rationale:

desc:

desc

desc

Description:

Tasks:

actions:

Description

Desc:

metric:

Desc:

Tasks:

Not mandatory

Recommend Projects

Recommend Topics

Recommend Org