clusterlabs / ha_cluster_exporter Goto Github PK
View Code? Open in Web Editor NEWPrometheus exporter for Pacemaker based Linux HA clusters
License: Apache License 2.0
Prometheus exporter for Pacemaker based Linux HA clusters
License: Apache License 2.0
Implement a metric which will give the date when a configuration is updated.
A pseudo/metric implementation would more or less like :
ha_cluster_pacemaker_resource_configuration_changes= Mon Oct 28 11:03:29 2019
the value should be timestamp
we have already the crm_mon
command which expose this information with
<crm_mon version="2.0.0">
<summary>
<stack type="corosync" />
<current_dc present="true" version="1.1.18+20180430.b12c320f5-3.15.1-b12c320f5" name="damadog-hana01" id="1084780051" with_quorum="true" />
<last_update time="Mon Oct 28 11:03:34 2019" />
<last_change time="Mon Oct 28 11:03:29 2019" user="root" client="crm_attribute" origin="damadog-hana02" />```
the `last_change` is the field we are looking for.
search to convert the Mon Oct 28 11:03:29 2019
to a prometheus compatible value perhaps a timestamp
Right now we print in terminal.
We should just print the golang output to systemd for beeing a service.
the following file has some unit-tests. https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/pacemaker_metrics_test.go#L76
It would be nice to extend this tests following same pattern Just testing other attirbutes.
the readme contain an outdated text:
This prometheus exporter is used to serve metrics for pacemaker https://github.com/ClusterLabs/pacemaker. It should run inside a node of the cluster or both.
We now export more metrics like SBD, drbd and other components.
We should write a more generic description for this.
An example would be: (it might need improvement and research)
Like the exporter is used to serve metrics for ha clusters and their components.
As second usage will be that, the exporter is intented to work also as only exporter for a single component, in this case it will just serve the metrics for this compenent
I have hardcoded the number because I need to implement the way we lookup the number.
https://github.com/MalloZup/ha_cluster_exporter/blob/master/main.go#L412
we need to build the exporter for other distros
A good one would be the centos7 since we already have the repo.
In case archlinux or debian, we could add repos there.
we build the pkg with:
https://build.opensuse.org/package/show/server:monitoring/prometheus-ha_cluster_exporter
https://duncan.codes/tutorials/rpm-packaging/
you will learn to packaging
this issue is for hackers level. Are you ready to be one? If yes , take this challenge. ( we will help along the way for any info required)
If want to take this issue
currenlty
# TYPE cluster_nodes gauge
cluster_nodes{type="DC"} 1
cluster_nodes{type="configured"} 2
cluster_nodes{type="expected_up"} 2
cluster_nodes{type="maintenance"} 0
cluster_nodes{type="member"} 2
cluster_nodes{type="online"} 2
cluster_nodes{type="pending"} 0
cluster_nodes{type="ping"} 0
cluster_nodes{type="remote"} 0
cluster_nodes{type="shutdown"} 0
cluster_nodes{type="standby"} 0
cluster_nodes{type="standby_onfail"} 0
cluster_nodes{type="unclean"} 0
cluster_nodes{type="unknown"} 0
we should add
cluster_nodes{node="foobar1" type="unknown"} 0
From metric ha_cluster_nodes
differentiate node "type" and "status":
type: member/remote
status: combinations of "online/standby/standby_onfail/maintenance/pending/unclean/shutdown/expected_up/is_dc".
follow-up from
#58 (comment)
Similar to #28 - an error is seen when DRBD is not set up on the system,
2019/10/16 14:34:28 fork/exec /sbin/drbdsetup: no such file or directory
Rename ha_cluster_node_resoucres to ha_cluster_resources
Depending on distro or on the service pack, the cluster binaries locations can vary, causing the exporter to not find the executable. For example, between SLE-12-SP3 and SLE-15, the binary drbdsetup
has changed location and the exporter shows the following error:
Oct 31 13:32:14 dev-1 ha_cluster_exporter[73317]: time="2019-10-31T13:32:14Z" level=warning msg="Could not initialize DRBD collector: '/usr/sbin/drbdsetup' not found: stat /usr/sbin/drbdsetup: n...r directory\n"
And here we see the real location
dev-1:~ # which drbdsetup
/sbin/drbdsetup
The exporter should allow different locations and ideally should try to find it by itself.
We should check if we need a prefix of cluster
or ha_cluster
to each metrics
Metric should look like
ha_cluster_pacemaker_constraint{type="ban|prefer" id="foobar"}
crm resource ban rsc_ip_PRD_HDB00
crm_mon
already show this info in field: <bans>
so it will show following:
<bans>
<ban id="cli-ban-rsc_ip_PRD_HDB00-on-damadog-hana02" resource="rsc_ip_PRD_HDB00" node="damadog-hana02" weight="-1000000" master_only="false" />
</bans>
And we can expose the ID of resource and add type
good idea. We should also do this for all the other ones, in the future, because actually non of the *_total
metrics document their value.
Originally posted by @stefanotorresi in #92
the metric should give 0 if there is not sbd device is configured and give a number int of the number
we should have an index in Markdown for each central point.
Use Register method (https://godoc.org/github.com/openshift/origin/vendor/github.com/prometheus/client_golang/prometheus#Registry.Register)
This allow us to catch the error and print a error msg .
Mustregister just panic without giving the opportunity of catching errors
release a new rpm of the release
add basic ci for golang here.
Currenlty we have a function for resetting some metric with labels.
Historically I didn't find any API for doing this job with the prometheus api. This issue is more a placeholder for research if we can use some API to remove the metrics/reset the state them with prometheus api.
Currently, it is not possible to alter which address the exporter will listen on; I propose the addition of a -ip4
flag.
E.g. ./ha_cluster_exporter --ip4 127.0.0.1
Pull request to follow :)
the golang language has the feature to create and display code coverage of unit- tests
Task:
research about this interesting feature
create a makefile target like make coverage
this will execute tests with code coverage
By default we don't need it , so we will have make coverage
and make test
will run without coverage. The make coverage will run go test
with some flags
this task should create an html golang report and then open a fenster in web-browser to visualize it.
We don't want to enable codecoverage or similar in travis or CI because codecovarage can be a really wrong metric for running in PRs or automation.
We just want to use code coverage from time to time to see what we could perhaps cover but there should be a human analyzing it.
As use-case could be:
As dev I run make coverage
and this will open automatically the browser fenster to the code-coverage of golang in html
Then I can analyze from time to time what perhaps isn't covered by tests. ( we can't cover everything by tests since we don't have binary).
The code-coverage thing should be manual human process .
It would be nice to have a logo for this project.
The logo should be a mix or remind to some elements of Prometheus, Grafana and also something with Cluster and HA . This is a starting point. Any imagination is welcome!
We should check if the maps we are accessing are empty to avoid some possible errors if any
if len(map) == 0 {
....
}
THe pr of drbd introduce a refactor of pacemaker (#26) .
This metric with types is similar to DRBD metrics.
We should follow the same pattern. Also add:
We currently have 4 jobs, 2 for each stage (gofmt and build), but something must be off: we should only have 1 job per stage.
we have to much metrics.
We can easy use labels with that
For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.
It would be very nice to see the implementation of log levelling to enable specification of log levels that are output which is a useful feature to have in use within a production environment.
In the current operation, the exporter continually output's info level logs.
Example:
2019/10/11 11:10:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:40 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:45 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:45 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:50 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:50 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:10:55 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:10:55 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:00 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:00 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:05 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:05 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:10 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:10 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:15 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:15 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:20 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:20 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:25 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:25 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:30 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:30 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:35 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:35 [INFO]: Reading quorum status with corosync-quorumtool...
2019/10/11 11:11:40 [INFO]: Reading cluster configuration with crm_mon..
2019/10/11 11:11:40 [INFO]: Reading quorum status with corosync-quorumtool...
from this 2 metrics create 1
cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master"} 1
cluster_resources_status{status="master"} 1
cluster_node_resources{node="ayoub-monitoring-hana01",resource_name="rsc_saphana_prd_hdb00",role="master" status="orpanhed"}} 1
I have the feeling that this 2 metrics are providing same info that an user could grab by reading other metrics of cluster, doing a total etc.
In sake of minimalism we could perhaps just remove them
ha_cluster_nodes_configured_total
ha_cluster_resources_configured_total
I think historically they were added as migration from hawk-apiserver original metrics but I was also doubting on their need to exist because imho one could do a promql for this 2 metrics.
Imho I would vote to remove them if we can in sake of simplicity and minimalism
add CLI option to print version of the exporter
follow-up from #25
impmplement something like
ha_cluster_drbd_resource_remote_connection{resource_name="1-single-0", role="primary", volume="0" peer-node-id="2" peer-role="Secondary"} 1
I have noticed that when operating the exporter on a machine that does not have sbd
setup the following set of errors are seen repeatedly:
2019/10/11 15:14:12 [INFO]: Reading cluster SBD configuration..
2019/10/11 15:14:12 [ERROR] Could not open sbd config file open /etc/sysconfig/sbd: no such file or directory
There should maybe be a check such as
if _, err := os.Stat("/etc/sysconfig/sbd"); os.IsNotExist(err) {
log.Println("[INFO]: SBD is not present... skipping.")
} else {
}
This would give a single output of 0000/00/00 00:00:00 [INFO]: SBD is not present... skipping.
instead of continually repeating an error in the absence of sbd. :)
Currently, if you visit the exporter on http://localhost:9002/
you get a 404
error, which is not an issue per se, however, would be nice to have a simple landing page which just explains what the exporter is- similar to various others such as node exporter.
This could be done using something such as:
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte(`<html>
<head>
<title>HACluster Exporter</title>
</head>
<body>
<h1>HACluster Exporter</h1>
<p><a href="/metrics">Metrics</a></p>
<br />
<h2>More information:</h2>
<p><a href="https://github.com/ClusterLabs/ha_cluster_exporter">github.com/ClusterLabs/ha_cluster_exporter</a></p>
</body>
</html>`))
})
We have such erorr/warning:
log.Warnln(err)
. Imho we should use log.Warnf("context bro erorr by doing x %s," err)
so we have a more context info
In past we used the metric configured . We should expose this metric separately.
we should just add a flag for refreshing XX seconds so the binary can be changes
discuss which format take fro changelog and add one if needed
This issue is after discussion with me, @diegoakechi and @stefanotorresi.
We agreed that we should find a compromise between generating docs automatically and maintain them manually.
This task might depend also if we want to refactor the exporter with the usage of collector, which will change a bit the API so also the automatic generation can be impacted. We should then wait a little before doing this
we should add a check in travis to check that golang code is properly formatted.
Check if crm_mon
return same data when running on different systems
run this on older systems
The readme might contains some english phrases which need rewords or better statement for improving UX
In case of multiple cluster we should add a label which will highlight that the metric belong to that clustername so we can see which cluster belong to
This might need to research and we should do on the dashboard level rather then on exporter
I think we should in case of error set an empty metrics but not error/panic.
We should log an error but not interrupt.
It is something to be researched when the exporter need to panic
https://github.com/ClusterLabs/ha_cluster_exporter/blob/master/.travis.yml#L10 point to wrong github project.
Change the path to the actual repository Clusterlab/ha_cluster_exporter
ha_cluster_drbd_resource{resource_name="1-single-0", role="primary", volume="0", disk-state="uptodate", replication-state="etablished", peer-disk-state"uptodate"}
ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="0", disk-state="uptodate" replication-state="etablished", peer-disk-state"uptodate"}
ha_cluster_drbd_resource{resource_name="vg2", role="secondary", volume="1", disk-state="outdated" replication-state="etablished", peer-disk-state"uptodate"}
drbdsetup status --json
[
{
"name": "1-single-0",
"node-id": 1,
"role": "Primary",
"suspended": false,
"write-ordering": "flush",
"devices": [
{
"volume": 0,
"minor": 2,
"disk-state": "UpToDate",
"client": false,
"quorum": true,
"size": 409600,
"read": 8457,
"written": 536513,
"al-writes": 4,
"bm-writes": 0,
"upper-pending": 0,
"lower-pending": 0
} ],
"connections": [
{
"peer-node-id": 2,
"name": "SLE15-sp1-gm-drbd1145296-node2",
"connection-state": "Connected",
"congested": false,
"peer-role": "Secondary",
"ap-in-flight": 0,
"rs-in-flight": 0,
"peer_devices": [
{
"volume": 0,
"replication-state": "Established",
"peer-disk-state": "UpToDate",
"peer-client": false,
"resync-suspended": "no",
"received": 8202,
"sent": 534390,
"out-of-sync": 0,
"pending": 0,
"unacked": 0,
"has-sync-details": false,
"has-online-verify-details": false,
"percent-in-sync": 100.00
} ]
} ]
}
,```
Create a metric
corosync_ring_errors("ring_id=1") 1
corosync_ring_errors("ring_id=42") 0
here so we can point what ring is failing. A zero or 1 is a failure.
we should use logrous (https://github.com/sirupsen/logrus) for a more fine grained control over logs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.