pdf / zfs_exporter Goto Github PK

Prometheus ZFS exporter

License: MIT License

Makefile 1.42% Go 98.58%

prometheus-exporter prometheus zfs golang

zfs_exporter's Introduction

ZFS Exporter

Prometheus exporter for ZFS (pools, filesystems, snapshots and volumes). Other implementations exist, however performance can be quite variable, producing occasional timeouts (and associated alerts). This exporter was built with a few features aimed at allowing users to avoid collecting more than they need to, and to ensure timeouts cannot occur, but that we eventually return useful data:

Pool selection - allow the user to select which pools are collected
Multiple collectors - allow the user to select which data types are collected (pools, filesystems, snapshots and volumes)
Property selection - allow the user to select which properties are collected per data type (enabling only required properties will increase collector performance, by reducing metadata queries)
Collection deadline and caching - if the collection duration exceeds the configured deadline, cached data from the last run will be returned for any metrics that have not yet been collected, and the current collection run will continue in the background. Collections will not run concurrently, so that when a system is running slowly, we don't compound the problem - if an existing collection is still running, cached data will be returned.

Installation

Download the latest release for your platform, and unpack it somewhere on your filesystem.

You may also build the latest version using Go v1.11 - 1.17 via go get:

go get -u github.com/pdf/zfs_exporter

Installation can also be accomplished using go install:

version=latest # or a specific version tag
go install github.com/pdf/zfs_exporter@$version

Usage

usage: zfs_exporter [<flags>]

Flags:
  -h, --help                 Show context-sensitive help (also try --help-long and --help-man).
      --collector.dataset-filesystem
                             Enable the dataset-filesystem collector (default: enabled)
      --properties.dataset-filesystem="available,logicalused,quota,referenced,used,usedbydataset,written"
                             Properties to include for the dataset-filesystem collector, comma-separated.
      --collector.dataset-snapshot
                             Enable the dataset-snapshot collector (default: disabled)
      --properties.dataset-snapshot="logicalused,referenced,used,written"
                             Properties to include for the dataset-snapshot collector, comma-separated.
      --collector.dataset-volume
                             Enable the dataset-volume collector (default: enabled)
      --properties.dataset-volume="available,logicalused,referenced,used,usedbydataset,volsize,written"
                             Properties to include for the dataset-volume collector, comma-separated.
      --collector.pool       Enable the pool collector (default: enabled)
      --properties.pool="allocated,dedupratio,fragmentation,free,freeing,health,leaked,readonly,size"
                             Properties to include for the pool collector, comma-separated.
      --web.listen-address=":9134"
                             Address on which to expose metrics and web interface.
      --web.telemetry-path="/metrics"
                             Path under which to expose metrics.
      --web.disable-exporter-metrics
                             Exclude metrics about the exporter itself (promhttp_*, process_*, go_*).
      --deadline=8s          Maximum duration that a collection should run before returning cached data. Should
                             be set to a value shorter than your scrape timeout duration. The current
                             collection run will continue and update the cache when complete (default: 8s)
      --pool=POOL ...        Name of the pool(s) to collect, repeat for multiple pools (default: all pools).
      --exclude=EXCLUDE ...  Exclude datasets/snapshots/volumes that match the provided regex (e.g.
                             '^rpool/docker/'), may be specified multiple times.
      --log.level=info       Only log messages with the given severity or above. One of: [debug, info, warn,
                             error]
      --log.format=logfmt    Output format of log messages. One of: [logfmt, json]
      --version              Show application version.

Collectors that are enabled by default can be negated by prefixing the flag with --no-*, ie:

zfs_exporter --no-collector.dataset-filesystem

TLS endpoint

EXPERIMENTAL

The exporter supports TLS via a new web configuration file.

./zfs_exporter --web.config.file=web-config.yml

See the exporter-toolkit https package for more details.

Caveats

The collector may need to be run as root on some platforms (ie - Linux prior to ZFS v0.7.0).

Whilst inspiration was taken from some of the alternative ZFS collectors, metric names may not be compatible.

Alternatives

In no particular order, here are some alternative implementations:

zfs_exporter's People

Contributors

Stargazers

Watchers

zfs_exporter's Issues

Feature Request: Include filesystem for snapshots

I could not find a simple way to generate graphs per ZFS filesystem and their snapshot usage.

The metric currently looks like this:

zfs_dataset_logical_used_bytes{name="rpool/application@zrepl_20220407_154137_000",pool="rpool",type="snapshot"} 0

If we could add a filesystem label, we could build easy statistics of snapshots for each filesystem:

zfs_dataset_logical_used_bytes{name="zrepl_20220407_154137_000",filesystem="rpool/application",pool="rpool",type="snapshot"} 0

I think this would add a clear benefit, and also is consistent of how we handle pool for for example.

Get additional ZFS properties

It is possible to add a few more ZFS properties to the zfs_dataset.go file? In particular:

Recordsize (zfs get recordsize)
Mount point (zfs get mountpoint)
Compression type (zfs get compression)
Primary Cache (zfs get primarycache)
Secondary Cache (zfs get secondarycache)

Thank you.

Feature request: publish detailed health metrics

Currently, only the overall zpool health status is published. It would be useful if the individual vdev and disk status were also published.

For example, this status

$ zpool status
...
config:
        NAME                        STATE     READ WRITE CKSUM
        zpool1                      ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x5000c500b3b2f8c0  ONLINE       0     0     0
            wwn-0x5000c500b3b53463  ONLINE       0     0     0
            wwn-0x5000c500b3b33354  ONLINE       0     0     0

could lead to these metrics being published:

zfs_pool_health{pool="zpool1"} 0
zfs_pool_vdev_health{pool="zpool1", vdev="raidz1-0"} 0
zfs_pool_disk_health{pool="zpool1", vdev="raidz1-0" disk="wwn-0x5000c500b3b2f8c0"} 0
zfs_pool_disk_health{pool="zpool1", vdev="raidz1-0" disk="wwn-0x5000c500b3b53463"} 0
zfs_pool_disk_health{pool="zpool1", vdev="raidz1-0" disk="wwn-0x5000c500b3b33354"} 0

This would make it possible to build more informative dashboards. With only the pool health status, the operator still has to log into the server to find out which/how many devices are faulted.

Improve documentation on installation and alerts

Hi @pdf I've been working with your awesome exporter the whole day, and I've created some docs for myself on how to install and configure it. You can find them here. Feel free to copy-paste any part that you'd like to use.

I'm fine with you closing the issue once you read it, it was the easiest way to reach out

Feature Request: Allow for all permutations of configurations per each dataset

Love this project, very useful. However, the control for features (collectors and properties) are enabled/collected per dataset is very limited (coarse). I have the need for finer-grained control over what features are enabled per dataset.

It should be possible to allow for all permutations of all configuration options per each dataset. I propose the following:

Add flag:
--include=INCLUDE
That acts similarly (opposite) to --exclude
Add finer-grained control over what types of datasets for which snapshots shall be collected
Instead of (or in addition to) the flag:
--collector.dataset-snapshot
Use these:
--collector.dataset-volume-snapshot
--collector.dataset-filesystem-snapshot
Add ability to enable bookmark data collection:
Add argument:
--collector.dataset-bookmark
--collector.dataset-volume-bookmark
--collector.dataset-filesystem-bookmark
Add ability to enable snapshot holds data collection:
Add argument:
--collector.dataset-snapshot-holds
--collector.dataset-volume-snapshot-holds
--collector.dataset-filesystem-snapshot-holds
Add ability to selectively enable properties or collectors per zpool/dataset by:
Most importantly for the --collector.dataset-snapshot (and other snapshot/bookmark) options since snapshot/bookmark data collection is VERY expensive, you may want to limit it down to a specific set of datasets.
Add argument:
--dataset XYZ
That can be specified multiple times. any following --collector* or --properties* arguments will then only be applied to the specified dataset(s)
Also:
--include and --exclude should be able to be specified multiple times and behave in the same manner.
--pool should behave in the same manner (but for pools)
or add:
--pools should allow comma-delimited list of pool names
e.g.
./zfs_exporter --pools bpool,rpool \
--collector.dataset-volume-snapshot --properties.dataset-snapshot="logicalused,referenced,used,written" \
--pools data --datasets "^data/my/important/dataset" \
--collector.dataset-snapshot \
--properties.datasetsnapshot="creation,logicalused,referenced,used,written,compressratio,refcompressratio"
Also, one more thing... add the ability to query all zfs module parameters (e.g. /sys/module/zfs/parameters/*)

This will allow for total zfs information awareness :-)

Export scrub/resilver status

I was curious if this exports scrub and/or resilvering status? I briefly dig through the code and didn't see anything that stuck out. ZED is very flaky in general and it would be so awesome to be able to dump that and have alerts through Prometheus.

go.mod is missing /v2 version suffix

https://go.dev/ref/mod#major-version-suffixes

This has some other side effects apart from making it inconvenient to import this code in another go module. Installation using go get is deprecated as of go 1.17, and will not work starting with go 1.18 (see https://go.dev/doc/go-get-install-deprecation). e.g.

% go get -u github.com/pdf/zfs_exporter
go get: installing executables with 'go get' in module mode is deprecated.
        Use 'go install pkg@version' instead.
        For more information, see https://golang.org/doc/go-get-install-deprecation
        or run 'go help get' or 'go help install'.

The go install method doesn't currently work because of the missing module suffix:

go install github.com/pdf/[email protected]
go install: github.com/pdf/[email protected]: github.com/pdf/[email protected]: invalid version: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v2

Updating go.mod to include the /v2 version suffix will fix this (starting with the next release).

PR incoming...

Renormalized values are counterintuitive (case compressratio)

It seems like the changes made in version 2.0.0 to renormalize values to be between 0..1 seems to produce wrong(?) results:

case in point compressratio:

$ zfs list -po name,refcompressratio,logicalreferenced,referenced rpool/crypt/var/log
NAME                 REFRATIO      LREFER       REFER
rpool/crypt/var/log      4.26  6032219136  1445105664

It's clear how the ratio is calculated: REFRATIO = LREFER/REFER and it's clear what it means: the "real" data is 4.26 times larger than what's written to disk ... i.e. compression shrunk the data to roughly 1/4 (~24%) of it's original size.

But activating the new metric gives numbers like this:

zfs_dataset_compression_ratio{name="rpool/crypt/var/log",pool="rpool",type="filesystem"} 0.7830802603036877

I was confused 😕 After looking at the formula (i.e. transformMultiplier()) I could rationalize that this is the ratio of data that was "compressed away" (i.e. ~ 1 - 0.24) ... which is "not so useful". I'm ok with renormalizing percentages, but "reformulating" ratios is maybe going to far ... while mathematically correct, they change their "meaning" (i.e. label <-> number correspondence).

My suggestion would be to remove ratios completely (when they can be calculated from other data: compression, capacity, etc.) or to report them "as is" (i.e. to keep the mental model close to what the CLI reports). Not all ratios are between 0 and 1, nor do they need to be. Whoever uses those ratios will probably know what to do with them (including transformations). 😉

Add more usedby properties

It would be awesome to have access to usedbysnapshots and usedbychildren. I have no idea of Go, but if you guide me a little bit I can try to make a pull request

Question - Problems Adding Pools

I have an issue filtering/adding the pools I want to collect data from. I think I might be using the flag incorrectly. From the Readme I understood the usage like this:

    --pool="fastpool-1" \
    --pool="slowpool-1" \

Exporter does not recover from problems calling the zpool binary

When the collector fails to call the local binaries, it does not export any metrics indicating the problem and does not recover on its own.

I'm hitting https://github.com/pdf/zfs_exporter/blob/master/collector/zfs.go#L124-L128 where getPools fails for some reason, e.g. where the zpool binary is missing or exits with a nonzero code. The former has happened during switching between zfs-kmod and zfs-dkms repositories on a RHEL-based distribution, the latter occurred in a pacemaker-controlled cluster, where pools move between machines (setup roughly based on zfs-ha) which may sometimes cause a race-condition in which zpool seems to exit 1. This is not easy to reproduce, but one can simply remove the zpool binary, e.g. by starting with an empty $PATH like so:

PATH= ./zfs_exporter

Error messages I got were:

caller=zfs.go:126 msg="Error finding pools" err="exit status 1" (pacemaker-cluster)
caller=zfs.go:126 msg="Error finding pools" err="exec: \"zpool\": executable file not found in $PATH" (empty PATH)

In both cases, the exporter won't export anything but the basic go-metrics and its own build version, so the scrape itself still works and the only indicator that something's happened is the sudden absence of metrics, including the zfs_scrape_collector_success metrics. The only way I've found that resolves this is to restart the exporter.

[Question] grafana dashboard

Is there a good grafana dashboard that works well with this exporter?

Error parsing Pool fragmentation

Hi,

With --collector.pool enabled.
On ZoF or ZoL (before OpenZFS 2), I have

level=error ts=2023-01-27T16:10:06.365Z caller=zfs.go:219 msg="Executing collector" status=error collector=pool durationSeconds=0.005495284 err="strconv.ParseFloat: parsing \"0%\": invalid syntax"
level=error ts=2023-01-27T16:11:31.411Z caller=zfs.go:219 msg="Executing collector" status=error collector=pool durationSeconds=0.005649248 err="strconv.ParseFloat: parsing \"1.00x\": invalid syntax"

because:

# zpool get -Hpo name,property,value dedupratio,fragmentation
backup	dedupratio	1.00x
backup	fragmentation	0%

No problem on OpenZFS 2.

I propose to add the minimal ZFS version suported in README, or add the support for older version.

Thanks

Inconsistent results with zfs list

Hi, I'm comparing the results of the metrics with the result of zfs list -o name,used,avail,refer,usedds,usedsnap,lused and I don't know why the results are different.

# zfs list -o name,used,avail,refer,usedds,usedsnap,lused 
NAME          USED  AVAIL     REFER  USEDDS  USEDSNAP  LUSED
main/lyz     61.5G  41.6G     37.4G   37.4G     24.1G  86.9G

The dataset main/lyz doesn't have any children.

So far I think that the correlation between the metrics exposed by the exporter and the zfs list columns are:

USED: zfs_dataset_used_bytes, shows 66.02G instead of 61.5G
AVAIL: zfs_dataset_available_bytes{name="main/lyz",type="filesystem"}, shows 44.5G instead of 41.6G
LUSED: zfs_dataset_logical_used_bytes{type="filesystem"}, shows 93.34G instead of 86.9G
USEDDS: zfs_dataset_used_by_dataset_bytes="filesystem"}, shows 40.11G instead of 37.4G
USEDSNAP: zfs_dataset_used_bytes - zfs_dataset_used_by_dataset_bytes shows 25G instead of 24.1G. This doesn't work for datasets that have children. An alternative metric could be sum by (hostname,filesystem) (zfs_dataset_used_bytes{type='snapshot'}) which shows 13.1G.

What can be the reason behind this differences? I've repeated the commands throughout time and the results are the same as the data in the dataset has not changed between runs

Official container image for this exporter?

Hi,

I just noticed, that this exporter provides no official container image and was not sure, if I could grab this task.

Altough it might not make sense to run this exporter in a container at first, it seems to be possible when running the container with zfs installed in privileged mode - mounting /dev/zfs alone did not do the trick.
I would assume that this is mostly safe as the commands executed in https://github.com/pdf/zfs_exporter/blob/master/zfs/pool.go are only reading, but I don't know for sure what is happening in the background.

Did you think about this already and is there maybe a good reason to not run the exporter in a container?

Thanks!

Feature request: Docker image

Hi,

It would be really useful to package this as a docker image.

Disabling go metric collection

Is there a way to not collect certain metrics, like go metrics?

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
...

I think this is possible in Node Exporter (see prometheus/node_exporter#1148).

Can we have more dataset properties?

I was hoping there's a way to add more dataset properties when scraping. Especially compressratio,usedby* and snapshot_count would be of value to me.

Mucking about in the code I was hoping to get to something like this:

type datasetCollector struct {
	kind string
	// all datasets
	compressionRatio       desc
	logicalReferencedBytes desc
	logicalUsedBytes       desc
	usedBytes              desc
	refCompressionRatio    desc
	referencedBytes        desc
	writtenBytes           desc
	// volumes only
	volumeSizeBytes desc
	// filesystems only
	quotaBytes desc
	// volumes and filesystems only (i.e. no snapshots)
	availableBytes            desc
	usedByChildrenBytes       desc
	usedByDatasetBytes        desc
	usedByRefReservationBytes desc
	usedBySnapshotsBytes      desc
	snapshotCount             desc
}

I'm not fluent in Go so I'm unfamiliar/slow to understand certain things: is it correct that the ZFS library used has the current set of properties hard-coded? o.O ... with no way of extending the list of properties? 😢
I'm assuming this from dsPropListOptions in https://github.com/mistifyio/go-zfs/blob/d3ead99562a85f25248b082c86644ef61e6798f0/utils.go#L287 looking like it's "private" and "static".

failed under supervisor

Hello

I've wrote fairly simple supervisor config

[program:zfs_exporter]
command=/usr/local/bin/zfs_exporter 
user=nobody

and it starts, but upon first request failes

ts=2022-01-15T15:04:13.594Z caller=zfs.go:126 level=error msg="Error finding pools" err="exit status 1"

without supervisor, it runs just fine.

I can't even imagine what I've managed to hit. Hoping for any advise

ts=2024-01-23T09:46:48.905Z caller=zfs.go:229 level=warn msg="Executing collector" status=delayed collector=dataset-volume durationSeconds=0.24057311 err="context deadline exceeded"
ts=2024-01-23T09:46:48.911Z caller=zfs.go:229 level=warn msg="Executing collector" status=delayed collector=pool durationSeconds=0.247038288 err="context deadline exceeded"
ts=2024-01-23T09:46:48.934Z caller=zfs.go:229 level=warn msg="Executing collector" status=delayed collector=dataset-filesystem durationSeconds=0.269771076 err="context deadline exceeded"

What's the meaning of those messages? The --deadline option is set to the default 8s

Full command and output:

./zfs_exporter --web.listen-address 127.0.0.1:9134 --properties.dataset-filesystem="available,logicalused,quota,referenced,used,usedbydataset,written,usedsnap,logicalreferenced,usedchild,compressratio" --exclude='/docker' --deadline=8s
ts=2024-01-23T09:45:21.298Z caller=zfs_exporter.go:39 level=info msg="Starting zfs_exporter" version="(version=2.3.2, branch=master, revision=30f2d5a069cfff3021f498e8bf4bf2cf82097af0)"
ts=2024-01-23T09:45:21.298Z caller=zfs_exporter.go:40 level=info msg="Build context" context="(go=go1.20.10, platform=linux/amd64, user=root@c4294b1b2f1d, date=20231013-21:14:42, tags=netgo)"
ts=2024-01-23T09:45:21.299Z caller=dataset.go:179 level=warn msg="Unsupported dataset property, results are likely to be undesirable" help="Please file an issue at https://github.com/pdf/zfs_exporter/issues" collector=filesystem property=usedsnap err="unsupported property"
ts=2024-01-23T09:45:21.299Z caller=dataset.go:179 level=warn msg="Unsupported dataset property, results are likely to be undesirable" help="Please file an issue at https://github.com/pdf/zfs_exporter/issues" collector=filesystem property=usedchild err="unsupported property"
ts=2024-01-23T09:45:21.299Z caller=zfs_exporter.go:66 level=info msg="Enabling pools" pools=(all)
ts=2024-01-23T09:45:21.299Z caller=zfs_exporter.go:75 level=info msg="Enabling collectors" collectors="pool, dataset-filesystem, dataset-volume"
ts=2024-01-23T09:45:21.299Z caller=tls_config.go:274 level=info msg="Listening on" address=127.0.0.1:9134
ts=2024-01-23T09:45:21.299Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9134