tarantool / grafana-dashboard Goto Github PK

View Code? Open in Web Editor NEW

32.0 31.0 11.0 5.92 MB

Dashboard for Tarantool application and database server monitoring with Grafana

License: MIT License

Lua 11.10% Dockerfile 0.49% Shell 1.31% Jsonnet 85.81% Makefile 1.29%

grafana grafonnet jsonnet tarantool grafana-dashboard prometheus

grafana-dashboard's Issues

Add LICENSE

Separate load from example app

Example app is a luatest code that both creates the cluster and generates the load so graphs will be non-empty. It would be better if we had an example cluster and load generator separately. For example, cluster could be bootstrapped with cartridge-cli(#34).

Monitor event loop

It would be helpful to have some event loop information on the dashboard.

Blocked by tarantool/metrics#123 , may be partially solved by #72

Missing `level` label in example Telegraf config

Documentation (and example cluster) Telegraf config lacks label_pairs_level tag to pass cartridge_issues level label to InfluxDB.

Add CPU panels

Add CPU panels based on tarantool/metrics#218 feature

Use single docker-compose file

It is possible to distinct docker-compose file sections with --profile flag: https://docs.docker.com/compose/reference/#use---profile-to-specify-one-or-more-active-profiles . Maybe it can be used to reduce copypaste in project and better organize docker infrastructure.

Add configurable InfluxDB policy

default InfluxDB policy is used on all panels now. This should be reworked to configurable (preferably on import like measurement) policy.

Add cluster issues panel

Cluster issues metric is an important metric in terms of cluster overview. For example, original "issues" button is placed in top of Cartridge UI. I think it should be added as part of cluster overview panels.

Add network memory panels

HTTP is not the only way of interacting with Tarantool instances. Binary protocol connections (iproto) are also popular. You can monitor them with group of tnt_net_ metrics. It should be decided which tnt_net_ are most helpful and then they should be added as panels to dashboard.

Bootstrap example cluster with cartridge-cli

Luatest is not a tool created for bootstrapping configured (with both roles and some clusterwide config) cluster, but it can do it while cartridge-cli can't. We should consider moving to cartridge-cli bootstrap in example cluster when cli will be able to do this.

Add CHANGELOG.md

Add CHANGELOG.md and describe all previously added features in it

Introduce grid generation

Use (and maybe improve) code from here grafana/grafonnet-lib#223 to build up dashboard. It would be convenient if we will have two different dashboards (like in #26).

Add customization guide

It is hard for beginners to understand how to build their own dashboard. The source code is a bit complicated, and there are no convenient way to add custom panels to existing dashboard -- they need to copypaste an entire dashboard. This way is also inconvenient if one wants to add its own panels to the tail of a dashboard and get all source dashboard updates automatically.

We need to add some guide on custom panel build (maybe in form of an article) and support adding custom panels to already existing dashboard in code.

Rework fill(none) to fill(null)

Consider reworking fill(none) to fill(null) on InfluxDB dashboard panels

Add some illustrations and descriptions

The dashboard should be described somewhere. It is needed to compile the board now to get what it consists of or read through the code. The description should be illustrated with screenshots.

Alert rules documentation

Since #59 we have a set of Prometheus alert rules. But it is just an example yml file with some comments. It would be more convenient for customers which interested in "how to set up alerts for tarantool cluster" to have some documentation page describing what and how you should monitor. Of course, we may base it on #59 results.

Add Lua memory panel

Lua memory is one of the most important metric in Tarantool default metrics. At least, it is the one that should be wrapped in alert because of strict 2 Gb limit per instance.

Add box.stat().CALL panel

Currently select-insert-delete panels are present, but call is not

Cluster overview panels for InfluxDB

Consider a possibility on how to add cluster overview panels (#19) to InfluxDB dashboard. https://habr.com/ru/company/raiffeisenbank/blog/490764/ may help

Publish alerts on Prometheus.io

There is a rumor that Prometheus.io have a section with user configuration which we can use to publish our alert rules. We should check it up and publish them somewhere if its possible.

Cluster overview panel

We need some kind of cluster overview panel.

Provide Graphite example

metrics module supports Graphite. We should provide an example with Graphite monitoring stack, similar to Prometheus and InfluxDB ones.

Add fiber panels

Check what fiber metrics may be useful and add then to the dashboard.

Fix Prometheus rps computation

If Prometheus scrape_time is equal to 1m or less, rate() computation of data vector for 1 minute will fail

rate(tnt_stats_op_total{job=~\"[[job]]\",operation=\"upsert\"}[1m])

It is equal to 1m by default, so we should increase default vector time interval at least to 2m ("at the very minimum it should be two times the scrape interval" -- https://www.metricfire.com/blog/understanding-the-prometheus-rate-function/?GAID=231802381.1611061565&GAID=231802381.1611061565) and make in configurable.

Consider reworking README screenshots width

README screenshots seems nice when README.md is expanded, but a bit ugly on repo page:

Build dashboard with pre-defined data source

It's impossible to build json with fixed datasource since all dashboards choose target type based on template variable value

grafana-dashboard/dashboard/common.libsonnet

Lines 75 to 88 in 845e88c

 if datasource == '${DS_PROMETHEUS}' then 

 prometheus.target( 

 expr=std.format('rate(%s{job=~"%s"}[%s])', 

 [metric_name, job, rate_time_range]), 

 legendFormat='{{alias}}', 

 ) 

 else if datasource == '${DS_INFLUXDB}' then 

 influxdb.target( 

 policy=policy, 

 measurement=measurement, 

 group_tags=['label_pairs_alias'], 

 alias='$tag_label_pairs_alias', 

 ).where('metric_name', '=', metric_name) 

 .selectField('value').addConverter('mean').addConverter('non_negative_derivative', ['1s']),

If someone decides to build compiled dashboard with myinflux datasource, it won't work out.

Add vynil panels

Current version of a dashboard monitor only memtx engine metrics. It would be a good idea to support tarantool/metrics#178 metrics in a dashboard

Set alias on example cluster

Since moving on metrics role (#10) alias should be set through env, but now it's missing.

Default dashboard don't work as it should be

Fix Prometheus in example docker cluster

Prometheus instance in example cluster can't extract Tarantool metrics because of the following error:

Monitor replication

Replication status and replication lag is an important information about cluster state. There are some replication metrics in Tarantool default metrics, but it may be not sufficient. We should study replication to find out what metrics are needed to monitor replication and then (maybe after reworking current replication metrics) add them as panels to dashboard.

Build dashboard with Prometheus datasource

Build dashboard with Prometheus datasource based on existing panels.

Example cluster don't start

example_project_1  | .///cluster/integration/bootstrap_test.lua:110: attempt to index field 'cluster' (a nil value)
example_project_1  | stack traceback:
example_project_1  | 	/app/.rocks/share/tarantool/luatest/capture.lua:139: in function '__index'
example_project_1  | 	.///cluster/integration/bootstrap_test.lua:110: in main chunk
example_project_1  | 	[C]: in function 'require'
example_project_1  | 	/app/.rocks/share/tarantool/luatest/loader.lua:43: in function 'load_tests'
example_project_1  | 	/app/.rocks/share/tarantool/luatest/runner.lua:25: in function </app/.rocks/share/tarantool/luatest/runner.lua:14>
example_project_1  | 	[C]: in function 'xpcall'
example_project_1  | 	/app/.rocks/share/tarantool/luatest/utils.lua:32: in function 'load_tests'
example_project_1  | 	/app/.rocks/share/tarantool/luatest/runner.lua:56: in function </app/.rocks/share/tarantool/luatest/runner.lua:40>
example_project_1  | 	[C]: in function 'xpcall'
example_project_1  | 	/app/.rocks/share/tarantool/luatest/runner.lua:40: in function 'fn'
example_project_1  | 	/app/.rocks/share/tarantool/luatest/sandboxed_runner.lua:14: in function 'run'
example_project_1  | 	/app/.rocks/share/tarantool/luatest/cli_entrypoint.lua:4: in function </app/.rocks/share/tarantool/luatest/cli_entrypoint.lua:3>
example_project_1  | 	....rocks/share/tarantool/rocks/luatest/0.5.0-1/bin/luatest:3: in main chunk
grafana-dashboard_example_project_1 exited with code 255

CPU time panels format

CPU time panels s(seconds) format is a bit confusing. It may be more convenient for users to inspect a panel with percentage format.

Describe Tarantool requirements

Dashboard ver. 0.2.0 requires Tarantool to use metrics 0.5.0, but it's not described explicitly. Part of #2

Rework average latency panels to percentiles

Move to new metrics version and use percentiles instead of average latency since it's deprecated (and a bit broken). Average collector also breaks Prometheus metrics (#13).

Add simple CI with tests

Run tests with CI container. Maybe we also could try artifacts build

Clean up code

Remove traces of old code: https://github.com/tarantool/grafana-dashboard/blob/master/tarantool/space.libsonnet#L2 , https://github.com/tarantool/grafana-dashboard/blob/master/tarantool/slab.libsonnet#L2

Documentation on tarantool.io

It would be nice to add dashboard documentation (setup, usage) to tarantool.io

Maybe smith from README and smith from https://habr.com/ru/company/mailru/blog/534826/

Maybe documentation should be placed in tarantool/metrics repo with the rest monitoring documentation

Health check

There is no health check both in dashboard and default Cartridge/metrics tools. Health check is required for overview panels.

Proposition: "if instance is sending any metrics, it is alive; otherwise, instance is not running (or something else is unhealthy)".

Overall load panels are red on low load

It the load is low, overall load panels show values in red. I think it's not a good idea to scare user like that since this may be expected (for example, HTTP handle only used to control the state of an app)

Release on Grafana Official & community built dashboards

We should inspect the requirements to get into https://grafana.com/grafana/dashboards and publish the artifact on Grafana
Official & community built dashboards list.

Support luajit monitoring

Check what luajit metrics may be useful and add then to the dashboard.

Release on Grafana Dashboards with collaborative account

There is no possibility to share access on dashboard edit and publish in Grafana Official and community built dashboards now. We need to create come kind of collaborative account to make it able to edit and upload new versions of dashboard with our team.

Autoscreenshooting tool

It is an annoying and monotonous job to make screenshots of new dashboard each time. Maybe there are some tools to make this with some script.

Rework Tarantool docker example on Cartridge metrics role

Rework Tarantool docker example on Cartridge metrics role (metrics 0.3.0) and check that current dashboard works fine with role. Fix panels if needed.

"example" project is not helpful for beginners

There is an example (https://github.com/tarantool/grafana-dashboard/tree/master/example) in our repo. It's not trivial to understand how does it works for person who first time run Grafana and Prometheus.

First of all, "project" application doesn't expose any metrics.

User need manually add to init.lua:

local metrics = require('cartridge.roles.metrics')
metrics.set_export({
    {
        path = '/metrics/prometheus',
        format = 'prometheus',
    }
})

Secondly it's not obvious how to install and run Prometheus and Graphana.
I think it's a lack of simple readme.md file that says:

brew install prometheus
brew install graphana
brew services start grafana # grafana will be available on :3000
prometheus --config.file="prometheus/prometheus.yml" # prometheus will be available on :9090
# Here we need to verify "Status" -> "Targets" page.

Also it's better to change target urls from "example_project:" to "localhost:" - because basically user will run application using cartridge start that starts applications on the localhost.

Please add replicasets.yml file to test project. It will help to avoid some excess actions from user - all we need to call cartridge replicasets setup.

Finally, it would be great to add a small description about steps how to import dashboard from grafana.com. It's easier if user understand that "job" option means. It wasn't so in my case and I spent some time to understand what's wrong.

Tech writer creates PR with a documentation update (rst sources).
- (push-translation on pull_request) GitHub workflow builds po files from rst and sends them to the same name branch in Crowdin
Pull translations back
- (pull-translation on workflow_dispatch) When translation of the corresponding changes is ready, tech writer manually triggers GitHub workflow for the translated branch (PR branch)
Tech writer or someone responsible merges PR
- (upload-translation on push to master) GithHub workflow upload merged translations from po files into Crowdin master (or main) branch

	if datasource == '${DS_PROMETHEUS}' then
	prometheus.target(
	expr=std.format('rate(%s{job=~"%s"}[%s])',
	[metric_name, job, rate_time_range]),
	legendFormat='{{alias}}',
	)
	else if datasource == '${DS_INFLUXDB}' then
	influxdb.target(
	policy=policy,
	measurement=measurement,
	group_tags=['label_pairs_alias'],
	alias='$tag_label_pairs_alias',
	).where('metric_name', '=', metric_name)
	.selectField('value').addConverter('mean').addConverter('non_negative_derivative', ['1s']),

tarantool / grafana-dashboard Goto Github PK

grafana-dashboard's Issues

Recommend Projects

Recommend Topics

Recommend Org