tarantool / grafana-dashboard Goto Github PK
View Code? Open in Web Editor NEWDashboard for Tarantool application and database server monitoring with Grafana
License: MIT License
Dashboard for Tarantool application and database server monitoring with Grafana
License: MIT License
Example app is a luatest code that both creates the cluster and generates the load so graphs will be non-empty. It would be better if we had an example cluster and load generator separately. For example, cluster could be bootstrapped with cartridge-cli
(#34).
It would be helpful to have some event loop information on the dashboard.
Blocked by tarantool/metrics#123 , may be partially solved by #72
Documentation (and example cluster) Telegraf config lacks label_pairs_level
tag to pass cartridge_issues
level
label to InfluxDB.
Add CPU panels based on tarantool/metrics#218 feature
It is possible to distinct docker-compose file sections with --profile
flag: https://docs.docker.com/compose/reference/#use---profile-to-specify-one-or-more-active-profiles . Maybe it can be used to reduce copypaste in project and better organize docker infrastructure.
default
InfluxDB policy is used on all panels now. This should be reworked to configurable (preferably on import like measurement
) policy.
Cluster issues metric is an important metric in terms of cluster overview. For example, original "issues" button is placed in top of Cartridge UI. I think it should be added as part of cluster overview panels.
HTTP is not the only way of interacting with Tarantool instances. Binary protocol connections (iproto) are also popular. You can monitor them with group of tnt_net_
metrics. It should be decided which tnt_net_
are most helpful and then they should be added as panels to dashboard.
Luatest is not a tool created for bootstrapping configured (with both roles and some clusterwide config) cluster, but it can do it while cartridge-cli can't. We should consider moving to cartridge-cli bootstrap in example cluster when cli will be able to do this.
Add CHANGELOG.md and describe all previously added features in it
Use (and maybe improve) code from here grafana/grafonnet-lib#223 to build up dashboard. It would be convenient if we will have two different dashboards (like in #26).
It is hard for beginners to understand how to build their own dashboard. The source code is a bit complicated, and there are no convenient way to add custom panels to existing dashboard -- they need to copypaste an entire dashboard. This way is also inconvenient if one wants to add its own panels to the tail of a dashboard and get all source dashboard updates automatically.
We need to add some guide on custom panel build (maybe in form of an article) and support adding custom panels to already existing dashboard in code.
Consider reworking fill(none) to fill(null) on InfluxDB dashboard panels
The dashboard should be described somewhere. It is needed to compile the board now to get what it consists of or read through the code. The description should be illustrated with screenshots.
Since #59 we have a set of Prometheus alert rules. But it is just an example yml file with some comments. It would be more convenient for customers which interested in "how to set up alerts for tarantool cluster" to have some documentation page describing what and how you should monitor. Of course, we may base it on #59 results.
Lua memory is one of the most important metric in Tarantool default metrics. At least, it is the one that should be wrapped in alert because of strict 2 Gb limit per instance.
Currently select-insert-delete panels are present, but call is not
Consider a possibility on how to add cluster overview panels (#19) to InfluxDB dashboard. https://habr.com/ru/company/raiffeisenbank/blog/490764/ may help
There is a rumor that Prometheus.io have a section with user configuration which we can use to publish our alert rules. We should check it up and publish them somewhere if its possible.
We need some kind of cluster overview panel.
metrics
module supports Graphite. We should provide an example with Graphite monitoring stack, similar to Prometheus and InfluxDB ones.
Check what fiber metrics may be useful and add then to the dashboard.
If Prometheus scrape_time is equal to 1m or less, rate() computation of data vector for 1 minute will fail
rate(tnt_stats_op_total{job=~\"[[job]]\",operation=\"upsert\"}[1m])
It is equal to 1m by default, so we should increase default vector time interval at least to 2m ("at the very minimum it should be two times the scrape interval" -- https://www.metricfire.com/blog/understanding-the-prometheus-rate-function/?GAID=231802381.1611061565&GAID=231802381.1611061565) and make in configurable.
It's impossible to build json with fixed datasource since all dashboards choose target type based on template variable value
grafana-dashboard/dashboard/common.libsonnet
Lines 75 to 88 in 845e88c
If someone decides to build compiled dashboard with myinflux
datasource, it won't work out.
Current version of a dashboard monitor only memtx engine metrics. It would be a good idea to support tarantool/metrics#178 metrics in a dashboard
Since moving on metrics role (#10) alias should be set through env, but now it's missing.
Default dashboard don't work as it should be
Replication status and replication lag is an important information about cluster state. There are some replication metrics in Tarantool default metrics, but it may be not sufficient. We should study replication to find out what metrics are needed to monitor replication and then (maybe after reworking current replication metrics) add them as panels to dashboard.
Build dashboard with Prometheus datasource based on existing panels.
example_project_1 | .///cluster/integration/bootstrap_test.lua:110: attempt to index field 'cluster' (a nil value)
example_project_1 | stack traceback:
example_project_1 | /app/.rocks/share/tarantool/luatest/capture.lua:139: in function '__index'
example_project_1 | .///cluster/integration/bootstrap_test.lua:110: in main chunk
example_project_1 | [C]: in function 'require'
example_project_1 | /app/.rocks/share/tarantool/luatest/loader.lua:43: in function 'load_tests'
example_project_1 | /app/.rocks/share/tarantool/luatest/runner.lua:25: in function </app/.rocks/share/tarantool/luatest/runner.lua:14>
example_project_1 | [C]: in function 'xpcall'
example_project_1 | /app/.rocks/share/tarantool/luatest/utils.lua:32: in function 'load_tests'
example_project_1 | /app/.rocks/share/tarantool/luatest/runner.lua:56: in function </app/.rocks/share/tarantool/luatest/runner.lua:40>
example_project_1 | [C]: in function 'xpcall'
example_project_1 | /app/.rocks/share/tarantool/luatest/runner.lua:40: in function 'fn'
example_project_1 | /app/.rocks/share/tarantool/luatest/sandboxed_runner.lua:14: in function 'run'
example_project_1 | /app/.rocks/share/tarantool/luatest/cli_entrypoint.lua:4: in function </app/.rocks/share/tarantool/luatest/cli_entrypoint.lua:3>
example_project_1 | ....rocks/share/tarantool/rocks/luatest/0.5.0-1/bin/luatest:3: in main chunk
grafana-dashboard_example_project_1 exited with code 255
CPU time panels s
(seconds) format is a bit confusing. It may be more convenient for users to inspect a panel with percentage format.
Dashboard ver. 0.2.0 requires Tarantool to use metrics 0.5.0, but it's not described explicitly. Part of #2
Run tests with CI container. Maybe we also could try artifacts build
It would be nice to add dashboard documentation (setup, usage) to tarantool.io
Maybe smith from README and smith from https://habr.com/ru/company/mailru/blog/534826/
Maybe documentation should be placed in tarantool/metrics repo with the rest monitoring documentation
There is no health check both in dashboard and default Cartridge/metrics tools. Health check is required for overview panels.
Proposition: "if instance is sending any metrics, it is alive; otherwise, instance is not running (or something else is unhealthy)".
We should inspect the requirements to get into https://grafana.com/grafana/dashboards and publish the artifact on Grafana
Official & community built dashboards list.
Check what luajit metrics may be useful and add then to the dashboard.
There is no possibility to share access on dashboard edit and publish in Grafana Official and community built dashboards now. We need to create come kind of collaborative account to make it able to edit and upload new versions of dashboard with our team.
It is an annoying and monotonous job to make screenshots of new dashboard each time. Maybe there are some tools to make this with some script.
Rework Tarantool docker example on Cartridge metrics role (metrics 0.3.0) and check that current dashboard works fine with role. Fix panels if needed.
There is an example (https://github.com/tarantool/grafana-dashboard/tree/master/example) in our repo. It's not trivial to understand how does it works for person who first time run Grafana and Prometheus.
First of all, "project" application doesn't expose any metrics.
User need manually add to init.lua:
local metrics = require('cartridge.roles.metrics')
metrics.set_export({
{
path = '/metrics/prometheus',
format = 'prometheus',
}
})
Secondly it's not obvious how to install and run Prometheus and Graphana.
I think it's a lack of simple readme.md file that says:
brew install prometheus
brew install graphana
brew services start grafana # grafana will be available on :3000
prometheus --config.file="prometheus/prometheus.yml" # prometheus will be available on :9090
# Here we need to verify "Status" -> "Targets" page.
Also it's better to change target urls from "example_project:" to "localhost:" - because basically user will run application using cartridge start
that starts applications on the localhost.
Please add replicasets.yml
file to test project. It will help to avoid some excess actions from user - all we need to call cartridge replicasets setup
.
Finally, it would be great to add a small description about steps how to import dashboard from grafana.com. It's easier if user understand that "job" option means. It wasn't so in my case and I spent some time to understand what's wrong.
One of the most asked questions from customers is "What thresholds/alert rules I should use to monitor Tarantool?". Most of them asking about it in term of Prometheus alerting rules.
Documentation: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
It would be convenient to publish new version of dashboard to Grafana Official & community built dashboards page with some sort of CI/CD, if it is possible.
It would be convenient to have multiple instances running in default cluster to test, i.e., future panel #19 .
The documentation translation process should be automated. We need to implement a set of tasks at each stage of tech writer's workflow to maintain 100% translation for each repository containing documentation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.