bahmanm / lemmy-meter Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 699 KB

A web application to track Lemmy instances performance and represent the results visually

Home Page: https://lemmy-meter.info

License: GNU General Public License v3.0

Makefile 66.52% Jinja 1.48% Perl 31.48% Dockerfile 0.52%

fediverse lemmy observability

lemmy-meter's People

Contributors

Stargazers

Watchers

lemmy-meter's Issues

Deploy new versions (tags) via Travis

Broken data link on Overall Health panel

The link points to localhost:3000

Externally embeddable gauges

Investigate if it is possible to embed the health indicator gauges for a given instance in another website, the way a usual health "badge" works.

Thanks @unruffled for bringing this up.

Investigate alerts and notifications

Explore whether it is possible for viewers to sign up for notifications as to when their favourite instances becomes (partially) unavailable.

This may be potentially helpful for admins as well.

For this to happen:

There should be an un/subscribe form.
lemmy-meter should be able to able to send e-mails - probably plenty of them.
Reasonable alerts should be configured.

Configure alerts for slow DNS resolution

There have been a couple of incidents already when Blackbox Exporter takes a very long time (10s+) to finish the "resolve" phase.

One suspect is the connection between the Docker daemon and provider's nameserver can become stale (:man_shrugging:) I patched the configuration to always use DNS servers outside the internal network.

However, I'd like to be alerted the next time this happens so I can start investigating right away.

Create alerting rules for latency and availability

As the first iteration, the following values and durations should be enough:

Availability:
- WARN if < 70% for > 15m
- ERROR if < 70% for > 30m
Latency
- WARN if > 30% for > 15m
- ERROR if > 30% for > 30m

Integrate Alertmanager w/ Gotify

Monitor the Prometheus volume size and act accordingly

Grafana admin password should be configurable in `deploy-remote` playbook

When deploying, restart/recreate only the services impacted

Currently during a deploy, the cluster is torn down and recreated which is very inefficient. It should be possible to instruct lemmy-meter to restart/recreate only the changed services.

Enable Renovate

Set the default availability warning and error alert durations

Based on the admins' feedback and real world experience, a default duration of 5m to trigger the warning and 10m to trigger the error is quite reasonable.

Calculate instance availablity via recording rules

Configure Alertmanager

Configure Prometheus alerts and Alertmanager to notify instance admins/communities of outages/degraded performance, eg in a Matrix channel/chat or a Discord server.

Enable load balancing for Grafana

Use volumes for Prometheus and Grafana

Currently all the cluster nodes use mount binds which are not totally reliable. Use volumes for at least Prometheus and Grafana.

Additional meta tags in the header for SEO and embeddability

Follow up from a conversation in #lemmy-meter:matrix.org

Try out Kamal instead of Compose

Kamal v1.0.0 which has just been released seems to be an interesting alternative to Docker Compose. It's worth trying it out while lemmy-meter is in its early stages.

Import/export Grafana dashboards w/ zero downtime

It should be possible to transfer the changes between local lemmy-meter and lemmy-meter.info w/o requiring the cluster to be stopped.

One workflow is

Grab latest dashboards from remote
Experiment and make changes locally
Upload the changes to remote

Or even better is to store the relevant Grafana configurations such data sources, users and dashboards so that they can be versioned in git.

Integrate Alertmanager with ntfy

Configure ntfy to run as a component in the cluster.
Write a webhook receiver which translates Alertmanager payload to ntfy model.
Configure Alertmanager to use the said receiver.

Tune the frequency/number of HTTP requests for the default configuration

Currently, the default configuration sends a good deal of HTTP requests per minute to an instance (~30 req/min.)

Tune it down to 2-4 req/min.

Allow instance owners to schedule planned downtime

As it stands, this project is very good for detecting unplanned service outages, but there is currently not a way to distinguish between planned and unplanned outages.

Run matrix-webhook in the cluster

Currently, matrix-webhoo which is used for Alert notifications is run as a separate user. Move it to the same cluster as other services to ensure fail-over and consistency.

Increase data retention period to 120 days

There are calculations and metrics which rely on at 90 days of data to be available. The current retention period is 60 days.

Endpoint to validate scheduled downtime file

It would be helpful to implement an endpoint to assist admins in validating scheduled-downtime.json.

For example:

$ curl -X GET https://lemmy-meter.info/.metadata/validate-json?instance=<INSTANCE>
Invalid 
<detailed error message>

Re-organise the project into subprojects

Migrate Grafana to PostgreSQL

Consider a special 5xx reponse response to a probe as maintenance mode

Investigate whether it's possible to assume the site is in maintenance mode if it responds to probes w/ a special 5xx response such as 503.

Probe Lemmy instances from different geo regions

Cron syntax for recurring scheduled periods

Follow up from #22

In the case of the planned downtime Google sheet, there should be two new columns for cron schedules.

Scheduled downtime does not show up in metrics

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Update dependency molecule to v24

Detected dependencies

cpanfile

cluster/downtime-processor/cpanfile

perl 5.39.8

Mojolicious 9.35

Net::Prometheus 0.12

Data::Dump 1.25

Schedule::Cron::Events 1.96

Text::CSV 2.04

Moose 2.2207

JSON 4.10

JSON::Validator 5.14

File::Slurper 0.014

Data::UUID 1.226

Log::Log4perl 1.57

docker-compose

cluster/docker-compose.yml

prom/prometheus v2.50.1

grafana/grafana 10.3.3

prom/blackbox-exporter v0.24.0

prometheuscommunity/json-exporter v0.6.0

nginx 1.25

postgres 16.2

prom/alertmanager v0.27.0

ixdotai/smtp v0.5.2

binwiederhier/ntfy v2.8.0

dockerfile

cluster/downtime-processor/Dockerfile

perl 5.39.8

pip_requirements

ansible/requirements.txt

ansible ==9.3.0

molecule ==6.0.3

molecule-plugins ==23.5.3

passlib == 1.7.4

Check this box to trigger a request for Renovate to run again on this repository

Scrape downtime schedules off instances

Follow up on #22

It should be possible to scrape downtime schedules off predefined URLs from instances. For example, https://INSTANCE/.well-known/host-metadata.json or https://INSTANCE/.well-known/scheduled-downtime.json

Write tests for the deploy playbook

Automate the rollout of a new version

The current process for deploying a new version is quite laborious and involves scp, wget and unzip which is just not right 😅

Ideally, there should be an Ansible playbook(s) to automate all or most aspects of that:

Deploying a new version of lemmy-meter
Deploying Grafana dashboards
Restarting the cluster
Restarting a particular service

For the sake of simplicity, the task of deploying a cluster to a new machine can be skipped.

Configure alerts

It should be possible to subscribe to a particular instance's alerts and receive a notification (eg an e-mail) whenever the alert is triggered.

Expose stats via APIs

It'd be useful to expose the health check results that lemmy-meter collects via some API to interested parties.

For example, uptime.lemmings.world could use such stats to generate uptime badges.

Things to note at the first pass:

The API shouldn't be public. Not at least for now, as lemmy-meter simply hasn't got the infrastructure for that.
There are two types of data that lemmy-meter ingests and stores: snapshot and time-series. Again, for the infrastructural reason, for the time being, the focus should be on the snapshot data.

Thanks @RikudouSage for bringing this up.

Deploy cluster configuration w/o restarting it

In cases like changes to Prometheus service discover files (eg adding an instance) there's no need to restart the cluster as Prometheus will pick the changes up OOTB.

Retire matrix-webhook

With Prometheus alerts in place, there's no more need for the Grafana-Matrix bridge and it can be safely retired.