Giter Club home page Giter Club logo

pyrra's Introduction

Pyrra: SLOs with Prometheus

Making SLOs with Prometheus manageable, accessible, and easy to use for everyone!

Screenshot of Pyrra

Dashboards to visualize SLOs in Grafana:

Pyrra Grafana dashboard

Watch the 5min lightning talk at Prometheus Day 2022:

PrometheusDay 2022: Lightning Talk Pyrra

Features

  • Support for Kubernetes, Docker, and reading from the filesystem
  • Alerting: Generates 4 Multi Burn Rate Alerts with different severity
  • Page listing all Service Level Objectives
    • Search through names and labels
    • Sorted by remaining error budget to see the worst ones quickly
    • All columns sortable
    • View and hide individual columns
    • Clicking on labels to filter SLOs that contain the label
    • Tool-tips when hovering for extra context
  • Page with details for a Service Level Objective
    • Objective, Availability, Error Budget highlighted as 3 most important numbers
    • Graph to see how the error budget develops over time
    • Time range picker to change graphs
    • Switch between absolute and relative chart scales
    • Request, Errors, Duration (RED) graphs for the underlying service
    • Multi Burn Rate Alerts overview table
  • Caching of Prometheus query results
  • Thanos: Disabling of partial responses and downsampling to 5m and 1h
  • connect-go and connect-web generate protobuf APIs
  • Grafana dashboard via --generic-rules generation

Feedback & Support

If you have any feedback, please open a discussion in the GitHub Discussions of this project.
We would love to learn what you think!

Demo

Check out our live demo on demo.pyrra.dev!
Grafana dashboards are available as demo on demo.pyrra.dev/grafana!

Feel free to give it a try there!

How It Works

There are three components of Pyrra, all of which work through a single binary:

  • The UI displays SLOs, error budgets, burn rates, etc.
  • The API delivers information about SLOs from a backend (like Kubernetes) to the UI.
  • A backend watches for new SLO objects and then creates Prometheus recording rules for each.
    • For Kubernetes, there is a Kubernetes Operator available
    • For everything else, there is a filesystem-based Operator available

For the backend/operator to do its work, an SLO object has to be provided in YAML-format:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: pyrra-api-errors
  namespace: monitoring
  labels:
    prometheus: k8s
    role: alert-rules
    pyrra.dev/team: operations # Any labels prefixed with 'pyrra.dev/' will be propagated as Prometheus labels, while stripping the prefix.
spec:
  target: "99"
  window: 2w
  description: Pyrra's API requests and response errors over time grouped by route.
  indicator:
    ratio:
      errors:
        metric: http_requests_total{job="pyrra",code=~"5.."}
      total:
        metric: http_requests_total{job="pyrra"}
      grouping:
        - route

Depending on your mode of operation, this information is provided through an object in Kubernetes, or read from a static file.

In order to calculate error budget burn rates, Pyrra will then proceed to create Prometheus recording rules for each SLO.

The following rules would be created for the above example:

http_requests:increase2w

http_requests:burnrate3m
http_requests:burnrate15m
http_requests:burnrate30m
http_requests:burnrate1h
http_requests:burnrate3h
http_requests:burnrate12h
http_requests:burnrate2d

The recording rules names are based on the originally provided metric. The recording rules contain the necessary labels to uniquely identify the recording rules in case there are multiple ones available.

Running inside a Kubernetes cluster

An example for this mode of operation can be found in examples/kubernetes.

Kubernetes Architecture

Here two deployments are needed: one for the API / UI and one for the operator. For the first deployment, start the binary with the api argument.

When starting the binary with the kubernetes argument, the service will watch the apiserver for ServiceLevelObjectives. Once a new SLO is picked up, Pyrra will create PrometheusRule objects that are automatically picked up by the Prometheus Operator.

If you're unable to run the Prometheus Operator inside your cluster, you can add the --config-map-mode=true flag after the kubernetes argument. This will save each recording rule in a separate ConfigMap.

Applying YAML

This repository contains generated YAML files in the examples/kubernetes/manifests folder. You can use the following commands to deploy them to a cluster right away.

kubectl apply --server-side -f ./example/kubernetes/manifests/setup
kubectl apply --server-side -f ./example/kubernetes/manifests
kubectl apply --server-side -f ./example/kubernetes/manifests/slos
Applying YAML and validating webhooks via cert-manager

This repository contains more generated YAML files in the examples/kubernetes/manifests-webhook folder.

This example deployment additionally applies and self-sign Issuer and requests a certificate via cert-manager, so that the Kubernetes APIServer can connect to Pyrra to validate any configuration object before applying it to the cluster.

kubectl apply --server-side -f ./example/kubernetes/manifests-webhook/setup
kubectl apply --server-side -f ./example/kubernetes/manifests-webhook
kubectl apply --server-side -f ./example/kubernetes/manifests-webhook/slos
kube-prometheus

The underlying jsonnet code is imported by the kube-prometheus project. If you want to install an entire monitoring stack including Pyrra we highly recommend using kube-prometheus.

Install with Helm

Thanks to @rlex there is a Helm chart for deploying Pyrra too.

Running inside Docker / Filesystem

An example for this mode of operation can be found in examples/docker-compose.

Filesystem Architecture

You can easily start Pyrra on its own via the provided Docker image:

docker pull ghcr.io/pyrra-dev/pyrra:v0.7.0

When running Pyrra outside of Kubernetes, the SLO object can be provided through a YAML file read from the file system. For this, one container or binary needs to be started with the api argument and the reconciler with the filesystem argument.

Here, Pyrra will save the generated recording rules to disk where they can be picked up by a Prometheus instance. While running Pyrra on its own works, there won't be any SLO configured, nor will there be any data from a Prometheus to work with. It's designed to work alongside a Prometheus.

Tech Stack

Client: TypeScript with React, Bootstrap, and uPlot.

Server: Go with libraries such as: chi, ristretto, xxhash, client-go.

Generated protobuf APIs with connect-go for Go and connect-web for TypeScript.

Roadmap

Best to check the Projects board and if you cannot find what you're looking for feel free to open an issue!

Contributing

Contributions are always welcome!

See CONTRIBUTING.md for ways to get started.

Please adhere to this project's code of conduct.

Maintainers

Name Area GitHub Twitter Company
Nadine Vehling UX/UI @nadinevehling @nadinevehling Grafana Labs
Matthias Loibl Engineering @metalmatze @metalmatze Polar Signals

We are mostly maintaining Pyrra in our free time.

Acknowledgements

@aditya-konarde, @brancz, @cbrgm, @codesome, @ekeih, @guusvw, @jzelinskie, @kakkoyun, @lilic, @markusressel, @morremeyer, @mxinden, @numbleroot, @paulfantom, @RiRa12621, @tboerger, and Maria Franke.

While we were working on Pyrra in private these amazing people helped us with a look of feedback and some even took an extra hour for a in-depth testing! Thank you all so much!

Additionally, @metalmatze would like to thank Polar Signals for allowing us to work on this project in his 20% time.

FAQ

Why not use Grafana in this particular use case?

Right now we could have used Grafana indeed. In upcoming releases, we plan to add more interactive features to give you better context when coming up with new SLOs. This is something we couldn't do with Grafana.

Do I still need Grafana?

Yes, Grafana is an amazing data visualization tool for Prometheus metrics. You can create your own custom dashboards and dive a lot deeper into each component while debugging.

Does it work with Thanos too?

Yes, in fact I've been developing this against my little Thanos cluster most of the time.
The queries even dynamically add headers for downsampling and disable partial responses.

How many instances should I deploy?

It depends on the topology of your infrastructure, however, we think that alerting should still happen within each individual Prometheus and therefore running one instance with one Prometheus (pair) makes the most sense. Pyrra itself only needs one instance per Prometheus (pair).

Why don't you support more complex SLOs?

For now, we try to accomplish an easy-to-setup workflow for the most common SLOs. It is still possible to write these more complex SLOs manually and deploy them to Prometheus along those generated. You can base more complex SLOs on the output of one SLO from this tool.

Why is the objective target a string not a float?

Kubebuilder doesn't support floats in CRDs...
Therefore, we need to pass it as string and internally convert it from string to float64.

Related

Here are some related projects:

pyrra's People

Contributors

achetronic avatar alexomosa avatar arthursens avatar asubiotto avatar bison avatar braderz avatar chlunde avatar cuishuang avatar dependabot[bot] avatar dotdc avatar fatsheep9146 avatar github-actions[bot] avatar m-messiah avatar manojvivek avatar metalmatze avatar mmazur avatar nadinevehling avatar obitech avatar paulfantom avatar rekup avatar rira12621 avatar roidelapluie avatar s-diez avatar saswatamcode avatar sepulworld avatar songjiayang avatar vrutkovs avatar wilfriedroset avatar yairst avatar yeya24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyrra's Issues

Error budget graph gradient is slightly wrong

If the error budget crosses 0 we want to color the filled area either in red or green.
To make that work we calculate a gradient that vertically runs from the top to the bottom of the graph and should split right where the 0 is.
I tried for quite some time to get it right and in the end, I wasn't able to make it happen. Something about the offsets and paddings etc.

errorbudget

I left some TODOs in the code and would be happy to get that reviewed by others:

// TODO: This seems "good enough" but sometimes the gradient still reaches the wrong side.
// Maybe it's a floating point thing?
const zeroPercentage = 1 - (0 - min) / (max - min)
const gradient = u.ctx.createLinearGradient(width / 2, canvasPadding - 2, width / 2, height - canvasPadding)
gradient.addColorStop(0, `#${greens[0]}`)
gradient.addColorStop(zeroPercentage, `#${greens[0]}`)
gradient.addColorStop(zeroPercentage, `#${reds[0]}`)
gradient.addColorStop(1, `#${reds[0]}`)
return gradient

UI: Show warning if volume is below error budget

If someone sets an objective of 99% that means that out of 100 events only 1 can fail.
If the volume/amount of events that happend for that SLO is less than 100 it becomes problematic.

I'd even say, that probably below 1000 requests in total is problematic. Although then it becomes hard to draw the line of how few events are acceptable when it isn't anymore.

At least we should show are warning if (error budget) * (volume) < 1 or (1 - objective) * (volume) < 1.

Concrete example could be:
The objective is to have 99.5% over 4w. Now, in 4w the service only had 135 requests.
Therefore (1 - 0.995) * 135 = 0.675 which is less than 1. It means that just one bad requests exhausts the entire error budget.
In that case, we should show a warning that the objective's target is too high for the few events the service has.

Fix prometheus-url flag for filesystem

The Prometheus client that's then used by the Prom API client passed here doesn't use the correct Prometheus API URL.

pyrra/main.go

Line 112 in 3a58b4e

code = cmdFilesystem(logger, reg, client, CLI.Filesystem.ConfigFiles, CLI.Filesystem.PrometheusFolder)

Release arm64 image

Hello โ˜€๏ธ

Would it be possible to release a multi-arch image which includes an arm64 build? We're currently migrating our clusters to arm64 nodes and this would fit very well on there ๐Ÿ‘Œ

Panic reloading Prometheus

pyrra-filesystem_1  | goroutine 70 [running]:
pyrra-filesystem_1  | main.cmdFilesystem.func7()
pyrra-filesystem_1  | 	/workspace/filesystem.go:237 +0x3ef
pyrra-filesystem_1  | github.com/oklog/run.(*Group).Run.func1({0xc00052a440, 0xc0005400f0})
pyrra-filesystem_1  | 	/go/pkg/mod/github.com/oklog/[email protected]/group.go:38 +0x2f
pyrra-filesystem_1  | created by github.com/oklog/run.(*Group).Run
pyrra-filesystem_1  | 	/go/pkg/mod/github.com/oklog/[email protected]/group.go:37 +0x22f
pyrra-filesystem_1  | level=info ts=2022-03-23T00:57:50.446907165Z caller=main.go:100 msg="using Prometheus" url=http://localhost:9090
pyrra-filesystem_1  | level=info ts=2022-03-23T00:57:50.447024576Z caller=filesystem.go:113 msg="watching directory for changes" directory=/etc/pyrra
pyrra-filesystem_1  | level=info ts=2022-03-23T00:57:50.447638782Z caller=filesystem.go:265 msg="starting up HTTP API" address=:9444
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.451548833Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/caddy-response-errors.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.458684776Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/caddy-response-latency.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.462586082Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/parca-grpc-profilestore-errors.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.465282016Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/parca-grpc-profilestore-latency.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.468134747Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/prometheus-http-errors.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.47298123Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/prometheus-rule-evaluation-failures.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.475449587Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/pyrra-demo-hourly.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:50.477591209Z caller=filesystem.go:155 msg=reading file=/etc/pyrra/pyrra-demo-random.yaml
pyrra-filesystem_1  | level=debug ts=2022-03-23T00:57:55.480652213Z caller=filesystem.go:231 msg="reloading Prometheus now"
pyrra-filesystem_1  | level=warn ts=2022-03-23T00:57:55.482150924Z caller=filesystem.go:235 msg="failed to reload Prometheus"
pyrra-filesystem_1  | panic: runtime error: invalid memory address or nil pointer dereference
pyrra-filesystem_1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x14647cf]

Instrument Pyrra itself with metrics

So far we don't have any metrics... ๐Ÿ˜…
We should add some Go and HTTP default metrics and can then start adding more application specific ones later on.

Additionally, it'd be great to provide SLOs for Pyrra itself based on these metrics. ๐Ÿ˜Ž

running locally using `filesystem` fails

$ ./bin/filesystem
2021/08/02 16:37:23 lstat /etc/pyrra: no such file or directory

This may be a little bit of an issue in general but specifically on MacOS it is rather uncommon to use /etc for anything non-system.
Should probably be changed from a hardcoded path here

filenames, err := filepath.Glob("/etc/pyrra/*.yaml")
to something configurable.

Additionally it would be good to add a note of documentation, what exactly is required in the yaml file.

Support for existing SLOs present in k8s env from record rules.

Tried the application, nice interface

have a question regarding importing/adding support for existing SLOs into the interface.

eg-> , we have a CI git ops setup for defining SLOs which generate prom record rules in k8s cluster (using sloth, is it possible to visualize those already present SLOs in pyrra interface.

Pyrra `ListObjectives` route returns 500 if SLO is created with invalid metrics

We tried creating an SLO with the following spec:

spec:
  target: "99.9"
  description: "Success ratio of workspace backups"
  window: 4w
  indicator:
    ratio:
      errors:
        metric: gitpod_ws_manager_workspace_backups_failure_total
      total:
        metric: (gitpod_ws_manager_workspace_backups_failure_total + gitpod_ws_manager_workspace_backups_success_total)

We made a mistake here when we assumed that a query could work instead of a single metric.

The problem is that the admission controller accepted the SLO, and after that all other SLOs we had stopped showing up in the ListObjectives route. We got confused at first, but after sometime we noticed the 500s showing up in the logs. We deleted this problematic SLO and 500s disappeared.


Accepting queries instead of a specific metric might be reasonable in some use cases, but that is not the point of this issue ๐Ÿ˜…. I believe it would be a better experience if the admission controller rejected the SLO during creation time, or if Pyrra UI could handle invalid SLOs without returning 500s.

Panic in GetObjectiveStatus

2022/03/25 15:51:49 http: panic serving 100.64.0.184:43928: runtime error: invalid memory address or nil pointer dereference
goroutine 3147 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1802 +0xb9
panic({0x15c0de0, 0x27f68b0})
	/usr/local/go/src/runtime/panic.go:1047 +0x266
main.(*ObjectivesServer).GetObjectiveStatus(0xc00026a0a0, {0x1bd3af8, 0xc003bd2510}, {0xc0038ecd40, 0x36}, {0xc003b17bfe, 0x0})
	/workspace/main.go:491 +0xed6
github.com/pyrra-dev/pyrra/openapi/server/go.(*ObjectivesApiController).GetObjectiveStatus(0xc00026e078, {0x1bc4208, 0xc00026e3c0}, 0xc003b64a00)
	/workspace/openapi/server/go/api_objectives.go:141 +0x164
net/http.HandlerFunc.ServeHTTP(0x4, {0x1bc4208, 0xc00026e3c0}, 0x0)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/pyrra-dev/pyrra/openapi/server/go.Logger.func1({0x1bc4208, 0xc00026e3c0}, 0xc003b64a00)
	/workspace/openapi/server/go/logger.go:22 +0x9e
net/http.HandlerFunc.ServeHTTP(0xc003bd2510, {0x1bc4208, 0xc00026e3c0}, 0x8)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/pyrra-dev/pyrra/openapi.MiddlewareLogger.func1.1({0x1bc4208, 0xc00026e3a8}, 0xc003b64a00)
	/workspace/openapi/server.go:176 +0xf4
net/http.HandlerFunc.ServeHTTP(0x7f5c167ada68, {0x1bc4208, 0xc00026e3a8}, 0xc0038e3dd0)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/pyrra-dev/pyrra/openapi.MiddlewareMetrics.func1.1({0x1bceca8, 0xc003f73ea0}, 0xc003bd2510)
	/workspace/openapi/server.go:161 +0xf4
net/http.HandlerFunc.ServeHTTP(0xc003b64900, {0x1bceca8, 0xc003f73ea0}, 0x4)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/gorilla/mux.(*Router).ServeHTTP(0xc000242e40, {0x1bceca8, 0xc003f73ea0}, 0xc003b64800)
	/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0x1cf
net/http.StripPrefix.func1({0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/usr/local/go/src/net/http/server.go:2090 +0x330
net/http.HandlerFunc.ServeHTTP(0xc003bd23f0, {0x1bceca8, 0xc003f73ea0}, 0xc003b1c628)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/go-chi/chi/v5.(*Mux).Mount.func1({0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/go/pkg/mod/github.com/go-chi/chi/[email protected]/mux.go:314 +0x19c
net/http.HandlerFunc.ServeHTTP(0x15b2c40, {0x1bceca8, 0xc003f73ea0}, 0xc003b1c620)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/go-chi/chi/v5.(*Mux).routeHTTP(0xc00017e4e0, {0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/go/pkg/mod/github.com/go-chi/chi/[email protected]/mux.go:442 +0x216
net/http.HandlerFunc.ServeHTTP(0xc003bd23f0, {0x1bceca8, 0xc003f73ea0}, 0x450194)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/go-chi/chi/v5.(*Mux).ServeHTTP(0xc00017e4e0, {0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/go/pkg/mod/github.com/go-chi/chi/[email protected]/mux.go:71 +0x48d
github.com/go-chi/chi/v5.(*Mux).Mount.func1({0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/go/pkg/mod/github.com/go-chi/chi/[email protected]/mux.go:314 +0x19c
net/http.HandlerFunc.ServeHTTP(0x15b2c40, {0x1bceca8, 0xc003f73ea0}, 0xc003ae97a4)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/go-chi/chi/v5.(*Mux).routeHTTP(0xc00017e480, {0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/go/pkg/mod/github.com/go-chi/chi/[email protected]/mux.go:442 +0x216
net/http.HandlerFunc.ServeHTTP(0xc00037cbe0, {0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/go-chi/cors.(*Cors).Handler.func1({0x1bceca8, 0xc003f73ea0}, 0xc003b64700)
	/go/pkg/mod/github.com/go-chi/[email protected]/cors.go:228 +0x1bd
net/http.HandlerFunc.ServeHTTP(0x1bd3a50, {0x1bceca8, 0xc003f73ea0}, 0x27f63c0)
	/usr/local/go/src/net/http/server.go:2047 +0x2f
github.com/go-chi/chi/v5.(*Mux).ServeHTTP(0xc00017e480, {0x1bceca8, 0xc003f73ea0}, 0xc003b64600)
	/go/pkg/mod/github.com/go-chi/chi/[email protected]/mux.go:88 +0x442
net/http.serverHandler.ServeHTTP({0xc003bd2330}, {0x1bceca8, 0xc003f73ea0}, 0xc003b64600)
	/usr/local/go/src/net/http/server.go:2879 +0x43b
net/http.(*conn).serve(0xc0036d1b80, {0x1bd3af8, 0xc00001d320})
	/usr/local/go/src/net/http/server.go:1930 +0xb08
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3034 +0x4e8

pyrra/main.go

Line 491 in a7ccfce

s.Availability.Errors = float64(v.Value)

First idea is that NaN isn't possible to be casted?

Generate alert message annotation

We should generate some helpful message as annotation for alerts.

In an ideal scenario, I'd like to see the current amount of left error budget, how quickly the error budget would be exhausted given the burnrate. Maybe some extra information I can't think of right now.

Wrap errors with more context

I think we could provide a bit more context around the returned errors. Right now there are a lot of these kind of error returns:

	if err := r.Create(ctx, newRule); err != nil {
		return ctrl.Result{}, err
	}

// ...

	if err := r.Update(ctx, newRule); err != nil {
		return ctrl.Result{}, err
	}

In this specific example it might not be immediately clear in the logs if an error occurred during the the Create or Update step.

My suggestion would be to wrap errors with minimal context.

Feature: Support for dark mode

Hi! Great job so far! Just wanted to check if support for dark mode is something that could be considered in a future release.

PR Check "Docker Push" Failing

The PR check to push an image to ghcr is failing like in #219 .
The permissions seem to not be on point for the account that's used as part of the workflow

Standalone cli tool able to transform SLO definitions into prom rules

I'd like to be able to use pyrra's functionality in an environment that does not match how pyrra is supposed to be deployed. To that end I'm interested in the following feature: a build of (parts of) pyrra as a standalone CLI binary capable (at least initially) of transforming files containing SLO definitions into files containing prom rules.

Example usage:
pyrratool slo2prom -i slo.yaml -o promrules.yaml

Team label in URL doesn't remove illegal chars

Hi!

Found a bug after we added the team-label to SLOs.

pyrra.dev/team: team

When you select the SLO from the frontpage the URL you're sent to looks like this:
https://pyrra.dev/objectives?expr={__name__=%22example-error-rate%22,%20namespace=%22example%22,%20pyrra.dev/team=%22team%22}&grouping={}

This results in an error because pyrra.dev/team contains a .
"1:115: parse error: unexpected character inside braces: '.'"

Removing the . from the URL results in the same error, only now the problem is /.
"1:118: parse error: unexpected character inside braces: '/'"
Removing / as well from the URL resolves the problem.

The URL that worked:
https://pyrra.dev/objectives?expr={__name__=%22example-error-rate%22,%20namespace=%22example%22,%20pyrradevteam=%22team%22}&grouping={}

Is it possible to filter this out from the generated URLs from the front page of Pyrra? Or maybe do the same as for alerts? Removing the pyrra.dev/ prefix that is.

If we click the label team on the frontpage to filter on team it works without any issues, but the filter doesn't use braces so that might be why it's not affected. The URL looks like this:
https://pyrra.dev/?filter=%7Bpyrra.dev/team=%22team%22%7D

Add label propagation to Pyrra

We want to support propagating the labels all the way from the SLO CRD through Pyrra until they show up in the alerts in alertmanager so the alerts can be routed to the correct receivers.

Let's imagine this label is added in here:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: your-slo-name-here
  labels:
    prometheus: k8s
    role: alert-rules
+   team: pyrra
spec:
  target: '99.0'
  window: 7d
  indicator:
    ratio:
      errors:
        metric: frontend_request_counter_total{status=~"5..",app="app-name"}
      total:
        metric: frontend_request_counter_total{app="app-name"}

Then it should show up in the list page of Pyrra too as team=pyrra.

In the end the same label needs to be part of the alert

ALERTS{alertname="ErrorBudgetBurn1w", alertstate="firing", long="1d", severity="warning", short="1h30m", slo="your-slo-name-here", namespace="your-namespace", exhaustion="1w", threshold="0.010", team="pyrra"}

At last alertmanager needs to be configured to route the alerts correctly based on the team label.

Original comment: #38 (comment)

Open questions

I'm not 100% sure if we really want to add the label to the metadata.labels. It could be better to have a simliar thing to what Deployments and StatefulSets do, where we would have a spec.metadata.labels. So we can have separate labels for Kubernetes and the SLO itself.

Show firing alerts for SLOs in list/table view

We should be able to, next to listing the SLOs in the table, also query ALERTS and match the firing alerts with the SLOs in the table and thus show on the overview/list view already, which SLOs might have firing alerts.
This should make it very handy if there's an ongoing incident.

Automatically refresh on Detail page

Especially during development, I often find myself refreshing the Detail page to get the latest numbers. Basically developing a new feature or improving performance I want to see how those changes are doing against an SLO.

It's kind of a waste to reload the entire page so having something to refresh graphs and numbers every so often would be helpful.

Support basePath other than `/`

Sometimes you don't want to run Pyrra's UI on the / base path but on something like /pyrra so it can be run behind a proxy.
We need to support that in the React UI and pass a flag from the CLI to the UI via HTML <script> tags, probably.

SLOs with latency should have their latency objective shown

Right now, if you have a latency SLO the target latency is only to be found in the config of that SLO.
We can do better and bring forward that latency objective like 99% within 1s and show it more prominently on the detail and list page.

Delete generated PrometheusRule and rule files

Both for filesystem and Kubernetes we need to reconcile the generated files so that not only new ones are created or existing ones are updated, but no longer existing ones are deleted.

The tricky part is figuring out the difference of what's gone, but it should be doable nonetheless.

Improve 'Installation' docs

Some things I wish that were more readily available in the README.md:

  • Short top-level explanation about how Pyrra works
  • Short description about the different modes (filesystem vs kubernetes vs kubernetes & config-map-mode)
    • Note about dependency on the Prometheus Operator
    • Link to kubernetes example

I don't think it's a massive issue since everything can be figured out, but it took me some time and digging through the repo. I think we can make it a bit easier for new users when we extend the docs in that regard.

Graphs for burn rate alerts

The table at the bottom of the detail page has the multiple burn rate alerts listed. These are each made of 2 alerts and will alert if the burn rate gets above a certain threshold.
We should add a graph showing both, short and long, burn rates with the threshold, so it's easy to tell how bad the error budget burn really is.
Additionally, we can have a dynamic text explaining what it means if the alert is firing. Something along the lines of:

This alert firing means that both the 6 hour and 3 day burn rates are above a threshold of 1. If the error budget continues to be burned at this rate, all the error budget will be burned after 4w

Given 4w is the objective's window.

By default that graphs aren't shown nor queried, however, users can expand the rows in the table to show the individual ones. Plus, if any of the alerts are firing, we should show these graphs right away.

Gauge metrics suport by Pyrra

At the moment, Pyrra suports only metrics-counters. It would be great if possible to have support for gauge metrics as well.
For example, blackbox exporter "probe_success" metric returns 1 (success) or 0 (fail). Is it possible to add formulas to Pyrra to create SLO graph based on such data?
Perhaps this can be achieved using the "count_over_time" function, and the number of successful attempts - with "sum_over_time"?

Thank you)

Add a CONTRIBUTING.md

This would be a small quality of life improvement: add a file which quickly explains how to setup the project for development.

Update examples manifests

The CRD spec has been changed but the files in the examples folder haven't been updated.

kubectl apply -f examples/nginx.yaml

Error from server (Invalid): error when creating "examples/nginx.yaml": ServiceLevelObjective.pyrra.dev "nginx-api-errors" is invalid: spec.indicator: Required value
Error from server (Invalid): error when creating "examples/nginx.yaml": ServiceLevelObjective.pyrra.dev "nginx-api-latency" is invalid: spec.indicator: Required value

Add recording rules for availability pre-computation

It would be really helpful to create a recording rule for error budgets.
Right now, the availability and error budget are always calculated from scratch (and then cached) which takes a lot of time when done across 2w or 4w for example (depending on the series cardinality and sample size).

We should explore creating a recording rule for each SLO that is then super lightweight to be queried (one series where for availability and error budget it can be an instant query).
The down side is, that if an SLO is changed, the recording rules history drastically differs, most likely. This might throw off users. Would that be a problem?

As for an implementation, we should look at the same approach that the kubernetes-mixin uses for the apiserver SLO. It effectively splits up the recording rule into two. The first recording rule level evaluates the average across a short time range and the second one then simply reads these pre-aggregated recroding rule series to get the overall availability and error budget.

Design graph hover

Currently, it's the theme's default state. We want to properly design it.
Bildschirmfoto 2021-07-31 um 14 56 05

Show loading state for availability and error budget

While loading availability and error budget (which is the same request, so they always finished together) we don't show a spinner or any indication of loading happening in the backround. Instead it looks somewhat broken.

Screenshot from 2021-08-06 16-12-09

We can probably re-use the same spinner as for the other components on the UI and make it 2x bigger than the others and that should be good for now!

Use Kubernetes Recommended Labels

Hello, and thanks for great tool!
I think it needs two things:

  1. Deletion of generated rules after pyrra object removal. Right now they are being left as is. What about adding labels like
managed-by=pyrra
pyrra-slo-name=prometheus-http-error

And some controller to ensure rules are in place / removed?

  1. Adjust name of alert, adding prefix / suffix to alertname. Atm all alerts are "ErrorBudgetBurn"

Interactive SLO creation & editing form

Rather than having users always start with an abstract configuration file, we want to have an interactive form where users can configure the objective, the objective's window and the underlying SLI.
We want to have graphs and stats for availability and thus error budget live update, when the configuration is changed.
This will be most helpful if the Prometheus instance already has historical data for the to be created SLO. Additionally, updating SLOs will have a much more visible aspect of knowing how the update would have influenced the SLO in the past and thus maybe how it will behave in the future.

The form is to be design and discussed in the future.

Propagating labels already containing a subdomain prefix causes an error

Labels in Kubernetes conform to RFC 1123 (docs). This means that using the prefix pyrra.dev/ disables the option of propagating a label that already contains a subdomain prefix since this breaks the RFC constraints.

Could a possible option be to use pyrra.dev- or simply pyrra- as the prefix for propagating labels?

Read SLO from Kubernetes but write rules to filesystem

We've built a huge Thanos and Prometheus cluster to run the monitoring stack for our production without using the operator,
We do not have the CRD for writing rules, is there a way to read the configs from kubernetes but write the result rules to filesystem?

Pyrra deployment with Helm

Hi!

Are there any plans to offer Pyrra as a single Helm chart? If not, would you accept contributions in this area? I would be glad to contribute in this area if you accept contributions.

My main motivation would be to offer more deployments capabilities for Pyrra application on Kubernetes, making easier to distribute and re-use the Helm chart across multiple environments.

Thanks in advance! ๐Ÿ™

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.