Comments (10)
Hi Anton, long time no see 😄
Would it be okay to share the SLO spec?
The errors displayed there are fetched from a recording rule generated by Pyrra called "increase:error metric name:slo window"
To start troubleshooting I would look for log errors in both Pyrra and Prometheus that might correlate with this metric
from pyrra.
While I agree that this is probably the best way to implement an SLO, would it make sense to not be so restrict about it?
Yes, fair enough, and also for those circumstances where you aren't in control of the metrics. I guess a relabeling rule might work indeed.
I think this issue is mostly about how we tried labels and how they cancel out in the generic rules (for Grafana etc), right? In that case, we might be able to come up with a test case and implement a fix.
Sadly, I've prioritized some other work for Pyrra right now and might not get to this soon. If anyone wants to try I'm happy to help!
from pyrra.
Hey @ArthurSens , @metalmatze here's our SLO rules
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
...
spec:
alerting:
name: VSCodeDesktopLocalSSHConnection
description: 99% of connection connection over local ssh proxy is succeed
indicator:
ratio:
errors:
metric: vscode_desktop_local_ssh_total{phase="failed", failure_code!~"NoRunningInstance|FailedToGetAuthInfo:UNAVAILABLE|FailedToGetAuthInfo:CANCELLED"}
grouping: null
total:
metric: vscode_desktop_local_ssh_total{phase!="connecting"}
target: "99.0"
window: 4w
Long time no see @ArthurSens 🙌 😄
from pyrra.
The way I read your configuration and how the total metric is supposed to work you shouldn't filter for the connecting at all. The total metric should include all counters.
total:
- metric: vscode_desktop_local_ssh_total{phase!="connecting"}
+ metric: vscode_desktop_local_ssh_total
The resulting recording rule won't include the phase labels as those are part of both the total and error metric.
from pyrra.
Yeah, as @ArthurSens said, it seems like the recording rules aren't properly running in the background in Prometheus.
from pyrra.
It seems the label phase
is not included by pyrra, then reset related data in pyrra should work
from pyrra.
Hey @metalmatze , we resolve this problem temporarily by make total metric query with vscode_desktop_local_ssh_total{phase!~"connecting"}
☝️ #794 (comment)
And we found in code base, for ratio
type, if we have error (phase~=xxx, failure_code)
and total (phase!=xx)
, it will
- list labels from error -> current phase, failure_code
- delete labels from total [1] -> current failure_code
- add labels back if they are regexp [2] -> current failure_code
So result labels will contain failure_code
only, it looks like a bug of pyrra?
Maybe check if label value is equal before delete is required?
from pyrra.
It's about how we report metrics, we have phase connecting
connected
failed
, so we need a label matcher vscode_desktop_local_ssh_total{phase!="connecting"}
from pyrra.
The way I read your configuration and how the total metric is supposed to work you shouldn't filter for the connecting at all. The total metric should include all counters.
While I agree that this is probably the best way to implement an SLO, would it make sense to not be so restrict about it?
I can see the case where metrics exposed by external services just don't follow this pattern, so building an SLO with them becomes extra difficult.
It's about how we report metrics, we have phase
connecting
connected
failed
, so we need a label matchervscode_desktop_local_ssh_total{phase!="connecting"}
@mustard-mh, as a workaround, do you need the metric with phase=connecting
label for any other reason that is not the SLO? Would it be fine to add a relabeling rule to your prometheus to drop the metric? Or even better, if you have access to the codebase that exposes this metric, could you update the instrumentation to not increment the counter during this phase?
from pyrra.
@mustard-mh, as a workaround, do you need the metric with phase=connecting label for any other reason that is not the SLO?
It's a part of user flow telemetry before: connecting -> connected
, connecting -> failed
, connecting->abort
, but yes if it's metric then we can have only connected
and failed
.
Right now as a workaround, use {phase!~"connecting"}
or {phase~="connected|failed"}
can make it work
from pyrra.
Related Issues (20)
- Using Kubernetes labels/annotation as Prometheus alert labels and annotations is limiting HOT 4
- 0s burnrate generated HOT 3
- Unexpected grouping for ratios HOT 1
- Issue with Grouping in Latency indicator HOT 3
- Google Managed Prometheus breaks on `partial_reponse` in queries. HOT 2
- Allow customisation of the name of the recording rules HOT 1
- Error Budget showing strange skew HOT 1
- Surface Pyrra version HOT 1
- Feature - Add link to Pyrra in alerts HOT 1
- Grafana Dashboard HOT 5
- Feature: Allow customisation of warning/critical alert levels HOT 2
- [Feature] Link Grafana Dashboard to Pyrra equivalent HOT 1
- Environment variables support in Pyrra HOT 1
- SLO dependencies / levels / hierarchy HOT 3
- Custom PropagationLabelsPrefix HOT 2
- No data Budget and Availability HOT 1
- Doubt - Error Budget
- Issue when showing some SLOs (white screen)
- Pyrra causes instability of Prometheus HOT 1
- Fix typo in Kubernetes example readme HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyrra.