We have a weird issue that SLO does not reflect errors properly. Availability says 0 e

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Yeah, as <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

SLO does not pick up errors about pyrra HOT 10 OPEN

akosyakov commented on June 16, 2024

SLO does not pick up errors

from pyrra.

Comments (10)

ArthurSens commented on June 16, 2024 4

Hi Anton, long time no see 😄

Would it be okay to share the SLO spec?

The errors displayed there are fetched from a recording rule generated by Pyrra called "increase:error metric name:slo window"

To start troubleshooting I would look for log errors in both Pyrra and Prometheus that might correlate with this metric

from pyrra.

metalmatze commented on June 16, 2024 2

While I agree that this is probably the best way to implement an SLO, would it make sense to not be so restrict about it?

Yes, fair enough, and also for those circumstances where you aren't in control of the metrics. I guess a relabeling rule might work indeed.

I think this issue is mostly about how we tried labels and how they cancel out in the generic rules (for Grafana etc), right? In that case, we might be able to come up with a test case and implement a fix.

Sadly, I've prioritized some other work for Pyrra right now and might not get to this soon. If anyone wants to try I'm happy to help!

from pyrra.

mustard-mh commented on June 16, 2024 1

Hey @ArthurSens , @metalmatze here's our SLO rules

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
...
spec:
  alerting:
    name: VSCodeDesktopLocalSSHConnection
  description: 99% of connection connection over local ssh proxy is succeed
  indicator:
    ratio:
      errors:
        metric: vscode_desktop_local_ssh_total{phase="failed", failure_code!~"NoRunningInstance|FailedToGetAuthInfo:UNAVAILABLE|FailedToGetAuthInfo:CANCELLED"}
      grouping: null
      total:
        metric: vscode_desktop_local_ssh_total{phase!="connecting"}
  target: "99.0"
  window: 4w

Long time no see @ArthurSens 🙌 😄

from pyrra.

metalmatze commented on June 16, 2024 1

The way I read your configuration and how the total metric is supposed to work you shouldn't filter for the connecting at all. The total metric should include all counters.

      total:
-        metric: vscode_desktop_local_ssh_total{phase!="connecting"}
+        metric: vscode_desktop_local_ssh_total

The resulting recording rule won't include the phase labels as those are part of both the total and error metric.

from pyrra.

metalmatze commented on June 16, 2024

Yeah, as @ArthurSens said, it seems like the recording rules aren't properly running in the background in Prometheus.

from pyrra.

mustard-mh commented on June 16, 2024

It seems the label phase is not included by pyrra, ~~then reset related data in pyrra should work~~

from pyrra.

mustard-mh commented on June 16, 2024

Hey @metalmatze , we resolve this problem temporarily by make total metric query with vscode_desktop_local_ssh_total{phase!~"connecting"} ☝️ #794 (comment)

And we found in code base, for ratio type, if we have error (phase~=xxx, failure_code) and total (phase!=xx), it will

list labels from error -> current phase, failure_code
delete labels from total [1] -> current failure_code
add labels back if they are regexp [2] -> current failure_code

So result labels will contain failure_code only, it looks like a bug of pyrra?

Maybe check if label value is equal before delete is required?

from pyrra.

mustard-mh commented on June 16, 2024

It's about how we report metrics, we have phase connecting connected failed, so we need a label matcher vscode_desktop_local_ssh_total{phase!="connecting"}

from pyrra.

ArthurSens commented on June 16, 2024

The way I read your configuration and how the total metric is supposed to work you shouldn't filter for the connecting at all. The total metric should include all counters.

While I agree that this is probably the best way to implement an SLO, would it make sense to not be so restrict about it?

I can see the case where metrics exposed by external services just don't follow this pattern, so building an SLO with them becomes extra difficult.

It's about how we report metrics, we have phase connecting connected failed, so we need a label matcher vscode_desktop_local_ssh_total{phase!="connecting"}

@mustard-mh, as a workaround, do you need the metric with phase=connecting label for any other reason that is not the SLO? Would it be fine to add a relabeling rule to your prometheus to drop the metric? Or even better, if you have access to the codebase that exposes this metric, could you update the instrumentation to not increment the counter during this phase?

from pyrra.

mustard-mh commented on June 16, 2024

@mustard-mh, as a workaround, do you need the metric with phase=connecting label for any other reason that is not the SLO?

It's a part of user flow telemetry before: connecting -> connected, connecting -> failed, connecting->abort, but yes if it's metric then we can have only connected and failed.

Right now as a workaround, use {phase!~"connecting"} or {phase~="connected|failed"} can make it work

from pyrra.

SLO does not pick up errors about pyrra HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent