In <a href="https://github.com/slok/sloth/blob/main/internal/prometheus/recording_rule

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Awesome <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Hi again <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

Make OptimizedSLIRecordGenerator optional about sloth HOT 5 CLOSED

slok commented on July 21, 2024

Make OptimizedSLIRecordGenerator optional

from sloth.

Comments (5)

tokheim commented on July 21, 2024 1

Hi @slok,

Sorry for causing trouble again. Unfortunately we have some services that e.g. are only in use while formula1 events happen, which is some hours every Sunday and such, that would suffer a bit from this. I actually see it a bit differently, that it could be an ok approximation for alerting, while the slo-period is the actual contract the team commits to. Alerting is just there to help the team reach the contract.

Completely agree that you are better off doing >0 or vector(1) things to avoid NaNs in the first place. The only real shortcoming I see the library could solve is that if you mess up the query, you have to wait 30d until the NaNs are scrubbed from prometheus. So maybe the optimized rule should more do

sum_over_time(x > 0) /
count_over_time(x > 0)

to simply disregard NaNs.

from sloth.

slok commented on July 21, 2024 1

Awesome @tokheim @xairos, thanks for your inputs and raising your hand. Is very valuable to see that is not an isolated case.

I'll think about this, but most likely that this will be on the next release :)

Again many thanks!

from sloth.

tokheim commented on July 21, 2024

Oh, and using the non-optimized calculation would also partially help for people experiencing NaNs like #231 in the period budget. If there is a NaN in any 5 minute slice, then the whole optimized sli recorder expression will evaluate to NaN. I guess there are other ways this could be improved though...

from sloth.

slok commented on July 21, 2024

Hi again @tokheim

That optimization was made because normally most people only want the 30d as an informative metric (not used for alerting or similar things) and the approximation works in most cases (well, Prometheus itself is also an approximation of reality), not having it would kill a lot of Prometheus installations (due to queries) or end having no metric because of timeouts on the rules.

You are the first person (that I'm aware of) that this doesn't work for him/her.

Let me think about adding this option, in case it's added, it would be done at the application level (with a flag) and not at the SLO level, I would prefer to maintaint the specs as simple as possible.

Regarding the NaNs, it's a different problem that should be solved with the right queries (using a combination of >0, vector(0)...).

from sloth.

xairos commented on July 21, 2024

You are the first person (that I'm aware of) that this doesn't work for him/her

I wanted to add a +1 to say that we'd benefit as well from this!

I actually see it a bit differently, that it could be an ok approximation for alerting, while the slo-period is the actual contract the team commits to. Alerting is just there to help the team reach the contract.

I have this perspective as well, and before looking at the PromQL that Sloth generates, I actually thought that Sloth would use this method for calculating slo:period_error_budget_remaining:ratio. I had been (erroneously) expecting that the SLOs would be calculated using good events / total events (rather than good time / total time using the optimized calculation).

We have many services with varying traffic patterns, ex. internal apps used by customer service that have almost no traffic outside of customer support hours. If those SLOs could be calculated using good events / total events, they would become more useful for reporting 👍🏻

from sloth.

Make OptimizedSLIRecordGenerator optional about sloth HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent