Currently if I have my min refresh period set to 5 minutes, and I set my repeat_rate_s

And <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

I must admit I don't fully understand your deion above in <a class="issue-link j

Yes, I think that deion makes sense. Just a note (but unrelate

wait for refresh before sending alert after repeat_rate_seconds about alertmanager HOT 12 CLOSED

camerondavison commented on May 16, 2024

wait for refresh before sending alert after repeat_rate_seconds

from alertmanager.

Comments (12)

juliusv commented on May 16, 2024

This would be also be addressed by adding explicit "resolved" messages from Prometheus->Alertmanager for resolved alerts, right? That is, the refresh interval would then only be used as a fallback for timing out alerts in case of lossy communication between Prometheus and Alertmanager. As long as no "resolved" message is received, Alertmanager will then (rightfully, in 99.9% of cases) assume that the alert is still firing.

from alertmanager.

camerondavison commented on May 16, 2024

I personally I am not sure that I would want to complicate prometheus->alertmanager interaction to make prometheus send a 'resolved' event. Then prometheus has to know information about alertmanager. I feel like if you are talking about #44 that issue is best solved by 'resolving' the alert after it stops getting refreshed.

I guess the way that I see the state machine for this is the following.

event causes alert (send alert, set last sent time) goto 2
has it been repeat rate seconds from last sent time (yes: goto 1, no: goto 3)
in state alerted (wait min refresh seconds)
- either we see the same event again within min refresh seconds (goto 2)
- we do not see the alert in min refresh seconds (goto 4)
remove alert, queue resolved event (this is #44)

from alertmanager.

juliusv commented on May 16, 2024

"Then prometheus has to know information about alertmanager" -> It already sends firing alerts to alertmanager, so it wouldn't have to know more about alertmanager to also send "resolved" messages once an alert stops firing, right? Implementation-wise, it would probably call to /api/alerts with a DELETE method. This has the added benefit that resolved alerts would also disappear in general from Alertmanager, meaning they will no longer be able to act as sources for inhibiting other alerts, nor will they be shown in the web UI (otherwise people get confused why resolved alerts take a long time to go away).

from alertmanager.

juliusv commented on May 16, 2024

And #44 would be handled independently of that: whenever an alert disappears from AM, whether that is via timeout or via explicit "resolved" message, AM would resolve the alert in e.g. PagerDuty.

from alertmanager.

camerondavison commented on May 16, 2024

I guess what you are proposing is that an alert in prometheus be set to the exact same as the 'wait-for-refresh' time. If the alert is > 5XX status than 200 status for 1 minute, and the wait-for-refresh time is set to 1 minute as well then in affect that would be the exact same thing that you described. I am not really sure this would be what I want. I like the idea of setting the alert to be something short, and then having the alert manager aggregate multiple of them together into 1 alert. Meaning that if something is a little flakey, going off for 30 seconds then off for 2 minutes in then on for 30 seconds then off for 1 minute. I want to get the 1 alert instead of a bunch of alerts and a bunch of resolves.

from alertmanager.

matthiasr commented on May 16, 2024

To avoid that there should be explicit flapping detection. I wouldn't want that to be implicit by hiding half the state changes from AM.

from alertmanager.

camerondavison commented on May 16, 2024

hmm. okay. i guess i would think that the easiest way to do that would be setting up a separate alert in prometheus on the ALERT metric

from alertmanager.

juliusv commented on May 16, 2024

I must admit I don't fully understand your description above in #46 (comment). Perhaps there is some confusion about how Prometheus and Alertmanager interact. Currently, AM does no aggregation of multiple Prometheus alerts into a single notification (though that is definitely something we need in the future). How it currently works:

Prometheus sends any firing alerts repeatedly to alertmanager (best at a repeat interval significantly lower than the AM's refresh timeout; like 30s vs. 5 minutes) to make sure AM always knows what's firing (even in the face of lost messages from Prometheus->Alertmanager). The whole pipeline is mostly state-based, not event-based, and we want Alertmanager to reflect the correct current state of alerts. The state in AM will converge sooner to the "real" one if we also send "resolved" notifications in case Prometheus detects a previously firing alert as no longer firing.

There'll be mechanisms in the future to deal with flapping, or aggregating multiple alerts into one notification (given some grace period of time in which they all need to start firing), but so far it doesn't exist yet.

from alertmanager.

camerondavison commented on May 16, 2024

Okay. This could me a misunderstanding, and I could try this test again, but I am pretty sure that I did the following and I can tell you what happened. I have prometheus set up to scrape cAdvisor. cAdvisor is giving me metrics on my docker containers.

1 ALERT in prometheus set for

IF absent(((time() - container_last_seen{name="container-name"}) < 5)) FOR 15s

This will trigger an alert if the container named 'container-name' is not seen for 5 seconds from now, and it lasts for 15 seconds.

I shut off my service. Naturally about 20 seconds later prometheus transitions the alert into firing and sends the alert to the alert manager. Everything works like expected.

Then I turn the service on, and prometheus transitions the alert back into not firing mode and stops sending the alert to alertmanager every 15 seconds.

Then alertmanager keeps the same alert up on the dashboard for 5 minutes (what the refresh rate seconds is set to) before finally then stopping to show the alert anymore.

When I killed the service again at say the 4 minute mark prometheus just kept the same alert up since it was not outside of the 5 minute refresh rate seconds. Then if I kept it down it would continue to send alerts every 30 minutes (the repeat rate seconds)

from alertmanager.

juliusv commented on May 16, 2024

Yes, I think that description makes sense.

Just a note (but unrelated): to meaningfully support a low FOR x resolution of 15s, you'll need to set a quite low evaluation_interval. I guess you have that set to something in the seconds range? Rules will only be evaluated as frequently as your evaluation_interval specifies, so if you use the default of 1m, Prometheus would still take 2 minutes (two rule evaluation iterations) until an alert is noticed to be firing for more than 15s.

Anyways, I don't think waiting for another alert-state push from Prometheus before sending another notification is the solution we want. Conceptually, I'd want those two things to be completely decoupled (the former is indicating and continuously reaffirming the current state of an alert, while the latter instantaneously triggers a notification). Resolving alerts in AM via a message from Prometheus sounds like the better approach.

from alertmanager.

camerondavison commented on May 16, 2024

Okay do with this ticket as you please. I have made my point to the best of my knowledge.

from alertmanager.

brian-brazil commented on May 16, 2024

I believe this is much better with the new alertmanager and latest Prometheus.

from alertmanager.

wait for refresh before sending alert after repeat_rate_seconds about alertmanager HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent