Giter Club home page Giter Club logo

Comments (12)

juliusv avatar juliusv commented on May 16, 2024

This would be also be addressed by adding explicit "resolved" messages from Prometheus->Alertmanager for resolved alerts, right? That is, the refresh interval would then only be used as a fallback for timing out alerts in case of lossy communication between Prometheus and Alertmanager. As long as no "resolved" message is received, Alertmanager will then (rightfully, in 99.9% of cases) assume that the alert is still firing.

from alertmanager.

camerondavison avatar camerondavison commented on May 16, 2024

I personally I am not sure that I would want to complicate prometheus->alertmanager interaction to make prometheus send a 'resolved' event. Then prometheus has to know information about alertmanager. I feel like if you are talking about #44 that issue is best solved by 'resolving' the alert after it stops getting refreshed.

I guess the way that I see the state machine for this is the following.

  1. event causes alert (send alert, set last sent time) goto 2
  2. has it been repeat rate seconds from last sent time (yes: goto 1, no: goto 3)
  3. in state alerted (wait min refresh seconds)
    • either we see the same event again within min refresh seconds (goto 2)
    • we do not see the alert in min refresh seconds (goto 4)
  4. remove alert, queue resolved event (this is #44)

from alertmanager.

juliusv avatar juliusv commented on May 16, 2024

"Then prometheus has to know information about alertmanager" -> It already sends firing alerts to alertmanager, so it wouldn't have to know more about alertmanager to also send "resolved" messages once an alert stops firing, right? Implementation-wise, it would probably call to /api/alerts with a DELETE method. This has the added benefit that resolved alerts would also disappear in general from Alertmanager, meaning they will no longer be able to act as sources for inhibiting other alerts, nor will they be shown in the web UI (otherwise people get confused why resolved alerts take a long time to go away).

from alertmanager.

juliusv avatar juliusv commented on May 16, 2024

And #44 would be handled independently of that: whenever an alert disappears from AM, whether that is via timeout or via explicit "resolved" message, AM would resolve the alert in e.g. PagerDuty.

from alertmanager.

camerondavison avatar camerondavison commented on May 16, 2024

I guess what you are proposing is that an alert in prometheus be set to the exact same as the 'wait-for-refresh' time. If the alert is > 5XX status than 200 status for 1 minute, and the wait-for-refresh time is set to 1 minute as well then in affect that would be the exact same thing that you described. I am not really sure this would be what I want. I like the idea of setting the alert to be something short, and then having the alert manager aggregate multiple of them together into 1 alert. Meaning that if something is a little flakey, going off for 30 seconds then off for 2 minutes in then on for 30 seconds then off for 1 minute. I want to get the 1 alert instead of a bunch of alerts and a bunch of resolves.

from alertmanager.

matthiasr avatar matthiasr commented on May 16, 2024

To avoid that there should be explicit flapping detection. I wouldn't want that to be implicit by hiding half the state changes from AM.

from alertmanager.

camerondavison avatar camerondavison commented on May 16, 2024

hmm. okay. i guess i would think that the easiest way to do that would be setting up a separate alert in prometheus on the ALERT metric

from alertmanager.

juliusv avatar juliusv commented on May 16, 2024

I must admit I don't fully understand your description above in #46 (comment). Perhaps there is some confusion about how Prometheus and Alertmanager interact. Currently, AM does no aggregation of multiple Prometheus alerts into a single notification (though that is definitely something we need in the future). How it currently works:

Prometheus sends any firing alerts repeatedly to alertmanager (best at a repeat interval significantly lower than the AM's refresh timeout; like 30s vs. 5 minutes) to make sure AM always knows what's firing (even in the face of lost messages from Prometheus->Alertmanager). The whole pipeline is mostly state-based, not event-based, and we want Alertmanager to reflect the correct current state of alerts. The state in AM will converge sooner to the "real" one if we also send "resolved" notifications in case Prometheus detects a previously firing alert as no longer firing.

There'll be mechanisms in the future to deal with flapping, or aggregating multiple alerts into one notification (given some grace period of time in which they all need to start firing), but so far it doesn't exist yet.

from alertmanager.

camerondavison avatar camerondavison commented on May 16, 2024

Okay. This could me a misunderstanding, and I could try this test again, but I am pretty sure that I did the following and I can tell you what happened. I have prometheus set up to scrape cAdvisor. cAdvisor is giving me metrics on my docker containers.

1 ALERT in prometheus set for

IF absent(((time() - container_last_seen{name="container-name"}) < 5)) FOR 15s

This will trigger an alert if the container named 'container-name' is not seen for 5 seconds from now, and it lasts for 15 seconds.

I shut off my service. Naturally about 20 seconds later prometheus transitions the alert into firing and sends the alert to the alert manager. Everything works like expected.

Then I turn the service on, and prometheus transitions the alert back into not firing mode and stops sending the alert to alertmanager every 15 seconds.

Then alertmanager keeps the same alert up on the dashboard for 5 minutes (what the refresh rate seconds is set to) before finally then stopping to show the alert anymore.

When I killed the service again at say the 4 minute mark prometheus just kept the same alert up since it was not outside of the 5 minute refresh rate seconds. Then if I kept it down it would continue to send alerts every 30 minutes (the repeat rate seconds)

from alertmanager.

juliusv avatar juliusv commented on May 16, 2024

Yes, I think that description makes sense.

Just a note (but unrelated): to meaningfully support a low FOR x resolution of 15s, you'll need to set a quite low evaluation_interval. I guess you have that set to something in the seconds range? Rules will only be evaluated as frequently as your evaluation_interval specifies, so if you use the default of 1m, Prometheus would still take 2 minutes (two rule evaluation iterations) until an alert is noticed to be firing for more than 15s.

Anyways, I don't think waiting for another alert-state push from Prometheus before sending another notification is the solution we want. Conceptually, I'd want those two things to be completely decoupled (the former is indicating and continuously reaffirming the current state of an alert, while the latter instantaneously triggers a notification). Resolving alerts in AM via a message from Prometheus sounds like the better approach.

from alertmanager.

camerondavison avatar camerondavison commented on May 16, 2024

Okay do with this ticket as you please. I have made my point to the best of my knowledge.

from alertmanager.

brian-brazil avatar brian-brazil commented on May 16, 2024

I believe this is much better with the new alertmanager and latest Prometheus.

from alertmanager.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.