Comments (12)
This would be also be addressed by adding explicit "resolved" messages from Prometheus->Alertmanager for resolved alerts, right? That is, the refresh interval would then only be used as a fallback for timing out alerts in case of lossy communication between Prometheus and Alertmanager. As long as no "resolved" message is received, Alertmanager will then (rightfully, in 99.9% of cases) assume that the alert is still firing.
from alertmanager.
I personally I am not sure that I would want to complicate prometheus->alertmanager interaction to make prometheus send a 'resolved' event. Then prometheus has to know information about alertmanager. I feel like if you are talking about #44 that issue is best solved by 'resolving' the alert after it stops getting refreshed.
I guess the way that I see the state machine for this is the following.
- event causes alert (send alert, set last sent time) goto 2
- has it been repeat rate seconds from last sent time (yes: goto 1, no: goto 3)
- in state alerted (wait min refresh seconds)
- either we see the same event again within min refresh seconds (goto 2)
- we do not see the alert in min refresh seconds (goto 4)
- remove alert, queue resolved event (this is #44)
from alertmanager.
"Then prometheus has to know information about alertmanager" -> It already sends firing alerts to alertmanager, so it wouldn't have to know more about alertmanager to also send "resolved" messages once an alert stops firing, right? Implementation-wise, it would probably call to /api/alerts
with a DELETE
method. This has the added benefit that resolved alerts would also disappear in general from Alertmanager, meaning they will no longer be able to act as sources for inhibiting other alerts, nor will they be shown in the web UI (otherwise people get confused why resolved alerts take a long time to go away).
from alertmanager.
And #44 would be handled independently of that: whenever an alert disappears from AM, whether that is via timeout or via explicit "resolved" message, AM would resolve the alert in e.g. PagerDuty.
from alertmanager.
I guess what you are proposing is that an alert in prometheus be set to the exact same as the 'wait-for-refresh' time. If the alert is > 5XX status than 200 status for 1 minute, and the wait-for-refresh time is set to 1 minute as well then in affect that would be the exact same thing that you described. I am not really sure this would be what I want. I like the idea of setting the alert to be something short, and then having the alert manager aggregate multiple of them together into 1 alert. Meaning that if something is a little flakey, going off for 30 seconds then off for 2 minutes in then on for 30 seconds then off for 1 minute. I want to get the 1 alert instead of a bunch of alerts and a bunch of resolves.
from alertmanager.
To avoid that there should be explicit flapping detection. I wouldn't want that to be implicit by hiding half the state changes from AM.
from alertmanager.
hmm. okay. i guess i would think that the easiest way to do that would be setting up a separate alert in prometheus on the ALERT metric
from alertmanager.
I must admit I don't fully understand your description above in #46 (comment). Perhaps there is some confusion about how Prometheus and Alertmanager interact. Currently, AM does no aggregation of multiple Prometheus alerts into a single notification (though that is definitely something we need in the future). How it currently works:
Prometheus sends any firing alerts repeatedly to alertmanager (best at a repeat interval significantly lower than the AM's refresh timeout; like 30s vs. 5 minutes) to make sure AM always knows what's firing (even in the face of lost messages from Prometheus->Alertmanager). The whole pipeline is mostly state-based, not event-based, and we want Alertmanager to reflect the correct current state of alerts. The state in AM will converge sooner to the "real" one if we also send "resolved" notifications in case Prometheus detects a previously firing alert as no longer firing.
There'll be mechanisms in the future to deal with flapping, or aggregating multiple alerts into one notification (given some grace period of time in which they all need to start firing), but so far it doesn't exist yet.
from alertmanager.
Okay. This could me a misunderstanding, and I could try this test again, but I am pretty sure that I did the following and I can tell you what happened. I have prometheus set up to scrape cAdvisor. cAdvisor is giving me metrics on my docker containers.
1 ALERT in prometheus set for
IF absent(((time() - container_last_seen{name="container-name"}) < 5)) FOR 15s
This will trigger an alert if the container named 'container-name' is not seen for 5 seconds from now, and it lasts for 15 seconds.
I shut off my service. Naturally about 20 seconds later prometheus transitions the alert into firing and sends the alert to the alert manager. Everything works like expected.
Then I turn the service on, and prometheus transitions the alert back into not firing mode and stops sending the alert to alertmanager every 15 seconds.
Then alertmanager keeps the same alert up on the dashboard for 5 minutes (what the refresh rate seconds is set to) before finally then stopping to show the alert anymore.
When I killed the service again at say the 4 minute mark prometheus just kept the same alert up since it was not outside of the 5 minute refresh rate seconds. Then if I kept it down it would continue to send alerts every 30 minutes (the repeat rate seconds)
from alertmanager.
Yes, I think that description makes sense.
Just a note (but unrelated): to meaningfully support a low FOR x
resolution of 15s
, you'll need to set a quite low evaluation_interval
. I guess you have that set to something in the seconds range? Rules will only be evaluated as frequently as your evaluation_interval
specifies, so if you use the default of 1m
, Prometheus would still take 2 minutes (two rule evaluation iterations) until an alert is noticed to be firing for more than 15s
.
Anyways, I don't think waiting for another alert-state push from Prometheus before sending another notification is the solution we want. Conceptually, I'd want those two things to be completely decoupled (the former is indicating and continuously reaffirming the current state of an alert, while the latter instantaneously triggers a notification). Resolving alerts in AM via a message from Prometheus sounds like the better approach.
from alertmanager.
Okay do with this ticket as you please. I have made my point to the best of my knowledge.
from alertmanager.
I believe this is much better with the new alertmanager and latest Prometheus.
from alertmanager.
Related Issues (20)
- Use text/tabwriter for formatting alerts HOT 2
- Aggregation seems to happen before filtering out silenced alerts HOT 1
- Pass Go's `time` function into the alertmanager templates to be able to do time calculations HOT 12
- Alertmanager Fails to Start with Configuration Error for Telegram Integration HOT 1
- opsgenie_config using api_key_file not working HOT 6
- Guidance on pagination token
- Confusing comments in silence.go
- Move pkg/labels into matcher package
- How can I get a week's worth of alert history for Prometheus, Alertmanager via python script? HOT 2
- Assistance Needed with Prometheus and Alertmanager Configuration HOT 3
- Some alerts going to a webhook crash alertmanager - panic: runtime error: invalid memory address or nil pointer dereference HOT 4
- `route.ID()` returns conflicting IDs
- Different aggregation groups can share the same nflog HOT 1
- docker image does not recognise timezone appropriately HOT 6
- Feature request: Please sign your releases HOT 1
- Wrong Alertmanager version for v1 API removal in the logs? HOT 1
- How to fix non unique GroupKeys? HOT 5
- Alertmanager web UI is broken after installing using generate-ui.sh HOT 3
- Have a way to mute alert until it's resolved to receive a resolved notification once it's fixed HOT 10
- FR: sort matchers on silences page HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alertmanager.