yelp / sensu_handlers Goto Github PK

Custom Sensu Handlers to support a multi-tenant environment, allowing checks themselves to emit the type of handler behavior they need in the event json

License: Apache License 2.0

Ruby 91.59% Puppet 8.41%

sensu_handlers's People

Contributors

Stargazers

Watchers

sensu_handlers's Issues

Init class doesn't use the split out sub classes

I think when you merged in my review, you missed init.pp?

Licensing

Thanks for the awesome work done here, especially the base handler. There are many parts of that I'd like to use in my own base handler along with much of my own functionality.

In the spirit of sharing, I'd like to open source my work under the MIT license but you have no license specified so I'm not sure if I can.

Can you clarify?

Replace fog dependency with aws-sdk

Fog is gigantic gem (6000 files), which is directly referenced in about 3 or 4 lines.

Given we already include aws-sdk in puppet-omnibus and that aws-sdk is far more natural dependency, let's migrate to that? I'm happy to volunteer.

Generate an alert if the API query fails

Hi guys,

This is more of a discussion than anything, just wondering if maybe you have any thoughts.

I have a monkey patched version of your JIRA handler, but in our orgs infinite wisdom, changes to transitions, required custom fields or such like will often mean the handler fails. It's difficult to notice this unless you're active catching the exceptions and checking events vs open jira tickets.

Can you think of a way whereby a sensu alert (with another handler) is generated without having to pull in a bunch of existing gems?

Obviously we can rescue the exception and go from there, just wondering if you have any novel ideas..

Use the new at_exit disabling mecanism rather than monkeypatching at_exit

sensu-plugins/sensu-plugin@82c4cd2

I added a neater way to disable the at_exit handler in the above commit.

Shift our Sensu handlers to use it.

plans for sensu core filtering support?

Sensu 0.26 and later have started deprecating use of filtering at handler and use core inbuilt filtering.
Are there any plans to support for sensu 0.26+ for the above changes ? :)

Can someone please help me install these Handlers

Hi There,
I need to use the Jira handler and I am having hard time installing the gem on my sensu server. Can someone please send me instructions on how to install this gem?

Thanks
Hadie

aws_prune.pp uses cron::d which is a yelpism

We haven't open-sourced our cron module (right?), so we should probably switch the cron::d define to just a regular cron.

filter alerts with "Execution timed out" in pagerduty and jira handlers

I would like to open this up for discussion.

If a check is taking longer to run than expected, it often would exit 2 (critical) with output of "Execution timed out".

This comes from sensu-spawn gem - https://github.com/sensu/sensu-spawn/blob/master/lib/sensu/spawn.rb#L163

What if we filter these out of pagerduty and jira handlers? After all when we get this, we can't be certain it's the check that failed - in fact almost always it's not the actual check but a bad or hung ec2 instance etc.

A positive outcome of this is we would cut down on (frequently) unactionable tickets and pages. Also, if we ever decide to do more auto-remediation on these, we would be more confident autoremediation is attempted when it's needed - since running autoremediation in response to "execution timed out" (essentially an unknown exit code from a check) may not always be desirable.

A negative outcome would be we would lose implicit info about some problems that oncall often derives from seeing "Execution timed out".

Discuss.

@solarkennedy @bobtfish

Document how to invoke these handlers directly for troubleshooting purposes

This is "just" cat'ing event json and piping it, but it isn't obvious.

Just an example taking some event data from the log and "replaying" it, pipe to a handler would be helpful for troubleshooting purposes.

Develop autofixer handler

Autofixer could be a system that receives events from sensu and attempts to fix them if it's told how to do it. This handler will provide the plumbing necessary to make it happen.

Potential sequence of steps could be receive an event from sensu, determine if this event is actionable by autofixer at this point in time (for example, attempt to autofix only after N minutes of ongoing event), perform a specified action, mark event as "autofix has been applied" in case the autofix action fails to resolve the event (to provide means for escalation to higher intelligence entity like a human).

Or make "autofix has been applied" as a counter to support situations where we want to apply autofix up to N times and after that give up.

Let's build plumbing to teach the machines how to fix problems themselves!

Remove irc.local references

This should be a parameter in the sensu_handlers class

exponential backoff sometimes crashes

I think this might be happening in a log(0) event:

severities":["ok","warning","critical","unknown"],"name":"jira"},"output":"/etc/sensu/handlers/base.rb:173:in `log': Numerical result out of range - log (Errno::ERANGE)\n"}

I think this is the corresponding event data:

{"timestamp":"2014-09-04T00:37:24.241628-0700","level":"info","message":"handling event","event":{"client":{"name":"REDACTED","address":
"REDACTED","subscriptions":[REDACTED"],"safe_mode":true,"keepalive":{"handlers":["default"],"team":"operations","page":true,"realert
_every":"-1","runbook":"REDACTED},"timestamp":1409816239},"check":{"handlers":["default"],"command":" /usr/lib/nagios/plugins/check_http -H l
ocalhost -p 5052","subscribers":[],"standalone":true,"interval":60,"aggregate":false,"handle":true,"alert_after":300,"realert_every":"-1","runbook":"REDACTED","annotation":REDACTED","sla":"No SLA defined.","dependencies":[],"team":"REDACTED","irc_channels":"REDACTED","notification_email":"undef","ticket":false,"project":false,"page":true,"tip":false,"name":"check_marathon","i
ssued":1409816234,"executed":1409816234,"output":"CRITICAL - Socket timeout after 10 seconds\n","status":2,"duration":10.004,"history":["2","2","0","0","0"
,"0","0","0","2","2","0","0","0","0","0","0","2","2","2","2","2"]},"occurrences":5,"action":"create"},"handler":{"command":"/etc/sensu/handlers/jira.rb","t
ype":"pipe","severities":["ok","warning","critical","unknown"],"name":"jira"}}

I think I need to detect this case and.. do something?

aws_prune should only consider "running" instances to be available

We currently only consider "non terminated" instances when considering which ec2 instances to respect when activating handlers.

This causes alerts in the time between a server being terminated (in the shutting-down state) and when the server is actually terminated.

But I think we can do better by simply ignoring servers that are shutting-down or terminated, and then they will be pruned asap.

Handlers are not CPU effecient

Due to the the design, all handlers spawn for every event, each filters themselves, each queries the api for the same stashes, etc.

@bobtfish We talked about doing this as an extension. This sounds good, but it pretty scary.

Could we do some intermediate thing where the base handler is activated, does filtering, and then it decides what additional handlers to spawn depending on the event data? Pretty much just making 1 mega handler?

Ability to set :context_path for jira handler

In the jira.rb file there is a field "context_path"
:context_path => '',

This is set to empty in the .rb file. My jira setup needs this set to /jira.
Therefore it would be great if this option would be read from the sensu handler config as well.

Can you add this option?
Thank you for your support!

When creating a jira ticket, assign it to the on-call person in pagerduty

My teams have always had the problem of ignoring sensu created JIRA tickets. I believe that this happens because of the bystander effect due to the tickets being created without an assignee.

It would be great if we could assign JIRA tickets to the on-call person in pagerduty who would be notified if the check was paging.

Remove 'nail' references

After removing habitat lines, this looks like it is just the nodebot path.
We should be able to just use the path to find it.

Jira handler has "magic numbers" and makes some assumptions about your Jira setup

The first magic number is "1" for the issue type. (which is a "Bug" at yelp in most projects)

The second is "731", which is a common resolving "transition_id" that most of our projects have. Other JIRA installs will probably have different numbers.

I think these need to be tunable on a per-team/project basis.

Assigning to @hashbrowncipher as he needs this to scratch his own itch.

Gemfile has bobtfish stuff in it

@bobtfish can you update the gem file to have released gems and still have the tests work?

handler not triggering resolve action for an event transitioned from crit -> warn -> ok

TL; DR

It looks like sensu does not filter events to handlers when it is resolved like
crit -> warn -> ok (in some cases, see below)

but when it does filter events to handlers when it is resolved like
crit -> ok

sensu while transitioning from action = create(crit) to action = create (warn), will reset the occurrences to 1.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

sensu while transitioning from action = create(crit) to action = resolve(ok), will retain the occurrences from create.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L536

given a scenario

interval: 10
alert_after: 40
initial_failing_occurrence = 4 (from https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

#crit -> warn
#occurrences set to 1 after status = 1
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

occurrences = 1
action = create
status = 1
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

simulate resolve incident
#immediately warn -> ok
#occurrences retained from warn after status = 0
occurrences = 1
action = resolve
status = 0
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

after this event is deleted and as a result sensu handler does not filter the event to PD handler (to resolve incident),
as a result orphaned PD incidents.

whereas (simulate where occurrences is not yet reset to 1)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

occurrences = 6
action = create
status = 2
number_of_failed_attempts = 6 - 4 = 2 (!<1) -> do not trigger until next alert_after …and so on.

simulate resolve incident
#crit -> ok
#occurrences retained from crit after status = 0
occurrences = 6
action = resolve
status = 0
number_of_failed_attempts = 6 - 4 = 1 (!<1) -> trigger (resolve PD !!)

Not sure i was able to explain this clearly, it is mostly code references.

But i guess this is the reason why we have had some cases where PD incidents were not cleared from pager duty even though these were resolved from PD.

fix:

https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231
we short circuit filter_repeated to handler when action: resolve.

Document the specifics of each handler

While there is "code", it would be nice if you could understand the behavior of the handlers (jira specifically) by reading documentation.

This issue is to write more documentation on each handler and how it works.

Develop watchdog handler

The sensu watchdog requires handler help to detect the present of a watchdog_timer key, if present it should update/add a stash to record the last event. The key is an int that represents seconds. The deployed stash should probably be something like watchdog/$fqdn/$check_name to be similar to silences.

This is all that it needs to do. A separate process is in charge of detecting stale watchdogs checks and spawning new events based on them.

Acceptance criteria is when there is some watchdog.rb file in our default handler array that does this, and you can see stashes show up after an event that uses the watchdog_timer key.

Use the send-test-sensu-alert script to aid in troubleshooting. Use hiera to selectively deploy the handler in production.

There should be tests that go with it. I want to see tests that show that:

when an event comes in with no watchdog_timer key, nothing happens
when an event comes in with a watchdog_timer => nil nothing happens
when an event comes in with a watchdog_timer => 60 expect it to receive create_stash or something like that.

pagerduty handler can sometimes emit too long of a description, and fail to page, which is bad

We should truncate like we do in the irc handler. PD requires the description to be <= 1024 chars.

errorsDescription is too long (maximum is 1024 characters)messageEvent object is invalidstatusinvalid event

The other wtf is that the error is not printed because of the elegance of our ruby.

Please also make the error reporting easier to discover by simply printing the output of the result if it didn't work:

    result = Redphone::Pagerduty.trigger_incident(
      :service_key  => api_key,
      :incident_key => incident_key,
      :description  => description,
      :details      => full_description_hash
    )
    puts result if  result['status'] != 'success'
    result['status'] == 'success'

Make aws-prune run "out of band" and not as a handler

Running the aws-prune script as an even handler has two downsides:

It is wasteful cpu-wise
It means there is always a warning event while a server is being shut down

Running the aws-prune thing as a cron job could make this more efficient and prune terminated instances faster, without causing spurious alerts.

Support related alerts functionality

So something that I've wanted for a long time is the ability for symptom alerting to include causal alert information. After chatting a bit with @solarkennedy we think that the right place to put this functionality is here in the sensu_handlers.

The idea is that the pagerduty/jira handler can query the sensu api for "related" events, probably by tag or host etc ... Then it would include this contextual information in the call to action alert.

Thoughts? Is this the right place to do it, is this a crazy idea?

Purge opsgenie references?

If we no longer use it, I'm tempted to not have it in here as it is going to be dead code and it will rot.

Thoughts @bobtfish?

remove habitat stuff

The habitat stuff is too specific and doesn't make sense for other consumers.

This should be generalized in some way, probably passed in as a setting that the handler can read

Remove `habitat` from pagerduty event ids in favor of `region` (I guess)

We need to know what habitat the pagerduty events come from in order to do webhooks.
@bobtfish what are these webhooks anyway, Nagios stuff?

Till then we'll get duplicate pagerduty incidents for any kind of source based event. (server_side, sensu stuff, etc)

Thank you. :) Team based config

I don't have another place to communicate with you guys. I wanted to say thank you for your handler model and the team based configuration. because of the team based configuration we're able to transition from hipchat to slack handler by just updating a teams channel information rather than having to find every check defined for those teams and modify them all.

thought you might like to know that that bit of forward looking design paid off for us in this transition.

Thank you.

remove dependency on puppet-omnibus embedded ruby

https://github.com/Yelp/sensu_handlers/blob/master/manifests/aws_prune.pp#L46

Install necessary gems for sensu embedded ruby and use those.

yelp / sensu_handlers Goto Github PK

sensu_handlers's People

Contributors

Stargazers

Watchers

Forkers

sensu_handlers's Issues

Recommend Projects

Recommend Topics

Recommend Org