Giter Club home page Giter Club logo

sensu_handlers's People

Contributors

ammaraskar avatar analogue avatar asottile avatar bobtfish avatar davent avatar fessyfoo avatar fhats avatar giuliano108 avatar hashbrowncipher avatar jaxxstorm avatar jolynch avatar jvperrin avatar liamjbennett avatar mesozoic avatar mks-m avatar nhandler avatar pauloconnor avatar solarkennedy avatar somic avatar stigkj avatar vulpine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sensu_handlers's Issues

Licensing

Thanks for the awesome work done here, especially the base handler. There are many parts of that I'd like to use in my own base handler along with much of my own functionality.

In the spirit of sharing, I'd like to open source my work under the MIT license but you have no license specified so I'm not sure if I can.

Can you clarify?

Replace fog dependency with aws-sdk

Fog is gigantic gem (6000 files), which is directly referenced in about 3 or 4 lines.

Given we already include aws-sdk in puppet-omnibus and that aws-sdk is far more natural dependency, let's migrate to that? I'm happy to volunteer.

Generate an alert if the API query fails

Hi guys,

This is more of a discussion than anything, just wondering if maybe you have any thoughts.

I have a monkey patched version of your JIRA handler, but in our orgs infinite wisdom, changes to transitions, required custom fields or such like will often mean the handler fails. It's difficult to notice this unless you're active catching the exceptions and checking events vs open jira tickets.

Can you think of a way whereby a sensu alert (with another handler) is generated without having to pull in a bunch of existing gems?

Obviously we can rescue the exception and go from there, just wondering if you have any novel ideas..

filter alerts with "Execution timed out" in pagerduty and jira handlers

I would like to open this up for discussion.

If a check is taking longer to run than expected, it often would exit 2 (critical) with output of "Execution timed out".

This comes from sensu-spawn gem - https://github.com/sensu/sensu-spawn/blob/master/lib/sensu/spawn.rb#L163

What if we filter these out of pagerduty and jira handlers? After all when we get this, we can't be certain it's the check that failed - in fact almost always it's not the actual check but a bad or hung ec2 instance etc.

A positive outcome of this is we would cut down on (frequently) unactionable tickets and pages. Also, if we ever decide to do more auto-remediation on these, we would be more confident autoremediation is attempted when it's needed - since running autoremediation in response to "execution timed out" (essentially an unknown exit code from a check) may not always be desirable.

A negative outcome would be we would lose implicit info about some problems that oncall often derives from seeing "Execution timed out".

Discuss.

@solarkennedy @bobtfish

Develop autofixer handler

Autofixer could be a system that receives events from sensu and attempts to fix them if it's told how to do it. This handler will provide the plumbing necessary to make it happen.

Potential sequence of steps could be receive an event from sensu, determine if this event is actionable by autofixer at this point in time (for example, attempt to autofix only after N minutes of ongoing event), perform a specified action, mark event as "autofix has been applied" in case the autofix action fails to resolve the event (to provide means for escalation to higher intelligence entity like a human).

Or make "autofix has been applied" as a counter to support situations where we want to apply autofix up to N times and after that give up.

Let's build plumbing to teach the machines how to fix problems themselves!

exponential backoff sometimes crashes

I think this might be happening in a log(0) event:

severities":["ok","warning","critical","unknown"],"name":"jira"},"output":"/etc/sensu/handlers/base.rb:173:in `log': Numerical result out of range - log (Errno::ERANGE)\n"}

I think this is the corresponding event data:

{"timestamp":"2014-09-04T00:37:24.241628-0700","level":"info","message":"handling event","event":{"client":{"name":"REDACTED","address":
"REDACTED","subscriptions":[REDACTED"],"safe_mode":true,"keepalive":{"handlers":["default"],"team":"operations","page":true,"realert
_every":"-1","runbook":"REDACTED},"timestamp":1409816239},"check":{"handlers":["default"],"command":" /usr/lib/nagios/plugins/check_http -H l
ocalhost -p 5052","subscribers":[],"standalone":true,"interval":60,"aggregate":false,"handle":true,"alert_after":300,"realert_every":"-1","runbook":"REDACTED","annotation":REDACTED","sla":"No SLA defined.","dependencies":[],"team":"REDACTED","irc_channels":"REDACTED","notification_email":"undef","ticket":false,"project":false,"page":true,"tip":false,"name":"check_marathon","i
ssued":1409816234,"executed":1409816234,"output":"CRITICAL - Socket timeout after 10 seconds\n","status":2,"duration":10.004,"history":["2","2","0","0","0"
,"0","0","0","2","2","0","0","0","0","0","0","2","2","2","2","2"]},"occurrences":5,"action":"create"},"handler":{"command":"/etc/sensu/handlers/jira.rb","t
ype":"pipe","severities":["ok","warning","critical","unknown"],"name":"jira"}}

I think I need to detect this case and.. do something?

aws_prune should only consider "running" instances to be available

We currently only consider "non terminated" instances when considering which ec2 instances to respect when activating handlers.

This causes alerts in the time between a server being terminated (in the shutting-down state) and when the server is actually terminated.

But I think we can do better by simply ignoring servers that are shutting-down or terminated, and then they will be pruned asap.

Handlers are not CPU effecient

Due to the the design, all handlers spawn for every event, each filters themselves, each queries the api for the same stashes, etc.

@bobtfish We talked about doing this as an extension. This sounds good, but it pretty scary.

Could we do some intermediate thing where the base handler is activated, does filtering, and then it decides what additional handlers to spawn depending on the event data? Pretty much just making 1 mega handler?

Ability to set :context_path for jira handler

In the jira.rb file there is a field "context_path"
:context_path => '',

This is set to empty in the .rb file. My jira setup needs this set to /jira.
Therefore it would be great if this option would be read from the sensu handler config as well.

Can you add this option?
Thank you for your support!

When creating a jira ticket, assign it to the on-call person in pagerduty

My teams have always had the problem of ignoring sensu created JIRA tickets. I believe that this happens because of the bystander effect due to the tickets being created without an assignee.

It would be great if we could assign JIRA tickets to the on-call person in pagerduty who would be notified if the check was paging.

Remove 'nail' references

After removing habitat lines, this looks like it is just the nodebot path.
We should be able to just use the path to find it.

handler not triggering resolve action for an event transitioned from crit -> warn -> ok

TL; DR

It looks like sensu does not filter events to handlers when it is resolved like
crit -> warn -> ok (in some cases, see below)

but when it does filter events to handlers when it is resolved like
crit -> ok

sensu while transitioning from action = create(crit) to action = create (warn), will reset the occurrences to 1.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

sensu while transitioning from action = create(crit) to action = resolve(ok), will retain the occurrences from create.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L536

given a scenario

interval: 10
alert_after: 40
initial_failing_occurrence = 4 (from https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

#crit -> warn
#occurrences set to 1 after status = 1
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

occurrences = 1
action = create
status = 1
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

simulate resolve incident
#immediately warn -> ok
#occurrences retained from warn after status = 0
occurrences = 1
action = resolve
status = 0
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

after this event is deleted and as a result sensu handler does not filter the event to PD handler (to resolve incident),
as a result orphaned PD incidents.

whereas (simulate where occurrences is not yet reset to 1)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

occurrences = 6
action = create
status = 2
number_of_failed_attempts = 6 - 4 = 2 (!<1) -> do not trigger until next alert_after …and so on.

simulate resolve incident
#crit -> ok
#occurrences retained from crit after status = 0
occurrences = 6
action = resolve
status = 0
number_of_failed_attempts = 6 - 4 = 1 (!<1) -> trigger (resolve PD !!)

Not sure i was able to explain this clearly, it is mostly code references.

But i guess this is the reason why we have had some cases where PD incidents were not cleared from pager duty even though these were resolved from PD.

fix:

https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231
we short circuit filter_repeated to handler when action: resolve.

Document the specifics of each handler

While there is "code", it would be nice if you could understand the behavior of the handlers (jira specifically) by reading documentation.

This issue is to write more documentation on each handler and how it works.

Develop watchdog handler

The sensu watchdog requires handler help to detect the present of a watchdog_timer key, if present it should update/add a stash to record the last event. The key is an int that represents seconds. The deployed stash should probably be something like watchdog/$fqdn/$check_name to be similar to silences.

This is all that it needs to do. A separate process is in charge of detecting stale watchdogs checks and spawning new events based on them.

Acceptance criteria is when there is some watchdog.rb file in our default handler array that does this, and you can see stashes show up after an event that uses the watchdog_timer key.

Use the send-test-sensu-alert script to aid in troubleshooting. Use hiera to selectively deploy the handler in production.

There should be tests that go with it. I want to see tests that show that:

  • when an event comes in with no watchdog_timer key, nothing happens
  • when an event comes in with a watchdog_timer => nil nothing happens
  • when an event comes in with a watchdog_timer => 60 expect it to receive create_stash or something like that.

pagerduty handler can sometimes emit too long of a description, and fail to page, which is bad

We should truncate like we do in the irc handler. PD requires the description to be <= 1024 chars.

errorsDescription is too long (maximum is 1024 characters)messageEvent object is invalidstatusinvalid event

The other wtf is that the error is not printed because of the elegance of our ruby.

Please also make the error reporting easier to discover by simply printing the output of the result if it didn't work:

    result = Redphone::Pagerduty.trigger_incident(
      :service_key  => api_key,
      :incident_key => incident_key,
      :description  => description,
      :details      => full_description_hash
    )
    puts result if  result['status'] != 'success'
    result['status'] == 'success'

Make aws-prune run "out of band" and not as a handler

Running the aws-prune script as an even handler has two downsides:

  • It is wasteful cpu-wise
  • It means there is always a warning event while a server is being shut down

Running the aws-prune thing as a cron job could make this more efficient and prune terminated instances faster, without causing spurious alerts.

Support related alerts functionality

So something that I've wanted for a long time is the ability for symptom alerting to include causal alert information. After chatting a bit with @solarkennedy we think that the right place to put this functionality is here in the sensu_handlers.

The idea is that the pagerduty/jira handler can query the sensu api for "related" events, probably by tag or host etc ... Then it would include this contextual information in the call to action alert.

Thoughts? Is this the right place to do it, is this a crazy idea?

remove habitat stuff

The habitat stuff is too specific and doesn't make sense for other consumers.

This should be generalized in some way, probably passed in as a setting that the handler can read

Thank you. :) Team based config

I don't have another place to communicate with you guys. I wanted to say thank you for your handler model and the team based configuration. because of the team based configuration we're able to transition from hipchat to slack handler by just updating a teams channel information rather than having to find every check defined for those teams and modify them all.

thought you might like to know that that bit of forward looking design paid off for us in this transition.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.