cloudflare / complainer Goto Github PK

Complainer's job is to send notifications to different services when tasks fail on Mesos cluster.

License: MIT License

Go 100.00%

complainer's Introduction

Complainer

Complainer's job is to send notifications to different services when tasks fail on Mesos cluster. While your system should be reliable to failures of individual tasks, it's nice to know when things fail and why.

Supported log upload services:

No-op - keeps URLs to Mesos slave sandbox.
S3 - both AWS S3 and on-premise S3-compatible API.

Supported reporting services:

Sentry - a great crash reporting software.
Hipchat - not so great communication platform.
Slack - another communication platform.
File - regular file stream output, including stdout/stderr.

Quick start

Start sending all failures to Sentry:

docker run -it --rm cloudflare/complainer \
  -masters=http://mesos.master:5050 \
  -uploader=noop \
  -reporters=sentry \
  -sentry.dsn=https://foo:[email protected]/8

Run this on Mesos itself!

Reporting configuration

Complainer needs two command line flags to configure itself:

name - Complainer instance name (default is default).
default - Whether to use default instance for each reporter implicitly.
masters - Mesos master URL list (ex: http://host:port,http://host:port).
listen - Listen address for HTTP (ex: 127.0.0.1:8888).

These settings can be applied by env vars as well:

COMPLAINER_NAME - Complainer instance name (default is default).
COMPLAINER_DEFAULT - Whether to use default instance for each reporter implicitly.
COMPLAINER_MASTERS - Mesos master URL list (ex: http://host:port,http://host:port).
COMPLAINER_LISTEN - Listen address for HTTP (ex: 127.0.0.1:8888).

Filtering based on the failures framework

If you're in the situation where you have multiple marathons running against a mesos, and want to segregate out which failures go where, the following options are of interest. Each option can be specified multiple times.

framework-whitelist - This is a regex option; if given, the failures framework must match at least one whitelist. If no whitelist is specified, then it's treated as if '.*' had been passed- all failures are whitelisted as long as they don't match a blacklist.
framework-blacklist - This is a regex option; if given, any failure that matches this are ignored.

Note that the order of evaluation is such that blacklists are applied first, then whitelists.

HTTP interface

Complainer provides HTTP interface. You have to enable it with -listen command line flag or with COMPLAINER_LISTEN env variable.

This interface is used for the following:

Health checks
pprof endpoint

Health checks

/health endpoint reports 200 OK when things are operating mostly normally and 500 Internal Server Error when complainer cannot talk to Mesos.

We don't check for other issues (uploader and reporter failures) because they are not guaranteed to be happening continuously to recover themselves.

version endpoint

/version endpoint reports 200 OK and outputs the current version of this application.

complainer (default) v1.7.0

pprof endpoint

/debug/pprof endpoint exposes the regular net/http/pprof interface:

https://golang.org/pkg/net/http/pprof/

Log upload services

Log upload service is specified by command line flag uploader. Alternatively you can specify this by env var COMPLAINER_UPLOADER. Only one uploader can be specified per complainer instance.

no-op

Uploader name: noop

No-op uploader just echoes Mesos slave sandbox URLs.

S3 AWS

Uploader name: s3aws.

This uploader uses official AWS SDK and should be used if you use AWS.

Stdout and stderr logs get uploaded to S3 and signed URLs provided to reporters. Logs are uploaded into the following directory structure by default:

${YYYY-MM-DD}/complainer/${task_name}/${YYYY-MM-DDTHH:mm:ssZ}-${task_id}/{stdout,stderr}

Command line flags:

s3aws.access_key - S3 access key.
s3aws.secret_key - S3 secret key.
s3aws.region - S3 region.
s3aws.bucket - S3 bucket name.
s3aws.prefix - S3 prefix template (Failure struct is available).
s3aws.timeout - Timeout for signed S3 URLs (ex: 72h).

You can set value of any command line flag via environment variable. Example:

Flag s3aws.access_key becomes env variable S3_ACCESS_KEY

Flags override env variables if both are supplied.

The minimum AWS policy for complainer is s3:PutObject:

https://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html

S3 Compatible APIs

Uploader name: s3goamz.

This uploader uses goamz package and supports S3 compatible APIs that use v2 style signatures. This includes Ceph Rados Gateway.

Stdout and stderr logs get uploaded to S3 and signed URLs provided to reporters. Logs are uploaded into the following directory structure by default:

${YYYY-MM-DD}/complainer/${task_name}/${YYYY-MM-DDTHH:mm:ssZ}-${task_id}/{stdout,stderr}
s3goamz.access_key - S3 access key.
s3goamz.secret_key - S3 secret key.
s3goamz.endpoint - S3 endpoint (ex: https://complainer.s3.example.com).
s3goamz.bucket - S3 bucket name.
s3goamz.prefix - S3 prefix template (Failure struct is available).
s3goamz.timeout - Timeout for signed S3 URLs (ex: 72h).

You can set value of any command line flag via environment variable. Example:

Flag s3goamz.access_key becomes env variable S3_ACCESS_KEY

Flags override env variables if both are supplied.

Reporting services

Reporting services are specified by command line flag reporters. Alternatively you can specify this by env var COMPLAINER_REPORTERS. Several services can be specified, separated by comma.

Sentry

Command line flags:

sentry.dsn - Default Sentry DSN to use for reporting.

Labels:

dsn - Sentry DSN to use for reporting.

If label is unspecified, command line flag value is used.

Hipchat

Command line flags:

hipchat.base_url - Base Hipchat URL, needed for on-premise installations.
hipchat.room - Default Hipchat room ID to send notifications to.
hipchat.token - Default Hipchat token to authorize requests.
hipchat.format - Template to use in messages.

Labels:

base_url - Hipchat URL, needed for on-premise installations.
room - Hipchat room ID to send notifications to.
token - Hipchat token to authorize requests.

If label is unspecified, command line flag value is used.

Templates are based on text/template. The following fields are available:

failure - Failure struct.
stdoutURL - URL of the stdout stream.
stderrURL - URL of the stderr stream.

Slack

Command line flags:

slack.hook_url - Webhook URL, needed to post something (required).
slack.channel - Channel to post into, e.g. #mesos (optional).
slack.username - Username to post with, e.g. "Mesos Cluster" (optional).
slack.icon_emoji - Icon Emoji to post with, e.g. ":mesos:" (optional).
slack.icon_url - Icon URL to post with, e.g. "http://my.com/pic.png" (optional).
slack.format - Template to use in messages.

Labels:

hook_url - Webhook URL, needed to post something (required).
channel - Channel to post into, e.g. #mesos (optional).
username - Username to post with, e.g. "Mesos Cluster" (optional).
icon_emoji - Icon Emoji to post with, e.g. ":mesos:" (optional).
icon_url - Icon URL to post with, e.g. "http://my.com/avatar.png" (optional).

If label is unspecified, command line flag value is used.

For more details see Slack API docs.

Templates are based on text/template. The following fields are available:

failure - Failure struct.
stdoutURL - URL of the stdout stream.
stderrURL - URL of the stderr stream.

Jira

Command line flags:

jira.url - Default JIRA instance url (required).
jira.username - JIRA user to authenticate as (required).
jira.password - JIRA password for the user to authenticate (required).
jira.issue_closed_status - The status of JIRA issue when it is considered closed.
jira.fields - JIRA fields in key:value;... format seperated by ;, this configuration MUST contain Project, Summary and Issue Type.

Example jira.fields:

Project:COMPLAINER;Issue Type:Bug;Summary:Task {{ .failure.Name }} died with status {{ .failure.State }};Description:[stdout|{{ .stdoutURL }}], [stderr|{{ .stderrURL }}], ID={{ .failure.ID }}

Templates are based on text/template. The following fields are available:

failure - Failure struct.
stdoutURL - URL of the stdout stream.
stderrURL - URL of the stderr stream.

File

Command line flags:

file.name - File name to output logs.
file.format - Template to use in output logs.

Templates are based on text/template. The following fields are available:

failure - Failure struct.
stdoutURL - URL of the stdout stream.
stderrURL - URL of the stderr stream.

Label configuration

Basics

To support flexible notification system, Mesos task labels are used. Marathon task labels get copied to Mesos labels, so these are equivalent.

The minimal set of labels needed is an empty set. You can configure default values in Complainer's command line flags and get all notifications with these settings. In practice, you might want to have different reporters for different apps.

Full format for complainer label name looks like this:

complainer_${name}_${reporter}_instance_${instance}_${key}

Example (dsn set for default Sentry of default Complainer):

complainer_default_sentry_instance_default_dsn

This is long and complex, so default parts can be skipped:

complainer_sentry_dsn

Advanced labels

The reason for having long label name version is to add the flexibility. Imagine you want to report app failures to the internal Sentry, two internal Hipchat rooms (default and project-specific) and the external Sentry.

Set of labels would look like this:

complainer_sentry_dsn: ABC - for internal Sentry.
complainer_hipchat_instances: default,myapp - adding instance myapp.
complainer_hipchat_instance_myapp_room: 123- setting room for myapp.
complainer_hipchat_instance_myapp_token: XYZ- setting token for myapp.
complainer_external_sentry_dsn: FOO - for external Sentry.

Internal and external complainers can have different upload services.

Implicit instances are different, depending on how you run Complainer.

-default=true (default) - default instance is implicit.
-default=false - no instances are configured implicitly.

The latter is useful for opt-in monitoring, including monitoring of Complainer itself (also known as dogfooding).

Templating

Templates are based on text/template. The following fields are available:

nl - Newline symbol (\n).
config - Function to get labels for the reporter.
failure - Failure struct: https://godoc.org/github.com/cloudflare/complainer#Failure
stdoutURL - URL of the stdout stream.
stderrURL - URL of the stderr stream.

With config you can use labels in templates. For example, the following template for the Slack reporter:

Task {{ .failure.Name }} ({{ .failure.ID }}) died | {{ config "mentions" }}{{ .nl }}

With the label complainer_slack_mentions=@devs will be evaluated to:

Task foo.bar (bar.foo.123) died | @devs

Dogfooding

To report errors for complainer itself you need to run two instances:

default to monitor all other tasks.
dogfood to monitor the default Complainer.

You'll need the following labels for the default instance:

labels:
  complainer_dogfood_sentry_instances: default
  complainer_dogfood_hipchat_instances: default

For the dogfood instance you'll need to:

Add -name=dogfood command line flag.
Add -default=false command line flag.

Since the dogfood Complainer ignores apps with not configured instances, it will ignore every failure except for the default instance failures.

If the dogfood instance fails, default reports it just like any other task.

If both instances fail at the same time, you get nothing.

Copyright

License

MIT

complainer's People

Contributors

Stargazers

Watchers

Forkers

bobrik bidesh alanbover prasincs bala529 ppar up73k vixns haraldnordgren max19931 isabella232

complainer's Issues

Eating your own dog food: Complainer should complain about itself

One idea that comes into my mind: If complainer fails (e.g. due to a node failure), it should complain about itself.

To make this possible we need at minimum two instances of complainer in one cluster. The should be scheduled on different nodes (maybe in different racks).
The challenge is here that these instances need to communicate with each other what instance will handle what failure, otherwise every failure will be reported twice.

One solution can be that a kind of "leader election" or state will be stored in Zookeeper (which is in place already due to the Mesos cluster).

This is just a rough and fast idea. What do you think about it?

Template-Support for Reporter

Complainer is growing with the number of reporters: Sentry, Hipchat, Slack, File, etc.

Messages for reporting systems that have a flexible part (like Slack or Hipchat) have a hardcoded message. See https://github.com/cloudflare/complainer/blob/master/reporter/slack.go#L73-L74 for Slack and https://github.com/cloudflare/complainer/blob/master/reporter/hipchat.go#L113-L114 for Hipchat.

It would be a nice feature to implement a functionality to hand in a golang template to modify this message. E.g. in slack you have the possibility to mention someone via @username. The use case would be to apply a label per task definition (e.g. label_author) and use this label in the golang template slack message. With this responsible persons / teams can be mentioned if a task is failing.

Another use case would be a JIRA reporter. Right now i am thinking about to add a JIRA Reporter to complainer (PS: Would you accept such a reporter?). A JIRA ticket could be benefit from this functionality as well.

At first i just want to get your feedback and thoughts about this (and an answer about the JIRA reporter / acceptance as well).

Docker-Tag is v1.4, but github release is v1.3

I saw that the latest github release is v1.3 (https://github.com/cloudflare/complainer/releases/tag/v1.3), but the latest docker hub image is v1.4 (https://hub.docker.com/r/cloudflare/complainer/tags/).

I assume Dockerhub v1.4 == Github Complainer v1.3 ?

Possibility to set templates per reporter via env vars

The template functionality is great. Thank you for this feature.
Sadly, this can only be applied by command line argument. See https://github.com/cloudflare/complainer/blob/master/reporter/slack.go#L31
It would be great if we can set a template per reporter via env var as well.

A normal os.GetEnv is not enough here, because per reporter there is a default value set.
For this we need a small custom function in complainer like this one: https://github.com/cloudflare/complainer/pull/25/files#diff-405e7bd8f95efdb0e45add9c8a9f8459R63

Would you be open for this PR?

AWS S3: Add minimum policy rules for complainer

I configured complainer with the usage of S3 (s3aws from #17). Everything is working fine.
What i asked myself during this road:

What Policies / Grants does complainer need to work with S3?
What are the minimum grants?

I tested it and the only requirement that is needed is:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1465396834000",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET-NAME/*"
            ]
        }
    ]
}

It would make sense to add this to the README / docs for future users.
I didn`t apply a PR (yet), because #17 is not merged yet and this would create a conflict if i put this into the README as well.

"Slave 10.1.2.125 is unreachable" & "Abnormal executor termination: unknown container" is not reported by complainer

Those two errors are not reported by complainer:

On 15/05/17 07:11, "[email protected]" <[email protected]> wrote:

    '2017-05-15T05:11:50.632Z'. Retries attempted: 0.
    Task id: ct:1494819000000:0:groundhog-HoteDescriptionDispatcher:
    The scheduler provided this message:
    
    Slave 10.1.2.3 is unreachable

and

On 14/05/17 06:56, "[email protected]" <[email protected]> wrote:

    '2017-05-14T04:56:57.846Z'. Retries attempted: 0.
    Task id: ct:1494732600000:0:groundhog-HoteDescriptionDispatcher:
    The scheduler provided this message:
    
    Abnormal executor termination: unknown container

Those two issues were send via email by Chronos / Mesos

S3 Uploader: Authorization mechanism ... is not supported. Please use AWS4-HMAC-SHA256.

I run into the following issue:

./complainer -masters "http://192.168.99.100:5050" \
                     -reporters "slack" \
                     -slack.channel "#mesos" \
                     -slack.hook_url "https://hooks.slack.com/services/TOKEN" \
                     -slack.username "Mesos Cluster" \
                     -slack.icon_emoji ":mesos:" \
                     -uploader "s3" \
                     -s3.access_key "KEY" \
                     -s3.bucket "BUCKET" \
                     -s3.endpoint "https://s3.eu-central-1.amazonaws.com/" \
                     -s3.secret_key "SECRET"

2016/06/08 19:05:02 Reporting ChronosTask:my-failing-job (ct:1465405479849:0:my-failing-job:) from 192.168.99.100
2016/06/08 19:05:02 Error reporting failure of ct:1465405479849:0:my-failing-job:: cannot get stdout and stderr urls from uploader: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.

This seems to be related to goamz/goamz#118

s3aws.go:68 panic: runtime error: invalid memory address or nil pointer dereference [recovered]

After an upgrade to Complainer v1.4 i see in stderr:

I0711 14:14:21.902189  5962 exec.cpp:143] Version: 0.28.1
I0711 14:14:21.906971  5976 exec.cpp:217] Executor registered on slave e739267b-454f-4b63-82ba-c9918a4eae07-S1
WARNING: Your kernel does not support memory limit capabilities. Limitation discarded.
2016/07/11 14:14:22 Serving http on :8888
2016/07/11 14:15:04 Reporting ChronosTask:metrics-MetricsStoreBotcaptchaMetrics (ct:1468246502158:0:metrics-MetricsStoreBotcaptchaMetrics:) from 10.1.2.123
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x10 pc=0x5ab6f6]

goroutine 1 [running]:
panic(0x9d8280, 0xc82000a0d0)
    /usr/lib/go/src/runtime/panic.go:481 +0x3e6
text/template.errRecover(0xc82016b4b0)
    /usr/lib/go/src/text/template/exec.go:133 +0xee
panic(0x9d8280, 0xc82000a0d0)
    /usr/lib/go/src/runtime/panic.go:443 +0x4e9
text/template.(*Template).Execute(0x0, 0x7fe5bc4df738, 0xc82012e460, 0x8d47e0, 0xc820329110, 0x0, 0x0)
    /usr/lib/go/src/text/template/exec.go:175 +0x226
github.com/cloudflare/complainer/uploader.(*s3AwsUploader).Upload(0xc82013a9f0, 0xc82042b280, 0x39, 0xc82042b2c0, 0x31, 0xc82018e706, 0xa, 0xc8203b15a0, 0xd, 0x0, ...)
    /go/src/github.com/cloudflare/complainer/uploader/s3aws.go:68 +0x25f
github.com/cloudflare/complainer/monitor.(*Monitor).processFailure(0xc820014d80, 0xc82042b280, 0x39, 0xc82042b2c0, 0x31, 0xc82018e706, 0xa, 0xc8203b15a0, 0xd, 0x0, ...)
    /go/src/github.com/cloudflare/complainer/monitor/monitor.go:160 +0x618
github.com/cloudflare/complainer/monitor.(*Monitor).Run(0xc820014d80, 0x0, 0x0)
    /go/src/github.com/cloudflare/complainer/monitor/monitor.go:101 +0x27b
main.main()
    /go/src/github.com/cloudflare/complainer/cmd/complainer/main.go:59 +0x93e

Marathon Config:

{
  "id": "/complainer",
  "apps": [
    {
      "id": "/complainer/default",
      "cmd": "/go/bin/complainer",
      "cpus": 1,
      "mem": 2048.0,
      "disk": 50,
      "instances": 1,
      "container": {
        "type": "DOCKER",
        "docker": {
          "image": "cloudflare/complainer:1.4",
          "network": "BRIDGE",
          "portMappings": [
            {
              "containerPort": 8888,
              "hostPort": 0,
              "servicePort": 10888,
              "protocol": "tcp"
            }
          ]
        }
      },
      "env": {
          "COMPLAINER_MASTERS": "http://master1:5050,http://master2:5050,http://master3:5050",
          "COMPLAINER_REPORTERS": "slack",
          "COMPLAINER_UPLOADER": "s3aws",
          "COMPLAINER_LISTEN": ":8888",

          "S3_ACCESS_KEY": "ACCESS-KEY",
          "S3_SECRET_KEY": "SECRET-KEY",
          "S3_REGION": "eu-central-1",
          "S3_BUCKET": "BUCKET",

          "SLACK_HOOK_URL": "SLACK-HOOK-URL",
          "SLACK_USERNAME": "Mesos Cluster",
          "SLACK_CHANNEL": "#mesos",
          "SLACK_ICON_EMOJI": ":mesos:",
          "SLACK_FORMAT": "Task {{ .failure.Name }} ({{ .failure.ID }}) died with status {{ .failure.State }} [<{{ .stdoutURL }}|stdout>, <{{ .stderrURL }}|stderr>]{{if config \"mentions\"}} | {{ config \"mentions\" }}{{ .nl }}{{end}}"
      },
      "healthChecks": [
          {
              "protocol": "HTTP",
              "path": "/health",
              "gracePeriodSeconds": 10,
              "intervalSeconds": 60,
              "portIndex": 0,
              "timeoutSeconds": 10,
              "maxConsecutiveFailures": 3
          }
      ]
    }
  ]
}

Docker Images on Dockerhub not up to date

The release tab shows released versions of v1.6.1 already.
The docker hub site of complainer shows the latest 1.4 version tagged.

The latest tag "seems" to be up to date with the master. But only the time ("6 month ago") is the matching point here.

Proposal:

Build, tag and upload a docker image in every github release
Document how to release a new version
Script / Automate the release (maybe under the usage of goreleaser

mesos https support

http scheme is hardcoded on https://github.com/cloudflare/complainer/blob/master/mesos/cluster.go#L148 and https://github.com/cloudflare/complainer/blob/master/mesos/cluster.go#L162

Complainer should instead extract and use the scheme from the first --masters, or add an option to enable https.

Dogfooding only works with Reporter, but not with Uploader

Dogfooding complainer is great.
With this feature you are able to monitor your default instance of complainer.

But if i see this correctly, dogfooding works only by dropping mandatory settings from the Reporter of the dogfooding instance (e.g. Slack Hook URL or Hipchat room).
So this means, if the default instance fails the reporter will report the URLs to the running Mesos slave.
At the moment it is not possible to upload the failures of the default instance to an Uploader, because the uploader don`t implement such a check for mandatory settings. Or am i missing something?

JIRA Reporter

I think about this quite some time. I want to add a JIRA Reporter.
The implementation to create a JIRA Ticket based on a failing Mesos Task is not a real challenge and i am able to do this.

But there is one doubt that comes into my mind:
Imagine you have a Cronjob that runs every hour (or every 30 minutes, whatever, important is "often). And now it starts failing at 8pm on a friday.
The whole weekend the job is failing and complainer is creating tickets for every job run.
My idea is to create only one task per Mesos Job ID / Identifier per Task (not per task run).
With this we can do something like this:

Job is failing
JIRA Reporter checks if there is already a ticket for this job which is not resolved, closed, (name the status)
If there is a ticket that is still not done, skip the error / reporting
If there is a ticket that with this task id (not run id) and is resolved / closed, create a new jira ticket
If there is no ticket at all, create one as well

To know if there is already a ticket created, we need a storage (file, S3, in memory, whatever) or we use a kind of "tricky" way to mark a ticket as this and we use JIRA as a "storage".
One idea is to assign tags to a ticket like complainer, mesos-task-id and query by them.
In memory won`t be a big deal, but when complainer get rescheduled, your data is gone and ticket will be created twice: Maybe this is acceptable?

The reason for this ticket is not that i want that you start the implementation.
The reason is i want to ask you about your idea about this functionality to avoid a ticket creation flooding. Maybe you have a better idea here?

Complainer uploads wrong logs if "Retry" feature from Chronos kicks in

The Chronos Framework for Mesos has a "Retry" functionality (docs at Job Configuration):

"retries": Number of retries to attempt if a command returns a non-zero status. Default: 2

Today we had a case where a Chronos Job has this definition:

{
  "schedule":"R/2015-11-12T09:00:00Z/PT30M",
  "scheduleTimeZone":"Europe/Berlin",
  "epsilon":"PT10M",
  "name":"affiliate-pullOtaProdTasData",
  "command":"MY pullOtaProdTasData COMMAND",
  "description":"MY pullOtaProdTasData DESCRIPTION",
  "owner":"MY pullOtaProdTasData OWNER",
  "ownerName":"My pullOtaProdTasData OWNER NAMES",
  "async":false,
  "executor":"",
  "disabled":false,
  "softError":false,
  "cpus":"0.5",
  "mem":"128",
  "disk":"24",
  "highPriority":false,
  "retries": 3
}

The jobs started and the first run was a failure (ID: ct:1466578800000:0:affiliate-pullOtaProdTasData:).
The second run (first retry), was a success (ID: ct:1466578800000:0:affiliate-pullOtaProdTasData:).
See the screenshot from the Mesos Master UI:

As you see they got the same Task ID. I don't know why.
Complainer uploaded the logs of the ct:1466578800000:0:affiliate-pullOtaProdTasData: with status FINISHED. I had expected that complainer uploaded the logs of ct:1466578800000:0:affiliate-pullOtaProdTasData: with Status FAILED.

stdout that was uploaded:

Registered executor on 10.1.2.126
Starting task ct:1466578800000:0:affiliate-pullOtaProdTasData:
sh -c 'MY COMMAND'
Forked command at 4103
MY PROGRAM STDOUT
Command exited with status 0 (pid: 4103)

stderr that was uploaded:

I0622 07:00:05.146646  3959 exec.cpp:143] Version: 0.28.1
I0622 07:00:05.638434  3969 exec.cpp:217] Executor registered on slave db1508de-3629-4e4f-b462-5ed6c2a1eb91-S1

Issue on Centos 6.5

Had to append the code as follows to avoid a "bad file descriptor" error.
f, err := os.OpenFile(file, os.O_APPEND|os.O_CREATE, 0666)
for
f, err := os.OpenFile(file, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0666)

Health endpoint

Marathon offers you a nice way to define Health Checks. An small and easy interface can be HTTP. As far as i can see such a functionality is not yet implemented in Complainer.

The basic idea is to offer a small http server (by stdlib / net/http) to offer a small /health endpoint.
Such a health endpoint should be checking if all dependencies (reporters, uploaders) are available.
In a possible implementation we could extend the Reporter and Uploader interface and add a new method Health() error.

If error is nil, reporter or uploader is healthy. If error != nil, this is not healthy.
With this every reporter / uploader is responsible to check their health on their own.
If no health endpoint / check is possible, a return nil might be sufficient.

Questions left:

What about uploaders / reporters configured by labels / tasks only?

I would love to get your feedback here.

cloudflare / complainer Goto Github PK

complainer's Introduction

Complainer

Quick start

Reporting configuration

Filtering based on the failures framework

HTTP interface

Health checks

version endpoint

pprof endpoint

Log upload services

no-op

S3 AWS

S3 Compatible APIs

Reporting services

Sentry

Hipchat

Slack

Jira

File

Label configuration

Basics

Advanced labels

Templating

Dogfooding

Copyright

License

complainer's People

Contributors

Stargazers

Watchers

Forkers

complainer's Issues

Recommend Projects

Recommend Topics

Recommend Org