Giter Club home page Giter Club logo

datadog-firehose-nozzle's Introduction

Build Status

Summary

The datadog-firehose-nozzle is a CF component which forwards metrics from the Loggregator Firehose to Datadog

Configure CloudFoundry UAA for Firehose Nozzle

The datadog firehose nozzle requires a UAA user who is authorized to access the loggregator firehose. You can add a user by editing your CloudFoundry manifest to include the details about this user under the properties.uaa.clients section. For example to add a user datadog-firehose-nozzle:

properties:
  uaa:
    clients:
      datadog-firehose-nozzle:
        access-token-validity: 1209600
        authorized-grant-types: authorization_code,client_credentials,refresh_token
        override: true
        secret: <password>
        scope: openid,oauth.approvals,logs.admin,cloud_controller.admin_read_only
        authorities: oauth.login,logs.admin,cloud_controller.admin_read_only

Running

The datadog nozzle uses a configuration file to obtain the firehose URL, datadog API key and other configuration parameters. The firehose and the datadog servers both require authentication -- the firehose requires a valid username/password and datadog requires a valid API key.

Start the firehose nozzle by executing:

go run main.go -config config/datadog-firehose-nozzle.json"

Batching

The configuration file specifies the interval at which the nozzle will flush metrics to datadog. By default this is set to 15 seconds.

slowConsumerAlert

For the most part, the datadog-firehose-nozzle forwards metrics from the loggregator firehose to datadog without too much processing. A notable exception is the datadog.nozzle.slowConsumerAlert metric. The metric is a binary value (0 or 1) indicating whether or not the nozzle is forwarding metrics to datadog at the same rate that it is receiving them from the firehose: 0 means the the nozzle is keeping up with the firehose, and 1 means that the nozzle is falling behind.

The nozzle determines the value of datadog.nozzle.slowConsumerAlert with the following rules:

  1. When the nozzle receives a websocket Close frame with status 1008, it publishes the value 1. Traffic Controller pings clients to determine if the connections are still alive. If it does not receive a Pong response before the KeepAlive deadline, it decides that the connection is too slow (or even dead) and sends the Close frame.

  2. Otherwise, the nozzle publishes 0.

Using Proxies

If you need a proxy to connect to the Internet, you can use the HTTPProxyURL and HTTPSProxyURL fields in your configuration file in order to configure the nozzle to do this. For example:

  • HTTPProxyURL: "http(s)://user:password@proxy_for_http:port"
  • HTTPSProxyURL: "http(s)://user:password@proxy_for_https:port"

Alternatively, you can use environment variables (HTTP_PROXY and HTTPS_PROXY).

Deploying

There is a bosh release that will configure, start and monitor the Datadog nozzle: https://github.com/DataDog/datadog-firehose-nozzle-release

datadog-firehose-nozzle's People

Contributors

bradylove avatar chentom88 avatar dsabeti avatar enocom avatar gmmeyer avatar gzussa avatar hithwen avatar ihcsim avatar irabinovitch avatar isauve avatar jeremy-lq avatar jmtuley avatar nmuesch avatar nouemankhal avatar poy avatar robertjsullivan avatar rothgar avatar roxtar avatar sarah-witt avatar tylarb avatar wfernandes avatar zippolyte avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datadog-firehose-nozzle's Issues

Creating v73 fails

Steps to reproduce the issue:

± |master ✓ | → bosh cr releases/datadog-firehose-nozzle/datadog-firehose-nozzle-73.yml
-- Started downloading 'golang1.9.4/16375c0c5d6666c054253735dc3722e66bcbcb07' (sha1=988edb6b67bc250b14709595f1753653fd353eda)

-- Failed downloading 'golang1.9.4/16375c0c5d6666c054253735dc3722e66bcbcb07' (sha1=988edb6b67bc250b14709595f1753653fd353eda)


-- Started downloading 'datadog-firehose-nozzle/b9a16aee2c202539cf9060e1b7909ade740936e7' (sha1=5516b3f66d41bbd7cfe2b452469cdc704a72f276)

-- Failed downloading 'datadog-firehose-nozzle/b9a16aee2c202539cf9060e1b7909ade740936e7' (sha1=5516b3f66d41bbd7cfe2b452469cdc704a72f276)


-- Started downloading 'datadog-firehose-nozzle/bbb75f005164103a2c87563756069e820a2b9ec4' (sha1=de93a059a24f9b7b8671699700e137a63504d937)

-- Failed downloading 'datadog-firehose-nozzle/bbb75f005164103a2c87563756069e820a2b9ec4' (sha1=de93a059a24f9b7b8671699700e137a63504d937)


- Downloading blob '2ba68976-ad9b-4251-6fa6-b1f93989e17e' with digest string '988edb6b67bc250b14709595f1753653fd353eda':
    Getting blob from inner blobstore:
      Getting blob from inner blobstore:
        Copying file:
          Opening source path:
            open blobstore/2ba68976-ad9b-4251-6fa6-b1f93989e17e: no such file or directory
- Downloading blob '02264251-23b0-4e7f-5740-ce433ff84fa9' with digest string '5516b3f66d41bbd7cfe2b452469cdc704a72f276':
    Getting blob from inner blobstore:
      Getting blob from inner blobstore:
        Copying file:
          Opening source path:
            open blobstore/02264251-23b0-4e7f-5740-ce433ff84fa9: no such file or directory
- Downloading blob '78f27ce9-81c4-47d4-725c-7564953b0892' with digest string 'de93a059a24f9b7b8671699700e137a63504d937':
    Getting blob from inner blobstore:
      Getting blob from inner blobstore:
        Copying file:
          Opening source path:
            open blobstore/78f27ce9-81c4-47d4-725c-7564953b0892: no such file or directory

Exit code 1

Describe the results you received:

It errored when downloading blobs.

Describe the results you expected:

A final release to be successfully created.

Additional information you deem important (e.g. issue happens only occasionally):

This is blocking the latest release from showing up on bosh.io.

Update cloudfoundry/noaa dep so nozzle can get metrics only, not metrics+logs

Right now the nozzle doesn't filter anything from the Firehose, so it gets both metrics and logs. For Datadog we only need metrics, and filtering out logs should keep network pipes a little/a lot clearer. And it would probably improve nozzle performance, too, however slightly.

The nozzle does, of course, ignore log messages—only posting metrics to Datadog—but it'd be best if the nozzle requested not to receive log messages at all.

The cloudfoundry/noaa dep is what can allow this, but the function needed isn't included in our currently-pinned version of noaa (2.0.0). Let's update noaa and change this line to use FilteredFirehose() rather than Firehose().

DNS resolution is not using internal DNS

Steps to reproduce the issue:

  1. Use an internal DNS server to delegate to the foundation subdomain
  2. Install dogate on the foundation
  3. Point datadog nozzle to dogate running on the foundation

https://github.com/pivotal-cloudops/dogate

Describe the results you received:

Connections never come into the app when running dogate locally within a foundation. When deploying dogate to something public like PWS, the domain resolves and metrics start to flow as expected.

{"timestamp":1518733939.313222885,"process_id":13538,"source":"datadog-firehose-nozzle","log_level":"info","message":"Posting 4475 metrics","data":null,"file":"/var/vcap/data/compile/datadog-firehose-nozzle/datadog-firehose-nozzle/src/github.com/DataDog/datadog-firehose-nozzle/datadogclient/datadog_client.go","line":108,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogclient.(*Client).PostMetrics"}
{"timestamp":1518733946.823879004,"process_id":13538,"source":"datadog-firehose-nozzle","log_level":"debug","message":"Error: POST http://dogate.example.com/api/v1/series?api_key=<redacted> request failed. Wait before retrying: 2.142857142s (2 left)","data":null,"file":"/var/vcap/data/compile/datadog-firehose-nozzle/datadog-firehose-nozzle/src/github.com/DataDog/datadog-firehose-nozzle/datadogclient/datadog_client.go","line":87,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogclient.New.func1"}
{"timestamp":1518733956.110441446,"process_id":13538,"source":"datadog-firehose-nozzle","log_level":"debug","message":"Error: POST http://dogate.example.com/api/v1/series?api_key=<redacted> request failed. Wait before retrying: 4.285714284s (1 left)","data":null,"file":"/var/vcap/data/compile/datadog-firehose-nozzle/datadog-firehose-nozzle/src/github.com/DataDog/datadog-firehose-nozzle/datadogclient/datadog_client.go","line":87,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogclient.New.func1"}
{"timestamp":1518733968.611387968,"process_id":13538,"source":"datadog-firehose-nozzle","log_level":"debug","message":"Error: POST http://dogate.example.com/api/v1/series?api_key=<redacted> request failed. Wait before retrying: 7.5s (0 left)","data":null,"file":"/var/vcap/data/compile/datadog-firehose-nozzle/datadog-firehose-nozzle/src/github.com/DataDog/datadog-firehose-nozzle/datadogclient/datadog_client.go","line":87,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogclient.New.func1"}
{"timestamp":1518733973.611835957,"process_id":13538,"source":"datadog-firehose-nozzle","log_level":"error","message":"Error posting metrics: POST http://dogate.example.com/api/v1/series?api_key=<redacted> giving up after 4 attempts\n\n","data":null,"file":"/var/vcap/data/compile/datadog-firehose-nozzle/datadog-firehose-nozzle/src/github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle/datadog_firehose_nozzle.go","line":189,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle.(*DatadogFirehoseNozzle).PostMetrics"}

I verified that dogate is running correctly and I can connect to the app from the same server that datadog-firehose-nozzle is running from.

Describe the results you expected:

DNS to resolve for internal domain, and work as expected.

Traffic Controller websocket connection abnormal closure

Steps to reproduce the issue:

  1. Install PCF 2.4.x onto GCP
  2. Set up GCP load balancers to deliver traffic to the cloud controller and the traffic controller.
  3. Install the Datadog tile 2.6.1

Describe the results you received:
The datadog-firehose-nozzle job on the nozzle VM is re-starting every 1-5 minutes with the following error message:

{"timestamp":1553078874.162184954,"process_id":17000,"source":"datadog-firehose-nozzle","log_level":"error","message":"Error while reading from the firehose: websocket: close 1006 (abnormal closure): unexpected EOF","data":null,"file":"/var/vcap/data/compile/datadog-firehose-nozzle/datadog-firehose-nozzle/src/github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle/datadog_firehose_nozzle.go","line":311,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle.(*DatadogFirehoseNozzle).handleError"}
{"timestamp":1553078874.162430763,"process_id":17000,"source":"datadog-firehose-nozzle","log_level":"info","message":"Closing connection with traffic controller due to websocket: close 1006 (abnormal closure): unexpected EOF","data":null,"file":"/var/vcap/data/compile/datadog-firehose-nozzle/datadog-firehose-nozzle/src/github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle/datadog_firehose_nozzle.go","line":318,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle.(*DatadogFirehoseNozzle).handleError"}

The table below illustrates the frequency of failures

Timestamp Time
1553078829.8758211 10:47:09.875
1553078958.7899298 10:49:18.789
1553079022.647535086 10:50:22.647
1553079253.344120741 10:54:13.344
1553079340.259693384 10:55:40.259

Describe the results you expected:
Expected the datadog-firehose-nozzle job to run continuously

Additional information you deem important (e.g. issue happens only occasionally):
The tile sets the Cloud Controller URL, so that the nozzle will get the endpoint URL from it.
However, when the Cloud Controller URL is re-set to empty, and the Traffic Controller URL is set manually, the nozzle runs without failures or interruptions for hours. The URL set manually exactly matches the URL returned by the Cloud Controller.

firehose nozzle not recovering from connection issues properly

Steps to reproduce the issue:

  1. Start datadog firehose nozzle.
  2. Either wait for a connection issue, or cause one somehow.
  3. Watch the nozzle crash

Describe the results you received:
Over the last few weeks, my team has noticed an issue with datadog firehose nozzle instances inside our pcf environment crashing and not properly recovering. Generally we were alerted to the issue because of "No Data" PagerDuty messages going off at 3 in the morning (super fun). We've been able to mitigate the problem by restarting the instances, but the team is getting tired of manually restarting them.

Today, I wrote a bash wrapper to automatically restart the firehose nozzle whenever it crashed.

During this exercise, I've been monitoring the logs to see what's been causing crashes, and finally was able to capture one.

2017-08-04T13:33:50.303-07:00 [APP/PROC/WEB/0] [OUT] {"timestamp":1501878830.303400517,"process_id":67,"source":"datadog-firehose-nozzle","log_level":"fatal","message":"FATAL ERROR: Post https://app.datadoghq.com/api/v1/series?api_key=XXXXXXXXXXXXXXXXXXXXX: net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n","data":null,"file":"/root/go/src/github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle/datadog_firehose_nozzle.go","line":106,"method":"github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle.(*DatadogFirehoseNozzle).postMetrics"}
2017-08-04T13:33:50.305-07:00 [APP/PROC/WEB/0] [ERR] panic: FATAL ERROR: Post https://app.datadoghq.com/api/v1/series?api_key=XXXXXXXXXXXXXXXXXXXXX: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2017-08-04T13:33:50.305-07:00 [APP/PROC/WEB/0] [ERR] goroutine 1 [running]:
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] panic(0x7b8b40, 0xc820313db0)
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] github.com/cloudfoundry/gosteno.(*Logger).Log(0xc8200b6840, 0x9059c8, 0x5, 0x1, 0xc8200be180, 0xb1, 0x0)
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /root/go/src/github.com/cloudfoundry/gosteno/logger.go:41 +0x2f0
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /root/go/src/github.com/cloudfoundry/gosteno/logger.go:66 +0x33e
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /root/go/src/github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle/datadog_firehose_nozzle.go:104 +0xfb
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /usr/lib/go-1.6/src/runtime/panic.go:481 +0x3e6
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] github.com/cloudfoundry/gosteno.(*BaseLogger).Log(0xc820012400, 0x9059c8, 0x5, 0x1, 0xc8200be180, 0xb1, 0x0)
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] github.com/cloudfoundry/gosteno.(*Logger).Fatalf(0xc8200b6840, 0x950e80, 0x11, 0xc820a4db90, 0x1, 0x1)
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle.(*DatadogFirehoseNozzle).postToDatadog(0xc820a4def8, 0x0, 0x0)
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /root/go/src/github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle/datadog_firehose_nozzle.go:49 +0xd8
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /vagrant/main.go:51 +0x6c0
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /root/go/src/github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle/datadog_firehose_nozzle.go:86 +0x11b
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] main.main()
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] /root/go/src/github.com/cloudfoundry/gosteno/logger.go:164 +0xa6
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle.(*DatadogFirehoseNozzle).postMetrics(0xc820a4def8)
2017-08-04T13:33:50.306-07:00 [APP/PROC/WEB/0] [ERR] github.com/DataDog/datadog-firehose-nozzle/datadogfirehosenozzle.(*DatadogFirehoseNozzle).Start(0xc820a4def8, 0x0, 0x0)

Describe the results you expected:

If there is a connection issue inside the firehose nozzle, it should gracefully recover, or reattempt the connection, instead of crashing.

Additional information you deem important (e.g. issue happens only occasionally):

Here's the command that we're using to run the application from the binary buildpack inside of a PCF instance:

echo "ZZZ starting firehose" && until $HOME/data-dog-nozzle -config ./datadog-firehose-nozzle-$PCF_ENVIRONMENT.json; do echo "ZZZ restarting firehose"; sleep 1 ; done

By the way, I totally know that this probably isn't the "recommended deployment behavior" for this application, but the fact that it crashes like this should be something that all users would probably enjoy having mitigated.

Token for RLP connection Expires

Describe the bug
Currently, the auth token to connect to reverse log proxy is only acquired when the nozzle process starts. Eventually, this token will expire and when it does, the nozzle just fails continuously until it is restarted.

Label the issue properly.

  • Add severity/ label.
  • Add documentation label if this issue is related to documentation changes.

To Reproduce
Steps to reproduce the behavior:

  1. Start a datadog nozzle instance
  2. Wait 14 days for the UAA token to expire
  3. You will see that the datadog nozzle fails to work properly until restarted

Expected behavior
When the nozzle detects a failure to authenticate, it should get a new token and try to connect again.

Screenshots

Environment and Versions (please complete the following information):
Datadog Nozzle version: 2.0.0

Additional context
We have a fix PR'd into this fork. Please take a look to see if it's usable for you. https://github.com/pivotal-cloudops/datadog-firehose-nozzle/pull/1. We would create a PR for your repo but we don't seem to have branching rights.

Kills process if DataDog on status code

It looks like the nozzle will die if DataDog returns a status code it wasn't expecting:

if resp.StatusCode >= 300 || resp.StatusCode < 200 {
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
body = []byte("failed to read body")
}
return fmt.Errorf("datadog request returned HTTP response: %s\nResponse Body: %s", resp.Status, body)
}

func (d *DatadogFirehoseNozzle) postMetrics() {
err := d.client.PostMetrics()
if err != nil {
d.log.Fatalf("FATAL ERROR: %s\n\n", err)
}
}

Is this the desired behavior? It implies that if DataDog experiences an issue, it will trickle down through everyone's nozzle and starting causing processes to die. I suspect it would be better to log an error without killing the process.

/cc @bradylove

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.