postmates / cernan Goto Github PK

View Code? Open in Web Editor NEW

315.0 315.0 11.0 2.99 MB

telemetry aggregation and shipping, last up the ladder

License: Other

Rust 97.18% Lua 2.52% Shell 0.11% Dockerfile 0.19%

graphite influxdb metrics prometheus statsd telemetry telemetry-aggregation

cernan's People

Contributors

Stargazers

Watchers

Forkers

skade rmoreira fnzv tfheen jacksonhoose ibotty blt etsangsplk cbxgyh goller iq-scm

cernan's Issues

Introduce a health-check concept into cernan

It is desirable to have cernan report its internal health for operational monitoring.

'bucket' notion is unacceptable for log streaming

The 'bucket' data store present in cernan is acceptable for statsd multiplexing but breaks down as soon as you want to retain datapoints. Which, you do if you pump in graphite or logs.

Support protocol extension to statsd to include tags

This has come up in conversation, especially with @doubleyou, @blakebarnett and @dvdklnr. Very likely we'll just use wavefront proxy's extension to statsd to support tags as this is standard-ish and somewhat supported by existing clients.

cernan doesn't drop old metrics

The memory use of cernan will grow without limit because it does not flush old metrics as needed. This can be corrected by moving from an internal HashMap to an LRU or other expiring cache.

Support graphite protocol

Statsd is a lossy protocol. The server is intended to aggregate before passing on, which is not strictly needed with any of our existing aggregation services. graphite is a lossless format, implying that the server merely stores and forwards points.

Supporting graphite will allow ingestion of local collectd points.

Reduce significant copying

In the process of avoiding generic string handling I've introduced a lot of copying of strings. This is okay while we're just multiplexing statsd but will break down as soon as we start storing points or want higher performance.

Extract sinks into plugins

Right now we have our sinks hard-coded into the binary. This is undesirable for anyone that wants to emit to new and interesting places. We should have the ability to allow users to write their own sinks and to prove this out we ought to extract some of our own.

Aggregate Wavefront points

It turns out that wavefront does not like sub-second points. That is, if you report

a.metric 1 00001
a.metric 1 00001
a.metric 1 00001

Wavefront will only include a.metric 1 00001 in the time series.

So! We need to use the metadata associated with metrics and aggregate appropriately.

Implement the Graphite protocol

http://graphite.readthedocs.io/en/latest/feeding-carbon.html#the-plaintext-protocol

IPv6 Statsd Listener Support

A few statsd clients (primarily those written in Go) have been experienced to default to using IPv6 loopback instead of IPv4 loopback.

This net results are confused users and silently dropped metrics.

Investigate Forward Error Correction

Investigate the possibility of introducing an optional mechanism for transmitting metrics over UDP with forward error correction. tl;dr: by sending redundant messages, but over a faster protocol, we can significantly increase egress throughput

Criterion include:

idempotent updates to metrics in wavefront
how wavefront treats duplicate points wrt billing
base implementation of QoS per metric
decomplect flush interval from aggregation
piggy back of binning and cursors on flush
durable storage

Kick out counts with percentile aggregates

Part and parcel with #61--we should no longer report exact points to wavefront--we can no longer rely on the timeseries information to provide accurate counts for our aggregates. This is something we need to collect and report ourselves.

Add ability to extract metadata from metric name

Re-enable full CLI configuration

The CLI of cernan should support its former set of configuration options. Without this, master has become difficult to promote to stable as all hosts will necessarily need a correct configuration file on-disk.

The CLI will not be updated to support the variety of configuration available through the config file.

Remove Wavefront backend aggregation

Right now we have a --wavefront-skip-aggrs and want to eventually remove this option entirely by removing the ability to ship aggregates to wavefront. They are not useful.

Make cernan logs more descriptive

Cernan's logging should be more descriptive than exists today. Before releasing 0.4.0we ought to audit every place where we currently log and ensure that the message provides the full context available at the time, as well as add additional logging in critical areas where it is currently lacking.

'bucket' concept should store one second bins

The present state of affairs in cernan is that counters are per flush-interval. That is, if cernan is configured to have a flush interval of 15 seconds a counter is implicitly per fifteen seconds.

Instead, the bucket concept should be expanded to aggregate at one second bins. Exactly how we want to cook this in the presence of potential flush failures is unknown as of this writing.

Tests have no notion of coverage

Right now cernan tests do not make coverage reports. This makes it difficult to determine how well tested cernan is in practice. Issues that have snuck by into the wild would suggest that it is not.

Log line allocations are pretty heinous

As noted by @tsantero in the #85 (comment)

memory allocation is inefficient, reading in 6 log files concurrently at an avg of 10k lines/sec of 100 bytes/line cost several gigs of RAM, with utilization growing as the counts increased

This is... not okay. Our present channel based method of communication has met the end of its useful lifetime.

Histograms and Configurable Bins

In order to preserve information about the distribution of request times or similar data, we can't rely on percentiles: there's no correct way to later aggregate two percentiles to get another percentile. A mean of a percentiles isn't usually what you want.

What we can do is calculate histograms for each reporting period, then report each bin of that histogram, for each flush interval, to Wavefront. From Wavefront's perspective it's just a bunch of separate metrics, but we can reassemble them and calculate a histogram for the distribution of some values for arbitrary intervals without loss of information.

The tricky part here is communicating from the application to Cernan what the bins should be. Statsd can be configured to calculate histograms in this way, but it involves defining the bins in statsd's config file which is awkward to realize.

So I propose an extension to the statsd protocol so applications can communicate to cernan at application start the desired histogram bins. Proposed format:

metric.name:-inf,1,10,100,inf|h

This configures four bins for the metric metric.name:

less than 1
[1, 10)
[10, 100)
greater than or equal to 100

Applications can send these each time they start, and Cernan will write them to the filesystem so they don't need to be sent again if Cernan restarts.

sinks' flush intervals ought to be independent

Right now cernan has a global concept of flush intervals. It would be much nicer if sinks had a per-sink flush interval, allowing us to differentiate between disk IO flushes and network flushes.

Support reading from configured files

Cernan must learn to ingest log files. This issue does not imply modification of the data streams exposed from the files. At present it would be sufficient to note that a line was written to the log file and convert this to a PIT metric.

Depends on #37

librato/wavefront backends do not emit histograms/timers

Use the console backend as an example.

Support disk-based queues

Cernan presently has in-memory channels for shuttling information back and forth between system components. This was the same approach that heka adopted, which killed its capability in high load situations. In-memory channels are problematic in that they have no concept of backpressure. They are also not crash tolerant.

Here the primary inspiration will be hindsight, which the heka folks have blessed as the works-well-enough replacement for heka.

Cernan Tagging

recording the metrics from every single celery worker * every single instance is going to create a bunch of noise for the graphs, and increase our wf upload rate to a level that, even with aggregation by time tweaks, won't make sense. Cernan tagging will enable us to tag everything as either prod or staging.

Admit analysis of log files

Cernan should allow the end-user to manipulate their logs into telemetry streams for differing backends. This implies a scripting capability and the ability for scripts to deliver packets into named queues.

Possibly depends on #40

Probably don't aggregate graphite metrics

Pgbouncer metrics in the past were reported each second, but now are every 10 seconds. I'm guessing this has to do with the changes that were made recently to Cernan's aggregation behavior.

To preserve the semantics expected from a graphite interface, Cernan should not aggregate these values, or buffer them for the flush interval, but instead forward them to Wavefront as reported.

Cernan must be durable between restarts

There are many places where cernan is aggregating important things in memory, holding until there's enough information from disk-based sources etc etc. Presently if cernan restarts all of this is lost.

Cernan must not lose things.

Metrics should carry their own metadata

Related to #98 each metric should have a notion of its own metadata. Presently metadata is per-sink.

Support --version on the CLI

As it says on the tin. Determining the running cernan version is very hard without this.

cernan allocates however much memory you force it to

Right now if you send cernan a metric name that is, say, 1MB, cernan will allocate 1MB for you.

So! Probably oughtn't to do that.

Support an --environment flag

It would be very useful for cernan to distinguish runtime environments. Inside of postmates we want this to be a first-class bit of metadata on all points.

Cernan forwarding to cernan

It's come up on a few occasions that it'd be awfully handy to have cernan forward to a remote cernan. This is possible to do.

Remove exact-point reporting to wavefront sink

As a part of #61 we can no longer report exact points to wavefront. Doing so eats into our point budget and provides no benefit.

Avoid data-log growing indefinitely

Introduced in #91 the mpmc log file will grow without bound. This needs to be corrected. Current thinking is to do periodic rotation and abstract the state machine used in file_server to service the Receiver.

Allow munging of metric names

Motivated by the fact that collectd puts metadata information into the metric name we need to be able to adjust metric names as they came in, in a user-configurable fashion.

Current QOS elides gauges

Our current QOS setup seemingly causes elision of metrics. Steps to reproduce:

emit a gauge every five seconds
set gauge QOS to 10 seconds

Break the hegemony of all sources to all sinks

At present all data sources--graphite, statsd, log files--are reported to all backends. This is not appropriate. Cernan should instead allow, by configuration, different sources to be reported to different sinks.

Handle Signals

It would be very useful for integration testing if SIGINT were to cause backends to flush and then result in program exit.

Test coverage is very low

Related to #82, cernan's test coverage is poor. We must improve this.

Introduce emission Quality of Service notion

At the suggestion of @pulltab we should introduce a quality-of-service notion to our metric types. The idea here being that backends can choose what rate to flush certain types. By default, we'll maintain flush-rate compatibility with existing statsd implementations.

This is intended to allow us to reduce the burden on our backend reporters.

Support a configuration file

Cernan has grown a plethora of CLI flags. This makes it difficult to add arbitrary files to track.

Cernan should accept a configuration file. We must keep the current CLI flags around meanwhile.

cernan cannot ingest metrics with numeric components

Consider a system that emits self-aggregated histograms as gauges with names like foo.bar.baz.median, foo.bar.baz.999. Cernan will only allow the first.

Implement Setup for Systematic Integration Tests

We must have some capability of determining that cernan emits the points we expect for any given input from the outside of the project.

~~This requires #34~~

cernan must accept a DNS name for all hosts

Right now cernan does not do a DNS lookup to resolve non-IP hosts into IP addresses. This significantly limits the fun things we can get up to.

For instance, we ought to do more than crash on wavefront-proxy.us-west-1.postmates.com.

Support arbitrary tags on CLI

Ambition is to parse --tags source=foo,service=bar,metadata=blerg and emit appropriately to correct emission sites.

As indicated, will fold --metric-source into this.

Must support hyphen prefix metric name hack

In existing postmates systems there is a need to parse metrics with
names like

 source.machine-metric.name

as being a metric named 'metric.name' with source 'source.machine'.
The etsy/statsd backend has been modified to take a regex to fiddle
with this but I've hard coded it in the expectation that future
backends will be programmable by the end-user.

Run integration and unit tests under Jenkins

We don't have a license for TravisCI. We should still have feedback on all PRs through. What we must have is:

unit test runs under all channels, stable/beta/nightly
integration test runs under all channels, stable/beta/nightly

Feedback and badges etc would be a bonus.

This issue is an elaboration of #35

What percentiles should we kick out to wavefront sink?

Presently we report the following percentiles:

min (0.0)
max (1.0)
0.5
0.9
0.99
0.999

Kosher?

Reset counters per flush interval

Presently cernan implements its counters are continuously increasing sums of points. What it should do instead is reset the counts on each flush.

This is related to #61