Giter Club home page Giter Club logo

cernan's People

Contributors

blt avatar doubleyou avatar dparton avatar dvdklnr avatar ekimekim avatar gliush avatar goller avatar ibotty avatar josephglanville avatar kichristensen avatar pulltab avatar randomdross avatar skade avatar tfheen avatar tsantero avatar zdlopez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cernan's Issues

cernan doesn't drop old metrics

The memory use of cernan will grow without limit because it does not flush old metrics as needed. This can be corrected by moving from an internal HashMap to an LRU or other expiring cache.

Support graphite protocol

Statsd is a lossy protocol. The server is intended to aggregate before passing on, which is not strictly needed with any of our existing aggregation services. graphite is a lossless format, implying that the server merely stores and forwards points.

Supporting graphite will allow ingestion of local collectd points.

Reduce significant copying

In the process of avoiding generic string handling I've introduced a lot of copying of strings. This is okay while we're just multiplexing statsd but will break down as soon as we start storing points or want higher performance.

Extract sinks into plugins

Right now we have our sinks hard-coded into the binary. This is undesirable for anyone that wants to emit to new and interesting places. We should have the ability to allow users to write their own sinks and to prove this out we ought to extract some of our own.

Aggregate Wavefront points

It turns out that wavefront does not like sub-second points. That is, if you report

a.metric 1 00001
a.metric 1 00001
a.metric 1 00001

Wavefront will only include a.metric 1 00001 in the time series.

So! We need to use the metadata associated with metrics and aggregate appropriately.

IPv6 Statsd Listener Support

A few statsd clients (primarily those written in Go) have been experienced to default to using IPv6 loopback instead of IPv4 loopback.

This net results are confused users and silently dropped metrics.

Investigate Forward Error Correction

Investigate the possibility of introducing an optional mechanism for transmitting metrics over UDP with forward error correction. tl;dr: by sending redundant messages, but over a faster protocol, we can significantly increase egress throughput

Criterion include:

  • idempotent updates to metrics in wavefront
  • how wavefront treats duplicate points wrt billing
  • base implementation of QoS per metric
  • decomplect flush interval from aggregation
  • piggy back of binning and cursors on flush
  • durable storage

Kick out counts with percentile aggregates

Part and parcel with #61--we should no longer report exact points to wavefront--we can no longer rely on the timeseries information to provide accurate counts for our aggregates. This is something we need to collect and report ourselves.

Re-enable full CLI configuration

The CLI of cernan should support its former set of configuration options. Without this, master has become difficult to promote to stable as all hosts will necessarily need a correct configuration file on-disk.

The CLI will not be updated to support the variety of configuration available through the config file.

Remove Wavefront backend aggregation

Right now we have a --wavefront-skip-aggrs and want to eventually remove this option entirely by removing the ability to ship aggregates to wavefront. They are not useful.

Make cernan logs more descriptive

Cernan's logging should be more descriptive than exists today. Before releasing 0.4.0we ought to audit every place where we currently log and ensure that the message provides the full context available at the time, as well as add additional logging in critical areas where it is currently lacking.

'bucket' concept should store one second bins

The present state of affairs in cernan is that counters are per flush-interval. That is, if cernan is configured to have a flush interval of 15 seconds a counter is implicitly per fifteen seconds.

Instead, the bucket concept should be expanded to aggregate at one second bins. Exactly how we want to cook this in the presence of potential flush failures is unknown as of this writing.

Tests have no notion of coverage

Right now cernan tests do not make coverage reports. This makes it difficult to determine how well tested cernan is in practice. Issues that have snuck by into the wild would suggest that it is not.

Log line allocations are pretty heinous

As noted by @tsantero in the #85 (comment)

memory allocation is inefficient, reading in 6 log files concurrently at an avg of 10k lines/sec of 100 bytes/line cost several gigs of RAM, with utilization growing as the counts increased

This is... not okay. Our present channel based method of communication has met the end of its useful lifetime.

Histograms and Configurable Bins

In order to preserve information about the distribution of request times or similar data, we can't rely on percentiles: there's no correct way to later aggregate two percentiles to get another percentile. A mean of a percentiles isn't usually what you want.

What we can do is calculate histograms for each reporting period, then report each bin of that histogram, for each flush interval, to Wavefront. From Wavefront's perspective it's just a bunch of separate metrics, but we can reassemble them and calculate a histogram for the distribution of some values for arbitrary intervals without loss of information.

The tricky part here is communicating from the application to Cernan what the bins should be. Statsd can be configured to calculate histograms in this way, but it involves defining the bins in statsd's config file which is awkward to realize.

So I propose an extension to the statsd protocol so applications can communicate to cernan at application start the desired histogram bins. Proposed format:

metric.name:-inf,1,10,100,inf|h

This configures four bins for the metric metric.name:

  • less than 1
  • [1, 10)
  • [10, 100)
  • greater than or equal to 100

Applications can send these each time they start, and Cernan will write them to the filesystem so they don't need to be sent again if Cernan restarts.

sinks' flush intervals ought to be independent

Right now cernan has a global concept of flush intervals. It would be much nicer if sinks had a per-sink flush interval, allowing us to differentiate between disk IO flushes and network flushes.

Support reading from configured files

Cernan must learn to ingest log files. This issue does not imply modification of the data streams exposed from the files. At present it would be sufficient to note that a line was written to the log file and convert this to a PIT metric.

Depends on #37

Support disk-based queues

Cernan presently has in-memory channels for shuttling information back and forth between system components. This was the same approach that heka adopted, which killed its capability in high load situations. In-memory channels are problematic in that they have no concept of backpressure. They are also not crash tolerant.

Here the primary inspiration will be hindsight, which the heka folks have blessed as the works-well-enough replacement for heka.

Cernan Tagging

recording the metrics from every single celery worker * every single instance is going to create a bunch of noise for the graphs, and increase our wf upload rate to a level that, even with aggregation by time tweaks, won't make sense. Cernan tagging will enable us to tag everything as either prod or staging.

Admit analysis of log files

Cernan should allow the end-user to manipulate their logs into telemetry streams for differing backends. This implies a scripting capability and the ability for scripts to deliver packets into named queues.

Possibly depends on #40

Probably don't aggregate graphite metrics

Pgbouncer metrics in the past were reported each second, but now are every 10 seconds. I'm guessing this has to do with the changes that were made recently to Cernan's aggregation behavior.

To preserve the semantics expected from a graphite interface, Cernan should not aggregate these values, or buffer them for the flush interval, but instead forward them to Wavefront as reported.

Cernan must be durable between restarts

There are many places where cernan is aggregating important things in memory, holding until there's enough information from disk-based sources etc etc. Presently if cernan restarts all of this is lost.

Cernan must not lose things.

Support an --environment flag

It would be very useful for cernan to distinguish runtime environments. Inside of postmates we want this to be a first-class bit of metadata on all points.

Cernan forwarding to cernan

It's come up on a few occasions that it'd be awfully handy to have cernan forward to a remote cernan. This is possible to do.

Avoid data-log growing indefinitely

Introduced in #91 the mpmc log file will grow without bound. This needs to be corrected. Current thinking is to do periodic rotation and abstract the state machine used in file_server to service the Receiver.

Allow munging of metric names

Motivated by the fact that collectd puts metadata information into the metric name we need to be able to adjust metric names as they came in, in a user-configurable fashion.

Current QOS elides gauges

Our current QOS setup seemingly causes elision of metrics. Steps to reproduce:

  1. emit a gauge every five seconds
  2. set gauge QOS to 10 seconds

Break the hegemony of all sources to all sinks

At present all data sources--graphite, statsd, log files--are reported to all backends. This is not appropriate. Cernan should instead allow, by configuration, different sources to be reported to different sinks.

Handle Signals

It would be very useful for integration testing if SIGINT were to cause backends to flush and then result in program exit.

Introduce emission Quality of Service notion

At the suggestion of @pulltab we should introduce a quality-of-service notion to our metric types. The idea here being that backends can choose what rate to flush certain types. By default, we'll maintain flush-rate compatibility with existing statsd implementations.

This is intended to allow us to reduce the burden on our backend reporters.

Support a configuration file

Cernan has grown a plethora of CLI flags. This makes it difficult to add arbitrary files to track.

Cernan should accept a configuration file. We must keep the current CLI flags around meanwhile.

cernan must accept a DNS name for all hosts

Right now cernan does not do a DNS lookup to resolve non-IP hosts into IP addresses. This significantly limits the fun things we can get up to.

For instance, we ought to do more than crash on wavefront-proxy.us-west-1.postmates.com.

Support arbitrary tags on CLI

Ambition is to parse --tags source=foo,service=bar,metadata=blerg and emit appropriately to correct emission sites.

As indicated, will fold --metric-source into this.

Must support hyphen prefix metric name hack

In existing postmates systems there is a need to parse metrics with
names like

 source.machine-metric.name

as being a metric named 'metric.name' with source 'source.machine'.
The etsy/statsd backend has been modified to take a regex to fiddle
with this but I've hard coded it in the expectation that future
backends will be programmable by the end-user.

Run integration and unit tests under Jenkins

We don't have a license for TravisCI. We should still have feedback on all PRs through. What we must have is:

  • unit test runs under all channels, stable/beta/nightly
  • integration test runs under all channels, stable/beta/nightly

Feedback and badges etc would be a bonus.

This issue is an elaboration of #35

Reset counters per flush interval

Presently cernan implements its counters are continuously increasing sums of points. What it should do instead is reset the counts on each flush.

This is related to #61

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.