Giter Club home page Giter Club logo

ramen's Introduction

What?

A stream processing language and compiler for human-scale infrastructure monitoring

"The right solution for 100X often not optimal for X" — Dean Jeff

Why?

Those last years, thanks to such large companies as Google, Facebook, Linkedin and Netflix, the culture and practice of modern infrastructure monitoring has vastly improved and many good and free tools have been released publicly. Those tools understandably focus on large distributed infrastructure.

For smaller use cases though, tools have been left to where they were in the 90s, with the notable exception of Riemann. But Riemann is only for monitoring hosts and uses Clojure as a configuration language. Which in turns requires a resource hungry JVM.

If you need an all-purpose stream processor to manipulate time series in order to turn inputs from sensors or network probes into alerts but do not want to deploy Kubernetes in your three racks of hardware or have only a couple of GiB left of RAM for monitoring, then you might want to consider Ramen.

How?

This is how operations look like:

DEFINE memory_alert AS
  FROM memory
  SELECT
    time, host,
    free + used + cached + buffered + slab AS total,
    free * 100 / total AS used_ratio,
    used_ratio > 50 AS firing
  GROUP BY host
  COMMIT AND KEEP ALL WHEN COALESCE (out.firing <> previous.firing, false)
  NOTIFY "http://192.168.1.1/notify?title=RAM%20is%20low%20on%20${host}&time=${time}&text=Memory%20on%20${host}%20is%20filled%20up%20to%20${used_ratio}%25";

Currently the stream processing programs are compiled into a language with automatic memory management (OCaml), so performance is not optimal. The plan is to compile down to C (or such) in a future step.

Also, imports/exports are limited: Ramen currently accepts time series from CSV files, and understands collectd and netflow (v5) protocols. As output, it merely reach out to alerting systems via HTTP requests.

Other than that it is possible to “tail” the output of operations from the CLI. More protocols for both input and output need obviously to be added.

ramen's People

Contributors

axiles avatar darlentar avatar elrik75 avatar rixed avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ramen's Issues

Preserve event time information as long as possible

If all parents have the same event definition, and all the fields used in this definition are also present in the child, and no specific event description is given, then propagate it.

We do not want to propagate the export flag though, so this needs to be separated from the event time description.

In.next and group.next tuples

In.next and group.next tuples would come very handy for some non trivial commit clauses.

If In.next is used then we would wait for that tuple before processing the input tuple.
If group.next is used then we would just store the in tuple in the aggregate without further processing, and wait for the next one to take action. Instead we would process the last stored tuple as if it was the new one.

Depends on #360

default_team is broken

Instead, either: broadcast the alert to all teams or force an editable flag "default" to one team?
Broadcasting seems to summon all kind of problems as it corner case the data model.

Implement parameters

Many operations differs just by some constant parameters. If we generated code that obtained those values from the environment rather than hardcoding them, then we could reduce drastically the number of distinct binaries in many cases.

Possible approaches:

  1. Mark such parameters in the operation text, for instance: WHERE foo=$p FOR PARAMETER p=42

  2. Define template operations (or functions) in a specific, distinct phase: DEFINE FOO_FILTER(p) AS WHERE foo=$p and then set the operation text as FOO_FILTER(42)

  3. More radically, make all immediate value a parameter.

This later approach is appealing since it does not involve the user at all and might reuse binaries that would not have been noticed as reusable otherwise. But on the other hand it can affect performance as it is no longer possible to optimise for known constants.

Export on demand

  1. separate export flag from event time info
  2. have export flag be changeable without recompilation (easy since we just have to edit the out_ref file)
  3. ramen should then turn it on/off when timeseries are requested, and timeout after a while
  4. a keyword to explicitly ask for a node result to be saved?

Lock out_ref files when altering them

It should not be the case ATM that we modify them simultaneously but better safe than sorry.
It's a good way to check this assertion BTW.

Also, we will need this later for a fork-based implementation, anyway.

More reliable NOTIFY/EXEC

Repeat until received acknowledgment/0 exit status.
Repeated notifs must be stored elsewhere than in the ringbuf and persisted on disk.

Also solves #127 :

  1. Each notification must have an identifier and a firing bool. We could add these to the generic notification and consider the others as fire-and-forget; So the identifier would be the name and we must add the firing boolean to the notify_cmd.
  2. When a firing notification is received, it is timestamped (both with receive time and a schedule time) and saved in a heap (ordered by schedule time), which is immediately serialised on disc;
  3. When a non-firing notification is received, look in the heap for this notification (requires to have an identifier for any notif: the keyword should provide it (as a string), as well as the firing bool for this to work) and cancel it. If not found, just ignore;
  4. While waiting for new notifications, check the heap top notifs for a notification to schedule (starting from top, look for the first one not sending);
  5. When eventually sending the heap top, leave the notification on the heap but flag it as sending. On success remove it and save the heap. On failure reschedule it and save the heap;
  6. At start up, read the heap from disk and reset all sending flags.

Several global states?

The recent WHERE patch have shown that we needed a distinct global state for the WHERE clause.
Maybe we could generalize and have one global state for all input, one for all selected, one for all unselected, and one per group? The clause would tell which default to use.

A JOIN clause

Given a column that's supposed to be synchronized in N input streams, merge sort them into a single output (using the select clause to construct the output tuple).
For each next value of the synchronized column, build out using the last in of each input stream, then replace the input tuple of one of the input to reach the next smallest synchronized value (or advance several input streams if the same sync value is present in several of them).

This should be an extension to the normal Group-By operation as opposed to a new kind of operation.

Graphite sink

A single listener on port 2003 must be able to fill several tables.
We could restrict to one table per metric (with additional fields from the tags, as factors).
Function name would be the metric path but the last few components, and main field name would be that last path components. How many components to chop off the metric path is unclear though, and should be part of the schema.
As tags are not necessarily present on every messages the schema will have to be declared by the workers wishing to convert some graphite metrics into a tuple stream, anyway.

So:

  • a new command ramen graphite that starts a TCP forking server and a UDP listener on the given ports;
  • collected graphite metrics should be enqueued without further ado into a #graphite ringbuf;
  • then it should be possible to LISTEN FOR GRAPHITE $metric_prefix (...schema...) that would perform a pivot.

Alternatively, we do the same as what we do for collectd, using a simple tuple type of reception time, sender, metric name, tags (as a list), value and event time.

A time type (alias for float) parsed/printed with time units

Would also help declaring start/stop/durations, as we could annotate a field with what type of time it represents (event start / event stop / event duration).

Not sure if we don't want instead some semantic attached to the type, that could be used to parse/print also data volume units, for instance, without requiring a whole new type?
Would probably depend on how light these new types can be made and how convenient they are to use as types.

API and visualizer for tops

Timeseries selects the tuple range using time only.
For this we need, in addition to the time range to select a single batch of TOP output.
So we need either a way to recognize such a batch or to provide a key that would be used to retrieve only the tuples from this time range with a single value for that key (so that we could reuse it for other things than tops).

Then a dedicated visualizer.

Timeline type of queries

and visualizer.

With this, and once #21 is in place, then we could render incident chronology over any graph.

A SORT clause

It might be more useful to sort the input rather than the output. So have one optional sorting clause per input stream?

Sharded group-by

Idea: have only one over N shard in RAM (both aggregates and the tuple queue) and the other N-1 on disk, along with N tuple queues.
When a tuple for another shard is received it's batch-queued on disc.
When we rotate the shards we load the queue and process it while newly received tuple for that shard are enqueued. So that we still process them in order.
On disc queues could be growable ringbuffers, as the one that we could also use for node history.

Requires to get rid of serialization and work on mmapped files for the groups.

Saving of last err msg in the node do not work any more

Since we fail out of the compilation we do not save the resulting graph, therefore the last error is not recorded in the node.

Alternative design (that would still require a rw-lock to protect the file) would be for conf to be a persistent data structure. Compile would return a new one with either new graph or new error messages.But then we would have to catch exceptions when calling compile and return a conf indicating of the error rather than an http error code.

Get rid of LWT somehow

To de-uglify the code and avoid mixing lwt with exceptions.

Need to use actual threads instead, assuming we do not need to parallelise anything that's blocking.

  • compile with threads
  • update RamenRWLock
  • update RamenOutRef
  • update RingBufLib
  • everything else should follow rather easily
  • regarding cohttp server, either wrap it, or replace it with a custom http server (à la csview), or replace it with a CGI interface and a lightweight external http server
  • regarding cohttp client in notifier, either wrap it or replace it with ocamlnet or exec shell.

async threads to be replaced:

  • watchdogs -> another thread, or a small filesystems monitored by the supervisor (touching a file per worker, where suspending the watchdog is replaced by deleting that file);
  • workers reporting stats every X seconds -> easier from a dedicated thread;
  • workers merging several ringbuffers with foreground wait times -> still need a background thread;
  • wait_all_pids thread should not be needed any longer -> easier without threads;
  • ramen tail reseting export timeouts on a timely basis -> no real need for a thread;
  • notifier scheduler -> still better with a separate thread;
  • notifier asynchronously sending notifications -> anything IO is easier with LWTs;
  • tester threads in RamenTest to be replaced by actual posix threads.

If we start with the workers, we need to focus on watchdogs, reports, and merge.
Watchdogs are problematic as they are used by workers and ramen cli tools such as supervisor and notifier. We could have a mutex and a LWT version of watchdogs?

Propagate null/not-null knowledge down the AST

In a CASE or in OR/AND operations, if the operand is a test for nullability then we should propagate the knowledge that something is null or not down the AST. We could then better type operations on this operand; especially if it's known to be not null.

For instance:

if (a is not null && b is not null) then remember a || remember b

currently the state of the 2 remember operations will be an optional bloom filter, and the code will match against Some a and Some b; but all this is useless since we know that a and b will, here, always be non null.

What is needed is a special operation from nullable to not nullable (akin of Option.get) that would be automatically added around all usage of a and b in the branches, before the actual typing start.
Given this operator, this is a pure rewrite operation.

Allow a FROM clause to mention nodes with different format

As long as the parent nodes do output the fields that this node is using.

We could cram all output tuples in the child input ringbuf, provided:

  1. There is be a prefix to each serialized tuple to say where it is coming from;
  2. There are as many read_tuple functions as there are parents;
  3. All those read_tuple functions output the same in_tuple made of only the fields that are
    mentioned (in the order they are mentioned).

Alternatively, we could have several input ringbufs so no prefix required.

Alternatively, the parent could strip down its output to match a child input. This might not be as stupid as it sounds, since we already update its out_ref ; we could add a format spec in that file (a bit mask of the fields) and then, provided fields are ordered by name in encoded tuples, then the single write out tuple function could just skip unwanted tuples. This would come handy for exporting data, as we could enable it on a per field basis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.