Giter Club home page Giter Club logo

heka's Introduction

This project is deprecated. Please see this email for more details.

Heka

Data Acquisition and Processing Made Easy

Heka is a tool for collecting and collating data from a number of different sources, performing "in-flight" processing of collected data, and delivering the results to any number of destinations for further analysis.

Heka is written in Go, but Heka plugins can be written in either Go or Lua. The easiest way to compile Heka is by sourcing (see below) the build script in the root directory of the project, which will set up a Go environment, verify the prerequisites, and install all required dependencies. The build process also provides a mechanism for easily integrating external plug-in packages into the generated hekad. For more details and additional installation options see Installing.

WARNING: YOU MUST SOURCE THE BUILD SCRIPT (i.e. source build.sh) TO BUILD HEKA. Setting up the Go build environment requires changes to the shell environment, if you simply execute the script (i.e. ./build.sh) these changes will not be made.

Resources:

heka's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

heka's Issues

Swap statsd metric structures during flush interval

Currently the StatAccumInput stops counting while it does the flush, an alternate approach would be to have a separate goroutine that talks on a channel so that the current structure is sent over to it, it flushes, and the main one continues counting while the flush occurs in a new structure.

For statsd, something will need to be done to ensure that bucket names aren't 'missed' between the swap in case no metrics come in for a given bucket name.

Extend flood client for more thorough testing

The heka flood client should generate messages of multiple types and sizes, malformed messages, and larger messages to ensure hekad acts properly and doesn't have performance issues with packets more likely to actually be seen in the wild.

use qualified imports so go get tooling can work ootb

ideally to install/fetch heka all one should need to do is

$ go get -v github.com/mozilla-services/heka/...

unfortunately heka doesn't do qualified internal imports so this can't work.

import "heka/agent": import path doesn't contain a hostname
package heka/agent: unrecognized import path "heka/agent"
import "heka/message": import path doesn't contain a hostname
package heka/message: unrecognized import path "heka/message"
import "heka/client": import path doesn't contain a hostname
package heka/client: unrecognized import path "heka/client"
import "heka/pipeline": import path doesn't contain a hostname
package heka/pipeline: unrecognized import path "heka/pipeline"

Use pool of decoders rather than creating new ones that must be closed

The decoderManager is currently creating decoders for inputs on demand. When an input is finished, it must close the decoder channels. The manager then holds on to the closed decoders for possible reuse. Closing the decoder is an unfortunate requirement to place on input developers, and the locking involved in tracking the running and closed decoders is less than ideal. A round-robin pool of decoder sets that are always running is a better choice.

AWS images

One web server + heka agent image, and one heka aggregator image.

Queue self-monitoring and reporting

A plugin that allows heka to capture information on the channels and plugins in use and report them (likely as a new message using the MGI).

Dynamic reloading of sandbox filter source code

Sandboxed filter plugins need to be able to reload source code at runtime. Initially this can be a SIGHUP triggering file system source code reload. After a secure control message mechanism has been developed this can be triggered by control messages containing the new source code as part of the payload.

Message delivery guarantees

The ability to specify durability of inputs where messages they receive are added to something such as an append-only file to ensure messages acknowledged are as durable as possible.

Investigate "recycle" pattern in outputs that use a channel to talk to a shared sender

Output plugins are created separately for each pipeline pack, but many of them share a single resource that actually handles the real output of writing to a file or sending to a connection, since it may be impossible or inefficient for there to be poolsize different entities actually performing the final output. Typically this is handled by spinning up an output goroutine which listens on a data channel. Each output plugin puts the output data on this channel, the goroutine takes it and sends it along. This can be seen very clearly in the FileOutput, which passes the output byte slice to a FileWriter which actually talks to the file system.

This works well, but there are inefficiencies b/c once a byte slice is used by the FileWriter it is no longer used, and the byte slice and the underlying array need to be garbage collected. We could probably improve performance quite a bit by extending the "recycleChan" pattern that we use for the pipeline packs. After using the byte slice, the output goroutine would put it in a recycle channel from which the outputs would pull, no need to create new ones or GC the old ones.

As this would be used in many different outputs, it'd be great to generalize the pattern a bit to minimize boilerplate.

Cassandra back-end

Support Cassandra as a back-end for messages, as well as for statsd type metrics (timeline series).

MGI: Prevent infinite loops

MGI should do additional configurable checks on how many times a message has gone through and dump/error on messages that have looped beyond the setting to prevent infinite message looping.

Proposal: TextParser Decoder

TextParser Decoder

A TextParser Decoder that will handle extracting information from the payload or in other ways manipulating a Message.

Functionality

  • Trim / Replace content of the payload
  • Parse the payload using a regular expression into variables
  • Set/Manipulate Message fields based on regex named variables
  • Define severity mappings (ie, turn 'MAJOR' into Severity 1)
  • Define Timestamp string layout

Naming a regular expression field will cause it to be appropriately coerced into the field required. For example, a named group 'Severity' will be parsed to an int.

Payload Parsing

Payloads can be parsed into a set of custom names using a regular expression. Go uses the re2 syntax for regular expressions.

re2 syntax: http://code.google.com/p/re2/wiki/Syntax
re2 match tester: http://regoio.herokuapp.com/

These custom names may be used when setting Message fields by prefixing a name with @, for example if the regular expression captured some text as reporter, then setting a field to Reported by @reporter would substitute the captured text when its set.

If matching a payload, the regular expression should be set as the payloadMatch key.

If no named group was captured, an error will be logged, and the message will be dropped.

Field Names

A named grouping in the regular expression of one of the following will automatically be set to the appropriate Message field:

  • Uuid
  • Timestamp*
  • Type
  • Logger
  • Severity*
  • Payload
  • EnvVersion
  • Pid
  • Hostname

Fields marked with an asterisk may require configured coercion rules to manipulate the string into the appropriate type.

Coercion Rules

Severity is recorded in the message as an int. It is possible that you will need to define a mapping to turn the captured severity from the payload into an int. This should be supplied as a dict under the severityMap key.

Timestamp will need to be coerced into the proper internal representation of a timestamp. If a timestamp is captured from the payload, then a Layout indicating the time format must be supplied as the timestampLayout key.

For layout formatting, see: http://golang.org/pkg/time/#Parse
Example layouts: http://golang.org/pkg/time/#pkg-constants

Setting Fields

Message fields may be set/reset by defining a dict under the messageFields which names each Message field to set (full list above under Field Names), and a string/int indicating what to set the field to. Setting a field to None will clear the contents of the field.

Any field set that is not one of the basic Message fields above, will set a separate custom key under Message.Fields.

Fix incompatibility with latest Go

This Go commit broke the build process:
https://code.google.com/p/go/source/detail?r=030625a923ca1582471acdee67fbd9260728eda6

Our build should be updated so that we can make Heka on the 1.1 builds.

Error:

# github.com/mozilla-services/heka/hekad
/usr/bin/ld: cannot find -lsandbox
collect2: ld returned 1 exit status
code.google.com/p/goprotobuf/proto.statictmp_1699: /home/bbangert/heka-build/build/go/pkg/tool/linux_amd64/6l: running gcc failed: unsuccessful exit status 0x100
make: *** [bin/hekad] Error 2

Better handling of pool and channel sizes

Currently heka pool size defaults to 1000, but we've seen that performance seems better with much lower numbers, <100. Also the PIPECHAN_BUFSIZE is set to a constant of 500, unchangeable. These numbers should be changed to more sensible defaults, tweakable as needed for different use cases.

Update / overhaul DashboardOutput web UI

We've got a rudimentary web UI made available by the DashboardOutput. This contains a lot of valuable information, but the presentation is currently very clunky, and is definitely not scaling well as we add more info to the standard Heka report. This deserves to be revisited to be made more readable and flexible.

Server side check of sending host to prevent client host spoofing

Heka depends on the client to specify the hostname value in Heka messages, which leaves us vulnerable to host spoofing from an untrusted or compromised client. We need to be able to tell Heka to check that message hostnames match those of the hosts making the connection, accounting for the possibility of messages being relayed through multiple Heka instances.

Allow prepending and/or appending to all chain definitions

The CounterOutput is a useful output that shows how many messages the heka pipeline is processing. Usually you'll want to be counting every single message, regardless of what chain that message goes through. As of now, the only way to achieve this is to explicitly add the CounterOutput as an output to every chain that is defined. This is fine, but slightly annoying as this option is frequently turned on and off during development, it would be much nicer to have a way to specify that an output should be added to every chain. This isn't yet enough of a use case to justify adding the feature, but if we come across more cases where a filter or output should to be added to every chain then it might be worthwhile.

One implementation idea is to support a chain_prepends and/or chain_appends section in the config. These sections would only allow filters and outputs keys, and the specified filters and outputs would be pre- or appended to the set of filters and outputs specified in the chain that is selected for each message, i.e. it'd just be shorthand for adding those filters and outputs to the beginning or end of each chain definitions list.

Reliable message transport

A reliable transport should be available between the client <-> heka, and between heka <-> heka instances. Possibly ZeroMQ, or our own TCP-based protocol.

Proposal: Statsd Filter

Statsd Filter

A Statsd Filter that will match varying attributes of the message, and construct a metric compliant with statsd that will then be fed into the Statsd input.

Functionality

  • Match the message on any of its fields, either directly or via a regex
  • Utilize a matched portion as a value for a statsd timer/gauge/counter
  • Feed stats metric into the StatsdInput and update StatsdInput to accept internal feeding of this nature

Portions captured by a regular expression will be coerced into the appropriate type for the metric being produced.

Message Matching

A message may be matched based on any of its fields. For example, to match a message from a specific logfile that is an HTTP Get and generate a counter for it named by the website its for, this config section would be appropriate (Note that placing a ~ in front indicates a regular expression):

    "filters": [
        {
            "type": "StatFilter",
            "name": "HTTP Hits",
            "match": {
                "Logger": "~/var/log/apache/(?P<Sitename>.*)/access.log",
                "Payload": "~\"(?P<Method>\w+) [^\"]* (?P<Code>\d{3}) (?P<Bytes>\d+)"
            },
            "metrics": [
                {
                    "type": "Counter",
                    "name": "@Sitename.@Method.@Code",
                    "value": "1"
                },
                {
                    "type": "Counter",
                    "name": "@Sitename.@Bytes",
                    "value": "@Bytes"
                }
        }
    ],

Runtime panic when testing hekad w/ metlog-py's "mb" tool

Currently getting the following when I launch hekad w/ a simple grater.json config and then point the metlog-py "mb" tool at it (using mb 127.0.0.1 5565):

spire ~/go/heka $ ./bin/hekad -config=grater.json
2012/11/20 16:59:21 Starting hekad...
2012/11/20 16:59:21 Input started: udp
2012/11/20 16:59:26 Got 21264 messages. 21286.12 msg/sec
2012/11/20 16:59:27 Got 42739 messages. 21474.86 msg/sec
2012/11/20 16:59:28 Got 67234 messages. 24494.55 msg/sec
panic: runtime error: slice bounds out of range

goroutine 77308 [running]:
time.Parse(0x31d6da, 0x1, 0xc2068e6333, 0x1, 0xc2068e6320, ...)
/Users/rob/src/go/src/pkg/time/format.go:859 +0x1a22
heka/pipeline.(*JsonDecoder).Decode(0xc200800e78, 0xc201399e80, 0xc200092c70, 0xb, 0xc200103540, ...)
/Users/rob/go/heka/src/heka/pipeline/decoders.go:50 +0x14e
heka/pipeline.func·017(0xc2000996c8, 0xc200099040, 0x12ac0, 0xc201399e80, 0x0, ...)
/Users/rob/go/heka/src/heka/pipeline/runner.go:150 +0x1a3
created by heka/pipeline.func·015
/Users/rob/go/heka/src/heka/pipeline/inputs.go:72 +0x12f

goroutine 1 [chan receive]:
heka/pipeline.(*PipelineConfig).Run(0xc20009c120, 0x7fff5fbffc84)
/Users/rob/go/heka/src/heka/pipeline/runner.go:200 +0x58b
main.main()
/Users/rob/go/heka/src/heka/hekad/main.go:57 +0x3c2

goroutine 2 [syscall]:
created by runtime.main
/Users/rob/src/go/src/pkg/runtime/proc.c:225

goroutine 3 [syscall]:
os/signal.loop()
/Users/rob/src/go/src/pkg/os/signal/signal_unix.go:20 +0x1c
created by os/signal.init·1
/Users/rob/src/go/src/pkg/os/signal/signal_unix.go:26 +0x2f

goroutine 5 [syscall]:
syscall.Syscall6()
/Users/rob/src/go/src/pkg/syscall/asm_darwin_amd64.s:38 +0x5
syscall.kevent(0x8, 0x0, 0x0, 0xc200088a88, 0xa, ...)
/Users/rob/src/go/src/pkg/syscall/zsyscall_darwin_amd64.go:199 +0x86
syscall.Kevent(0x8, 0x0, 0x0, 0x0, 0xc200088a88, ...)
/Users/rob/src/go/src/pkg/syscall/syscall_bsd.go:551 +0x9b
net.(_pollster).WaitFD(0xc200088a80, 0xc2001032a0, 0x0, 0xffffffffffffffff, 0x0, ...)
/Users/rob/src/go/src/pkg/net/fd_darwin.go:96 +0x179
net.(_pollServer).Run(0xc2001032a0, 0x0)
/Users/rob/src/go/src/pkg/net/fd_unix.go:211 +0xef
created by net.newPollServer
/Users/rob/src/go/src/pkg/net/newpollserver_unix.go:33 +0x367

goroutine 6 [select]:
heka/pipeline.counterLoop()
/Users/rob/go/heka/src/heka/pipeline/outputs.go:85 +0x70a
created by heka/pipeline.InitCountChan
/Users/rob/go/heka/src/heka/pipeline/outputs.go:56 +0x46

goroutine 7 [runnable]:
created by addtimer
/Users/rob/src/go/src/pkg/runtime/ztime_darwin_amd64.c:73

goroutine 8 [runnable]:
heka/pipeline.func·015(0xc2040e91c8, 0xc2040e91d8, 0xc2040e91d0, 0xc2040e91e0, 0x0, ...)
/Users/rob/go/heka/src/heka/pipeline/inputs.go:58 +0x56
created by heka/pipeline.(*InputRunner).Start
/Users/rob/go/heka/src/heka/pipeline/inputs.go:77 +0xcd
spire ~/go/heka $

Dynamic configuration of plugins

Plugins should be configurable on the fly, perhaps using a signed/encrypted message that contains the new configuration or an update of configuration to an existing plugin.

Agent / Aggregator Discovery

heka agent's need a way to discover aggregators to deliver messages to in a dynamic changing environment (such as AWS). This could be done with possibly nsq_lookup or Zookeeper.

Continuous Integration

Setup a continuous integration server, ideally that can also track performance metrics to avoid regressions and ensure complete test coverage.

Abstraction of "batch-and-back" pattern

Outputs that need to do a task from a single point (like write to a logfile, send stats, etc.), have to either toggle between accepting messages and writing, or make their own two-channel switch system. This abstraction would reduce the repetitive nature of plugins needing this.

Plugin config documentation generator

It'd be great to use godoc to write a script that will extract a set of Heka plugin config documentation from the source code, try to find all of the HasConfigStruct plugins and pull docs from the specified config structs.

MGI: Refactor from plugin

MGI should not be an input, this should hopefully not break the API, but that may be necessary if an internal point is made available for other possible future internal hekad API's intended for use by plugins.

Get rid of `messageHolder` structs and instead use `PipelinePack`s everywhere

Right now we have PipelinePack objects which are a wrapper around messages that are flowing through the main Heka pipeline, and we also have messageHolder objects which are used only in the MGI infrastructure. They both serve almost identical purposes, though, and they both need reference counting and to be recycled when they're done. Getting rid of messageHolders and using PipelinePacks everywhere would both remove duplicate logic and would mean we no longer need to do message copies when injecting messages in through the MGI.

GoDoc doc string audit

A quick pass should be made through the heka/pipeline code base adding GoDoc compatible doc comments to all code that is lacking.

Support global Heka config values from config file

Currently settings such as pool size, max procs, and input channel buffer size are either specified on the command line or are hard-coded. It'd be nice to have these be settable in the config file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.