mozilla-services / heka Goto Github PK

DEPRECATED: Data collection and processing made easy.

Home Page: http://hekad.readthedocs.org/

License: Other

Shell 0.35% Go 71.43% CMake 5.50% CSS 0.24% HTML 0.10% JavaScript 4.12% Lua 15.47% C 1.15% Batchfile 0.03% Yacc 0.58% Dockerfile 0.08% Mustache 0.70% Roff 0.25%

heka's Introduction

This project is deprecated. Please see this email for more details.

Heka

Data Acquisition and Processing Made Easy

Heka is a tool for collecting and collating data from a number of different sources, performing "in-flight" processing of collected data, and delivering the results to any number of destinations for further analysis.

Heka is written in Go, but Heka plugins can be written in either Go or Lua. The easiest way to compile Heka is by sourcing (see below) the build script in the root directory of the project, which will set up a Go environment, verify the prerequisites, and install all required dependencies. The build process also provides a mechanism for easily integrating external plug-in packages into the generated hekad. For more details and additional installation options see Installing.

WARNING: YOU MUST SOURCE THE BUILD SCRIPT (i.e. source build.sh) TO BUILD HEKA. Setting up the Go build environment requires changes to the shell environment, if you simply execute the script (i.e. ./build.sh) these changes will not be made.

Resources:

Heka project docs: https://hekad.readthedocs.io/
GoDoc package docs: http://godoc.org/github.com/mozilla-services/heka
Mailing list: https://mail.mozilla.org/listinfo/heka
IRC: #heka on irc.mozilla.org

heka's People

Stargazers

Watchers

Forkers

crankycoder bihicheng hfeeki newsky seacoastboy daqing15 hellcoderz honkfestival obsdeck levigross shykes fewbytes willmadison chiehwen pjzjchn scalp42 shijiawu strogo crowhou rafrombrc linkdd changguanghua sinopower lazaroj yaotian atomaths pawelchcki jbochi jimdo michaelgibson tlrx qiuye dctrwatson yekeqiang rhysh perryhau wtmmac nchapman strategist922 jbli ericholscher somergy seldaek lucciano macbre gregglind jhawk28 martin-ly victorpoluceno cloud-school iamima kushalp jvehent vcastellm codesummer nickman hujunfei disqus uikit0 mattrco bbinet michaeltri butland jreamlu davidreuss thedrow cjl3080434008 augieschwer whd tanbamboo j00k nixm0nk3y lann dannycoates phc zbrdge devcowboy superdense josedonizetti carlosdp maxmetagravity intjonathan anandaverma megamsys graveyard dhammika chancez russell-io yosephharyanto shawnrusaw-wf slojo404 oldmantaiter nemosupremo despiegk zhanglei kingpro justinjudd aalekh liujianping dieterbe

heka's Issues

Sandbox: Javascript support

Support Javascript sandbox environment, using V8 or SpiderMonkey to embed it.

Router: Lazy decoding of protobuf message in Message API

Use the byte array instead of the decoded protobuf message (future performance optimization). Message accessors will trigger necessary decoding which will then be cached on the message.

Wrap plugin code in panic / recover structures

Any time the Heka engine invokes a plugin we should check for panics and recover so we can provide useful error messages and so we don't bring down the process unnecessarily.

Add support for Heka plugins to reload config when hekad rec's a HUP signal

Plugins should also have a way to specify policy re: what to do when config loading fails (stop the process, unregister the plugin and try to continue, etc.) and the means to actually honor that policy.

Swap statsd metric structures during flush interval

Currently the StatAccumInput stops counting while it does the flush, an alternate approach would be to have a separate goroutine that talks on a channel so that the current structure is sent over to it, it flushes, and the main one continues counting while the flush occurs in a new structure.

For statsd, something will need to be done to ensure that bucket names aren't 'missed' between the swap in case no metrics come in for a given bucket name.

Defensive programming additions

Defensive programming (more panic recovery, checking for / handling malicious payloads).

Extend flood client for more thorough testing

The heka flood client should generate messages of multiple types and sizes, malformed messages, and larger messages to ensure hekad acts properly and doesn't have performance issues with packets more likely to actually be seen in the wild.

use qualified imports so go get tooling can work ootb

ideally to install/fetch heka all one should need to do is

$ go get -v github.com/mozilla-services/heka/...

unfortunately heka doesn't do qualified internal imports so this can't work.

import "heka/agent": import path doesn't contain a hostname
package heka/agent: unrecognized import path "heka/agent"
import "heka/message": import path doesn't contain a hostname
package heka/message: unrecognized import path "heka/message"
import "heka/client": import path doesn't contain a hostname
package heka/client: unrecognized import path "heka/client"
import "heka/pipeline": import path doesn't contain a hostname
package heka/pipeline: unrecognized import path "heka/pipeline"

Use pool of decoders rather than creating new ones that must be closed

The decoderManager is currently creating decoders for inputs on demand. When an input is finished, it must close the decoder channels. The manager then holds on to the closed decoders for possible reuse. Closing the decoder is an unfortunate requirement to place on input developers, and the locking involved in tracking the running and closed decoders is less than ideal. A round-robin pool of decoder sets that are always running is a better choice.

Repeat use of report functionality causes Heka to freeze

Something isn't being recycled correctly b/c when I launch Heka and send multiple SIGUSR signals to generate reports eventually Heka deadlocks and the entire pipeline freezes up.

AWS images

One web server + heka agent image, and one heka aggregator image.

Sandbox: Options to restrict filter output

Restricting sandbox filter output (by size, by # of messages, TBD).

Queue self-monitoring and reporting

A plugin that allows heka to capture information on the channels and plugins in use and report them (likely as a new message using the MGI).

Dynamic reloading of sandbox filter source code

Sandboxed filter plugins need to be able to reload source code at runtime. Initially this can be a SIGHUP triggering file system source code reload. After a secure control message mechanism has been developed this can be triggered by control messages containing the new source code as part of the payload.

Better handling of poorly / maliciously configured plugins during configuration load

Plugins that have poor/bad configuration should error out and/or deliver more useful information when configured improperly.

Message delivery guarantees

The ability to specify durability of inputs where messages they receive are added to something such as an append-only file to ensure messages acknowledged are as durable as possible.

Investigate "recycle" pattern in outputs that use a channel to talk to a shared sender

Output plugins are created separately for each pipeline pack, but many of them share a single resource that actually handles the real output of writing to a file or sending to a connection, since it may be impossible or inefficient for there to be poolsize different entities actually performing the final output. Typically this is handled by spinning up an output goroutine which listens on a data channel. Each output plugin puts the output data on this channel, the goroutine takes it and sends it along. This can be seen very clearly in the FileOutput, which passes the output byte slice to a FileWriter which actually talks to the file system.

This works well, but there are inefficiencies b/c once a byte slice is used by the FileWriter it is no longer used, and the byte slice and the underlying array need to be garbage collected. We could probably improve performance quite a bit by extending the "recycleChan" pattern that we use for the pipeline packs. After using the byte slice, the output goroutine would put it in a recycle channel from which the outputs would pull, no need to create new ones or GC the old ones.

As this would be used in many different outputs, it'd be great to generalize the pattern a bit to minimize boilerplate.

Cassandra back-end

Support Cassandra as a back-end for messages, as well as for statsd type metrics (timeline series).

MGI: Prevent infinite loops

MGI should do additional configurable checks on how many times a message has gone through and dump/error on messages that have looped beyond the setting to prevent infinite message looping.

Proposal: TextParser Decoder

TextParser Decoder

A TextParser Decoder that will handle extracting information from the payload or in other ways manipulating a Message.

Functionality

Trim / Replace content of the payload
Parse the payload using a regular expression into variables
Set/Manipulate Message fields based on regex named variables
Define severity mappings (ie, turn 'MAJOR' into Severity 1)
Define Timestamp string layout

Naming a regular expression field will cause it to be appropriately coerced into the field required. For example, a named group 'Severity' will be parsed to an int.

Payload Parsing

Payloads can be parsed into a set of custom names using a regular expression. Go uses the re2 syntax for regular expressions.

re2 syntax: http://code.google.com/p/re2/wiki/Syntax
re2 match tester: http://regoio.herokuapp.com/

These custom names may be used when setting Message fields by prefixing a name with @, for example if the regular expression captured some text as reporter, then setting a field to Reported by @reporter would substitute the captured text when its set.

If matching a payload, the regular expression should be set as the payloadMatch key.

If no named group was captured, an error will be logged, and the message will be dropped.

Field Names

A named grouping in the regular expression of one of the following will automatically be set to the appropriate Message field:

Uuid
Timestamp*
Type
Logger
Severity*
Payload
EnvVersion
Pid
Hostname

Fields marked with an asterisk may require configured coercion rules to manipulate the string into the appropriate type.

Coercion Rules

Severity is recorded in the message as an int. It is possible that you will need to define a mapping to turn the captured severity from the payload into an int. This should be supplied as a dict under the severityMap key.

Timestamp will need to be coerced into the proper internal representation of a timestamp. If a timestamp is captured from the payload, then a Layout indicating the time format must be supplied as the timestampLayout key.

For layout formatting, see: http://golang.org/pkg/time/#Parse
Example layouts: http://golang.org/pkg/time/#pkg-constants

Setting Fields

Message fields may be set/reset by defining a dict under the messageFields which names each Message field to set (full list above under Field Names), and a string/int indicating what to set the field to. Setting a field to None will clear the contents of the field.

Any field set that is not one of the basic Message fields above, will set a separate custom key under Message.Fields.

Fix incompatibility with latest Go

This Go commit broke the build process:
https://code.google.com/p/go/source/detail?r=030625a923ca1582471acdee67fbd9260728eda6

Our build should be updated so that we can make Heka on the 1.1 builds.

Error:

# github.com/mozilla-services/heka/hekad
/usr/bin/ld: cannot find -lsandbox
collect2: ld returned 1 exit status
code.google.com/p/goprotobuf/proto.statictmp_1699: /home/bbangert/heka-build/build/go/pkg/tool/linux_amd64/6l: running gcc failed: unsuccessful exit status 0x100
make: *** [bin/hekad] Error 2

Better handling of pool and channel sizes

Currently heka pool size defaults to 1000, but we've seen that performance seems better with much lower numbers, <100. Also the PIPECHAN_BUFSIZE is set to a constant of 500, unchangeable. These numbers should be changed to more sensible defaults, tweakable as needed for different use cases.

Ops reporting / Notifications functionality

Ability to report to ops, possibly via SMS or e-mail, alerts and other messages that flag certain conditions.

Update / overhaul DashboardOutput web UI

We've got a rudimentary web UI made available by the DashboardOutput. This contains a lot of valuable information, but the presentation is currently very clunky, and is definitely not scaling well as we add more info to the standard Heka report. This deserves to be revisited to be made more readable and flexible.

Developer documentation for 0.2 API

Hekad's internals have changed such that existing plugin developer documentation needs to be updated to reflect it.

Server side check of sending host to prevent client host spoofing

Heka depends on the client to specify the hostname value in Heka messages, which leaves us vulnerable to host spoofing from an untrusted or compromised client. We need to be able to tell Heka to check that message hostnames match those of the hosts making the connection, accounting for the possibility of messages being relayed through multiple Heka instances.

Switch configuration to toml

When making the config file format, we used JSON as that worked well for some of the requirements that INI didn't quite do. Now however, there's TOML:
https://github.com/mojombo/toml

And a nice shiny new Go implementation:
https://github.com/BurntSushi/toml

This would be a lot nicer to use for config in the future than JSON config, and should be used instead of JSON when it settles down to a 1.0 spec.

Switch to manual statsd parsing instead of regex

Currently a regex is used to parse the statsd UDP input lines, it is quite likely this is not as efficient as directly iterating through the line and creating the metric packets. There's an implementation of this in etsy's, as well as a Go version here:
https://github.com/kisielk/gostatsd/blob/master/statsd/receiver.go#L68

Messages should have some type of ACL to ensure only specific consumers can see them

ACLs for data access (messages intended for only specific consumers)

Allow prepending and/or appending to all chain definitions

The CounterOutput is a useful output that shows how many messages the heka pipeline is processing. Usually you'll want to be counting every single message, regardless of what chain that message goes through. As of now, the only way to achieve this is to explicitly add the CounterOutput as an output to every chain that is defined. This is fine, but slightly annoying as this option is frequently turned on and off during development, it would be much nicer to have a way to specify that an output should be added to every chain. This isn't yet enough of a use case to justify adding the feature, but if we come across more cases where a filter or output should to be added to every chain then it might be worthwhile.

One implementation idea is to support a chain_prepends and/or chain_appends section in the config. These sections would only allow filters and outputs keys, and the specified filters and outputs would be pre- or appended to the set of filters and outputs specified in the chain that is selected for each message, i.e. it'd just be shorthand for adding those filters and outputs to the beginning or end of each chain definitions list.

Reliable message transport

A reliable transport should be available between the client <-> heka, and between heka <-> heka instances. Possibly ZeroMQ, or our own TCP-based protocol.

Idempotence / Duplicate message handling

Do we want to provide support for Heka to recognize duplicate messages sent within a given time period to prevent duplicate message delivery?

Formalize heka transport spec and compliance

The transport spec should be documented, and a compliance tool that can be used to ensure a client meets it developed.

Proposal: Statsd Filter

Statsd Filter

A Statsd Filter that will match varying attributes of the message, and construct a metric compliant with statsd that will then be fed into the Statsd input.

Functionality

Match the message on any of its fields, either directly or via a regex
Utilize a matched portion as a value for a statsd timer/gauge/counter
Feed stats metric into the StatsdInput and update StatsdInput to accept internal feeding of this nature

Portions captured by a regular expression will be coerced into the appropriate type for the metric being produced.

Message Matching

A message may be matched based on any of its fields. For example, to match a message from a specific logfile that is an HTTP Get and generate a counter for it named by the website its for, this config section would be appropriate (Note that placing a ~ in front indicates a regular expression):

    "filters": [
        {
            "type": "StatFilter",
            "name": "HTTP Hits",
            "match": {
                "Logger": "~/var/log/apache/(?P<Sitename>.*)/access.log",
                "Payload": "~\"(?P<Method>\w+) [^\"]* (?P<Code>\d{3}) (?P<Bytes>\d+)"
            },
            "metrics": [
                {
                    "type": "Counter",
                    "name": "@Sitename.@Method.@Code",
                    "value": "1"
                },
                {
                    "type": "Counter",
                    "name": "@Sitename.@Bytes",
                    "value": "@Bytes"
                }
        }
    ],

Runtime panic when testing hekad w/ metlog-py's "mb" tool

Currently getting the following when I launch hekad w/ a simple grater.json config and then point the metlog-py "mb" tool at it (using mb 127.0.0.1 5565):

spire ~/go/heka $ ./bin/hekad -config=grater.json
2012/11/20 16:59:21 Starting hekad...
2012/11/20 16:59:21 Input started: udp
2012/11/20 16:59:26 Got 21264 messages. 21286.12 msg/sec
2012/11/20 16:59:27 Got 42739 messages. 21474.86 msg/sec
2012/11/20 16:59:28 Got 67234 messages. 24494.55 msg/sec
panic: runtime error: slice bounds out of range

goroutine 77308 [running]:
time.Parse(0x31d6da, 0x1, 0xc2068e6333, 0x1, 0xc2068e6320, ...)
/Users/rob/src/go/src/pkg/time/format.go:859 +0x1a22
heka/pipeline.(*JsonDecoder).Decode(0xc200800e78, 0xc201399e80, 0xc200092c70, 0xb, 0xc200103540, ...)
/Users/rob/go/heka/src/heka/pipeline/decoders.go:50 +0x14e
heka/pipeline.func·017(0xc2000996c8, 0xc200099040, 0x12ac0, 0xc201399e80, 0x0, ...)
/Users/rob/go/heka/src/heka/pipeline/runner.go:150 +0x1a3
created by heka/pipeline.func·015
/Users/rob/go/heka/src/heka/pipeline/inputs.go:72 +0x12f

goroutine 1 [chan receive]:
heka/pipeline.(*PipelineConfig).Run(0xc20009c120, 0x7fff5fbffc84)
/Users/rob/go/heka/src/heka/pipeline/runner.go:200 +0x58b
main.main()
/Users/rob/go/heka/src/heka/hekad/main.go:57 +0x3c2

goroutine 2 [syscall]:
created by runtime.main
/Users/rob/src/go/src/pkg/runtime/proc.c:225

goroutine 3 [syscall]:
os/signal.loop()
/Users/rob/src/go/src/pkg/os/signal/signal_unix.go:20 +0x1c
created by os/signal.init·1
/Users/rob/src/go/src/pkg/os/signal/signal_unix.go:26 +0x2f

goroutine 5 [syscall]:
syscall.Syscall6()
/Users/rob/src/go/src/pkg/syscall/asm_darwin_amd64.s:38 +0x5
syscall.kevent(0x8, 0x0, 0x0, 0xc200088a88, 0xa, ...)
/Users/rob/src/go/src/pkg/syscall/zsyscall_darwin_amd64.go:199 +0x86
syscall.Kevent(0x8, 0x0, 0x0, 0x0, 0xc200088a88, ...)
/Users/rob/src/go/src/pkg/syscall/syscall_bsd.go:551 +0x9b
net.(_pollster).WaitFD(0xc200088a80, 0xc2001032a0, 0x0, 0xffffffffffffffff, 0x0, ...)
/Users/rob/src/go/src/pkg/net/fd_darwin.go:96 +0x179
net.(_pollServer).Run(0xc2001032a0, 0x0)
/Users/rob/src/go/src/pkg/net/fd_unix.go:211 +0xef
created by net.newPollServer
/Users/rob/src/go/src/pkg/net/newpollserver_unix.go:33 +0x367

goroutine 6 [select]:
heka/pipeline.counterLoop()
/Users/rob/go/heka/src/heka/pipeline/outputs.go:85 +0x70a
created by heka/pipeline.InitCountChan
/Users/rob/go/heka/src/heka/pipeline/outputs.go:56 +0x46

goroutine 7 [runnable]:
created by addtimer
/Users/rob/src/go/src/pkg/runtime/ztime_darwin_amd64.c:73

goroutine 8 [runnable]:
heka/pipeline.func·015(0xc2040e91c8, 0xc2040e91d8, 0xc2040e91d0, 0xc2040e91e0, 0x0, ...)
/Users/rob/go/heka/src/heka/pipeline/inputs.go:58 +0x56
created by heka/pipeline.(*InputRunner).Start
/Users/rob/go/heka/src/heka/pipeline/inputs.go:77 +0xcd
spire ~/go/heka $

Dynamic configuration of plugins

Plugins should be configurable on the fly, perhaps using a signed/encrypted message that contains the new configuration or an update of configuration to an existing plugin.

Agent / Aggregator Discovery

heka agent's need a way to discover aggregators to deliver messages to in a dynamic changing environment (such as AWS). This could be done with possibly nsq_lookup or Zookeeper.

The router creates a new instance of each filter for every message

Add MGI output to the heka self-report

MGI should add info re: the state of its internal channels to the heka self report.

Continuous Integration

Setup a continuous integration server, ideally that can also track performance metrics to avoid regressions and ensure complete test coverage.

Abstraction of "batch-and-back" pattern

Outputs that need to do a task from a single point (like write to a logfile, send stats, etc.), have to either toggle between accepting messages and writing, or make their own two-channel switch system. This abstraction would reduce the repetitive nature of plugins needing this.

API: Expose StatPacket channel as API for direct stat injection

Expose statpacket API to filters to prevent stats round-tripping through message structs just to feed into the statsd aggregator.

Plugin config documentation generator

It'd be great to use godoc to write a script that will extract a set of Heka plugin config documentation from the source code, try to find all of the HasConfigStruct plugins and pull docs from the specified config structs.

MGI: Refactor from plugin

MGI should not be an input, this should hopefully not break the API, but that may be necessary if an internal point is made available for other possible future internal hekad API's intended for use by plugins.

Get rid of `messageHolder` structs and instead use `PipelinePack`s everywhere

Right now we have PipelinePack objects which are a wrapper around messages that are flowing through the main Heka pipeline, and we also have messageHolder objects which are used only in the MGI infrastructure. They both serve almost identical purposes, though, and they both need reference counting and to be recycled when they're done. Getting rid of messageHolders and using PipelinePacks everywhere would both remove duplicate logic and would mean we no longer need to do message copies when injecting messages in through the MGI.

mozilla-services / heka Goto Github PK

heka's Introduction

Heka

heka's People

Stargazers

Watchers

Forkers

heka's Issues

TextParser Decoder

Functionality

Payload Parsing

Field Names

Coercion Rules

Setting Fields

Statsd Filter

Functionality

Message Matching

Recommend Projects

Recommend Topics

Recommend Org