addthis / hydra Goto Github PK

License: Apache License 2.0

Java 84.29% HTML 1.90% JavaScript 12.70% CSS 0.88% Shell 0.24%

hydra's Introduction

hydra

Hydra is a distributed data processing and storage system originally developed at AddThis. It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).

You can run hydra from the command line to slice and dice that Apache access log you have sitting around (or that gargantuan csv file). Or if terabytes per day is your cup of tea run a Hydra Cluster that supports your job with resource sharing, job management, distributed backups, data partitioning, and efficient bulk file transfer.

Documentation and References

The Hydra Documentation Page contains concepts, tutorials, guides, and the web api.

The Hydra User Reference is built automatically from the source code and contains reference material on hydra's configurable job components.

Getting Started With Hydra is a blog post that contains a nice self-contained introduction to hydra processing.

AddThis Java Code Style is the code style that hydra tries to adhere to.

Building

Assuming you have Apache Maven installed and configured:

mvn package

Should compile and build jars. All hydra dependencies should be available on maven central but hydra itself is not yet published.

Berkeley DB Java Edition is used for several core features. The sleepycat license has strong copyleft properties that do not match the rest of the project. It is set as a non-transitive dependency to avoid inadvertently pulling it into downstream projects. In the future hydra should have pluggable storage with multiple implementations.

The hydra-uber module builds an exec jar containing hydra and all of it's dependencies. To include BDB JE when building with mvn package use -P bdbje. The main class of the exec jar launches the various components of a hydra cluster by name.

System dependencies

JDK 8 is required. Hydra has been developed on Linux (Centos 6) and should work on any modern Linux distro. Other unix-like systems should work with minor changes but have not been tested. Mac OSX should work for building and running local-stack (see below).

Hydra uses rabbitmq for low volume command and control message exchange. On a modern Linux systems apt-get install rabbitmq-server and running with the default settings is adequate in most cases.

To run efficiently Hydra needs a mechanism to take copy on write backups of the output of jobs. The is currently accomplished by adding the fl-cow library to LD_PRELOAD. Experimenting with other approaches such as ZFS or cp --reflink are under consideration.

Many components assume that there is a local user called hydra and that all minion nodes can ssh as that user to each other. This is used most prominently for rsync based replicas. The user hydra is not necessary when running a local-stack environment (see below).

OS X

On OS X several utilities are necessary to run the local-stack environment:

brew install coreutils
brew install wget

Components

While hydra can be used for ad-hoc analysis of csv and other local files, it's most commonly used in a distributed cluster. In that case the following components are involved:

ZooKeeper
Spawn: Job control and execution
Minion: Task runner
QueryMaster: Handler for queries
QueryWorker: Handle scatter-gather requests from QueryMaster
Meshy: File server

A typical configuration is to have a cluster head with Spawn & QueryMaster backed by a homogeneous clusters of nodes running Minion, QueryWorker, and Meshy.

Local Stack

For local development all of the above components can run together in a single stack run out of hydra-local. There is a local-stack.sh script to assist with this. To run the local stack:

You must be able to build hydra
Have rabbitmq installed
Allow your current user to ssh to itself

The first time the script is run a hydra-local directory will be created.

./hydra-uber/local/bin/local-stack.sh start - start ZooKeeper
./hydra-uber/local/bin/local-stack.sh start - start spawn, querymaster etc.
./hydra-uber/local/bin/local-stack.sh seed - add some sample data

You can then navigate to http://localhost:5052/ and you should see the spawn web interface.

When done ./hydra-uber/local/bin/local-stack.sh stop will stop everything except ZooKeeper, and running stop a second time will bring that process down as well.

There are sample job configurations located in hydra-uber/local/sample/

Administrative

Discussion

Mailing list: http://groups.google.com/group/hydra-oss

Freenode channel: #hydra

Versioning

It's x.y.z where:

x: Something Big Happened
y: next release
z: strive for bug fix only

License

hydra is released under the Apache License Version 2.0. See Apache or the LICENSE file in this distribution for details.

Logo

Hydra logo by Appy Vohra.

hydra's People

Contributors

Stargazers

Watchers

Forkers

adityamugali dgkris mspiegel cburroughs bradjcox rayleyva xyuan carbonfish jacklcz kangliqiang is00hcw ixiami wangbin83-gmail-com raychar phoenixhadoop xiongjy2104 atula strategist922 princessd8251 lemonhall hughgao rkparkjr changbiao nangal mako-taco skendall newbrandanalytics elpatron bsneade jesselivingston mythguided fysoft2006 ranadessr aomara benno233 mychapati hancy2013 easonyi zu-ctrl jxu86 linghushaoxia gtassone jhorwit2 mxk1235 georgekankava shylockgou sniper42434 jcarpeauxsherman mindis wangwenonly fandw06 xiongeee sksundaram-learning mscandal adisonhn yousetme liangguo85 maniacs-ops liftup packetlost iuliandumitru navis87 kxingit jklife3 interweb-lt wla80 dylan8888 icse18-refactorings tea-dragon javaboykao mwilliams143 jiayangli2 foryouslg sandiyar tooptoop4 diffblue-benchmarks cloudxtreme tomzhang connectionmaster zhangjunchao1937 stewartoallen doytsujin aaatifanas cuidd2018

hydra's Issues

Not able to run samples as per user guide

Hi, I am trying to get started with Hydra, I am running Hydra in local configuration for testing. Where can I get sample job configurations and sample data? Where should the data be placed if I have a csv file? I managed to configure a job, however not able to provide input data. Please help. Your sample job with automatic data inject runs correctly.

Syntax for the job configuration and query

Hi,
Where can I get the exact syntax/format of all the sections and parameters in the job configuration file and the queries. The reference document provides all the parameters, but only a few examples are provided, not the exhaustive syntax. Same is the case for queries.

NPE on Minion shutdown

http://pastebin.com/1tq7HsU3

Job validation should fail when a key is declared more than once

Error while shutting down job

Exception in thread "SourceReader" java.lang.ExceptionInInitializerError
at com.addthis.hydra.task.run.TaskFeeder.run(TaskFeeder.java:220)
Caused by: java.lang.IllegalStateException: Shutdown in progress
at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:66)
at java.lang.Runtime.addShutdownHook(Runtime.java:211)
at com.google.common.util.concurrent.MoreExecutors$Application.addShutdownHook(MoreExecutors.java:223)
at com.google.common.util.concurrent.MoreExecutors$Application.addDelayedShutdownHook(MoreExecutors.java:195)
at com.google.common.util.concurrent.MoreExecutors$Application.getExitingScheduledExecutorService(MoreExecutors.java:187)
at com.google.common.util.concurrent.MoreExecutors$Application.getExitingScheduledExecutorService(MoreExecutors.java:219)
at com.google.common.util.concurrent.MoreExecutors.getExitingScheduledExecutorService(MoreExecutors.java:169)
at com.addthis.muxy.MuxFileDirectoryCacheInstance.(MuxFileDirectoryCacheInstance.java:62)
at com.addthis.muxy.MuxFileDirectoryCacheInstance.(MuxFileDirectoryCacheInstance.java:84)
at com.addthis.muxy.MuxFileDirectoryCacheInstance.(MuxFileDirectoryCacheInstance.java:41)
at com.addthis.muxy.MuxFileDirectoryCacheInstance$Builder.build(MuxFileDirectoryCacheInstance.java:312)
at com.addthis.muxy.MuxFileDirectoryCache.(MuxFileDirectoryCache.java:32)
... 1 more

Why don't you guys use CLOUDFLARE as CDN instead of FASTLY?

site I noticed you guys use FASTLY as your CDN, but I don’t get why you don’t use for example CLOUDFLARE which would provide huge benefits:

Cloudflare bandwith is FREE
a. Cloudflare doesn’t charge for bandwith, it’s free (could be they charge you guys since you are a huge company but for sure would be cheaper than fastly)
b. This should save your company thousands of dollars
More POP/datacenters locations to deliver the assets:
a. Cloudflare: https://www.cloudflare.com/network-map
i. And this are the ones they plan for 2015 which are insane: https://twitter.com/davidmytton/status/543126344377585664
b. Fastly: http://www.fastly.com/network/
Has many other good features but I think just the 2 above are more than enough to take into account

excessive zk logging on empty/idle cluster

i've started and stopped a virgin local stack a few times then and after letting it sit idle for 15-20 minutes, randomly did a du -sh on the hydra run directory. surprisingly it came it at close to 500MB. i drilled down and found a huge chunk of zookeeper logs in hydra/etc/zookeeper/version-2/ ... it seems they are all /spawn/queue updates. not clear why that would be given the fact that the cluster was completely idle and lacked jobs, tasks, hosts, etc. not sure how to really debug this or if i want to (i don't).

Minion does not shut down cleanly if Rabbit or ZK is down

sett http://pastebin.com/GA72n2qu

Comparison with popular streaming engine as Flink

Hi,

I've noticed that the hydra project has appeared for several years and is still under active developing. Currently, similar tasks as unified batch and stream processing could also be delivered using some popular streaming engine as Flink, for example, Uber uses AthenaX to facilitate developers with SQL on Flink to process massive data from both batch and stream data. How does hydra compared with such platform? Thanks~

When i build hydra by maven, some thing error

when build it by mvn package use -P bdbje , hydra-local.sh generated but run it
/hydra-uber/bin/local-stack.sh start - start ZooKeeper
./hydra-uber/bin/local-stack.sh start - start spawn, querymaster etc.
./hydra-uber/bin/local-stack.sh seed - add some sample data

not been generated who can finish it

Not able to build Hydra using mvn package

While trying to build hydra package I got following error:

Failed to execute goal on project hydra-essentials: Could not resolve dependencies for project com.addthis.hydra:hydra-essentials:jar:5.5.13: Could not find artifact com.addthis:basis:jar:3.0.4 in central (http://repo.maven.apache.org/maven2) -> [Help 1]

Please help in resolving. Tried older version 3.4 also. Same error with -P bdbje option.

custom user-agent？

hydra's user-agent is Mozilla/4.0 (Hydra),usually blocked by waf.
can we custom the user-agent?

Documentation links not loading

Hi,

User Guide and Reference links are not loading.