analytically / hadoop-ansible Goto Github PK

Ansible playbook that installs a Hadoop cluster, with HBase, Hive, Presto for analytics, and Ganglia, Smokeping, Fluentd, Elasticsearch and Kibana for monitoring and centralized log indexing.

License: Apache License 2.0

Shell 67.37% XSLT 8.57% PHP 18.66% JavaScript 4.00% DTrace 1.40%

hadoop-ansible's Introduction

Hadoop Ansible Playbook

Ansible playbook that installs a CDH 4.6.0 Hadoop cluster (running on Java 7, supported from CDH 4.4), with HBase, Hive, Presto for analytics, and Ganglia, Smokeping, Fluentd, Elasticsearch and Kibana for monitoring and centralized log indexing.

Follow @analytically. Browse the CI build screenshots.

Requirements

Ansible 1.5 or later (pip install ansible)
6 + 1 Ubuntu 12.04 LTS/13.04/13.10 or Debian "wheezy" hosts - see ubuntu-netboot-tftp if you need automated server installation
Mandrill username and API key for sending email notifications
ansibler user in sudo group without sudo password prompt (see Bootstrapping section below)

Cloudera (CDH4) Hadoop Roles

If you're assembling your own Hadoop playbook, these roles are available for you to reuse:

cdh_common - sets up Cloudera's Ubuntu repository and key
cdh_hadoop_common - common packages shared by all Hadoop nodes
cdh_hadoop_config - common configuration shared by all Hadoop nodes
cdh_hadoop_datanode - installs Hadoop DataNode
cdh_hadoop_journalnode - installs Hadoop JournalNode
cdh_hadoop_mapreduce - installs Hadoop MapReduce
cdh_hadoop_mapreduce_historyserver - installs Hadoop MapReduce history server
cdh_hadoop_namenode - installs Hadoop NameNode
cdh_hadoop_yarn_nodemanager - installs Hadoop YARN node manager
cdh_hadoop_yarn_proxyserver - installs Hadoop YARN proxy server
cdh_hadoop_yarn_resourcemanager - installs Hadoop YARN resource manager
cdh_hadoop_zkfc - installs Hadoop Zookeeper Failover Controller
cdh_hbase_common - common packages shared by all HBase nodes
cdh_hbase_config - common configuration shared by all HBase nodes
cdh_hbase_master - installs HBase-Master
cdh_hbase_regionserver - installs HBase RegionServer
cdh_hive_common - common packages shared by all Hive nodes
cdh_hive_config - common configuration shared by all Hive nodes
cdh_hive_metastore - installs Hive metastore (with PostgreSQL database)
cdh_zookeeper_server - installs ZooKeeper Server

Facebook Presto Roles

presto_common - downloads Presto to /usr/local/presto and prepares the node configuration
presto_coordinator - installs Presto coordinator config
presto_worker - installs Presto worker config

Configuration

Set the following variables using --extra-vars or editing group_vars/all:

Required:

site_name - used as Hadoop nameservices and various directory names. Alphanumeric only.

Optional:

Network interface: if you'd like to use a different IP address per host (eg. internal interface), change site.yml and change set_fact: ipv4_address=... to determine the correct IP address to use per host. If this fact is not set, ansible_default_ipv4.address will be used.
Email notification: notify_email, postfix_domain, mandrill_username, mandrill_api_key
roles/common: kernel_swappiness(0), nofile limits, ntp servers and rsyslog_polling_interval_secs(10)
roles/2_aggregated_links: bond_mode (balance-alb) and mtu (9216)
roles/cdh_hadoop_config: dfs_blocksize (268435456), max_xcievers (4096), heapsize (12278)

Adding hosts

Edit the hosts file and list hosts per group (see Inventory for more examples):

[datanodes]
hslave010
hslave[090:252]
hadoop-slave-[a:f].example.com

Make sure that the zookeepers and journalnodes groups contain at least 3 hosts and have an odd number of hosts.

Ganglia nodes

Since we're using unicast mode for Ganglia (which significantly reduces chatter), you may have to wait 60 seconds after node startup before it is seen/shows up in the web interface.

Installation

To run Ansible:

./site.sh

To e.g. just install ZooKeeper, add the zookeeper tag as argument (available tags: apache, bonding, configuration, elasticsearch, elasticsearch_curator, fluentd, ganglia, hadoop, hbase, hive, java, kibana, ntp, postfix, postgres, presto, rsyslog, tdagent, zookeeper):

./site.sh zookeeper

What else is installed?

To improve performance, sysctl tuning
link aggregation configures Link Aggregation if 2 interfaces are available
htop, curl, checkinstall, heirloom-mailx, intel-microcode/amd64-microcode, net-tools, zip
NTP configured with the Oxford University NTP service by default
Postfix with Mandrill configuration
unattended upgrades email to inform success/failure
php5-cli, sysstat, hddtemp to report device metrics (reads/writes/temp) to Ganglia every 10 minutes
LZO (Lempel–Ziv–Oberhumer) and Google Snappy 1.1.1 compression
a fork of openjdk's FloatingDecimal to fix monitor contention when parsing doubles due to a static synchronized method
Elasticsearch Curator, defaults to maximum 30 GB of data in Elasticsearch, via cron daily at 2:00AM
Elasticsearch Marvel, monitor your Elasticsearch cluster's heartbeat
SmokePing to keep track of network latency

URL's

After the installation, go here:

Ganglia at monitor01/ganglia
Kibana at monitor01/kibana/index.html#/dashboard/file/logstash.json
Smokeping at monitor01/smokeping/smokeping.cgi
hmaster01 at hmaster01:50070 - active namenode
hmaster02 at hmaster02:50070 - standby namenode
Presto at hmaster02:8081 - Presto coordinator

Performance testing

Instructions on how to test the performance of your CDH4 cluster.

SSH into one of the machines.
Change to the hdfs user: sudo su - hdfs
Set HADOOP_MAPRED_HOME: export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
cd /usr/lib/hadoop-mapreduce

TeraGen and TeraSort

hadoop jar hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=1000 10000000000 /tera/in to run TeraGen
hadoop jar hadoop-mapreduce-examples.jar terasort /tera/in /tera/out to run TeraSort

DFSIO

hadoop jar hadoop-mapreduce-client-jobclient-2.0.0-cdh4.6.0-tests.jar TestDFSIO -write

Bootstrapping

Paste your public SSH RSA key in bootstrap/ansible_rsa.pub and run bootstrap.sh to bootstrap the nodes specified in bootstrap/hosts. See bootstrap/bootstrap.yml for more information.

What about Pig, Flume, etc?

You can manually install additional components after running this playbook. Follow the official CDH4 Installation Guide.

Screenshots

License

Licensed under the Apache License, Version 2.0.

hadoop-ansible's People

Contributors

Stargazers

Watchers

Forkers

pragnesh kulikov youngwookim avontd2868 octo47 libin amirhhz rtvt123 strategist922 iskandar pachalko ngbinh coderfi justin2061 rikima faisalrabbani cloven iambocai coderplay juwi jivesoftware raphaelmiranda bryanl theclaymethod ytc301 olatif batyd-turksat jnbala junyoed mdodsworth lxiong sky4star simapple mariosten leochencipher ay65535 vimvim tylerzhangzc tyleraland pramodkoneru549-zz mkscala hsz-devops ifa6 wangganyu188 haiyang1987 heartshare gree2 alisheikh yonglehou gf53520 bopo devekko theseusyang userguy ricebeans robi56 janurag wanghaisheng rafme duanshuaimin tigertong mdshuai clark007007 pgaref cloudxtreme frascuchon rhidayat1980 zhaofuyun hsumairay liwuliao kaustubhmuley jaikrish paulijokinen jiananbear jainvikashraj chenweihua zhang637 bigdata-nosql kavap binxiong ddoloroi abdul-git kaydoh revilla iamupendra ip-2014 meetsdeep leolorenzoluis rsettlage ashwinrayaprolu dkbrown8 marcelomata yanzhongli kangzhenkang mikalv lijinlin1213 bh-lushuai msellamitn immohitsgit codingzhouk

hadoop-ansible's Issues

java.lang.IllegalArgumentException: Does not contain a valid host:port authority: Ubuntu_Cluster_02:8020

Hi,
Thanks for this playbook.
I'm having some troubles getting it off the ground.
PLAY [namenodes[0]] ***********************************************************

TASK: [Make sure the /data/dfs/nn directory exists] ***************************
ok: [Ubuntu_Cluster_02]

TASK: [Make sure the namenode is formatted - WILL NOT FORMAT IF /data/dfs/nn/current/VERSION EXISTS TO AVOID DATA LOSS] ***

BAIL!!!!!

Full details at this gist https://gist.github.com/darKoram/7031777

Any tutorials for complete beginners?

I guess this project is supposed to simplify the Hadoop installation process, so it would be great to create a step-by-step tutorial on how to set everything up - from the very beginning to the complete running cluster.

I (personally) am aware of Hadoop setup and installation and find it bit complicated, too java'ish and long. So it would be nice to have a really smart tool to install everything in maximum 5 minutes - from spinning up instances to starting the Hadoop task. At the same time, I'm not aware of Ansible, and I think that a good tutorial for complete Ansible noobs is a good thing to have.

Thanks ;)

update the defaultFS

Shouldn't this https://github.com/analytically/hadoop-ansible/blob/master/roles/cdh_hadoop_config/templates/core-site.xml#L10 by the site name vs an individual namenode, as specified by https://github.com/analytically/hadoop-ansible/blob/master/roles/cdh_hadoop_config/templates/hdfs-site.xml#L20

Thanks, great work on this.

Documentation reference to Debian

The playbook has ubuntu-specific tasks (at least the ones regarding landscape) which makes the run fail on Debian targets.
Ubuntu specific tasks should be made contingent on the targets actually being ubuntu servers or the requirement for Debian Wheezy servers removed from the documentation.

command > shell

"command" is lighter than "shell", without a complete chell context, the module should be faster and consume less memory. The differrence is picky.

don't use shell for a ln-s

"file" with "state=link" is the way to do it.

the handler "reload apache2" should be in the apache2 role

You can notify in one role, and handle it in another.

cdh_hbase_master | start hbase-master - Error: Could not find or load main class HotSpot(TM)

I did tried install it on 4 google compute engine instances with ubuntu-12-04

I got problems with "service rsyslog restart" tasks (did restarted it manually)
got this error at NOTIFIED: [cdh_hbase_master | start hbase-master] task

=> {"failed": true}
msg:  * Stopping HBase master daemon (hbase-master): 
no master to stop because no pid file /var/run/hbase/hbase-hbase-master.pid
 * Starting HBase master daemon (hbase-master): 
starting master, logging to /var/log/hbase/hbase-hbase-master-***.***.**.***.out
Error: Could not find or load main class HotSpot(TM)
Heap
 par new generation   total 35904K, used 1277K [0x00000006ed400000, 0x00000006efaf0000, 0x00000006f7a60000)
  eden space 31936K,   4% used [0x00000006ed400000, 0x00000006ed53f6b8, 0x00000006ef330000)
  from space 3968K,   0% used [0x00000006ef330000, 0x00000006ef330000, 0x00000006ef710000)
  to   space 3968K,   0% used [0x00000006ef710000, 0x00000006ef710000, 0x00000006efaf0000)
 concurrent mark-sweep generation total 79872K, used 0K [0x00000006f7a60000, 0x00000006fc860000, 0x00000007ed400000)
 concurrent-mark-sweep perm gen total 21248K, used 2631K [0x00000007ed400000, 0x00000007ee8c0000, 0x0000000800000000)
To enable GC log rotation, use -Xloggc:<filename> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=<num_of_files>
where num_of_file > 0

root@104:~# java -version
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
root@104:~# echo $JAVA_HOME
/usr/lib/jvm/java-7-oracle

validate playbook
before_script: setup hosts in Travis CI box (/etc/hosts)
run ansible script to spin up 8 DO hosts with an encrypted key using my account
run playbook (this might take a long time, need to check if Travis CI is ok with this)
validate Hadoop was correctly installed (port checks?)
after_script: destroy DO hosts, check they are destroyed