Giter Club home page Giter Club logo

hadoop-ansible's Introduction

Hadoop Ansible Playbook Build Status

Ansible playbook that installs a CDH 4.6.0 Hadoop cluster (running on Java 7, supported from CDH 4.4), with HBase, Hive, Presto for analytics, and Ganglia, Smokeping, Fluentd, Elasticsearch and Kibana for monitoring and centralized log indexing.

Follow @analytically. Browse the CI build screenshots.

Requirements

  • Ansible 1.5 or later (pip install ansible)
  • 6 + 1 Ubuntu 12.04 LTS/13.04/13.10 or Debian "wheezy" hosts - see ubuntu-netboot-tftp if you need automated server installation
  • Mandrill username and API key for sending email notifications
  • ansibler user in sudo group without sudo password prompt (see Bootstrapping section below)

Cloudera (CDH4) Hadoop Roles

If you're assembling your own Hadoop playbook, these roles are available for you to reuse:

Configuration

Set the following variables using --extra-vars or editing group_vars/all:

Required:

  • site_name - used as Hadoop nameservices and various directory names. Alphanumeric only.

Optional:

  • Network interface: if you'd like to use a different IP address per host (eg. internal interface), change site.yml and change set_fact: ipv4_address=... to determine the correct IP address to use per host. If this fact is not set, ansible_default_ipv4.address will be used.
  • Email notification: notify_email, postfix_domain, mandrill_username, mandrill_api_key
  • roles/common: kernel_swappiness(0), nofile limits, ntp servers and rsyslog_polling_interval_secs(10)
  • roles/2_aggregated_links: bond_mode (balance-alb) and mtu (9216)
  • roles/cdh_hadoop_config: dfs_blocksize (268435456), max_xcievers (4096), heapsize (12278)

Adding hosts

Edit the hosts file and list hosts per group (see Inventory for more examples):

[datanodes]
hslave010
hslave[090:252]
hadoop-slave-[a:f].example.com

Make sure that the zookeepers and journalnodes groups contain at least 3 hosts and have an odd number of hosts.

Ganglia nodes

Since we're using unicast mode for Ganglia (which significantly reduces chatter), you may have to wait 60 seconds after node startup before it is seen/shows up in the web interface.

Installation

To run Ansible:

./site.sh

To e.g. just install ZooKeeper, add the zookeeper tag as argument (available tags: apache, bonding, configuration, elasticsearch, elasticsearch_curator, fluentd, ganglia, hadoop, hbase, hive, java, kibana, ntp, postfix, postgres, presto, rsyslog, tdagent, zookeeper):

./site.sh zookeeper

What else is installed?

URL's

After the installation, go here:

Performance testing

Instructions on how to test the performance of your CDH4 cluster.

  • SSH into one of the machines.
  • Change to the hdfs user: sudo su - hdfs
  • Set HADOOP_MAPRED_HOME: export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  • cd /usr/lib/hadoop-mapreduce
TeraGen and TeraSort
  • hadoop jar hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=1000 10000000000 /tera/in to run TeraGen
  • hadoop jar hadoop-mapreduce-examples.jar terasort /tera/in /tera/out to run TeraSort
DFSIO
  • hadoop jar hadoop-mapreduce-client-jobclient-2.0.0-cdh4.6.0-tests.jar TestDFSIO -write

Bootstrapping

Paste your public SSH RSA key in bootstrap/ansible_rsa.pub and run bootstrap.sh to bootstrap the nodes specified in bootstrap/hosts. See bootstrap/bootstrap.yml for more information.

What about Pig, Flume, etc?

You can manually install additional components after running this playbook. Follow the official CDH4 Installation Guide.

Screenshots

zookeeper

hmaster01

ganglia

kibana

smokeping

License

Licensed under the Apache License, Version 2.0.

Copyright 2013-2014 Mathias Bogaert.

hadoop-ansible's People

Contributors

amirhhz avatar analytically avatar gjhenrique avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hadoop-ansible's Issues

java.lang.IllegalArgumentException: Does not contain a valid host:port authority: Ubuntu_Cluster_02:8020

Hi,
Thanks for this playbook.
I'm having some troubles getting it off the ground.
PLAY [namenodes[0]] ***********************************************************

TASK: [Make sure the /data/dfs/nn directory exists] ***************************
ok: [Ubuntu_Cluster_02]

TASK: [Make sure the namenode is formatted - WILL NOT FORMAT IF /data/dfs/nn/current/VERSION EXISTS TO AVOID DATA LOSS] ***

BAIL!!!!!

Full details at this gist https://gist.github.com/darKoram/7031777

Any tutorials for complete beginners?

I guess this project is supposed to simplify the Hadoop installation process, so it would be great to create a step-by-step tutorial on how to set everything up - from the very beginning to the complete running cluster.

I (personally) am aware of Hadoop setup and installation and find it bit complicated, too java'ish and long. So it would be nice to have a really smart tool to install everything in maximum 5 minutes - from spinning up instances to starting the Hadoop task. At the same time, I'm not aware of Ansible, and I think that a good tutorial for complete Ansible noobs is a good thing to have.

Thanks ;)

Documentation reference to Debian

The playbook has ubuntu-specific tasks (at least the ones regarding landscape) which makes the run fail on Debian targets.
Ubuntu specific tasks should be made contingent on the targets actually being ubuntu servers or the requirement for Debian Wheezy servers removed from the documentation.

command > shell

"command" is lighter than "shell", without a complete chell context, the module should be faster and consume less memory. The differrence is picky.

cdh_hbase_master | start hbase-master - Error: Could not find or load main class HotSpot(TM)

I did tried install it on 4 google compute engine instances with ubuntu-12-04

  1. I got problems with "service rsyslog restart" tasks (did restarted it manually)
  2. got this error at NOTIFIED: [cdh_hbase_master | start hbase-master] task
=> {"failed": true}
msg:  * Stopping HBase master daemon (hbase-master): 
no master to stop because no pid file /var/run/hbase/hbase-hbase-master.pid
 * Starting HBase master daemon (hbase-master): 
starting master, logging to /var/log/hbase/hbase-hbase-master-***.***.**.***.out
Error: Could not find or load main class HotSpot(TM)
Heap
 par new generation   total 35904K, used 1277K [0x00000006ed400000, 0x00000006efaf0000, 0x00000006f7a60000)
  eden space 31936K,   4% used [0x00000006ed400000, 0x00000006ed53f6b8, 0x00000006ef330000)
  from space 3968K,   0% used [0x00000006ef330000, 0x00000006ef330000, 0x00000006ef710000)
  to   space 3968K,   0% used [0x00000006ef710000, 0x00000006ef710000, 0x00000006efaf0000)
 concurrent mark-sweep generation total 79872K, used 0K [0x00000006f7a60000, 0x00000006fc860000, 0x00000007ed400000)
 concurrent-mark-sweep perm gen total 21248K, used 2631K [0x00000007ed400000, 0x00000007ee8c0000, 0x0000000800000000)
To enable GC log rotation, use -Xloggc:<filename> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=<num_of_files>
where num_of_file > 0

root@104:~# java -version
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
root@104:~# echo $JAVA_HOME
/usr/lib/jvm/java-7-oracle

Smokeping playbook only works Ubuntu versions 13.10+ out of the box

The use of the /etc/apache2/conf-available directory set-up and a2enconf command require apache 2.4 which was only introduced in Ubuntu 13.10. The previous versions of Ubuntu have apache 2.2 by default.

Just recording the issue here in case someone else fixes it first, but might get around to sending a fix for it myself. For my own use I'm just disabling smokeping for now.

Any plan to support CentOS?

This repo is excellent, but it only supports Debian/Ubuntu, is there any plan to support RedHat/CentOS? Actually it won't need much refactoring.

Spin up some DigitalOcean boxes when running Travis to deploy a full Hadoop stack

It'd be good if I could fully test the playbook using Travis CI. It'd go like:

  • validate playbook
  • before_script: setup hosts in Travis CI box (/etc/hosts)
  • run ansible script to spin up 8 DO hosts with an encrypted key using my account
  • run playbook (this might take a long time, need to check if Travis CI is ok with this)
  • validate Hadoop was correctly installed (port checks?)
  • after_script: destroy DO hosts, check they are destroyed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.