Giter Club home page Giter Club logo

caochong's Introduction

Hadoop and Spark on Docker

This tool sets up a Hadoop and/or Spark cluster running within Docker containers on a single physical machine (e.g. your laptop). It's convenient for debugging, testing and operating a real cluster, especially when you run customized packages with changes of Hadoop/Spark source code and configuration files.

Why Me

This tool is:

  • easy to go: just one command run.sh (Tell me, friend, can you ask for anything more? ).
  • customizable: you can specify the cluster specs easily, e.g. HA-enabled, number of datanodes, LDAP, security etc.
  • configurable: you can either change the Hadoop and/or Spark configuration files before launching the cluster or change them online by logging on the virtual machines.
  • elastic: imagine your physical machine can run as many containers as you wish.

The distributed cluster in Docker containers outperforms its counterparts:

  1. psudo-distributed Hadoop cluster on a single machine, which is nontrivial to run HA, to launch multiple datanodes, to test HDFS balancer/mover etc.
  2. setting up a real cluster, which is complex and heavy to use, and in the first place you can afford a real cluster.
  3. building Ambari cluster using vbox/vmware virtual machines, nice try. But let's see who runs faster.

Usage

The following illustrates the basic procedure on how to use this tool. It provides two ways to set up Hadoop and Spark cluster: from-ambari and from-source.

From Source

The only one step is to run from-source/run.sh.

$ ./run.sh --help
Usage: ./run.sh hadoop|spark [--rebuild] [--nodes=N]

hadoop       Make running mode to hadoop
spark        Make running mode to spark
--rebuild    Rebuild hadoop if in hadoop mode; else reuild spark
--nodes      Specify the number of total nodes

From Ambari

Apache Ambari is a tool for provisioning, managing, and monitoring Apache Hadoop clusters. If using Ambari to set up Hadoop and Spark, Spark will run in Yarn client/cluster mode.

  1. [Optional] Choose Ambari version in from-ambari/Dockerfile file (default Ambari 2.2)

  2. Run from-ambari/run.sh to set up an Ambari cluster and launch it

    $ ./run.sh --help
    Usage: ./run.sh [--nodes=3] [--port=8080] --secure
    
    --nodes      Specify the number of total nodes
    --port       Specify the port of your local machine to access Ambari Web UI (8080 - 8088)
    --secure     Specify the cluster to be secure
    
  3. Hit http://localhost:port from your browser on your local computer. The port is the parameter specified in the command line of running run.sh. By default, it is http://localhost:8080. NOTE: Ambari Server can take some time to fully come up and ready to accept connections. Keep hitting the URL until you get the login page.

  4. Login the Ambari webpage with the default username:password is admin:admin.

  5. [Optional] Customize the repository Base URLs in the Select Stack step.

  6. On the Install Options page, use the hostnames reported by run.sh as the Fully Qualified Domain Name (FQDN). For example:

    Using the following hostnames:
    ------------------------------
    85f9417e3d94
    b5077ffd9f7f
    ------------------------------
    
  7. Upload from-ambari/id_rsa as your SSH Private Key to automatically register hosts when asked.

  8. Follow the onscreen instructions to install Hadoop (YARN + MapReduce2, HDFS) and Spark.

  9. [Optional] Log in to any of the nodes and you're all set to use an Ambari cluster!

    # login to your Ambari server node
    $ docker exec -it caochong-ambari-0 /bin/bash
    
  10. [Optional] If you want to make the cluster secure, you need to login to your Ambari server node, and run install_Kerberos.sh (you may need to do "chmod 777 install_Kerberos.sh"). Then go back to Ambari web page, follow the onscreen instructions to configure Kerberos.

License

Unlike all other Apache projects which use Apache license, this project uses an advanced and modern license named The Star And Thank Author License (SATA). Please see the LICENSE file for more information.

caochong's People

Contributors

cmbasics avatar liuml07 avatar weiqingy avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.