Giter Club home page Giter Club logo

hadoop-hbase-streaming's Introduction

Initial Setup

All instructions pertain to CDH3u3 on CentOS 6.

Place hadoop-hbase-streaming.jar in /usr/local/hadoop-hbase-streaming.jar

Add to : /etc/hadoop-0.20/conf/hadoop-env.sh

export HADOOP_CLASSPATH="/usr/local/hadoop-hbase-streaming.jar:$HADOOP_CLASSPATH"
export HADOOP_CLASSPATH="/usr/lib/hbase/lib/guava-r06.jar:$HADOOP_CLASSPATH"
export HADOOP_CLASSPATH="/usr/lib/hbase/hbase-0.90.4-cdh3u3.jar:$HADOOP_CLASSPATH"
export HADOOP_CLASSPATH="/usr/lib/zookeeper/zookeeper-3.3.4-cdh3u3.jar:$HADOOP_CLASSPATH" 

Loading Data into HBase

Create the output table with appropriate column families:

create 'outputtable', {NAME=>'cf1'}, {NAME=>'cf2'}

Create a reducer that will output in the following format (tab-delimited):

put	<rowid>	<cf>	<qualifier>	<value>

Run your map reduce job with the OutputFormat set to: org.childtv.hadoop.hbase.mapred.ListTableOutputFormat

As a test, create a file called source_input/test.tab and include the expected reducer output.

An example of the reducer output might be (tab-delimited):

put	r1	cf1	test	Value1
put	r1	cf2	test	Value2
put	r2	cf1	test	Value3 

Then invoke the hadoop streaming API with the outputformat set to org.childtv.hadoop.hbase.mapred.ListTableOutputFormat and the job configuration parameter reduce.output.table=outputtable

hadoop jar  /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar \
	-input source_input -output dummy_output \
	-mapper /bin/cat \
	-outputformat org.childtv.hadoop.hbase.mapred.ListTableOutputFormat \
	-jobconf reduce.output.table=outputtable 

This will write the provided fields to HBase.

Extracting Data from HBase

For reading from hbase, create a dummy input directory containing no files.

mkdir dummy_input

Select your desired InputFormat. Two exist : JSON: org.childtv.hadoop.hbase.mapred.JSONTableInputFormat Tabular values: org.childtv.hadoop.hbase.mapred.ListTableInputFormat

Select you desired input column families using the job configuration parameter map.input.columns

The JSON format has the advantage that the format is stricter and more expressive.

r1	{"cf2:test":{"timestamp":"1333428648468","value":"Value1"},"cf1:test":{"timestamp":"1333428678724","value":"Value2"}} 
r2	{"cf2:test":{"timestamp":"1333428656033","value":"Value3"},"cf1:test":{"timestamp":"1333428660721","value":"Value4"}} 

The ListTableInputFormat only includes rowid and value. It does not include column names in any way.

r1	Value1 Value2
r2	Value3 Value4

To run a test job on an HBase table called sourcetable with column families cf1 and cf2 and run:

hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u3.jar \
	-input dummy_input -inputformat org.childtv.hadoop.hbase.mapred.JSONTableInputFormat \
	-mapper /bin/cat \
	-jobconf map.input.table=sourcetable -jobconf "map.input.columns=cf1 cf2" \
	-output myoutput 

This will produce a file in myoutput/part-00000 that contains the JSON output.

hadoop-hbase-streaming's People

Contributors

wanpark avatar dmaust avatar

Stargazers

 avatar sean head avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.