Giter Club home page Giter Club logo

cascading.multitool's Introduction

Multitool

Welcome

This is the Cascading.Multitool (Multitool) application.

Multitool provides a simple command line interface for building data processing jobs. Think of this as grep, sed, and awk for Hadoop, which also supports joins between multiple data-sets.

For example, with $HADOOP_HOME/bin/ in your PATH, the following command,

$ hadoop jar multitool-<release-date>.jar source=input.txt select=Monday sink=outputDir

will start a Hadoop job to read from the source file input.txt, grep all lines with the word Monday, then output the results into the outputDir directory.

Multitool will inherit the underlying Hadoop configuration, so if the default FileSystem is HDFS, all paths will be relative to the cluster filesystem, not local. Using fully qualified urls will override the defaults (file://some/path or s3n:/bucket/file).

This application is built with Cascading.

Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster. It can be found at http://www.cascading.org/

Installing

This step is not necessary if you wish to run Multitool directly from the uncompressed distribution folder or Multitool was pre-installed with your Hadoop distribution.

To see if Multitool has already been added to your PATH, type:

$ which multitool

To install for all users into /usr/local/bin:

$ sudo ./bin/multitool install

or for the current user only into ~/.multitool:

$ ./bin/multitool install

For detailed instructions:

$ ./bin/multitool help install

Choose the method that best suites your environment.

If you are running Multitool on AWS Elastic MapReduce, you need to follow the Elastic MapReduce instructions on the AWS site, which typically expect the multitool-<release-date>.jar to be uploaded to AWS S3.

Using

The environment variable HADOOP_HOME should always be set first before using Multitool.

To run from the command line with the jar, Hadoop should be in the path:

$ hadoop jar multitool-<release-date>.jar <args>

...or if Multitool has been installed based on the instructions above:

$ multitool source=data/artist.100.txt cut=0 sink=output

This will cut the first fields out of the file artists.100.txt and save the results to output file.

If no args are given, a comprehensive list of commands will be printed. That list is also available as COMMANDS.md in this directory.

Examples

For more detailed examples of using Multitool, see also: http://cascading.org/multitool/

Copying:

$ ./bin/multitool source=input.txt sink=outputDir

Copying while removing the first header line, and overwriting output:

$ ./bin/multitool source=input.txt source.skipheader=true sink=outputDir sink.replace=true

Filter out data:

$ ./bin/multitool source=input.txt "reject=some words" sink=outputDir

For a more complex example:

$ ./bin/multitool source=data/topic.100.txt cut=0 \
"pgen=(\b[12][09][0-9]{2}\b)" group=0 count=0 group=1 \
sink=output sink.replace=true sink.parts=1

This will find all years in the input file, count them, and sort them by counts.

Building

To build Multitool, you may download the source code from GitHub:

https://github.com/cascading/cascading.multitool

This release will pull all dependencies from the relevant maven repos, including http://conjars.org

To build a jar,

$ ant retrieve jar

To test,

$ ant test

License

See apl.txt in this directory.

cascading.multitool's People

Contributors

cwensel avatar ohrite avatar ceteri avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.