Giter Club home page Giter Club logo

pigpen's Introduction

PigPen is map-reduce for Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

##What is PigPen?

  • A map-reduce language that looks and behaves like clojure.core
  • The ability to write map-reduce queries as programs, not scripts
  • Strong support for unit tests and iterative development

Really, yet another map-reduce language?

If you know Clojure, you already know PigPen

The primary goal of PigPen is to take language out of the equation. PigPen operators are designed to be as close as possible to the Clojure equivalents. There are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program.

Here's the proverbial word count:

(require '[pigpen.core :as pig])

(defn word-count [lines]
  (->> lines
    (pig/mapcat #(-> % first
                   (clojure.string/lower-case)
                   (clojure.string/replace #"[^\w\s]" "")
                   (clojure.string/split #"\s+")))
    (pig/group-by identity)
    (pig/map (fn [[word occurrences]] [word (count occurrences)]))))

This defines a function that returns a PigPen query expression. The query takes a sequence of lines and returns the frequency that each word appears. As you can see, this is just the word count logic. We don't have to conflate external concerns, like where our data is coming from or going to.

Will it compose?

Yep - PigPen queries are written as function compositions - data in, data out. Write it once and avoid the copy & paste routine.

Here we use our word-count function (defined above), along with a load and store command, to make a PigPen query:

(defn word-count-query [input output]
  (->>
    (pig/load-tsv input)
    (word-count)
    (pig/store-tsv output)))

This function returns the PigPen representation of the query. By itself, it won't do anything - we have to execute it locally or generate a script (more on that later).

You like unit tests? Yeah, we do that

With PigPen, you can mock input data and write a unit test for your query. No more crossing your fingers & wondering what will happen when you submit to the cluster. No more separate files for test input & output.

Mocking data is really easy. With pig/return and pig/constantly, you can inject arbitrary data as a starting point for your script.

A common pattern is to use pig/take to sample a few rows of the actual source data. Wrap the result with pig/return and you've got mock data.

(use 'clojure.test)

(deftest test-word-count
  (let [data (pig/return [["The fox jumped over the dog."]
                          ["The cow jumped over the moon."]])]
    (is (= (pig/dump (word-count data))
           [["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]]))))

The pig/dump operator runs the query locally.

Closures (yes, the kind with an S)

Parameterizing your query is trivial. Any in-scope function parameters or let bindings are available to use in functions.

(defn reusable-fn [lower-bound data]
  (let [upper-bound (+ lower-bound 10)]
    (pig/filter (fn [x] (< lower-bound x upper-bound)) data)))

Note that lower-bound and upper-bound are present when we generate the script, and are made available when the function is executed within the cluster.

Note: To exclude a local variable, add the metadata ^:local to the declaration.

So how do I use it?

Just tell PigPen where to write the query as a Pig script:

(pig/write-script "word-count.pig" (word-count-query "input.tsv" "output.tsv"))

And then you have a Pig script which you can submit to your cluster. The script uses pigpen.jar, an uberjar with all of the dependencies, so make sure that is submitted with it. Another option is to build an uberjar for your project and submit that instead. Just rename it prior to submission. Check out the tutorial for how to build an uberjar.

As you saw before, we can also use pig/dump to run the query locally and return Clojure data:

=> (def data (pig/return [["The fox jumped over the dog."]
                          ["The cow jumped over the moon."]]))
#'pigpen-demo/data

=> (pig/dump (word-count data))
[["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]]

Getting started

Getting started with Clojure and PigPen is really easy. Just follow the steps below to get up and running.

  1. Install Leiningen
  2. Create a new leiningen project with lein new pigpen-demo
  3. Add PigPen as a dependency by adding [com.netflix.pigpen/pigpen "0.1.2"] into your project's project.clj file.
  4. Run lein repl to start a REPL for your new project.
  5. Try some samples in the tutorial

Note: If you are not familiar at all with Clojure, I strongly recommend that you try a tutorial here, here, or here to understand some of the basics.

Note: Make sure you're using Clojure 1.5.1 or greater

Note: PigPen is not a Clojure wrapper for writing Pig scripts you can hand edit. While it's entirely possible, the resulting scripts are not intended for human consumption.

Tutorials & Documentation

There are three distinct audiences for PigPen, so we wrote three different tutorials:

If you know both Clojure and Pig, you'll probably find all of the tutorials interesting.

The full API documentation is located here

Questions & Complaints: [email protected]

Artifacts

pigpen is available from Maven:

With Leiningen:

[com.netflix.pigpen/pigpen "0.1.2"]

With Maven:

<dependency>
  <groupId>com.netflix.pigpen</groupId>
  <artifactId>pigpen</artifactId>
  <version>0.1.2</version>
</dependency>

Note: Make sure you're using Clojure 1.5.1 or greater

pigpen's People

Contributors

mbossenbroek avatar quidryan avatar randgalt avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.