Giter Club home page Giter Club logo

cascading-flink's Introduction

Cascading Connector for Apache Flink

Cascading is a popular framework to develop, maintain, and execute large-scale and robust data analysis applications. Originally, Cascading flows have been executed on Apache Hadoop. Cascading's 3.0 release added support for Apache Tez as a runtime backend.

Apache Flink is a platform for scalable stream and batch processing. Flink's execution engine features low-latency pipelined and scalable batched data transfers and high-performance, in-memory operators for sorting and joining that gracefully go out-of-core in case of scarce memory resources. Flink can execute programs on a local machine, in a dedicated cluster, or on Hadoop YARN setups.

The Cascading Connector for Apache Flink enables you to execute Cascading flows on Apache Flink. Using this connector, your Cascading flows will benefit from Flink's runtime features such as its pipelined data shuffles and its efficient and robust in-memory operators.

Please report any problems with the Cascading on Flink Connector by opening an issue.

Features & Limitations

The Cascading Connector for Apache Flink supports most Cascading and Flink features.

  • All Cascading operators (except for BufferJoins) are directly executed on Flink's memory-safe operators. This significantly reduces the need for cumbersome parameter tuning such as spill thresholds and the risk for OutOfMemoryErrors.
  • Flink's runtime leverages field type information of Cascading programs. Apache Flink uses specialized serializers and comparators to efficiently operate on binary data. Cascading flows that specify the type of key fields benefit from significant performance improvements.
  • Supports all execution modes of Apache Flink, i.e., local, cluster, YARN, and in-IDE debugging.
  • Supports all Cascading Hadoop Taps and Schemes. Local Taps and Schemes are not supported, but Hadoop Taps support reading from local file systems.

However, there are also a few limitations, which we are still working on, namely:

  • Only Inner- and LeftJoins are supported for HashJoin pipes. The remaining join types will be available once Flink supports hash-based full outer joins.

Install

The latest version of the Cascading Connector for Apache Flink is version 0.2 (released on 2016-04-12). It depends on Apache Flink 1.0.1 and Cascading 3.1.0 which is not released yet but available as 3.1.0-wip-56 at Conjars. All required dependencies are provided by including the Cascading Flink connector in your project.

The Cascading Connector for Apache Flink is available on Maven central can be used in your project by adding the following Maven dependency:

<dependency>
    <groupId>com.data-artisans</groupId>
    <artifactId>cascading-flink</artifactId>
    <version>0.2</version>
</dependency>

WordCount Example

Next, let's run the classic WordCount example. Before running it in our ETL pipeline on the cluster, we want to run it locally first. The Cascading Connector for Apache Flink contains a built-in local execution mode to test your applications before deploying them.

Here is the main method of our WordCount example:

public static void main(String[] args) {

    Fields token = new Fields( "token" );
    Fields text = new Fields( "text" );
    RegexSplitGenerator splitter = new RegexSplitGenerator( token, "\\s+" );
    // only returns "token"
    Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

    Pipe wcPipe = new Pipe( "wc", docPipe );
    wcPipe = new AggregateBy( wcPipe, token, new CountBy(new Fields("count")));

    Tap inTap = new Hfs(new TextDelimited(text, "\n" ), args[0]);
    Tap outTap = new Hfs(new TextDelimited(false, "\n"), args[1], SinkMode.REPLACE);

    FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
            .addSource( docPipe, inTap )
            .addTailSink( wcPipe, outTap );

    FlowConnector flowConnector = new FlinkConnector();

	Flow wcFlow = flowConnector.connect( flowDef );

    wcFlow.complete();
}

That's really all the code you need to run a WordCount.

Local execution

Now let's see how our WordCount performs.

To execute the example, let's first get some sample data:

curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > kinglear.txt

Then let's run the included WordCount locally on your machine:

mvn exec:exec -Dinput=kinglear.txt -Doutput=wordcounts.txt

Congratulations! You have run your first Cascading application on top of Apache Flink.

Cluster Execution

To execute your program on a cluster, you need to build a "fat jar" containing your program and the Flink-Cascading dependencies. Use the assembly plugin or the shading plugin to achieve that.

Then submit the jar to a Flink cluster by using the command-line utility:

cd flink-dist
cd bin
./flink run -c <entry_class> <path_to_jar> <jar_arguments>

This will use the cluster's default parallelism. You may explicitly specify the parallelism by setting the -p flag.

./flink run -p 20 -c <entry_class> <path_to_jar> <jar_arguments>

This will set the default parallelism for operators to 20 which means Flink tries to run 20 parallel instances of each operation.

YARN cluster

If you want to run your Flink cluster on YARN, you can do that as follows:

./flink run -m yarn-cluster -yn 10 -p 20 -c <entry_class> <path_to_jar> <jar_arguments>

This will create a Flink cluster for your job consisting of 10 task managers (workers) with a default parallelism of 20.

WordCount on the Cluster

Now let's run the included WordCount example on the cluster.

./flink run -c com.dataartisans.flinkCascading.example.WordCount cascading-flink.jar hdfs:///input hdfs:///output

Or on a YARN cluster:

./flink run  -m yarn-cluster -yn 10 -c com.dataartisans.flinkCascading.example.WordCount cascading-flink.jar hdfs:///input hdfs:///output

Building

To retrieve the latest development version (0.3-SNAPSHOT) of the Cascading Connector for Apache Flink, run the following command

git clone https://github.com/dataArtisans/cascading-flink.git

Then switch to the newly created directory and run Maven to build the Cascading Connector for Apache Flink:

cd cascading-flink
mvn clean install -DskipTests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.