Giter Club home page Giter Club logo

agilemobiledev / dataflowjavasdk-examples Goto Github PK

View Code? Open in Web Editor NEW

This project forked from googlecloudplatform/dataflowsdk-examples

0.0 3.0 0.0 172 KB

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines. This repository hosts a few example pipelines to get you started with Dataflow.

Home Page: https://cloud.google.com/dataflow

License: Apache License 2.0

Java 100.00%

dataflowjavasdk-examples's Introduction

Examples of the Cloud Dataflow SDK for Java

Google Cloud Dataflow provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines. This repository hosts example pipelines that use the Cloud Dataflow SDK for Java and demonstrate the basic functionality of Google Cloud Dataflow.

Word Count

A good starting point for new users is our set of word count examples, which computes word frequencies. This series of four successively more detailed pipelines is described in detail in the accompanying walkthrough.

  1. MinimalWordCount is the simplest word count pipeline and introduces basic concepts like Pipelines, PCollections, ParDo, and reading and writing data from external storage.

  2. WordCount introduces Dataflow best practices like PipelineOptions and custom PTransforms.

  3. DebuggingWordCount shows how to view live aggregators in the Dataflow Monitoring Interface, get the most out of Cloud Logging integration, and start writing good tests.

  4. WindowedWordCount shows how to run the same pipeline over either unbounded PCollections in streaming mode or bounded PCollections in batch mode.

Building and Running

The examples in this repository can be built and executed from the root directory by running:

mvn compile exec:java \
-Dexec.mainClass=<MAIN CLASS> \
-Dexec.args="<EXAMPLE-SPECIFIC ARGUMENTS>"

This will use the latest release of the Cloud Dataflow SDK for Java pulled from the Maven Central Repository.

For example, you can execute the WordCount pipeline on your local machine as follows:

mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--inputFile=<LOCAL INPUT FILE> --output=<LOCAL OUTPUT FILE>"

Once you have followed the general Cloud Dataflow Getting Started instructions, you can execute the same pipeline on fully managed resources in Google Cloud Platform:

mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner"

Make sure to use your project id, not the project number or the descriptive name. The Cloud Storage location should be entered in the form of gs://bucket/path/to/staging/directory.

Alternatively, you may choose to bundle all dependencies into a single JAR and execute it outside of the Maven environment. For example, you can execute the following commands to create the bundled JAR of the examples and execute it both locally and in Cloud Platform:

mvn package

java -cp target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE>

java -cp target/google-cloud-dataflow-java-examples-all-bundled-manual_build.jar \
com.google.cloud.dataflow.examples.WordCount \
--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner

Other examples can be run similarly by replacing the WordCount class path with the example classpath, e.g. com.google.cloud.dataflow.examples.cookbook.BigQueryTornadoes, and adjusting runtime options under the Dexec.args parameter, as specified in the example itself.

Note that when running Maven on Microsoft Windows platform, backslashes (\) under the Dexec.args parameter should be escaped with another backslash. For example, input file pattern of c:\*.txt should be entered as c:\\*.txt.

Beyond Word Count

After you've finished running your first few word count pipelines, take a look at the cookbook directory for some common and useful patterns like joining, filtering, and combining.

The complete directory contains a few realistic end-to-end pipelines.

Additional Resources

For more information on Google Cloud Dataflow, see the following resources:

dataflowjavasdk-examples's People

Contributors

davorbonaci avatar francesperry avatar kennknowles avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.