Giter Club home page Giter Club logo

dataflow-java's Introduction

dataflow-java Build Status Build Coverage

If you are ready to start coding, take a look at the information below. But if you are looking for a task-oriented list (e.g., How do I compute principal coordinate analysis with Google Genomics?), a better place to start is the Google Genomics Cookbook .

Getting started

  1. First git clone this repository.
  2. If you have not already done so, follow the Google Genomics getting started instructions to set up your environment including installing gcloud and running gcloud init.
  3. If you have not already done so, follow the Google Cloud Dataflow getting started instructions to set up your environment for Dataflow.
  4. This project now includes code for calling the Genomics API using gRPC. To use gRPC, you'll need a version of ALPN that matches your JRE version.
  1. See the ALPN documentation for a table of which ALPN jar to use for your JRE version.
  2. Then download the correct version from here.

Local Run

To use this code, build the client using Apache Maven:

cd dataflow-java
mvn package

Then you can run a pipeline locally with the command line, passing in the Project ID and Google Cloud Storage bucket you made in the first step. This command runs the VariantSimilarity pipeline (which runs PCoA on a dataset):

java -Xbootclasspath/p:/YOUR/PATH/TO/alpn-boot-YOUR-VERSION.jar \
  -cp target/google-genomics-dataflow*-runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity \
  --variantSetId=3049512673186936334 \
  --references=chr17:41196311:41277499 \
  --output=gs://your-bucket/output/localtest.txt

Run on Google Compute Engine

To deploy your pipeline (which runs on Google Compute Engine), ALPN is no longer needed but some additional command line arguments are required:

java -cp target/google-genomics-dataflow*-runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity \
  --project=your-project-id \
  --variantSetId=3049512673186936334 \
  --references=chr17:41196311:41277499 \
  --output=gs://your-bucket/output/test.txt \
  --runner=BlockingDataflowPipelineRunner \
  --project=your-project-id \
  --stagingLocation=gs://your-bucket/staging \
  --numWorkers=1

See the Google Genomics Cookbook for more sample command lines for the various pipelines.

Command Line Options

Use --help to get more information about the command line options. Change the pipeline class name below to match the one you would like to run:

java -cp google-genomics-dataflow*-runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity --help

Code layout

The Main code directory contains several useful utilities:

coders:
includes Coder classes that are useful for Genomics pipelines.
functions:
and its subdirectories contain common DoFns and other serializable function types that can be reused as part of any pipeline. OutputPCoAFile is an example of a complex PTransform that provides a useful common analysis.
model:

pojos for intermediate data types used in the sample pipelines. If you are instead looking for the Reads and Variants java objects:

pipelines:

contains example pipelines which demonstrate how Google Cloud Dataflow can work with Google Genomics

  • VariantSimilarity runs a principal coordinates analysis over a dataset containing variants, and writes a file of graph results that can be easily displayed by Google Sheets.
  • IdentityByState runs IBS over a dataset containing variants. See the results/ibs directory for more information on how to use the pipeline's results.
  • and several others!
readers:
contains functions that perform API calls to read data from the genomics API and also from BAM files in Cloud Storage
writers:
contains functions that perform API calls to write data to BAM files in Cloud Storage
utils:

contains utilities for running dataflow workflows against the genomics API

  • GenomicsOptions.java, ShardOptions and GCSOutputOptions use these classes for your command line options to take advantage of common command line functionality

Maven artifact

This code is also deployed as Maven artifacts through Sonatype, including both a normal jar and a runnable jar containing all dependencies (a fat jar). The utils-java readme has detailed instructions on how to deploy new versions.

To depend on this code, add the following to your pom.xml file:

<project>
  <dependencies>
    <dependency>
      <groupId>com.google.cloud.genomics</groupId>
      <artifactId>google-genomics-dataflow</artifactId>
      <version>LATEST</version>
    </dependency>
  </dependencies>
</project>

You can find the latest version in Maven's central repository

For an example pipeline that depends on this code in another GitHub repository, see https://github.com/googlegenomics/codelabs/tree/master/Java/PlatinumGenomes-variant-transformation.

Local development

With Maven you can locally install a SNAPSHOT version of this code (or any of its dependencies), to use from other projects directly without having to wait for the Maven repository. Use:

mvn install

to run the full tests and do a local install. You can also use

mvn install -DskipITs

to run only the unit tests and do a local install. This is faster.

Project status

Goals

  • Provide a Maven artifact which makes it easier to use Google Genomics within Google Cloud Dataflow.
  • Provide some example pipelines which demonstrate how Dataflow can be used to analyze Genomics data.

Current status

This code is in active development. See the github issues for more detail.

dataflow-java's People

Contributors

calbach avatar cassiedoll avatar cmclean avatar davidadamsphd avatar deflaux avatar dionloy avatar garrickevans avatar gregmcinnes avatar gunan avatar iliat avatar jakeakopp avatar jean-philippe-martin avatar kwestbrooks avatar lbergelson avatar mbookman avatar nitag avatar reprogrammer avatar sergipv avatar ssgross avatar tovanadler avatar wbrockman avatar xzhangpeijin avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.