googlecloudplatform / dataflowtemplates Goto Github PK

View Code? Open in Web Editor NEW

1.1K 89.0 910.0 15.76 MB

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks

Home Page: https://cloud.google.com/dataflow/docs/guides/templates/provided-templates

License: Apache License 2.0

Java 88.93% JavaScript 0.31% Dockerfile 0.02% Python 0.23% PureBasic 0.01% Go 0.77% FreeMarker 0.01% HCL 9.73%

apache-beam dataflow-templates google-cloud-dataflow google-cloud-storage google-cloud-spanner bigquery bigtable

dataflowtemplates's Introduction

Google Cloud Dataflow Template Pipelines

These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.

Note on Default Branch

As of November 18, 2021, our default branch is now named "main". This does not affect forks. If you would like your fork and its local clone to reflect these changes you can follow GitHub's branch renaming guide.

Template Pipelines

For documentation on each template's usage and parameters, please see the official docs.

Contributing

To contribute to the repository, see CONTRIBUTING.md.

Release Process

Templates are released in a weekly basis (best-effort) as part of the efforts to keep Google-provided Templates updated with latest fixes and improvements.

To learn more about this process, or how you can stage your own changes, see Release Process.

More Information

Dataflow - general Dataflow documentation.
Dataflow Templates - basic template concepts.
Google-provided Templates - official documentation for templates provided by Google (the source code is in this repository).
Dataflow Cookbook: Blog, GitHub Repository - pipeline examples and practical solutions to common data processing challenges.
Dataflow Metrics Collector - CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Useful for comparison and visualization of the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc
Apache Beam
- Overview
- Quickstart: Java, Python, Go
- Tour of Beam - an interactive tour with learning topics covering core Beam concepts from simple ones to more advanced ones.
- Beam Playground - an interactive environment to try out Beam transforms and examples without having to install Apache Beam.
- Beam College - hands-on training and practical tips, including video recordings of Apache Beam and Dataflow Templates lessons.
- Getting Started with Apache Beam - Quest - A 5 lab series that provides a Google Cloud certified badge upon completion.

dataflowtemplates's People

Contributors

Stargazers

Watchers

Forkers

tad-kershner-dev9 nguyenvanthan hongjink smeyn kik008 heatherclemons thuyenho terrydhariwal yohayg ericyz mbrukman mchalek seangz sarvex carlosgonzalezpro playground-xyz mlaurenzo silver-labs nayyarunda tokyouncle mathem-se jungan21 the-thappy kbensado andonilarz jontradesy rangastartup yennanliu hanfeijp iht graffer-inc tracycuican viethapascal raynassar brentdorsey asouletdebrugiere delpinof elisska annasunsunny shashank901 cshaff0524 wallaceicy06 goungoun joinhandshake rbrto jessiejingxugao epishova unk1nd0n3 bmitioglov dwschewe1 o1o1o1o david-mart karthikey-surineni imbrito vitaly-am dreading magnusatikea dalavancloud jasonquekavalon omarcoteixeira lhong375 ryangordon evmin01 raunakjhawar-zz mayansalama manuelaguilar wataameto priyankr2411 pdeyhim satyabharat hendrakurniad yogeshtewari hostirosti freedomofnet david-kiesel-thd rgreg jitkasempin oliverfierro77 pawanrana jamesapple finsagit alec-ferguson-sunrun florjon infusionsoft yuemori ken-m goldfishy g9-ihabib yzhou2001 kiora1120 szintle kanglicheng aurelienwa diffblue-benchmarks davidcavazos a-satyateja chinkuocr davidepastore lbergelson barata

dataflowtemplates's Issues

PubSub Subscription to BQ unable to set number of max workers

Hi,

I created a dataflow job from PubSub_Subscription_to_BigQuery template and set the max number of worker to 10. The job was created successfully. However, from the job details in GCP console, the maxNumWorkers is always reverted back to 3. Is there any workaround for this? Thanks!

PS: I have tried to create the job from UI and gcloud cli and was getting the same result.

Create Template to bulk import from Datastore/Firestore exported files

Cloud Firestore and Cloud Datastore share the same leveldb export format.

Feature request to support bulkd processing of datastore/firestore files.

reference:

Firestore: import/export
Datastore: import/export
Firestore -> BigQuery: https://cloud.google.com/bigquery/docs/loading-data-cloud-firestore

I can confirm , the following python snippet reads and displays the export files.

#!/usr/bin/python
# virtualenv env
# source env/bin/activate

import sys
sys.path.append('/apps/google-cloud-sdk/platform/google_appengine/')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore

raw = open('2018-11-05T17_49_44_60804_all_namespaces_all_kinds_output-0', 'r')
reader = records.RecordsReader(raw)
for record in reader:
    entity = datastore.Entity.FromPb(entity_pb.EntityProto(contents=record))
    print entity

blog post

BQ to GCS Avro or Parquet

Any plans to support this? I wanted to be able to have an incremental unload from BA to GCS to see with Apache Spark

ClassCastException importing integer array from BigQuery Avro export

When importing an Avro dump of a BigQuery table with repeated integer column (array of integers) we see a class cast exception:

Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
    at com.google.cloud.teleport.spanner.AvroRecordConverter.lambda$readInt64Array$11(AvroRecordConverter.java:359)

This looks like a line number from the previous commit of this code: 42ce71e

I have experienced a similar problem when trying to write the output of BigQuery with a nullable INTEGER column into a Spanner INT64 and I had to do use a function like:

  private Long longVal(Object v, Long defaultValue) {
        return v == null ? defaultValue : Long.parseLong(v.toString());
    }

So I think the fix is probably replace

value.stream().map(x -> x == null ? null : (long) x).collect(Collectors.toList()));

with

value.stream().map(x -> x == null ? null : Long.parseLong(x.toString)).collect(Collectors.toList()));

I have not had a chance to try this fix or develop a test + PR but thought I would share the bug report in case other people are Googling for the answer to those pesky exceptions like I was.

Add format transformation for json-like input for BigQueryConverters

Hi there,

Got the below issue (missing double-quote at beginning of filedname) when creating a kafka-pubsub-bigquery pipeline using the pubsub to bigquery template.

Failed to serialize json to table row: {SPEED=71.2, FREEWAY_ID=163, TIMESTAMP=2008-11-01 00:00:00, LONGITUDE=-117.155519, FREEWAY_DIR=S, LATITUDE=32.749679, LANE=3}
.....
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('S' (code 83)): was expecting double-quote to start field name
 at [Source: {SPEED=71.2, FREEWAY_ID=163, TIMESTAMP=2008-11-01 00:00:00, LONGITUDE=-117.155519, FREEWAY_DIR=S, LATITUDE=32.749679, LANE=3}; line: 1, column: 3]

I think the error popped up from the below codes which cannot deal with missing double-quote or some non-standard json-like format (like "key=value" format), which got created from upstream pipeline... I am wondering if you could add something that may look similar to this https://www.mkyong.com/java/jackson-was-expecting-double-quote-to-start-field-name/ for our case
to the source code like an input format validation, transformation and then exception handling. In case of having an upstream that could take standard json but feed non-standard jsons into pubsub and then this dataflow template, it would be better to have this added.

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java

/**
  * Converts a JSON string to a {@link TableRow} object. If the data fails to convert, a {@link
  * RuntimeException} will be thrown.
  *
  * @param json The JSON string to parse.
  * @return The parsed {@link TableRow} object.
  */
 private static TableRow convertJsonToTableRow(String json) {
   TableRow row;
   // Parse the JSON into a {@link TableRow} object.
   try (InputStream inputStream =
       new ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8))) {
     row = TableRowJsonCoder.of().decode(inputStream, Context.OUTER);

   } catch (IOException e) {
     throw new RuntimeException("Failed to serialize json to table row: " + json, e);
   }

   return row;
 }

Thank you!

ZIP compression in Bulk Compress Cloud Storage Files template

I would like to use the ZIP compression type in the Bulk Compress Cloud Storage Files Template, which the documentation states is a viable option:

However, it is not an option when running the template through GCP-Dataflow though.

Are there plans to make this an option? If not, the documentation should be updated.

Thank you!

PubSub To PubSub

Hi there,

Got a situation with the dataflow template PubSub to PubSub
I follow the documentation https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloudpubsubtocloudpubsub

I created two topics
For one I created a subscription in order to put in the inputSubscription parameter

For a particular reason it doesn't work, the dataflow appears to read no messages, but when i execute the following command in the terminal it returns messages from the topic

gcloud pubsub subscriptions pull projects/test-project/subscriptions/testtopic --limit 100 --format="json"

PubSub to BQ deadletter table

Seeing the following error which seems to be an uncaught exception which I believe is skipping the deadletter table:

java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"location.latitude","message":"Cannot convert value to floating point (bad value):","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":4}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":5}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":6}, {"errors":
...[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":209}]
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:142)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:103)
Caused by: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"location.latitude","message":"Cannot convert value to floating point (bad value):","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":4}, {"errors": ...

template creation for golang

does template creation support for golang?

Add support for importing ARRAY, BYTES and STRUCT types into Spanner

Case comment in https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/spanner/TextImportPipeline.java mentions the following:

NOTE: BYTES, ARRAY, STRUCT types are not supported.

It is impossible to import tables containing one of these types. The following exception error is thrown.

"Unrecognized or unsupported column data type: ARRAY<STRING(20)>"

Can we please add support for these 3 code types?

Support for ORC to BigTable

Is there a supported template for converting ORC on GCS to BigTable? (I know it is a bit eccentric conversion...) I am trying to implement a batch dataflow for migrating some hive table data (Originally located from AWS S3, storage transferring to GCS) to BigTable.

If not, if there is any workaround/example to this, please leave a comment here.

AvroToBigtable

I am tying to import and run this project .. The AVRO to Big Table conversion is missing the BigTableRow.java and BigTableCell.java and hence getting compilation errors on the example

DatastoreToBigQuery CREATE_IF_NEEDED problem

Hi,

I'm trying to compile the DatastoreToBigQuery template. But I get this error

An exception occured while executing the Java class. CreateDisposit ion is CREATE_IF_NEEDED, however no schema was provided

I tried to compile the DatastoreToText and TextToBigQuery and both worked, but somehow, from DatastoreToBigQuery isn't working.

Here is the whole Log:

    at org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkArgument (Preconditions.java:122)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped (BigQueryIO.java:2122)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand (BigQueryIO.java:2099)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand (BigQueryIO.java:1445)
    at org.apache.beam.sdk.Pipeline.applyInternal (Pipeline.java:537)
    at org.apache.beam.sdk.Pipeline.applyTransform (Pipeline.java:488)
    at org.apache.beam.sdk.values.PCollection.apply (PCollection.java:370)
    at com.google.cloud.teleport.templates.DatastoreToBigQuery.main (DatastoreToBigQuery.java:79)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:748)```

Maybe someone had the same problem and can help me.

Nested Fields for JSON to TableSchema

In the "GCS Text to BigQuery" template, the JSON to TableSchema Serializable function does not handle nested Fields.

BigQuery to PubSub

Would BigQuery to PubSub be a good addition to these templates?
In the Dataflow Codelab, the NYC Taxi BigQuery dataset is used to publish to a topic. I think developers can utilize the existing, well-maintained BigQuery datasets to test out various streaming solutions through PubSub.

Unable to dump unbounded PubSub content to gs:// bucket

Hi there,

I'm completely new to Apache Beam and its programming model is quite surprising, and while trying to workaround a Parquet writer while reading from a PubSub Writer I can't wrap my head around the following...

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToAvro.java taking that template as base.

Seems that AvroIO has support for windowed writes to buckets such as gs://my-bucket/YYYY/MM/DD, being 'YYYY' variables automatically filled at runtime by the AvroIO handler.

Is there any way to achieve this using ParquetIO? The only bits of Parquet I've seen are the following ones, but none of them write by date...

Tried a first approach and, even if the code compile and runs one event after another, I can't get it to run in local with DirectRunner and against a bucket. The code I've got so far is the following one.

SOLVED

Publishing Template

Just pulling project & publish this by mvn command:
mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToAvro \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=${gcp_project_id} \ --stagingLocation=gs://${bucket_name}/staging \ --tempLocation=gs://${bucket_name}/temp \ --templateLocation=gs://${bucket_name}/templates/default-pubsub-to-avro.json \ --runner=DataflowRunner"

and create a job with default template from Dataflow & see that warning:

No metadata file found for this template.

Specify project id for spanner to avro export

I'd like to run the "Cloud Spanner to Cloud Storage Avro" template in a different project than the project my spanner instance lives.
Currently I cannot specify the project in the template:

DataflowTemplates/src/main/java/com/google/cloud/teleport/spanner/ExportPipeline.java

Lines 80 to 84 in 9a95f05

 SpannerConfig spannerConfig = 

 SpannerConfig.create() 

 .withHost(options.getSpannerHost()) 

 .withInstanceId(options.getInstanceId()) 

 .withDatabaseId(options.getDatabaseId());

It is however possible in the spanner to text export:

DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/SpannerToText.java

Lines 91 to 95 in 9a95f05

 SpannerConfig spannerConfig = 

 SpannerConfig.create() 

 .withProjectId(options.getSpannerProjectId()) 

 .withInstanceId(options.getSpannerInstanceId()) 

 .withDatabaseId(options.getSpannerDatabaseId());

Is this something that can be added?

BulkDecompressor -- Error writing failures CSV file

When BulkDecompressor tries to write an error CSV file, seeing this stack trace:

Caused by: java.lang.IllegalArgumentException: No quotes mode set but no escape character is set
	at org.apache.commons.csv.CSVFormat.validate(CSVFormat.java:1397)
	at org.apache.commons.csv.CSVFormat.<init>(CSVFormat.java:647)
	at org.apache.commons.csv.CSVFormat.withQuoteMode(CSVFormat.java:1832)
	at com.google.cloud.teleport.templates.BulkDecompressor.lambda$run$9962e4b6$1(BulkDecompressor.java:234)
	at org.apache.beam.sdk.transforms.Contextful.lambda$fn$36334a93$1(Contextful.java:112)
	at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:123)

From reading the CSVFormat code, it looks like you need to call withEscape() before you call withQuoteMode(). I will have a pull request to fix this shortly.

Multiple BootstrapServers for KafkaToBigQuery parameters

Hi,

I notice there is a parameter called bootstrapServers in https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/KafkaToBigQuery.java. I am trying to using 3 kafka brokers to sync data to BigQuery from Kafka. I cannot use comma "," to separate the brokers parameter.

Could you tell me how to set the bootstrap servers so the dataflow would ingest from mulitple kafka brokers?

Thanks in advance.

GroupByKey Exception in common/DatastoreConverters.java

Trying to use the Pub/Sub to Datastore template I ran into the error described in this StackOverflow thread. I tried the recommendation posted there, but then got the following exception:

java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at com.google.cloud.teleport.templates.common.DatastoreConverters$CheckSameKey.expand(DatastoreConverters.java:210)
at com.google.cloud.teleport.templates.common.DatastoreConverters$CheckSameKey.expand(DatastoreConverters.java:172)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at com.google.cloud.teleport.templates.common.DatastoreConverters$WriteJsonEntities.expand(DatastoreConverters.java:158)
at com.google.cloud.teleport.templates.common.DatastoreConverters$WriteJsonEntities.expand(DatastoreConverters.java:134)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at com.google.cloud.teleport.templates.PubsubToDatastore.main(PubsubToDatastore.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)

There is a workaround for this in another StackOverflow thread, that bypasses this issue, but it wouldn't hurt if someone can take a look at it. Thanks.

Javascript UDF error when parsing JSON

I have used the Pub/Sub to BigQuery template to stream JSON data that are sent to a Pub/Sub topic. Through Dataflow I want to flatten the data to match the BigQuery schema and stream them.

Here is the Javascript UDF for the Dataflow process:

function transform(inJson) {
    var obj = JSON.parse(inJson);
    // variable declarations
    // ... 
    data['domain'] = obj['data']['domain']; // line 18
    ...
    return JSON.stringify(data);
}

I've also tried:

data.domain = obj.data.domain;

I've just copied the example from this repo and extended it to flatten the JSON data.

Here is the error message:

TypeError: Cannot read property "domain" from undefined in <eval> at line number 18

and here is th stacktrace:

javax.script.ScriptException: TypeError: Cannot read property "domain" from undefined in <eval> at line number 18
    at jdk.nashorn.api.scripting.NashornScriptEngine.throwAsScriptException(NashornScriptEngine.java:470)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:392)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeFunction(NashornScriptEngine.java:190)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$JavascriptRuntime.invoke(JavascriptTextTransformer.java:156)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$FailsafeJavascriptUdf$1.processElement(JavascriptTextTransformer.java:315)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$FailsafeJavascriptUdf$1$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:609)
    at com.google.cloud.teleport.templates.PubSubToBigQuery$PubsubMessageToFailsafeElementFn.processElement(PubSubToBigQuery.java:412)
    at com.google.cloud.teleport.templates.PubSubToBigQuery$PubsubMessageToFailsafeElementFn$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)
    at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
    at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:122)
    at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:76)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1233)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:144)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:972)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: <eval>:18 TypeError: Cannot read property "domain" from undefined
    at jdk.nashorn.internal.runtime.ECMAErrors.error(ECMAErrors.java:57)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:213)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:185)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:172)
    at jdk.nashorn.internal.runtime.Undefined.get(Undefined.java:157)
    at jdk.nashorn.internal.scripts.Script$Recompilation$1$7667A$\^eval\_.transform(<eval>:18)
    at jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:639)
    at jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:494)
    at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:393)
    at jdk.nashorn.api.scripting.ScriptObjectMirror.callMember(ScriptObjectMirror.java:199)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:386)
    ... 42 more

When I try the Javascript locally by passing some sample data it works as expected without any errors.

DataFlow Template : Sequential BigQuery Table Insertion issue

@sabhyankar

Hi , I am trying to create a dataflow template where it needs to write in two tables sequentially in bigquery.

e.g First it writes in "Patient" table. Ones its done then it writes in "Statistics" table. But I am facing this issue - once it writes in Patient table , it does not execution any code after that.

If you can please give some suggestion , it will be really helpful.

Thanks!

GCS Avro to BigQuery

This is not a issue, but a question. In absence of any other method, asking my question in form of issue. Sorry for that. Here goes the question:

Will the code GCS Avro to BigTable work for BQ also? Is there anything I should take take of, before applying it for a BQ case?

Thanks much
Pramod

Allow configuration of write disposition for jdbc to bigquery

Currently the jdbctobigquery template sets the bigquery write disposition to append. It would be helpful if a parameter was supported that allowed you to set it to overwrite, in addition to append.

AutoValue_DynamicJdbcIO_DynamicRead.Builder

Hi I have configured google cloud tools for eclipse and am facing this issue after cloning Dataflow templates, I have added all these dependencies auto-service-1.0-rc1.jar, guava-16.0.1.jar, jsr-305-2.0.3.jar,auto-value-1.0-rc1.jar but could not resolve the issue :AutoValue_DynamicJdbcIO_DynamicRead.Builder could not be resolved to a type

Cannot rebuild template as it is

I pulled the repo and followed the instruction to build TextToBigQueryStreaming template as it is. But I got the following error:
[INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 45.236 s [INFO] Finished at: 2019-08-28T10:30:13-04:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:288) at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) Caused by: org.apache.maven.plugin.MojoExecutionException: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.codehaus.mojo.exec.ExecJavaMojo.execute (ExecJavaMojo.java:339) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:288) at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) Caused by: java.lang.UnsupportedOperationException: Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass (ClassInjector.java:410) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw (ClassInjector.java:235) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject (ClassInjector.java:111) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load (ClassLoadingStrategy.java:232) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load (ClassLoadingStrategy.java:143) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize (TypeResolutionStrategy.java:100) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load (DynamicType.java:5623) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.generateInvokerClass (ByteBuddyDoFnInvokerFactory.java:351) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.getByteBuddyInvokerConstructor (ByteBuddyDoFnInvokerFactory.java:247) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker (ByteBuddyDoFnInvokerFactory.java:220) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker (ByteBuddyDoFnInvokerFactory.java:151) at org.apache.beam.sdk.transforms.reflect.DoFnInvokers.invokerFor (DoFnInvokers.java:35) at org.apache.beam.runners.core.construction.SplittableParDo.expand (SplittableParDo.java:170) at org.apache.beam.runners.core.construction.SplittableParDo.expand (SplittableParDo.java:87) at org.apache.beam.sdk.Pipeline.applyReplacement (Pipeline.java:564) at org.apache.beam.sdk.Pipeline.replace (Pipeline.java:290) at org.apache.beam.sdk.Pipeline.replaceAll (Pipeline.java:208) at org.apache.beam.runners.dataflow.DataflowRunner.replaceTransforms (DataflowRunner.java:995) at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:712) at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:179) at org.apache.beam.sdk.Pipeline.run (Pipeline.java:313) at org.apache.beam.sdk.Pipeline.run (Pipeline.java:299) at com.google.cloud.teleport.templates.TextToBigQueryStreaming.run (TextToBigQueryStreaming.java:255) at com.google.cloud.teleport.templates.TextToBigQueryStreaming.main (TextToBigQueryStreaming.java:136) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:835) [ERROR] [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

The command I ran was:
mvn -X compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.TextToBigQueryStreaming \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=[project] \ --stagingLocation=gs://[bucket]/staging \ --tempLocation=gs://[bucket]/temp \ --templateLocation=gs://[bucket]/templates/text_to_bq_streaming.json \ --runner=DataflowRunner"

The output of 'mvn --version' is
Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T15:00:29-04:00) Maven home:[HOME]/apache/maven/apache-maven-3.6.1 Java version: 12.0.2, vendor: Oracle Corporation, runtime: [HOME]/jdk/jdk-12.0.2 Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "4.19.37-5+deb10u1rodete2-amd64", arch: "amd64", family: "unix"

Pubsub to BigQuery - Cumlated errors make jobs crash

I've been using Google Dataflow Templates to send messages from pub/sub to BigQuery based on this: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloudpubsubtobigquery

Since I've launched the dataflow job in streaming mode, the job has started to generate errors and finally crash based on the way Dataflow exceptions are handled:
https://cloud.google.com/dataflow/faq#how-are-java-exceptions-handled-in-cloud-dataflow

Here is the error:
java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"","message":"Repeated record added outside of an array.","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3} .......]
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:125)
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:94)

As I understand, this kind of behaviour makes this template not useful for data streaming.

Are there any possibilities to configure the template to avoid exceptions thrown but still send them to stackdriver?

Thanks

PubSub to BigQuery partitioned table

Is there any plan to add PubSub to BigQuery partitioned table template?

Getting Exception when Creating New Template

Hi,

I am trying to creating a new template (specifically, modifying one of the DataflowTemplates as a new one). But, when I run the dataflow, I am getting an exception that I could not trace because there is no related script in this repo files. Could you give me advice about tracing and solving this issue?

fyi, the template that I am trying to modify is KafkaToBigQuery.java.

java.lang.NullPointerException
        org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770)
        org.apache.beam.sdk.util.WindowedValue$TimestampedWindowedValue.<init>(WindowedValue.java:263)
        org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow.<init>(WindowedValue.java:278)
        org.apache.beam.sdk.util.WindowedValue.timestampedValueInGlobalWindow(WindowedValue.java:117)
        org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.getCurrent(WorkerCustomSources.java:827)
        org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.getCurrent(WorkerCustomSources.java:759)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.getCurrent(ReadOperation.java:394)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
        org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1287)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        java.lang.Thread.run(Thread.java:745)

Thanks in advance

V2 Flex Templates are broken for Python

When a flex template is created using both my own templates and the flex wordcount template provided the dataflow pipeline fails to build. The issue causes the dataflow pipeline to not build correctly once the API call is made.

I believe the issue arises from the base docker images found at: gcr.io/dataflow-templates-base

When I have tested these with:

docker run --interactive --tty gcr.io/dataflow-templates-base/java8-template-launcher-base bash

I get the following error before it kicks me from the container and shuts it down:

2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.Http
2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.HttpRule
2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.CustomHttpPattern
2019/11/13 09:50:10 proto: duplicate proto type registered: google.protobuf.FieldMask
Created new fluentd log writer for: /var/log/dataflow/template_launcher/runner-json.log

Rename repository

Since Dataflow supports both Java and Python, it's misleading to name it as just DataflowTemplates and only keep Java examples in there.

I suggest that it should use the preferred nomenclature of DataflowTemplates-Java the same way google-cloud SDK does.

I can help add some templates for python. Here's a sample directory which I have been keeping some samples in: https://github.com/VikramTiwari/dataflow-samples

PS: I know, even I am guilty of not using proper nomenclature, but I am not Google :)

PubSub to Bigquery #Null Pointer Exception

After changes made to PubSubToBigQuery on April 2nd , I am unable to build dataflow template and getting NullPointer Exception d240b96

Even the error information doesn't help me to locate what params/options is missing while running through mvn in debug mode .Please fix

mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery -Dexec.cleanupDaemonThreads=false -Dexec.args="--project=${PROJECT_ID} --stagingLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/staging --tempLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/temp --templateLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/template --runner=DataflowRunner"

[WARNING]
java.lang.NullPointerException
at com.google.cloud.teleport.templates.PubSubToBigQuery.run (PubSubToBigQuery.java:226)
at com.google.cloud.teleport.templates.PubSubToBigQuery.main (PubSubToBigQuery.java:191)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)

Unable to create template

C:\Users\anirusharma\poc\DataflowTemplates>mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToPubsub -Dexec.cleanupDaemonThreads=false -Dexec.args="--project=testbatch-211413 --stagingLocation=gs://templates_test_as/staging --tempLocation=gs://templates_test_as/temp --templateLocation=gs://template_data_as/templates/PubsubToPubsub.json --filesToStage=gs://templates_test_as/staging2 --runner=DataflowRunner"
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Detecting the operating system and CPU architecture
[INFO] ------------------------------------------------------------------------
[INFO] os.detected.name: windows
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 10.0
[INFO] os.detected.version.major: 10
[INFO] os.detected.version.minor: 0
[INFO] os.detected.classifier: windows-x86_64
[INFO]
[INFO] --------< com.google.cloud.teleport:google-cloud-teleport-java >--------
[INFO] Building Google Cloud Teleport 0.1-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from central: https://repo.maven.apache.org/maven2/io/grpc/grpc-core/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/io/grpc/grpc-core/maven-metadata.xml (1.8 kB at 1.4 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/io/netty/netty-codec-http2/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/io/netty/netty-codec-http2/maven-metadata.xml (2.0 kB at 20 kB/s)
Downloading from Apache Snapshots Repository: https://repository.apache.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml
Downloading from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml
Downloaded from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml (802 B at 850 B/s)
Downloading from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml
Downloading from Apache Snapshots Repository: https://repository.apache.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml
Downloaded from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml (1.5 kB at 6.7 kB/s)
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce) @ google-cloud-teleport-java ---
[INFO] artifact io.grpc:grpc-core: checking for updates from central
[INFO] artifact io.netty:netty-codec-http2: checking for updates from central
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce-banned-dependencies) @ google-cloud-teleport-java ---
[INFO]
[INFO] --- protobuf-maven-plugin:0.5.1:compile (default) @ google-cloud-teleport-java ---
[INFO] Compiling 1 proto file(s) to C:\Users\anirusharma\poc\DataflowTemplates\target\generated-sources\protobuf\java
[INFO]
[INFO] --- protobuf-maven-plugin:0.5.1:compile-custom (default) @ google-cloud-teleport-java ---
[INFO] Compiling 1 proto file(s) to C:\Users\anirusharma\poc\DataflowTemplates\target\generated-sources\protobuf\grpc-java
[INFO]
[INFO] --- avro-maven-plugin:1.8.2:schema (default) @ google-cloud-teleport-java ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ google-cloud-teleport-java ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO] Copying 1 resource
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.6.2:compile (default-compile) @ google-cloud-teleport-java ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 93 source files to C:\Users\anirusharma\poc\DataflowTemplates\target\classes
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java: Some input files use or override a deprecated API.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java: Recompile with -Xlint:deprecation for details.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaRecordCoder.java: Some input files use unchecked or unsafe operations.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaRecordCoder.java: Recompile with -Xlint:unchecked for details.
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ google-cloud-teleport-java ---
[WARNING]
java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:224)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:214)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.RuntimeException: Unable to get application default credentials. Please see https://developers.google.com/accounts/docs/application-default-credentials for details on how to specify credentials. This version of the SDK is dependent on the gcloud core component version 2015.02.05 or newer to be able to get credentials from the currently authorized user via gcloud auth.
at org.apache.beam.sdk.extensions.gcp.auth.NullCredentialInitializer.throwNullCredentialException (NullCredentialInitializer.java:60)
at org.apache.beam.runners.dataflow.util.DataflowTransport.chainHttpRequestInitializer (DataflowTransport.java:99)
at org.apache.beam.runners.dataflow.util.DataflowTransport.newDataflowClient (DataflowTransport.java:76)
at org.apache.beam.runners.dataflow.options.DataflowPipelineDebugOptions$DataflowClientFactory.create (DataflowPipelineDebugOptions.java:134)
at org.apache.beam.runners.dataflow.options.DataflowPipelineDebugOptions$DataflowClientFactory.create (DataflowPipelineDebugOptions.java:131)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:158)
at com.sun.proxy.$Proxy42.getDataflowClient (Unknown Source)
at org.apache.beam.runners.dataflow.DataflowClient.create (DataflowClient.java:41)
at org.apache.beam.runners.dataflow.DataflowRunner. (DataflowRunner.java:338)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:332)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:214)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:40 min
[INFO] Finished at: 2018-12-18T00:01:47-05:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): InvocationTargetException: Unable to get application default credentials. Please see https://developers.google.com/accounts/docs/application-default-credentials for details on how to specify credentials. This version of the SDK is dependent on the gcloud core component version 2015.02.05 or newer to be able to get credentials from the currently authorized user via gcloud auth. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Custome shard templates with YYYY/MM/dd/HH/mm replacements

Hi all,

We are trying to setup a custom shard template like gs://bucket/YYYY/MM/dd/HH/W/P-SS-of-NN, so that the bucket can still be easily browse manually.

We are having issues with that custom shard template as it seems like those replacements are not supported. The short description listed when creating the job does not says anything about the likes of year/month/date/hour replacements. It just says:

The shard template defines the unique/dynamic portion of each windowed file. Recommended to use the default (W-P-SS-of-NN). At runtime, 'W' is replaced with the window date range and 'P' is replaced with the pane info.

Searching on this repo we found these references but we are not sure if these are available as part of to the custom shard template.

Questions:

Are YYYY, MM, dd, HH available to the custom shard template?
What are the supported replacements?

Any pointers will be greatly appreciated.

Possible to read file name from JS UDF function?

@jasonkuster @ryanmcdowell

Is there anyway to get the filename from Javascript function? Something similar to FileIO's file path matching to return the file name of being loaded.

Thanks.

SpannerIO: Support read write transactions

Is it possible to perform read-write transaction in spanner connector for dataflow/beam?
I have use case which is currently implemented in Java App Engine flex app, but would like to see if I can do it in dataflow.

I have gone through grouped mutation but not sure if I can read in it

Windowing in unbounded PubSubIO if no aggregation (groupbykey) is applied

Is it mandatory to put windowing in unbounded pcollection from pubsub if i'm not using any aggregation steps i.e groupbyKey?

A note in following link is as follows.
https://cloud.google.com/dataflow/docs/guides/specifying-exec-params

If your pipeline uses unbounded data sources and sinks, it is necessary to pick a Windowing strategy for your unbounded PCollections before you use any aggregation such as a GroupByKey.

Pub/Sub to BQ fails to serialize json

I am trying to get going with the Dataflow template for Pub/Sub subscription to BigQuery. All of the messages end up in the table for errors records. The stack trace says: "java.lang.RuntimeException: Failed to serialize json to table row" and it fails for the actual message body.

The body of each message is an JSON array like:
[{"itemNo":"00050330","itemType":"Sales","itemDesc":"A Table","quantity":4.0,"extendedAmount":120.0,"orderId":null,"itemGroup":"0421","originalDocumentNo":null}]

Any help would be appreciated!

Pub/Sub to BigQuery Errors PayloadString is not valid JSON

The errors from the Cloud Pub/Sub Subscription to BigQuery template aren't saved as valid JSON in the PayloadString column and I am unable to replay them.

It looks like this:
{event={userId=1234, sessionEvent={sessionId=DSFG, ua=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36, browser={name=Chrome, version=74.0.3729.131, major=74}}}

If this is intended, how do I convert it and push it back to Pub/Sub?

java.lang.RuntimeException: Failed to serialize json to table row:

actually i am trying to store json string in big query table, so when i tried to publish below simple json in publish message box, records are not inserted in big query table. i checked in logs, it give above error.

{
"childName":"Aijaz_Google555",
"present":"Gift_Google555",
"JsonObject":[{"Actions":"test actions","CreatedBy": "test created by", "CreatedTimestamp": "test","Extended": "test extended"}]
}

Child Name, Present and Json object are 3 string type columns, in 3rd column jsonobject i want to store json object as string. kindly help.

also from code c#, i am trying to make string like below

string content = "{ "childName":"Aijaz_Google666","present":"Gift_Google666","JsonObject":"{"Actions": "test actions","CreatedBy": "test created by", "CreatedTimestamp": "2015 - 10 - 28T10: 15:30(ISO Date Time Format)","Extended": "test extended"}"}";

Thanks in advance.

TextIOToBigQuery.java why is calling .apply(BigQueryConverters.jsonToTableRow()) as method

Hello guys ,

My question is very simple , why are you using on the class TextIOToBigQuery.java the action:
.apply(BigQueryConverters.jsonToTableRow()) like a method instead that a new class lie : .apply(new BigQueryConverters.jsonToTableRow()) because really is a class ?

thanks so much for confirming.

Support dumping multiple Spanner databases to Avro

To be able to use Cloud Scheduler effectively with the Spanner->Avro template, it would be ideal if https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/spanner/ExportPipeline.java allowed specifying multiple Database IDs (instead of a single one, as happens currently)

The current template already creates a subdirectory for the exported database in the GCS output directory: if multiple databases were specified multiple subdirectories would be created, one for each database.

As an extension, it would be very useful even to make the Database ID optional, in which case the dataflow would have to enumerate the databases in the specified Spanner instance, and then export all of them.

The goal is to be able to trigger an export of one, multiple or all databases on a spanner instance from a cloud scheduler job.

Dataflow to BQ

Hi,

I'm trying to use dataflow as bridge from kafka to bigquery, but for each topic created in kafka, i need to create 1 dataflow and that will result in huge cost. is there any way to read multiple topic to produce multiple table in bigquery using only 1 dataflow?

i'm using this dataflow template
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/KafkaToBigQuery.java

Is this updated with latest Dataflow Templates? from march 2nd?

It seems the JavascriptTextTransformer is missing from the source code (probably changed on March 2nd).

GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger

I am new to running pipeline jobs on the google cloud and I am running to the issue with PubSub to DataQuery. 'mvn clean && mvn compile' worked but the command to create the template fails.
--Command
mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToDatastore \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=PROJECT_ID \ --pubsubReadTopic=projects/PROJECT_ID/topics/topic \ --javascriptTextTransformGcsPath=gs://PROJECT_ID/*.js. \ --javascriptTextTransformFunctionName=transform \ --stagingLocation=gs://PROJECT_ID/staging \ --tempLocation=gs://PROJECT_ID/temp \ --templateLocation=gs://<PROJECT_ID>/templates/PubSub_to_Datastore.json \ --runner=DataflowRunner"

-- javascript
`/**

A transform which adds a field to the incoming data.
@param {string} inJson
@return {string} outJson
*/
function transform(line) {
var values = line.split(',');

var obj = new Object();
obj._description = values[0];
obj._east = values[1];
obj._last_updt = values[2];
obj._north = values[3];
obj._region_id = values[4];
obj._south = values[5];
obj._west = values[6];
obj.current_speed = values[7];
obj.region = values[8];
var jsonString = JSON.stringify(obj);

return jsonString;
}--Datastore Schema{
"Datastore Schema": [
{
"name": "_description",
"type": "STRING"
},
{
"name": "_east",
"type": "FLOAT"
},
{
"name": "_last_updt",
"type": "TIMESTAMP"
},
{
"name": "_north",
"type": "FLOAT"
},
{
"name": "_region_id",
"type": "INTEGER"
},
{
"name": "_south",
"type": "FLOAT"
},
{
"name": "_west",
"type": "FLOAT"
},
{
"name": "current_speed",
"type": "FLOAT"
},
{
"name": "region",
"type": "STRING"
}
]
}`

Record type not supported in TextIOToBigQuery template

Hi guys,

The biq query schema file that is used in this template cannot have a RECORD type defined..else get error below . Looking at the code.. does not look like code accommodates nested/recursive build up of table schema in the withSchema() block of code to deal with RECORD schema definition... I would be happy to code it up if you like....

Error: Field xxxx is type RECORD but has no schema.

PubSub to BigQuery Javascript UDF destroys attributes

Hi,
We're using the PubSub Subscription to bigquery template. We have data in both PubSubMessage Attributes and the body. Our body contains an array without a field name i.e

[
 {"id": "item1"},
 {"id": "item2"}
]

Which the template had issues parsing, so we added a simple UDF

function process(str){
    var arrayOfItems = JSON.parse(str);
    var outObject = {items: arrayOfItems};
    return JSON.stringify(outObject);

When this template runs it seems like the attributes are discarded after the UDF step.

I'm not that well versed with BEAM but it seems that when the InvokeUDF step is built it's discarding everything but the message payload

 PCollectionTuple udfOut =
          input
              // Map the incoming messages into FailsafeElements so we can recover from failures
              // across multiple transforms.
              .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()))
              .apply(
                  "InvokeUDF",
                  FailsafeJavascriptUdf.<PubsubMessage>newBuilder()
                      .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                      .setFunctionName(options.getJavascriptTextTransformFunctionName())
                      .setSuccessTag(UDF_OUT)
                      .setFailureTag(UDF_DEADLETTER_OUT)
                      .build());

The PubsubMessageToFailsafeElementFn looks like this

 static class PubsubMessageToFailsafeElementFn
      extends DoFn<PubsubMessage, FailsafeElement<PubsubMessage, String>> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          FailsafeElement.of(message, new String(message.getPayload(), StandardCharsets.UTF_8)));
    }
  }

It seems to call message.getPlayload which would probably cause the issue.

So my question is: Am I doing something wrong, is there some way of getting both the attributes and the payload through the UDF? Or do I have to modify the java template?

Thanks in advance!

Supporing AppProfileID in BigtableToAvro DataflowTemplate

We use BigtableToAvro to backup a huge Bigtable in production.
We have a replication on the BigTable. The main application connects to the Bigtable on one of the regions.
We want to force the backup process (BigtableToAvro) to use a specific Bigtable instance, which is in another region.
The reason is we don't want to affect the main application performance.
This feature already exists in Cloud Bigtable to Cloud Storage SequenceFile

Use same group.id in consumer properties

Setting the group.id in the .updateConsumerProperties() still makes the reader to start reading at offset 0 for all jobs.

The setup

        Map<String, Object> props = new HashMap<>();
        props.put("group.id", "dataflow-reader");
        props.put("auto.offset.reset", "earliest");

        PCollection<KafkaRecord<String, String>> pcol = p.apply(KafkaIO.<String, String>read()
            .withBootstrapServers(options.getBootstrapServers())
            .withTopics(topics)
            .withKeyDeserializer(StringDeserializer.class)
            .withValueDeserializer(StringDeserializer.class)
            .withNumSplits(1)
            .updateConsumerProperties(props));

when I start a new job this is logged in the console.

Reader-0: reading from name-of-topic-0 starting at offset 0
ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [xxxx]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = Reader-0_offset_consumer_778069295_dataflow-reader

And it looks like that happens here

DataflowTemplates/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaUnboundedReader.java

Line 146 in 4788104

String offsetGroupId =

Is it possible to make the reader not start from offset 0 for each new dataflow job instance?

	SpannerConfig spannerConfig =
	SpannerConfig.create()
	.withHost(options.getSpannerHost())
	.withInstanceId(options.getInstanceId())
	.withDatabaseId(options.getDatabaseId());

	SpannerConfig spannerConfig =
	SpannerConfig.create()
	.withProjectId(options.getSpannerProjectId())
	.withInstanceId(options.getSpannerInstanceId())
	.withDatabaseId(options.getSpannerDatabaseId());

googlecloudplatform / dataflowtemplates Goto Github PK

dataflowtemplates's Introduction

Google Cloud Dataflow Template Pipelines

Note on Default Branch

Template Pipelines

Contributing

Release Process

More Information

dataflowtemplates's People

Contributors

Stargazers

Watchers

Forkers

dataflowtemplates's Issues

Recommend Projects

Recommend Topics

Recommend Org