Giter Club home page Giter Club logo

googlecloudplatform / dataflowtemplates Goto Github PK

View Code? Open in Web Editor NEW
1.1K 89.0 910.0 15.76 MB

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks

Home Page: https://cloud.google.com/dataflow/docs/guides/templates/provided-templates

License: Apache License 2.0

Java 88.93% JavaScript 0.31% Dockerfile 0.02% Python 0.23% PureBasic 0.01% Go 0.77% FreeMarker 0.01% HCL 9.73%
apache-beam dataflow-templates google-cloud-dataflow google-cloud-storage google-cloud-spanner bigquery bigtable

dataflowtemplates's Introduction

Google Cloud Dataflow Template Pipelines

These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.

Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.

Open in Cloud Shell

Note on Default Branch

As of November 18, 2021, our default branch is now named "main". This does not affect forks. If you would like your fork and its local clone to reflect these changes you can follow GitHub's branch renaming guide.

Template Pipelines

For documentation on each template's usage and parameters, please see the official docs.

Contributing

To contribute to the repository, see CONTRIBUTING.md.

Release Process

Templates are released in a weekly basis (best-effort) as part of the efforts to keep Google-provided Templates updated with latest fixes and improvements.

To learn more about this process, or how you can stage your own changes, see Release Process.

More Information

  • Dataflow - general Dataflow documentation.
  • Dataflow Templates - basic template concepts.
  • Google-provided Templates - official documentation for templates provided by Google (the source code is in this repository).
  • Dataflow Cookbook: Blog, GitHub Repository - pipeline examples and practical solutions to common data processing challenges.
  • Dataflow Metrics Collector - CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Useful for comparison and visualization of the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc
  • Apache Beam
    • Overview
    • Quickstart: Java, Python, Go
    • Tour of Beam - an interactive tour with learning topics covering core Beam concepts from simple ones to more advanced ones.
    • Beam Playground - an interactive environment to try out Beam transforms and examples without having to install Apache Beam.
    • Beam College - hands-on training and practical tips, including video recordings of Apache Beam and Dataflow Templates lessons.
    • Getting Started with Apache Beam - Quest - A 5 lab series that provides a Google Cloud certified badge upon completion.

dataflowtemplates's People

Contributors

abacn avatar adrw-google avatar aksharauke avatar alain-baxter avatar alexeykukuku avatar ali-ince avatar andreigurau avatar ash-ddog avatar billyjacobson avatar biswanag avatar bvolpato avatar cherepushko avatar cloud-teleport avatar damccorm avatar damondouglas avatar deep1998 avatar dhercher avatar dippatel98 avatar drumcircle avatar fbiville avatar fozzie15 avatar georgecma avatar melbrodrigues avatar oleg-semenov avatar pabloem avatar polber avatar pranavbhandari24 avatar shreyakhajanchi avatar theshanbhag avatar zhoufek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataflowtemplates's Issues

PubSub Subscription to BQ unable to set number of max workers

Hi,

I created a dataflow job from PubSub_Subscription_to_BigQuery template and set the max number of worker to 10. The job was created successfully. However, from the job details in GCP console, the maxNumWorkers is always reverted back to 3. Is there any workaround for this? Thanks!

PS: I have tried to create the job from UI and gcloud cli and was getting the same result.

Create Template to bulk import from Datastore/Firestore exported files

Cloud Firestore and Cloud Datastore share the same leveldb export format.

Feature request to support bulkd processing of datastore/firestore files.

reference:

I can confirm , the following python snippet reads and displays the export files.

#!/usr/bin/python
# virtualenv env
# source env/bin/activate

import sys
sys.path.append('/apps/google-cloud-sdk/platform/google_appengine/')
from google.appengine.api.files import records
from google.appengine.datastore import entity_pb
from google.appengine.api import datastore

raw = open('2018-11-05T17_49_44_60804_all_namespaces_all_kinds_output-0', 'r')
reader = records.RecordsReader(raw)
for record in reader:
    entity = datastore.Entity.FromPb(entity_pb.EntityProto(contents=record))
    print entity

BQ to GCS Avro or Parquet

Any plans to support this? I wanted to be able to have an incremental unload from BA to GCS to see with Apache Spark

ClassCastException importing integer array from BigQuery Avro export

When importing an Avro dump of a BigQuery table with repeated integer column (array of integers) we see a class cast exception:

Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
    at com.google.cloud.teleport.spanner.AvroRecordConverter.lambda$readInt64Array$11(AvroRecordConverter.java:359)

This looks like a line number from the previous commit of this code: 42ce71e

I have experienced a similar problem when trying to write the output of BigQuery with a nullable INTEGER column into a Spanner INT64 and I had to do use a function like:

  private Long longVal(Object v, Long defaultValue) {
        return v == null ? defaultValue : Long.parseLong(v.toString());
    }

So I think the fix is probably replace

value.stream().map(x -> x == null ? null : (long) x).collect(Collectors.toList()));

with

value.stream().map(x -> x == null ? null : Long.parseLong(x.toString)).collect(Collectors.toList()));

I have not had a chance to try this fix or develop a test + PR but thought I would share the bug report in case other people are Googling for the answer to those pesky exceptions like I was.

Add format transformation for json-like input for BigQueryConverters

Hi there,

Got the below issue (missing double-quote at beginning of filedname) when creating a kafka-pubsub-bigquery pipeline using the pubsub to bigquery template.

Failed to serialize json to table row: {SPEED=71.2, FREEWAY_ID=163, TIMESTAMP=2008-11-01 00:00:00, LONGITUDE=-117.155519, FREEWAY_DIR=S, LATITUDE=32.749679, LANE=3}
.....
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('S' (code 83)): was expecting double-quote to start field name
 at [Source: {SPEED=71.2, FREEWAY_ID=163, TIMESTAMP=2008-11-01 00:00:00, LONGITUDE=-117.155519, FREEWAY_DIR=S, LATITUDE=32.749679, LANE=3}; line: 1, column: 3]

I think the error popped up from the below codes which cannot deal with missing double-quote or some non-standard json-like format (like "key=value" format), which got created from upstream pipeline... I am wondering if you could add something that may look similar to this https://www.mkyong.com/java/jackson-was-expecting-double-quote-to-start-field-name/ for our case
to the source code like an input format validation, transformation and then exception handling. In case of having an upstream that could take standard json but feed non-standard jsons into pubsub and then this dataflow template, it would be better to have this added.

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java

/**
  * Converts a JSON string to a {@link TableRow} object. If the data fails to convert, a {@link
  * RuntimeException} will be thrown.
  *
  * @param json The JSON string to parse.
  * @return The parsed {@link TableRow} object.
  */
 private static TableRow convertJsonToTableRow(String json) {
   TableRow row;
   // Parse the JSON into a {@link TableRow} object.
   try (InputStream inputStream =
       new ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8))) {
     row = TableRowJsonCoder.of().decode(inputStream, Context.OUTER);

   } catch (IOException e) {
     throw new RuntimeException("Failed to serialize json to table row: " + json, e);
   }

   return row;
 }

Thank you!

ZIP compression in Bulk Compress Cloud Storage Files template

I would like to use the ZIP compression type in the Bulk Compress Cloud Storage Files Template, which the documentation states is a viable option:
image

However, it is not an option when running the template through GCP-Dataflow though.

Are there plans to make this an option? If not, the documentation should be updated.

Thank you!

PubSub To PubSub

Hi there,

Got a situation with the dataflow template PubSub to PubSub
I follow the documentation https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloudpubsubtocloudpubsub

I created two topics
For one I created a subscription in order to put in the inputSubscription parameter

For a particular reason it doesn't work, the dataflow appears to read no messages, but when i execute the following command in the terminal it returns messages from the topic

gcloud pubsub subscriptions pull projects/test-project/subscriptions/testtopic --limit 100 --format="json"

PubSub to BQ deadletter table

Seeing the following error which seems to be an uncaught exception which I believe is skipping the deadletter table:

java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"location.latitude","message":"Cannot convert value to floating point (bad value):","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":4}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":5}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":6}, {"errors":
...[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":209}]
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:142)
at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:103)
Caused by: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"location.latitude","message":"Cannot convert value to floating point (bad value):","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":4}, {"errors": ...

Add support for importing ARRAY, BYTES and STRUCT types into Spanner

Case comment in https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/spanner/TextImportPipeline.java mentions the following:

NOTE: BYTES, ARRAY, STRUCT types are not supported.

It is impossible to import tables containing one of these types. The following exception error is thrown.

"Unrecognized or unsupported column data type: ARRAY<STRING(20)>"

Can we please add support for these 3 code types?

Support for ORC to BigTable

Is there a supported template for converting ORC on GCS to BigTable? (I know it is a bit eccentric conversion...) I am trying to implement a batch dataflow for migrating some hive table data (Originally located from AWS S3, storage transferring to GCS) to BigTable.

If not, if there is any workaround/example to this, please leave a comment here.

AvroToBigtable

I am tying to import and run this project .. The AVRO to Big Table conversion is missing the BigTableRow.java and BigTableCell.java and hence getting compilation errors on the example

DatastoreToBigQuery CREATE_IF_NEEDED problem

Hi,

I'm trying to compile the DatastoreToBigQuery template. But I get this error

An exception occured while executing the Java class. CreateDisposit ion is CREATE_IF_NEEDED, however no schema was provided

I tried to compile the DatastoreToText and TextToBigQuery and both worked, but somehow, from DatastoreToBigQuery isn't working.

Here is the whole Log:

    at org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkArgument (Preconditions.java:122)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped (BigQueryIO.java:2122)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand (BigQueryIO.java:2099)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand (BigQueryIO.java:1445)
    at org.apache.beam.sdk.Pipeline.applyInternal (Pipeline.java:537)
    at org.apache.beam.sdk.Pipeline.applyTransform (Pipeline.java:488)
    at org.apache.beam.sdk.values.PCollection.apply (PCollection.java:370)
    at com.google.cloud.teleport.templates.DatastoreToBigQuery.main (DatastoreToBigQuery.java:79)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:748)```

Maybe someone had the same problem and can help me. 

BigQuery to PubSub

Would BigQuery to PubSub be a good addition to these templates?
In the Dataflow Codelab, the NYC Taxi BigQuery dataset is used to publish to a topic. I think developers can utilize the existing, well-maintained BigQuery datasets to test out various streaming solutions through PubSub.

Unable to dump unbounded PubSub content to gs:// bucket

Hi there,

I'm completely new to Apache Beam and its programming model is quite surprising, and while trying to workaround a Parquet writer while reading from a PubSub Writer I can't wrap my head around the following...

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToAvro.java taking that template as base.

Seems that AvroIO has support for windowed writes to buckets such as gs://my-bucket/YYYY/MM/DD, being 'YYYY' variables automatically filled at runtime by the AvroIO handler.

Is there any way to achieve this using ParquetIO? The only bits of Parquet I've seen are the following ones, but none of them write by date...

Tried a first approach and, even if the code compile and runs one event after another, I can't get it to run in local with DirectRunner and against a bucket. The code I've got so far is the following one.

SOLVED

Publishing Template

Just pulling project & publish this by mvn command:
mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToAvro \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=${gcp_project_id} \ --stagingLocation=gs://${bucket_name}/staging \ --tempLocation=gs://${bucket_name}/temp \ --templateLocation=gs://${bucket_name}/templates/default-pubsub-to-avro.json \ --runner=DataflowRunner"

and create a job with default template from Dataflow & see that warning:

No metadata file found for this template.

Specify project id for spanner to avro export

I'd like to run the "Cloud Spanner to Cloud Storage Avro" template in a different project than the project my spanner instance lives.
Currently I cannot specify the project in the template:

SpannerConfig spannerConfig =
SpannerConfig.create()
.withHost(options.getSpannerHost())
.withInstanceId(options.getInstanceId())
.withDatabaseId(options.getDatabaseId());

It is however possible in the spanner to text export:
SpannerConfig spannerConfig =
SpannerConfig.create()
.withProjectId(options.getSpannerProjectId())
.withInstanceId(options.getSpannerInstanceId())
.withDatabaseId(options.getSpannerDatabaseId());

Is this something that can be added?

BulkDecompressor -- Error writing failures CSV file

When BulkDecompressor tries to write an error CSV file, seeing this stack trace:

Caused by: java.lang.IllegalArgumentException: No quotes mode set but no escape character is set
	at org.apache.commons.csv.CSVFormat.validate(CSVFormat.java:1397)
	at org.apache.commons.csv.CSVFormat.<init>(CSVFormat.java:647)
	at org.apache.commons.csv.CSVFormat.withQuoteMode(CSVFormat.java:1832)
	at com.google.cloud.teleport.templates.BulkDecompressor.lambda$run$9962e4b6$1(BulkDecompressor.java:234)
	at org.apache.beam.sdk.transforms.Contextful.lambda$fn$36334a93$1(Contextful.java:112)
	at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:123)

From reading the CSVFormat code, it looks like you need to call withEscape() before you call withQuoteMode(). I will have a pull request to fix this shortly.

Multiple BootstrapServers for KafkaToBigQuery parameters

Hi,

I notice there is a parameter called bootstrapServers in https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/KafkaToBigQuery.java. I am trying to using 3 kafka brokers to sync data to BigQuery from Kafka. I cannot use comma "," to separate the brokers parameter.

Could you tell me how to set the bootstrap servers so the dataflow would ingest from mulitple kafka brokers?

Thanks in advance.

GroupByKey Exception in common/DatastoreConverters.java

Trying to use the Pub/Sub to Datastore template I ran into the error described in this StackOverflow thread. I tried the recommendation posted there, but then got the following exception:

java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at com.google.cloud.teleport.templates.common.DatastoreConverters$CheckSameKey.expand(DatastoreConverters.java:210)
at com.google.cloud.teleport.templates.common.DatastoreConverters$CheckSameKey.expand(DatastoreConverters.java:172)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at com.google.cloud.teleport.templates.common.DatastoreConverters$WriteJsonEntities.expand(DatastoreConverters.java:158)
at com.google.cloud.teleport.templates.common.DatastoreConverters$WriteJsonEntities.expand(DatastoreConverters.java:134)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at com.google.cloud.teleport.templates.PubsubToDatastore.main(PubsubToDatastore.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)

There is a workaround for this in another StackOverflow thread, that bypasses this issue, but it wouldn't hurt if someone can take a look at it. Thanks.

Javascript UDF error when parsing JSON

I have used the Pub/Sub to BigQuery template to stream JSON data that are sent to a Pub/Sub topic. Through Dataflow I want to flatten the data to match the BigQuery schema and stream them.

Here is the Javascript UDF for the Dataflow process:

function transform(inJson) {
    var obj = JSON.parse(inJson);
    // variable declarations
    // ... 
    data['domain'] = obj['data']['domain']; // line 18
    ...
    return JSON.stringify(data);
}

I've also tried:

data.domain = obj.data.domain;

I've just copied the example from this repo and extended it to flatten the JSON data.

Here is the error message:

TypeError: Cannot read property "domain" from undefined in <eval> at line number 18

and here is th stacktrace:

javax.script.ScriptException: TypeError: Cannot read property "domain" from undefined in <eval> at line number 18
    at jdk.nashorn.api.scripting.NashornScriptEngine.throwAsScriptException(NashornScriptEngine.java:470)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:392)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeFunction(NashornScriptEngine.java:190)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$JavascriptRuntime.invoke(JavascriptTextTransformer.java:156)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$FailsafeJavascriptUdf$1.processElement(JavascriptTextTransformer.java:315)
    at com.google.cloud.teleport.templates.common.JavascriptTextTransformer$FailsafeJavascriptUdf$1$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:609)
    at com.google.cloud.teleport.templates.PubSubToBigQuery$PubsubMessageToFailsafeElementFn.processElement(PubSubToBigQuery.java:412)
    at com.google.cloud.teleport.templates.PubSubToBigQuery$PubsubMessageToFailsafeElementFn$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:272)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:309)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:77)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:621)
    at org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:71)
    at org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:122)
    at org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
    at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
    at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
    at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:76)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1233)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:144)
    at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:972)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: <eval>:18 TypeError: Cannot read property "domain" from undefined
    at jdk.nashorn.internal.runtime.ECMAErrors.error(ECMAErrors.java:57)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:213)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:185)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:172)
    at jdk.nashorn.internal.runtime.Undefined.get(Undefined.java:157)
    at jdk.nashorn.internal.scripts.Script$Recompilation$1$7667A$\^eval\_.transform(<eval>:18)
    at jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:639)
    at jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:494)
    at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:393)
    at jdk.nashorn.api.scripting.ScriptObjectMirror.callMember(ScriptObjectMirror.java:199)
    at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:386)
    ... 42 more

When I try the Javascript locally by passing some sample data it works as expected without any errors.

DataFlow Template : Sequential BigQuery Table Insertion issue

@sabhyankar

Hi , I am trying to create a dataflow template where it needs to write in two tables sequentially in bigquery.

e.g First it writes in "Patient" table. Ones its done then it writes in "Statistics" table. But I am facing this issue - once it writes in Patient table , it does not execution any code after that.

If you can please give some suggestion , it will be really helpful.

Thanks!

GCS Avro to BigQuery

This is not a issue, but a question. In absence of any other method, asking my question in form of issue. Sorry for that. Here goes the question:

Will the code GCS Avro to BigTable work for BQ also? Is there anything I should take take of, before applying it for a BQ case?

Thanks much
Pramod

AutoValue_DynamicJdbcIO_DynamicRead.Builder

Hi I have configured google cloud tools for eclipse and am facing this issue after cloning Dataflow templates, I have added all these dependencies auto-service-1.0-rc1.jar, guava-16.0.1.jar, jsr-305-2.0.3.jar,auto-value-1.0-rc1.jar but could not resolve the issue :AutoValue_DynamicJdbcIO_DynamicRead.Builder could not be resolved to a type

Cannot rebuild template as it is

I pulled the repo and followed the instruction to build TextToBigQueryStreaming template as it is. But I got the following error:
[INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 45.236 s [INFO] Finished at: 2019-08-28T10:30:13-04:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:288) at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) Caused by: org.apache.maven.plugin.MojoExecutionException: An exception occured while executing the Java class. Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.codehaus.mojo.exec.ExecJavaMojo.execute (ExecJavaMojo.java:339) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:210) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:288) at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) Caused by: java.lang.UnsupportedOperationException: Cannot define class using reflection: Cannot define nest member class java.lang.reflect.AccessibleObject$Cache + within different package then class org.apache.beam.repackaged.core.net.bytebuddy.mirror.AccessibleObject at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection$Dispatcher$Initializable$Unavailable.defineClass (ClassInjector.java:410) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$UsingReflection.injectRaw (ClassInjector.java:235) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassInjector$AbstractBase.inject (ClassInjector.java:111) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$InjectionDispatcher.load (ClassLoadingStrategy.java:232) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load (ClassLoadingStrategy.java:143) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize (TypeResolutionStrategy.java:100) at org.apache.beam.repackaged.core.net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load (DynamicType.java:5623) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.generateInvokerClass (ByteBuddyDoFnInvokerFactory.java:351) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.getByteBuddyInvokerConstructor (ByteBuddyDoFnInvokerFactory.java:247) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker (ByteBuddyDoFnInvokerFactory.java:220) at org.apache.beam.sdk.transforms.reflect.ByteBuddyDoFnInvokerFactory.newByteBuddyInvoker (ByteBuddyDoFnInvokerFactory.java:151) at org.apache.beam.sdk.transforms.reflect.DoFnInvokers.invokerFor (DoFnInvokers.java:35) at org.apache.beam.runners.core.construction.SplittableParDo.expand (SplittableParDo.java:170) at org.apache.beam.runners.core.construction.SplittableParDo.expand (SplittableParDo.java:87) at org.apache.beam.sdk.Pipeline.applyReplacement (Pipeline.java:564) at org.apache.beam.sdk.Pipeline.replace (Pipeline.java:290) at org.apache.beam.sdk.Pipeline.replaceAll (Pipeline.java:208) at org.apache.beam.runners.dataflow.DataflowRunner.replaceTransforms (DataflowRunner.java:995) at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:712) at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:179) at org.apache.beam.sdk.Pipeline.run (Pipeline.java:313) at org.apache.beam.sdk.Pipeline.run (Pipeline.java:299) at com.google.cloud.teleport.templates.TextToBigQueryStreaming.run (TextToBigQueryStreaming.java:255) at com.google.cloud.teleport.templates.TextToBigQueryStreaming.main (TextToBigQueryStreaming.java:136) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:567) at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282) at java.lang.Thread.run (Thread.java:835) [ERROR] [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

The command I ran was:
mvn -X compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.TextToBigQueryStreaming \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=[project] \ --stagingLocation=gs://[bucket]/staging \ --tempLocation=gs://[bucket]/temp \ --templateLocation=gs://[bucket]/templates/text_to_bq_streaming.json \ --runner=DataflowRunner"

The output of 'mvn --version' is
Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T15:00:29-04:00) Maven home:[HOME]/apache/maven/apache-maven-3.6.1 Java version: 12.0.2, vendor: Oracle Corporation, runtime: [HOME]/jdk/jdk-12.0.2 Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "4.19.37-5+deb10u1rodete2-amd64", arch: "amd64", family: "unix"

Pubsub to BigQuery - Cumlated errors make jobs crash

Hi

I've been using Google Dataflow Templates to send messages from pub/sub to BigQuery based on this: https://cloud.google.com/dataflow/docs/templates/provided-templates#cloudpubsubtobigquery

Since I've launched the dataflow job in streaming mode, the job has started to generate errors and finally crash based on the way Dataflow exceptions are handled:
https://cloud.google.com/dataflow/faq#how-are-java-exceptions-handled-in-cloud-dataflow

Here is the error:
java.lang.RuntimeException: java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"","message":"Repeated record added outside of an array.","reason":"invalid"}],"index":0}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":1}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":2}, {"errors":[{"debugInfo":"","location":"","message":"","reason":"stopped"}],"index":3} .......]
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:125)
org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:94)

captura de pantalla de 2018-09-13 13-54-58

image

As I understand, this kind of behaviour makes this template not useful for data streaming.

Are there any possibilities to configure the template to avoid exceptions thrown but still send them to stackdriver?

Thanks

Getting Exception when Creating New Template

Hi,

I am trying to creating a new template (specifically, modifying one of the DataflowTemplates as a new one). But, when I run the dataflow, I am getting an exception that I could not trace because there is no related script in this repo files. Could you give me advice about tracing and solving this issue?

fyi, the template that I am trying to modify is KafkaToBigQuery.java.

java.lang.NullPointerException
        org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770)
        org.apache.beam.sdk.util.WindowedValue$TimestampedWindowedValue.<init>(WindowedValue.java:263)
        org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow.<init>(WindowedValue.java:278)
        org.apache.beam.sdk.util.WindowedValue.timestampedValueInGlobalWindow(WindowedValue.java:117)
        org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.getCurrent(WorkerCustomSources.java:827)
        org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.getCurrent(WorkerCustomSources.java:759)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.getCurrent(ReadOperation.java:394)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:201)
        org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
        org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1287)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
        org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        java.lang.Thread.run(Thread.java:745)

Thanks in advance

V2 Flex Templates are broken for Python

When a flex template is created using both my own templates and the flex wordcount template provided the dataflow pipeline fails to build. The issue causes the dataflow pipeline to not build correctly once the API call is made.

I believe the issue arises from the base docker images found at: gcr.io/dataflow-templates-base

When I have tested these with:

docker run --interactive --tty gcr.io/dataflow-templates-base/java8-template-launcher-base bash

I get the following error before it kicks me from the container and shuts it down:

2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.Http
2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.HttpRule
2019/11/13 09:50:10 proto: duplicate proto type registered: google.api.CustomHttpPattern
2019/11/13 09:50:10 proto: duplicate proto type registered: google.protobuf.FieldMask
Created new fluentd log writer for: /var/log/dataflow/template_launcher/runner-json.log

Rename repository

Since Dataflow supports both Java and Python, it's misleading to name it as just DataflowTemplates and only keep Java examples in there.

I suggest that it should use the preferred nomenclature of DataflowTemplates-Java the same way google-cloud SDK does.

I can help add some templates for python. Here's a sample directory which I have been keeping some samples in: https://github.com/VikramTiwari/dataflow-samples

PS: I know, even I am guilty of not using proper nomenclature, but I am not Google :)

PubSub to Bigquery #Null Pointer Exception

After changes made to PubSubToBigQuery on April 2nd , I am unable to build dataflow template and getting NullPointer Exception d240b96

Even the error information doesn't help me to locate what params/options is missing while running through mvn in debug mode .Please fix

mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.PubSubToBigQuery -Dexec.cleanupDaemonThreads=false -Dexec.args="--project=${PROJECT_ID} --stagingLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/staging --tempLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/temp --templateLocation=gs://${BUCKET_NAME}/dataflow/pipelines/pubsub-to-bigquery/template --runner=DataflowRunner"

[WARNING]
java.lang.NullPointerException
at com.google.cloud.teleport.templates.PubSubToBigQuery.run (PubSubToBigQuery.java:226)
at com.google.cloud.teleport.templates.PubSubToBigQuery.main (PubSubToBigQuery.java:191)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)

Unable to create template

C:\Users\anirusharma\poc\DataflowTemplates>mvn compile exec:java -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToPubsub -Dexec.cleanupDaemonThreads=false -Dexec.args="--project=testbatch-211413 --stagingLocation=gs://templates_test_as/staging --tempLocation=gs://templates_test_as/temp --templateLocation=gs://template_data_as/templates/PubsubToPubsub.json --filesToStage=gs://templates_test_as/staging2 --runner=DataflowRunner"
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Detecting the operating system and CPU architecture
[INFO] ------------------------------------------------------------------------
[INFO] os.detected.name: windows
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 10.0
[INFO] os.detected.version.major: 10
[INFO] os.detected.version.minor: 0
[INFO] os.detected.classifier: windows-x86_64
[INFO]
[INFO] --------< com.google.cloud.teleport:google-cloud-teleport-java >--------
[INFO] Building Google Cloud Teleport 0.1-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from central: https://repo.maven.apache.org/maven2/io/grpc/grpc-core/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/io/grpc/grpc-core/maven-metadata.xml (1.8 kB at 1.4 kB/s)
Downloading from central: https://repo.maven.apache.org/maven2/io/netty/netty-codec-http2/maven-metadata.xml
Downloaded from central: https://repo.maven.apache.org/maven2/io/netty/netty-codec-http2/maven-metadata.xml (2.0 kB at 20 kB/s)
Downloading from Apache Snapshots Repository: https://repository.apache.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml
Downloading from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml
Downloaded from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/grpc/grpc-core/maven-metadata.xml (802 B at 850 B/s)
Downloading from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml
Downloading from Apache Snapshots Repository: https://repository.apache.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml
Downloaded from Sonatype Snapshots Repository: https://oss.sonatype.org/content/repositories/snapshots/io/netty/netty-codec-http2/maven-metadata.xml (1.5 kB at 6.7 kB/s)
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce) @ google-cloud-teleport-java ---
[INFO] artifact io.grpc:grpc-core: checking for updates from central
[INFO] artifact io.netty:netty-codec-http2: checking for updates from central
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M1:enforce (enforce-banned-dependencies) @ google-cloud-teleport-java ---
[INFO]
[INFO] --- protobuf-maven-plugin:0.5.1:compile (default) @ google-cloud-teleport-java ---
[INFO] Compiling 1 proto file(s) to C:\Users\anirusharma\poc\DataflowTemplates\target\generated-sources\protobuf\java
[INFO]
[INFO] --- protobuf-maven-plugin:0.5.1:compile-custom (default) @ google-cloud-teleport-java ---
[INFO] Compiling 1 proto file(s) to C:\Users\anirusharma\poc\DataflowTemplates\target\generated-sources\protobuf\grpc-java
[INFO]
[INFO] --- avro-maven-plugin:1.8.2:schema (default) @ google-cloud-teleport-java ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ google-cloud-teleport-java ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO] Copying 1 resource
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-compiler-plugin:3.6.2:compile (default-compile) @ google-cloud-teleport-java ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 93 source files to C:\Users\anirusharma\poc\DataflowTemplates\target\classes
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java: Some input files use or override a deprecated API.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/templates/common/BigQueryConverters.java: Recompile with -Xlint:deprecation for details.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaRecordCoder.java: Some input files use unchecked or unsafe operations.
[INFO] /C:/Users/anirusharma/poc/DataflowTemplates/src/main/java/com/google/cloud/teleport/kafka/connector/KafkaRecordCoder.java: Recompile with -Xlint:unchecked for details.
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ google-cloud-teleport-java ---
[WARNING]
java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:224)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:214)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.RuntimeException: Unable to get application default credentials. Please see https://developers.google.com/accounts/docs/application-default-credentials for details on how to specify credentials. This version of the SDK is dependent on the gcloud core component version 2015.02.05 or newer to be able to get credentials from the currently authorized user via gcloud auth.
at org.apache.beam.sdk.extensions.gcp.auth.NullCredentialInitializer.throwNullCredentialException (NullCredentialInitializer.java:60)
at org.apache.beam.runners.dataflow.util.DataflowTransport.chainHttpRequestInitializer (DataflowTransport.java:99)
at org.apache.beam.runners.dataflow.util.DataflowTransport.newDataflowClient (DataflowTransport.java:76)
at org.apache.beam.runners.dataflow.options.DataflowPipelineDebugOptions$DataflowClientFactory.create (DataflowPipelineDebugOptions.java:134)
at org.apache.beam.runners.dataflow.options.DataflowPipelineDebugOptions$DataflowClientFactory.create (DataflowPipelineDebugOptions.java:131)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:158)
at com.sun.proxy.$Proxy42.getDataflowClient (Unknown Source)
at org.apache.beam.runners.dataflow.DataflowClient.create (DataflowClient.java:41)
at org.apache.beam.runners.dataflow.DataflowRunner. (DataflowRunner.java:338)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions (DataflowRunner.java:332)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod (InstanceBuilder.java:214)
at org.apache.beam.sdk.util.InstanceBuilder.build (InstanceBuilder.java:155)
at org.apache.beam.sdk.PipelineRunner.fromOptions (PipelineRunner.java:55)
at org.apache.beam.sdk.Pipeline.create (Pipeline.java:145)
at com.google.cloud.teleport.templates.PubsubToPubsub.run (PubsubToPubsub.java:149)
at com.google.cloud.teleport.templates.PubsubToPubsub.main (PubsubToPubsub.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:40 min
[INFO] Finished at: 2018-12-18T00:01:47-05:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project google-cloud-teleport-java: An exception occured while executing the Java class. Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): InvocationTargetException: Unable to get application default credentials. Please see https://developers.google.com/accounts/docs/application-default-credentials for details on how to specify credentials. This version of the SDK is dependent on the gcloud core component version 2015.02.05 or newer to be able to get credentials from the currently authorized user via gcloud auth. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Custome shard templates with YYYY/MM/dd/HH/mm replacements

Hi all,

We are trying to setup a custom shard template like gs://bucket/YYYY/MM/dd/HH/W/P-SS-of-NN, so that the bucket can still be easily browse manually.

We are having issues with that custom shard template as it seems like those replacements are not supported. The short description listed when creating the job does not says anything about the likes of year/month/date/hour replacements. It just says:

The shard template defines the unique/dynamic portion of each windowed file. Recommended to use the default (W-P-SS-of-NN). At runtime, 'W' is replaced with the window date range and 'P' is replaced with the pane info.

Searching on this repo we found these references but we are not sure if these are available as part of to the custom shard template.

Questions:

  • Are YYYY, MM, dd, HH available to the custom shard template?

  • What are the supported replacements?

Any pointers will be greatly appreciated.

SpannerIO: Support read write transactions

Is it possible to perform read-write transaction in spanner connector for dataflow/beam?
I have use case which is currently implemented in Java App Engine flex app, but would like to see if I can do it in dataflow.

I have gone through grouped mutation but not sure if I can read in it

Pub/Sub to BQ fails to serialize json

I am trying to get going with the Dataflow template for Pub/Sub subscription to BigQuery. All of the messages end up in the table for errors records. The stack trace says: "java.lang.RuntimeException: Failed to serialize json to table row" and it fails for the actual message body.

The body of each message is an JSON array like:
[{"itemNo":"00050330","itemType":"Sales","itemDesc":"A Table","quantity":4.0,"extendedAmount":120.0,"orderId":null,"itemGroup":"0421","originalDocumentNo":null}]

Any help would be appreciated!

Pub/Sub to BigQuery Errors PayloadString is not valid JSON

The errors from the Cloud Pub/Sub Subscription to BigQuery template aren't saved as valid JSON in the PayloadString column and I am unable to replay them.

It looks like this:
{event={userId=1234, sessionEvent={sessionId=DSFG, ua=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36, browser={name=Chrome, version=74.0.3729.131, major=74}}}

If this is intended, how do I convert it and push it back to Pub/Sub?

java.lang.RuntimeException: Failed to serialize json to table row:

actually i am trying to store json string in big query table, so when i tried to publish below simple json in publish message box, records are not inserted in big query table. i checked in logs, it give above error.

{
"childName":"Aijaz_Google555",
"present":"Gift_Google555",
"JsonObject":[{"Actions":"test actions","CreatedBy": "test created by", "CreatedTimestamp": "test","Extended": "test extended"}]
}

Child Name, Present and Json object are 3 string type columns, in 3rd column jsonobject i want to store json object as string. kindly help.

also from code c#, i am trying to make string like below

string content = "{ "childName":"Aijaz_Google666","present":"Gift_Google666","JsonObject":"{"Actions": "test actions","CreatedBy": "test created by", "CreatedTimestamp": "2015 - 10 - 28T10: 15:30(ISO Date Time Format)","Extended": "test extended"}"}";

Thanks in advance.

Support dumping multiple Spanner databases to Avro

To be able to use Cloud Scheduler effectively with the Spanner->Avro template, it would be ideal if https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/spanner/ExportPipeline.java allowed specifying multiple Database IDs (instead of a single one, as happens currently)

The current template already creates a subdirectory for the exported database in the GCS output directory: if multiple databases were specified multiple subdirectories would be created, one for each database.

As an extension, it would be very useful even to make the Database ID optional, in which case the dataflow would have to enumerate the databases in the specified Spanner instance, and then export all of them.

The goal is to be able to trigger an export of one, multiple or all databases on a spanner instance from a cloud scheduler job.

GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger

I am new to running pipeline jobs on the google cloud and I am running to the issue with PubSub to DataQuery. 'mvn clean && mvn compile' worked but the command to create the template fails.
--Command
mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToDatastore \ -Dexec.cleanupDaemonThreads=false \ -Dexec.args=" \ --project=PROJECT_ID \ --pubsubReadTopic=projects/PROJECT_ID/topics/topic \ --javascriptTextTransformGcsPath=gs://PROJECT_ID/*.js. \ --javascriptTextTransformFunctionName=transform \ --stagingLocation=gs://PROJECT_ID/staging \ --tempLocation=gs://PROJECT_ID/temp \ --templateLocation=gs://<PROJECT_ID>/templates/PubSub_to_Datastore.json \ --runner=DataflowRunner"

-- javascript
`/**

  • A transform which adds a field to the incoming data.
  • @param {string} inJson
  • @return {string} outJson
    */
    function transform(line) {
    var values = line.split(',');

var obj = new Object();
obj._description = values[0];
obj._east = values[1];
obj._last_updt = values[2];
obj._north = values[3];
obj._region_id = values[4];
obj._south = values[5];
obj._west = values[6];
obj.current_speed = values[7];
obj.region = values[8];
var jsonString = JSON.stringify(obj);

return jsonString;
}--Datastore Schema{
"Datastore Schema": [
{
"name": "_description",
"type": "STRING"
},
{
"name": "_east",
"type": "FLOAT"
},
{
"name": "_last_updt",
"type": "TIMESTAMP"
},
{
"name": "_north",
"type": "FLOAT"
},
{
"name": "_region_id",
"type": "INTEGER"
},
{
"name": "_south",
"type": "FLOAT"
},
{
"name": "_west",
"type": "FLOAT"
},
{
"name": "current_speed",
"type": "FLOAT"
},
{
"name": "region",
"type": "STRING"
}
]
}`

Record type not supported in TextIOToBigQuery template

Hi guys,

The biq query schema file that is used in this template cannot have a RECORD type defined..else get error below . Looking at the code.. does not look like code accommodates nested/recursive build up of table schema in the withSchema() block of code to deal with RECORD schema definition... I would be happy to code it up if you like....

Error: Field xxxx is type RECORD but has no schema.

PubSub to BigQuery Javascript UDF destroys attributes

Hi,
We're using the PubSub Subscription to bigquery template. We have data in both PubSubMessage Attributes and the body. Our body contains an array without a field name i.e

[
 {"id": "item1"},
 {"id": "item2"}
]

Which the template had issues parsing, so we added a simple UDF

function process(str){
    var arrayOfItems = JSON.parse(str);
    var outObject = {items: arrayOfItems};
    return JSON.stringify(outObject);

When this template runs it seems like the attributes are discarded after the UDF step.

I'm not that well versed with BEAM but it seems that when the InvokeUDF step is built it's discarding everything but the message payload

 PCollectionTuple udfOut =
          input
              // Map the incoming messages into FailsafeElements so we can recover from failures
              // across multiple transforms.
              .apply("MapToRecord", ParDo.of(new PubsubMessageToFailsafeElementFn()))
              .apply(
                  "InvokeUDF",
                  FailsafeJavascriptUdf.<PubsubMessage>newBuilder()
                      .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                      .setFunctionName(options.getJavascriptTextTransformFunctionName())
                      .setSuccessTag(UDF_OUT)
                      .setFailureTag(UDF_DEADLETTER_OUT)
                      .build());

The PubsubMessageToFailsafeElementFn looks like this

 static class PubsubMessageToFailsafeElementFn
      extends DoFn<PubsubMessage, FailsafeElement<PubsubMessage, String>> {
    @ProcessElement
    public void processElement(ProcessContext context) {
      PubsubMessage message = context.element();
      context.output(
          FailsafeElement.of(message, new String(message.getPayload(), StandardCharsets.UTF_8)));
    }
  }

It seems to call message.getPlayload which would probably cause the issue.

So my question is: Am I doing something wrong, is there some way of getting both the attributes and the payload through the UDF? Or do I have to modify the java template?

Thanks in advance!

Use same group.id in consumer properties

Setting the group.id in the .updateConsumerProperties() still makes the reader to start reading at offset 0 for all jobs.

The setup

        Map<String, Object> props = new HashMap<>();
        props.put("group.id", "dataflow-reader");
        props.put("auto.offset.reset", "earliest");

        PCollection<KafkaRecord<String, String>> pcol = p.apply(KafkaIO.<String, String>read()
            .withBootstrapServers(options.getBootstrapServers())
            .withTopics(topics)
            .withKeyDeserializer(StringDeserializer.class)
            .withValueDeserializer(StringDeserializer.class)
            .withNumSplits(1)
            .updateConsumerProperties(props));

when I start a new job this is logged in the console.

Reader-0: reading from name-of-topic-0 starting at offset 0
ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [xxxx]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = Reader-0_offset_consumer_778069295_dataflow-reader

And it looks like that happens here

Is it possible to make the reader not start from offset 0 for each new dataflow job instance?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.