Easy configurable from application.ini file
Input sources supported:
File supported file formats : parquet, json, csv, etc PubSub BigQuery RestAPI
Output sources supported:
File supported file formats : parquet, json, csv, etc BigQuery PubSub
Application.ini :
a) Job = dataflow parameters
b) Input = input part
c) Sink = writing part
d) Transformations = mapping, filter : beam.Map or beam.Filter.
It reads lambda functions in text format from application.ini file
Notes:
- Writing to parquet files, an avro schema must be provided
- Job supports different runners : DirectRunner, DataflowRunner
- For mapping and filtering (PCollection at row level), it can be used lambda functions or just function_name if it is already defined in business_rules.py files. see : templates/application_template_gs_gs.ini
Installation:
pip install -r requiments.txt
please note that you also have to install
pip install apache_beam[gcp]
Run from local :
You can run from Intellij both with DirectRunner and DataflowRunner.
Please set GOOGLE_APPLICATION_CREDENTIALS=path_to_json.json , in environmnet variables in main.py configuration.
Credential file is obtained from your gcp project, service accounts, keys, create and download as JSON.