The pybeamflow from kosminus

pybeamflow's Introduction

pyBeamFlow is an ETL tool based on apache beam

Easy configurable from application.ini file

Input sources supported:

File supported file formats : parquet, json, csv, etc PubSub BigQuery RestAPI

Output sources supported:

File supported file formats : parquet, json, csv, etc BigQuery PubSub

Application.ini :

a) Job = dataflow parameters

b) Input = input part

c) Sink = writing part

d) Transformations = mapping, filter : beam.Map or beam.Filter.
It reads lambda functions in text format from application.ini file

Notes:

Writing to parquet files, an avro schema must be provided
Job supports different runners : DirectRunner, DataflowRunner
For mapping and filtering (PCollection at row level), it can be used lambda functions or just function_name if it is already defined in business_rules.py files. see : templates/application_template_gs_gs.ini

Installation:
pip install -r requiments.txt
please note that you also have to install
pip install apache_beam[gcp]

Run from local :
You can run from Intellij both with DirectRunner and DataflowRunner. Please set GOOGLE_APPLICATION_CREDENTIALS=path_to_json.json , in environmnet variables in main.py configuration. Credential file is obtained from your gcp project, service accounts, keys, create and download as JSON.

Recommend Projects

kosminus / pybeamflow Goto Github PK

pybeamflow's Introduction

pyBeamFlow is an ETL tool based on apache beam

pybeamflow's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent