Giter Club home page Giter Club logo

invoices-data-project's Introduction

Invoices data project

This data project demonstrates how Tinybird could be used for real-time analytics retrieving data from BigQuery using the CLI and an Apache Beam connector.

This project is divided into:

  • ds-gen: Python script for generating the source dataset used to populate BigQuery
  • tb-project: The Tinybird data project
  • tableau-connector: The "connectors" to consume Tinybird APIs from Tableau

Start by cloning this repo locally.

Generate synthetic data

cd ds-gen
mkdir output
python3 -mvenv .e
source .e/bin/activate
pip install -r requirements.txt
python3 gen.py

With this script you will be able to generate a synthetic dataset for the invoices data project. It generates these files in the output directory:

  • clients-synthetic-data.json
  • recipients-synthetic-data.json
  • invoices-synthetic-data.json
  • agents-synthetic-data.json

Upload data to BigQuery

You can just upload manually the generated JSON files in the previous step from the BigQuery console:

Create a dataset called tinybird and upload the JSON files using these table names:

Upload the data project to Tinybird

Check the CLI documentation for other installation options and troubleshooting.

cd tb-project
python3 -mvenv .e
source .e/bin/activate
pip install tinybird-cli
tb auth
tb push datasources/currency.datasource --fixtures
tb push --push-deps

Import existing data from BigQuery to Tinybird

There's a BigQuery connector integrated in the CLI. To start using it run:

export GOOGLE_APPLICATION_CREDENTIALS=absolute_path_to_your_json_service_account
tb auth --connector bigquery

You'll be prompted to introduce the name of your BigQuery project and the name of a Storage bucket to store temporary files.

It supports several ingestion scenarios:

Bulk upload

This is handy to move data from BigQuery to Tinybird for the first time:

export PROJECT_ID="YOUR_PROJECT_ID"

tb datasource append agents \
                --connector bigquery \
                --sql "select agent_name, id from \`$PROJECT_ID.tinybird.agents\`"

tb datasource append clients \
                --connector bigquery \
                --sql "select company_country, company_name, id from \`$PROJECT_ID.tinybird.clients\`"

tb datasource append recipients \
                --connector bigquery \
                --sql "select recipient_country, recipient_code, id from \`$PROJECT_ID.tinybird.recipients\`"

tb datasource append invoices \
                --connector bigquery \
                --sql "select id, agent_id, recipient_code, client_id, amount, currency, created_at, to_json_string(added_payments) added_payments from \`$PROJECT_ID.tinybird.invoices\`"

Incremental updates

This is the kind of ingestion you want to schedule periodically (e.g. using a cron job), to ingest just the new data incoming to the invoices table.

tb datasource append invoices \
            --connector bigquery \
            --sql "select id, agent_id, recipient_code, client_id, amount, currency, created_at, to_json_string(added_payments) added_payments from \`$PROJECT_ID.tinybird.invoices\`" --incremental created_at

Replacements

The dimensions tables (agents, clients, recipients), can be fully replaced when you need to ingest new data:

tb datasource replace agents \
                --connector bigquery \
                --sql "select agent_name, id from \`$PROJECT_ID.tinybird.agents\`"

Or you can replace parts of data. For instance, to replace the invoices for the year 20201, run this:

tb datasource replace invoices \
            --connector bigquery \
            --sql "select id, agent_id, recipient_code, client_id, amount, currency, created_at, to_json_string(added_payments) added_payments from \`$PROJECT_ID.tinybird.invoices\` where date(created_at) > date('2021-01-01')" \
            --sql-condition="toDate(created_at) > toDate('2021-01-01')"

Streaming with Google DataFlow

Now we'll publish some invoices from Google PubSub and deploy a DataFlow pipeline which will ingest those events into the invoices data source in Tinybird. You can check how your endpoints get updated as data comes in.

Create the Python environment:

git clone https://github.com/tinybirdco/tinybird-beam
cd tinybird-beam/dataflow
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
# Update the variables in the `sample.env` file and source it:
# TB_TOKEN should be the same token you used to push the data project in the previous steps
source sample.env

Create the PubSub topic:

gcloud pubsub topics create demo-topic

Push the Apache Beam pipeline to DataFlow:

python dataflow.py \
  --project=$PROJECT_NAME \
  --region=$REGION \
  --runner=DataflowRunner \
  --temp_location=$TMP_LOCATION \
  --input_topic=projects/$PROJECT_NAME/topics/$TOPIC \
  --bq_table=tinybird.invoices \
  --batch_size=10000 \
  --batch_seconds=5 \
  --batch_key= \
  --tb_host=https://api.tinybird.co \
  --tb_token=$TB_TOKEN \
  --tb_datasource=invoices \
  --tb_columns="id,agent_id,recipient_code,client_id,amount,currency,created_at,added_payments" --worker_machine_type=n1-highmem-32 --enable_streaming_engine --num_workers=1 --max_num_workers=5

This pipeline will batch 10000 elements or a window of 5 seconds to the Tinybird invoices Data Source.

Once running check the job status:

Start the PubSub publisher:

cd tinybird-beam/pubsub
python pubsub.py

Once you finish running the pipeline remember cleaning resources.

invoices-data-project's People

Contributors

alrocar avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.