Giter Club home page Giter Club logo

cost_effective_data_pipelines_sh's Introduction

Cost Effective Data Pipelines

Code for blog at: WIP

Setup:

Prerequisites:

  1. Python 3.8 or above
  2. sqlite3
  3. Sufficient disk memory (depending on if you want to run with 1 or 10 or 100GB)

Note: all of this can run on GitHub codespaces: Codespace

Lets create a virtual env and install the libraries needed:

python3 -m venv myenv
# windows
# myenv\Scripts\activate
# Linux
source myenv/bin/activate
pip install -r requirements.txt
# After you are done, deactivate with the command `deactivate`

Generate data:

For the example in this repo we use the TPC-H data set and Coincap API. Let's generate the TPCH data:

# NOTE: This is to clean up any data (if present) 
rm tpch-dbgen/*.tbl
# Generate data set of 1 GB size
cd tpch-dbgen
make
./dbgen -s 1 # Change this number to generate a data of desired size
cd ..

# NOTE: Load the generated data into a tpch sqlite3 db
sqlite3 tpch.db < ./upstream_db/tpch_DDL_DML.sql > /dev/null 2>&1

Let's open a sqlite3 shell and run a quick count check to ensure that the tables were loaded properly.

sqlite3 tpch.db
.read ./upstream_db/count_test.sql
/* 
Your output will be (if you generated a 1GB dataset)
150000
6001215
25
1500000
200000
800000
5
10000
*/
.exit  -- exit sqlite3

Data processing

Run ETL with python as shown below

time python ./src/data_processor/exchange_data.py 2024-05-29
time python ./src/data_processor/dim_parts_supplier.py 2024-05-29
time python ./src/data_processor/one_big_table.py 2024-05-29 # 2m 43s
time python ./src/data_processor/wide_month_supplier_metrics.py 2024-05-29 # 7m 20s

The script wide_month_supplier_metrics.py ran in 7m and 20s, this included reading in about 10GB of data, inefficiently processing it and writing it out.

The script was run on a 8 core, 32 GB RAM, 1TB HDD 2017 Thinkpad.

Resource utilization:

htop

cost_effective_data_pipelines_sh's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.