Giter Club home page Giter Club logo

sqlbucket's Introduction

SQLBucket

image

SQLBucket is a lightweight framework to help write, orchestrate and validate SQL data pipelines. It gives the possibility to set variables and introduces some control flow using the fantastic Jinja2 library. It also implements a very simplistic unit and integration test framework where you can validate the results of your ETL in the form of SQL checks. With SQLBucket, you can apply TDD principles when writing data pipelines.

It can work as a stand alone service, or be part of your workflow manager environment (Airflow, Luigi, ..).

Installing

Install and update using pip:

pip install -U sqlbucket

SQLBucket now works for python 3.10.

A Simple Example

To start working, you need to instantiate your SQLBucket core object with the project_folder parameter. That folder will contain all your SQL ETL. The python file where you create your SQLBucket object is also a good place to instantiate your command line interface, as shown below.

# my_sqlbucket.py
from sqlbucket import SQLBucket


bucket = SQLBucket(projects_folder='projects')


if __name__ == '__main__':
    bucket.cli()

The following command will create your first project in your projects folder.

python my_sqlbucket.py create-project -n my_first_project

For more info on CLI, please refer to its documentations.

Your projects should now look like the structure below:

projects/
    |-- my_first_project/
        |-- config.yaml
        |-- queries/
            |-- query_one.sql
            |-- query_two.sql
        |-- integrity/
            |-- integrity_one.sql

SQLBucket project structure

An SQLBucket project is made of 3 core components: the configuration, the ETL queries and the integrity check queries.

Configuration

The config.yaml is the core of your project. This is where you can define variables at project level, and configure the order your sql queries must be executed. For a better explanations on how to configure variables you can refer to the usage documentation, and also the variables documentation which also describes environment and connections variables.

ETL queries

The queries folder simply contain your SQL queries. You can organize them in the folder structure of your choice. As long as they are in the queries folder, SQLBucket will find them and execute them when configured to do so. See the documentation on how to write SQL with SQLBucket.

Integrity queries

The integrity folder simply contain SQL queries to help you validate your ETL. You can organize them in the folder structure of your choice. The only convention is to return the result of your integrity (True/False) in a field named passed. The main idea is that integrity is done by SQL itself. Check documentation on integrity for a more detailed explanation on testing the integrity of your ETL. We also have a set of common macros that can be helpful to start with.

See below a full example that will actually first run your ETL, and then run your integrity checks for a given database configuration.

from sqlbucket import SQLBucket

connections = {
    'db_demo': 'postgresql://user:password@host:5439/database'
}

bucket = SQLBucket(connections=connections)
project = bucket.load_project(
    project_name='my_first_project',
    connection_name='db_demo',
    variables={'foo': 1}
)

# to run ETL
project.run()

# to run integrity
project.run_integrity()

We recommend setting your connection urls as environment variables for security purposes.

Template project

To get you up to speed, you can create a fork of the SQLBucket template project and start building SQL data pipelines within minutes.

Contributing

For guidance on how to make a contribution to SQLBucket, see the contributing guidelines.

sqlbucket's People

Contributors

acjones27 avatar philippe2803 avatar sp-enric-fradera avatar sp-joan-madrid avatar sp-philippe-oger avatar sp-sergio-sanchez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sqlbucket's Issues

Connect to Oracle error

Hi,

Does this project support Oracle connection? Or how can I set the connection?

I set connections as below
connections = {'conn': 'oracle+cx_oracle://name:passwd@ip:port/dbname'}

When I run the project , got error "TypeError: Invalid argument(s) 'isolation_level' sent to create_engine(), using configuration OracleDialect_cx_oracle/NullPool/Engine. Please check that the keyword arguments are appropriate for this combination of components.
"
Seems the isolation_level is fixed to "AUTOCOMMIT" in create_connection(runners.py).

Thanks

TODO list for next week

Small refacto plus new little features to add in the next few days:

  • Add connection variables.
  • Renaming variables sources (env => global).
  • Documentation on variables, and distinction between the different ways to set them.
    • Distinction between global and project variables.
    • Distinction between variables submitted and variables set in config.
  • Add a connection_query attribute in config, referencing an SQL to be ran before ETL, regardless of the steps (typically for setting up search path).

Parameter ts is not exclusive

Hey,

I tried running a job with params -fs 1 -ts 2 and I got the first AND second query. Don't think this is intended, at least not according to the cli documentation. Am I doing something wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.