Giter Club home page Giter Club logo

alphasql's Introduction

AlphaSQL

release test

AlphaSQL provides Automatic Parallelization for sets of SQL files and integrated Type/Scheme Checker to eliminate syntax, type and schema errors from your datawarehouse.

Features

You can quickly introduce AlphaSQL by CI Example.

Docker Image

You can run commands below with docker

docker run --rm -v `pwd`:/home matts966/alphasql:latest [command]

like

docker run --rm -v `pwd`:/home matts966/alphasql:latest alphacheck ./samples/sample/dag.dot

Commands are installed in the PATH of the image.

Fast Binaries

# To install for MacOSX
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_darwin_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_darwin_x86_64.tar.gz -C /usr/local/bin
# To install for Linux
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_linux_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_linux_x86_64.tar.gz -C /usr/local/bin --strip=1

Extract DAG from SQL set

alphadag finds dependencies between table references and create table statements, function calls and create function statements.

# To extract DAG from your SQL set
$ alphadag --output_path ./samples/sample/dag.dot ./samples/sample/

# Or you can check the output in stdout by
$ alphadag [paths]

# with graphviz
$ dot -Tpng samples/sample/dag.dot -o samples/sample/dag.png

Note that sometimes the output has cycle, and refactoring SQL files or manual editing of the dot file is needed (see this issue).

If there are cycles, warning is emitted, type checker reports error, and bq_jobrunner raise error before execution. You can see the example in ./samples/sample-cycle .

If you want to serially execute some statements, you can write SQL script that contains multiple statements. See samples/sample/create_interim1.sql as an example.

Sample DAG output

The image below is extracted from SQL set in ./samples/sample . You can write tests for created tables and run them parallely only by separating SQL file.

dag.dot

Side effect first

You can resolve side effects such as INSERT and UPDATE statements before simple references by the --side_effect_first option.

dag.dot

With tables and functions

You can extract dependencies containing tables and functions by --with_tables and --with_functions options.

With tables

dag.dot

With functions

dag.dot

Parallel Execution

For BigQuery, the output DAG can be run parallely using

bq-airflow-dag-generator

airflow DAG

bq-airflow-dag-generator as Python package can generate Airflow DAG by simple Python script.

dagpath = "/path/to/dag.dot"
dag = generate_airflow_dag_by_dot_path(dagpath)

See usage on README for more details.

bq-jobrunner

from bq_jobrunner.bq_jobrunner import BQJobrunner

FILE_PATH = "./path/to/dag.dot"
PROJECT_ID = "your-project-id"
REGION = "asia-northeast1" # your region

runner = BQJobrunner(
    PROJECT_ID,
    location=REGION,
)
runner.compose_query_by_dot_path(FILE_PATH)
runner.execute()

Note that you should run job_runner in the same path as in extracting DAG.

Pipeline level Type Check for SQL set

Note that you should run type_checker in the same path as in extracting DAG.

# to check type and schema of SQL set
$ alphacheck ./samples/sample.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interim2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interium1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/insert_into_interim1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_mart.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart3.sql"
SUCCESS: analysis finished!
Successfully finished type check!

If you change column x's type in ./samples/sample/create_datawarehouse3.sql to STRING, type checker reports error.

$ alphacheck ./samples/sample/dag.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
ERROR: INVALID_ARGUMENT: Column 1 in UNION ALL has incompatible types: INT64, STRING [at ./samples/sample/create_interim2.sql:7:1]
catalog:
        datawarehouse3
        datawarehouse2

Schema specification by JSON

You can specify external schemata (not created by queries in SQL set) by passing JSON schema path.

# with external schema
$ alphacheck --json_schema_path ./samples/sample-schema.json ./samples/sample/dag.dot

You can extract required external tables by

$ alphadag --external_required_tables_output_path ./required_tables.txt {./path/to/sqls}
# and get schemata using bq command
$ cat ./required_tables.txt | while read line
do
    bq show \
        --schema \
        --format=prettyjson \
        $line
done

JSON schema file should have only a top level map element keyed by string element, which specifies table name, and each value specifies schema for its key. name and type elements in the schema elements are recognized like the official API.

{
        "tablename1": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null}
        ],
        "tablename2": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null},
                {"mode": "NULLABLE", "name": "column2", "type": "INT64", "description": null}
        ]
}

CI Example

The pipeline level type check above is also useful in CI context. The sample in ./samples/sample-ci contains an example for extracting DAG, retrieving schema and checking schema and type of SQL set quering bigquery public dataset. You can introduce the CI to your environment only by copying cloudbuild_ci_sample.yaml and python_entrypoint.py to your project.

You can try the example CI with gcloud command by

(cd ./samples/sample-ci && gcloud builds submit --config=cloudbuild_ci_sample.yaml .)

This example

  • Supports _TABLE_SUFFIX feature!
  • Does not execute actual BigQuery and very cheap!

License

Apache License 2.0

Sponsors

The development of this project is sponsored by Japan Data Science Consortium and Cybozu Lab.

alphasql's People

Contributors

matts966 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

alphasql's Issues

How to build/test AlphaSQL?

Hopefully this is the right place to pose such a question.
I was wondering on how I could build AlphaSQL myself as well as run its tests.

Currently my efforts got stuck at building the docker/dev.Dockerfile.
Unfortunately those ended up failing for me on a fresh try:

$ docker build -t alphasql-dev -f docker/dev.Dockerfile .
$ docker run -ti --entrypoint /usr/bin/make alphasql-dev build

(I tried both on v1.3.4 and master branch)

I would be happy to follow along and setup myself up to build AlphaSQL (at least on Docker), as well as update the documentation with what I learned to contribute.

Implement parameter type reader

For a while, we can use UDF instead of parameterized queries.

For example

CREATE FUNCTION IF NOT EXISTS `constants.get_parameter_a`() AS (100);
SELECT * FROM `x.x` WHERE a = `constants.get_parameter_a`();

You can replace the constants.get_parameter_a before executing the query.

Creating multiple tables whose name is the same should cause error

Creating multiple tables whose name is the same not only can cause a cycle, but also cause undefined behavior. We can visualize graphs and report errors if there are cycles, but it is more desirable to report errors if multiple tables which has the same name.

Also, function names should be checked in the same way if #4 implemented.

create view statements are not recognized

Hi,

I wanted to open a discussion about "CREATE VIEW" statements, as I think they should show up in the dependency graph.

What do you think? Using alphadag with the option --with_tables. Maybe another option --with_views would be a non breaking backwards compatible change to add these statements to the graph.

Arbitrary dependency expression

Current AlphaSQL (v0.6.0) can't simply handle arbitrary dependency expression for the sake of its simplicity (only dependency between table references and create table statements). #2

For example, we can't simply insert values into some tables before selecting from them if there are not create table statements.

There are 4 ways to solve this problem.

  1. We can use a table for dependency by creating the table in a SQL file we want to process early and selecting from it in a SQL file we want to process after. We can use functions after #4 is implemented.
  2. We can separate dag and run them serially.
  3. We can serialize all the queries by SQL script, but it's not efficient.
  4. We can implement a new flag to resolve arbitrary dependencies, but it will be complicated.

We can already use 1 and it's simple, I think.
Example implementation is in ./samples/sample-arbitrary-dependency-graph-with-drop-statement.

some sql with `declare` does not generate any references (although it should)

Given a query 'child_table.sql`:

declare foo_var int64;

create or replace table dataset.child_table as
select * from dataset.parent_table

Running alphadag against it, produces unexpected results, not recognising dataset.parent_table as the precursor to dataset.child_table.

$ alphadag --with_tables child_table.sql
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "foo.sql"
digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
0->1 ;
}
EXTERNAL REQUIRED TABLES:

Expected results:

digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
2 [label="dataset.parent_table", shape=box, type=table];
0->1 ;
2->1 ;
}
EXTERNAL REQUIRED TABLES:
dataset.parent_table

I expected for dataset.parent_table to be the precursor to the table created in child_table.sql.

Type checking after topological sorting is not enough

Type checking after topological sorting can't check all the possible problems and we can check all the possible orders by full search, but it's not efficient and almost all the problems can be checked by topological sort if the dependency resolution is correct.

Sometimes there are cycles in extracted graph

  • We can prevent cycles by making dependency only between table references and create table statements if a table is not created by multiple statements and serial execution can create required tables.
  • However, we are making dependency of INSERT on CREATE, UPDATE on INSERT and other DMLs on UPDATE for it is useful in preprocessing (before v0.5).
  • If it is needed, we should implement an option to make dependency only between table references and create table statements simply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.