matts966 / alphasql Goto Github PK

AlphaSQL provides Integrated Type and Schema Check and Parallelization for SQL file set mainly for BigQuery

License: Apache License 2.0

Starlark 5.02% Makefile 4.33% Dockerfile 1.33% C++ 89.31%

bigquery zetasql typechecker parallelization graphviz dot-language datawarehouse datawarehouseautomation docker workflow-automation

alphasql's Introduction

AlphaSQL

AlphaSQL provides Automatic Parallelization for sets of SQL files and integrated Type/Scheme Checker to eliminate syntax, type and schema errors from your datawarehouse.

Features

You can quickly introduce AlphaSQL by CI Example.

Docker Image
- Use our AlphaSQL on Docker ；）
Fast Binaries
- For local use, binary installation is fast!
Dependency Analysis
- Extract DAG from your SQL file set.
- Sample DAG output
- Side effect first
- With tables and functions
Parallel Execution
- Automatically parallelize your SQL file set.
- Integrate with Airflow.
Schema Checker
- Eliminate syntax, type and schema errors from your datawarehouse.
- Schema specification by JSON
  - Input your lake schema in JSON.
CI Example
- Use our AlphaSQL to continuously check your datawarehouse on BigQuery using CloudBuild.
- Supports _TABLE_SUFFIX.

Docker Image

You can run commands below with docker

docker run --rm -v `pwd`:/home matts966/alphasql:latest [command]

docker run --rm -v `pwd`:/home matts966/alphasql:latest alphacheck ./samples/sample/dag.dot

Commands are installed in the PATH of the image.

Fast Binaries

# To install for MacOSX
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_darwin_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_darwin_x86_64.tar.gz -C /usr/local/bin

# To install for Linux
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/latest/download/alphasql_linux_x86_64.tar.gz \
    && sudo tar -zxvf $temp/alphasql_linux_x86_64.tar.gz -C /usr/local/bin --strip=1

Extract DAG from SQL set

alphadag finds dependencies between table references and create table statements, function calls and create function statements.

# To extract DAG from your SQL set
$ alphadag --output_path ./samples/sample/dag.dot ./samples/sample/

# Or you can check the output in stdout by
$ alphadag [paths]

# with graphviz
$ dot -Tpng samples/sample/dag.dot -o samples/sample/dag.png

Note that sometimes the output has cycle, and refactoring SQL files or manual editing of the dot file is needed (see this issue).

If there are cycles, warning is emitted, type checker reports error, and bq_jobrunner raise error before execution. You can see the example in ./samples/sample-cycle .

If you want to serially execute some statements, you can write SQL script that contains multiple statements. See samples/sample/create_interim1.sql as an example.

Sample DAG output

The image below is extracted from SQL set in ./samples/sample . You can write tests for created tables and run them parallely only by separating SQL file.

Side effect first

You can resolve side effects such as INSERT and UPDATE statements before simple references by the --side_effect_first option.

With tables and functions

You can extract dependencies containing tables and functions by --with_tables and --with_functions options.

With tables

With functions

Parallel Execution

For BigQuery, the output DAG can be run parallely using

bq-airflow-dag-generator

bq-airflow-dag-generator as Python package can generate Airflow DAG by simple Python script.

dagpath = "/path/to/dag.dot"
dag = generate_airflow_dag_by_dot_path(dagpath)

See usage on README for more details.

bq-jobrunner

from bq_jobrunner.bq_jobrunner import BQJobrunner

FILE_PATH = "./path/to/dag.dot"
PROJECT_ID = "your-project-id"
REGION = "asia-northeast1" # your region

runner = BQJobrunner(
    PROJECT_ID,
    location=REGION,
)
runner.compose_query_by_dot_path(FILE_PATH)
runner.execute()

Note that you should run job_runner in the same path as in extracting DAG.

Pipeline level Type Check for SQL set

Note that you should run type_checker in the same path as in extracting DAG.

# to check type and schema of SQL set
$ alphacheck ./samples/sample.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interim2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim1.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/update_interium1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/insert_into_interim1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_mart.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart1.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart2.sql"
SUCCESS: analysis finished!
Analyzing "./samples/sample/test_mart3.sql"
SUCCESS: analysis finished!
Successfully finished type check!

If you change column x's type in ./samples/sample/create_datawarehouse3.sql to STRING, type checker reports error.

$ alphacheck ./samples/sample/dag.dot
Analyzing "./samples/sample/create_datawarehouse3.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_datawarehouse2.sql"
DDL analyzed, adding table to catalog...
SUCCESS: analysis finished!
Analyzing "./samples/sample/create_interim2.sql"
ERROR: INVALID_ARGUMENT: Column 1 in UNION ALL has incompatible types: INT64, STRING [at ./samples/sample/create_interim2.sql:7:1]
catalog:
        datawarehouse3
        datawarehouse2

Schema specification by JSON

You can specify external schemata (not created by queries in SQL set) by passing JSON schema path.

# with external schema
$ alphacheck --json_schema_path ./samples/sample-schema.json ./samples/sample/dag.dot

You can extract required external tables by

$ alphadag --external_required_tables_output_path ./required_tables.txt {./path/to/sqls}
# and get schemata using bq command
$ cat ./required_tables.txt | while read line
do
    bq show \
        --schema \
        --format=prettyjson \
        $line
done

JSON schema file should have only a top level map element keyed by string element, which specifies table name, and each value specifies schema for its key. name and type elements in the schema elements are recognized like the official API.

{
        "tablename1": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null}
        ],
        "tablename2": [
                {"mode": "NULLABLE", "name": "column1", "type": "STRING", "description": null},
                {"mode": "NULLABLE", "name": "column2", "type": "INT64", "description": null}
        ]
}

CI Example

The pipeline level type check above is also useful in CI context. The sample in ./samples/sample-ci contains an example for extracting DAG, retrieving schema and checking schema and type of SQL set quering bigquery public dataset. You can introduce the CI to your environment only by copying cloudbuild_ci_sample.yaml and python_entrypoint.py to your project.

You can try the example CI with gcloud command by

(cd ./samples/sample-ci && gcloud builds submit --config=cloudbuild_ci_sample.yaml .)

This example

Supports _TABLE_SUFFIX feature!
Does not execute actual BigQuery and very cheap!

License

Apache License 2.0

alphasql's People

Contributors

Stargazers

Watchers

Forkers

hotoku lukas-at-harren tryweirdier kuniyuta

alphasql's Issues

Verbose log problem

The usage output is emitted twice.

Dependency resolution of functions

We can simply resolve dependencies of functions in the same way with tables safely.

How to build/test AlphaSQL?

Hopefully this is the right place to pose such a question.
I was wondering on how I could build AlphaSQL myself as well as run its tests.

Currently my efforts got stuck at building the docker/dev.Dockerfile.
Unfortunately those ended up failing for me on a fresh try:

$ docker build -t alphasql-dev -f docker/dev.Dockerfile .
$ docker run -ti --entrypoint /usr/bin/make alphasql-dev build

(I tried both on v1.3.4 and master branch)

I would be happy to follow along and setup myself up to build AlphaSQL (at least on Docker), as well as update the documentation with what I learned to contribute.

Docker images tagged latest are not latest images

Type not found: INTEGER

Directory whose name ends with .sql should be ignored

Support for `_TABLE_SUFFIX`

Implement Warning for queries without create disposition

Implement model dependency resolution

Add --with-tables option

To illustrate required tables and generated tables.

alphacheck should ignore tables and functions

Implement parameter type reader

For a while, we can use UDF instead of parameterized queries.

For example

CREATE FUNCTION IF NOT EXISTS `constants.get_parameter_a`() AS (100);
SELECT * FROM `x.x` WHERE a = `constants.get_parameter_a`();

You can replace the constants.get_parameter_a before executing the query.

Add warning when side effects and their targets are not in the same script

Discussed in #5

Filter temporary tables from referenced tables list because they are local

Add --warning-as-error option

To be more scrict.
Target warnings are as follows

UPDATE and INSERT targets are not in the same script #53
Dependency cycle

alphadag ignore output_path if the specified path's dirname does not exists

Support scripting

required tables are not in EXTERNAL_REQUIRED_TABLES if the tables are created with recursive DDL

example:

create or replace table t1 -- ← created
AS (
select *
from t1 -- ← but referenced
union all
select *
from t2
)

For now, we can use INSERT or UPDATE statements.

Support FIFO

DOT path
External required tables path

Creating multiple tables whose name is the same should cause error

Creating multiple tables whose name is the same not only can cause a cycle, but also cause undefined behavior. We can visualize graphs and report errors if there are cycles, but it is more desirable to report errors if multiple tables which has the same name.

Also, function names should be checked in the same way if #4 implemented.

create view statements are not recognized

Hi,

I wanted to open a discussion about "CREATE VIEW" statements, as I think they should show up in the dependency graph.

What do you think? Using alphadag with the option --with_tables. Maybe another option --with_views would be a non breaking backwards compatible change to add these statements to the graph.

External Required Tables spec by yaml file?

Arbitrary dependency expression

Current AlphaSQL (v0.6.0) can't simply handle arbitrary dependency expression for the sake of its simplicity (only dependency between table references and create table statements). #2

For example, we can't simply insert values into some tables before selecting from them if there are not create table statements.

There are 4 ways to solve this problem.

We can use a table for dependency by creating the table in a SQL file we want to process early and selecting from it in a SQL file we want to process after. We can use functions after #4 is implemented.
We can separate dag and run them serially.
We can serialize all the queries by SQL script, but it's not efficient.
We can implement a new flag to resolve arbitrary dependencies, but it will be complicated.

We can already use 1 and it's simple, I think.
Example implementation is in ./samples/sample-arbitrary-dependency-graph-with-drop-statement.

some sql with `declare` does not generate any references (although it should)

Given a query 'child_table.sql`:

declare foo_var int64;

create or replace table dataset.child_table as
select * from dataset.parent_table

Running alphadag against it, produces unexpected results, not recognising dataset.parent_table as the precursor to dataset.child_table.

$ alphadag --with_tables child_table.sql
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "foo.sql"
digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
0->1 ;
}
EXTERNAL REQUIRED TABLES:

Expected results:

digraph G {
0 [label="child_table.sql", shape="", type=query];
1 [label="dataset.child_table", shape=box, type=table];
2 [label="dataset.parent_table", shape=box, type=table];
0->1 ;
2->1 ;
}
EXTERNAL REQUIRED TABLES:
dataset.parent_table

I expected for dataset.parent_table to be the precursor to the table created in child_table.sql.

Type checking after topological sorting is not enough

Type checking after topological sorting can't check all the possible problems and we can check all the possible orders by full search, but it's not efficient and almost all the problems can be checked by topological sort if the dependency resolution is correct.

The same name for table, UDF, TVF should not be allowed

The expected error is Already Exists: Routine test

Sometimes there are cycles in extracted graph

We can prevent cycles by making dependency only between table references and create table statements if a table is not created by multiple statements and serial execution can create required tables.
However, we are making dependency of INSERT on CREATE, UPDATE on INSERT and other DMLs on UPDATE for it is useful in preprocessing (before v0.5).
If it is needed, we should implement an option to make dependency only between table references and create table statements simply.

Change binary file names

pipeline_type_checker to alphacheck

dag to alphadag

to be more clear.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.