greatexpectationslabs / ge_tutorials Goto Github PK
View Code? Open in Web Editor NEWLearn how to add data validation and documentation to a data pipeline built with dbt and Airflow.
Learn how to add data validation and documentation to a data pipeline built with dbt and Airflow.
Hello,
From the Great Expectations web site here : https://docs.greatexpectations.io/en/stable/guides/tutorials/getting_started/initialize_a_data_context.html
I'm trying to run these steps, but seems that subdir "ge_getting_started_tutorial" does not exists ?
git clone https://github.com/superconductive/ge_tutorials
cd ge_tutorials/ge_getting_started_tutorial
docker-compose up --detach
Hi there,
Excited to see this tutorial as it's something we've been struggling in the past.
I've tried mixing airflow, dbt and ge in the past.. This approach has 2 issues:
Here's the approach we've taken:
Could you validate our approach please? Is GE used the way it was designed to?
Hi,
trying out the demo, and am running into a problem when executing the following (btw, airflow test command required me to add execution date to dag and task):
airflow test ge_tutorials_dag task_validate_source_data 2020-03-09
This command gives me the following error
File "/home/kris/anaconda3/lib/python3.7/site-packages/great_expectations/data_context/util.py", line 168, in substitute_config_variable
raise InvalidConfigError("Unable to find match for config variable {:s}. See https://great-expectations.readthedocs.io/en/latest/reference/data_context_reference.html#managing-environment-and-secrets".format(match.group(1)))
great_expectations.exceptions.InvalidConfigError: Unable to find match for config variable warehouse. See https://great-expectations.readthedocs.io/en/latest/reference/data_context_reference.html#managing-environment-and-secrets
I'm doing mariadb from fresh install on my local machine (user root and no pwd) and my .dbt profile is:
$ cat ~/.dbt/profiles.yml
# For more information on how to configure this file, please see:
# https://docs.getdbt.com/docs/profile
default:
outputs:
dev:
type: mysql
threads: 1
host: 127.0.0.1
port: 3306
user: root
pass:
dbname: warehouse
and I do have env vars set up:
GE_TUTORIAL_DB_URL=mysql://root@localhost:3306/ge #also tried warehouse as db name instead of ge
GE_TUTORIAL_PROJECT_PATH=/home/kris/projects/ge_demo/ge_tutorials
(and airflow is setup correctly because I could see ge_tutorials_dag in the airflow ui)
Sorry, don't have time to investigate further, just wanted to give feedback to your awesome tutorial, maybe it helps also someone else if they hit a snag.
Thanks and keep up your awesome stuff :D
Out of the box, there appear to be several issues with running the Airflow DBT examples using Docker.
pip install dbt-core
or pip install dbt-<connector>
, e.g., pip install dbt-postgres
. This applies both to the Dockerfile and to requirements.txt
dbt-postgres<1.0.0
, wtforms==2.3.3
, and werkzeug<1.0.0
in requirements.txt
, which is required by Airflow 1.10.9.dbt_project.yml
is missing the config-version: 2
setting, which prevents DAGs from executingairflow/ge_tutorials_dag_with_great_expectations.py
, though it looks like #16 addresses thiswebserver_1 | Traceback (most recent call last):
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/models/dagbag.py", line 243, in process_file
webserver_1 | m = imp.load_source(mod_name, filepath)
webserver_1 | File "/usr/local/lib/python3.7/imp.py", line 171, in load_source
webserver_1 | module = _load(spec)
webserver_1 | File "<frozen importlib._bootstrap>", line 696, in _load
webserver_1 | File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
webserver_1 | File "<frozen importlib._bootstrap_external>", line 724, in exec_module
webserver_1 | File "<frozen importlib._bootstrap_external>", line 860, in get_code
webserver_1 | File "<frozen importlib._bootstrap_external>", line 791, in source_to_code
webserver_1 | File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
webserver_1 | File "/usr/local/airflow/dags/ge_tutorials_dag_with_great_expectations.py", line 21
webserver_1 | "owner":` "Airflow",
webserver_1 | ^
webserver_1 | SyntaxError: invalid syntax```
I'm trying to import the ExpectColumnMaxToBeBetweenCustom
Expectation from this tutorial, but No module named 'custom_module'
.
I have copied the file from Complete Example into the great_expectations/plugins
directory.
For the suggested line from custom_module import ExpectColumnMaxToBeBetweenCustom
to work there are few more steps required:
column_custom_max_expectation.py
file should be in great_expectations/plugins/custom_module
instead of great_expectations/plugins
great_expectations/plugins/custom_module
, there should be an __init__.py
file__init__.py
file there should be the line from .column_custom_max_expectation import ExpectColumnMaxToBeBetweenCustom
Hey folks, I'm trying to use this repo specifically the airflow examples for a PoC but I keep getting the following error:
webserver_1 | [2021-11-09 15:56:44,142] {{settings.py:253}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=16
webserver_1 | Traceback (most recent call last):
webserver_1 | File "/usr/local/bin/airflow", line 26, in <module>
webserver_1 | from airflow.bin.cli import CLIFactory
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 70, in <module>
webserver_1 | from airflow.www.app import (cached_app, create_app)
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/app.py", line 37, in <module>
webserver_1 | from airflow.www.blueprints import routes
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/blueprints.py", line 25, in <module>
webserver_1 | from airflow.www import utils as wwwutils
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/utils.py", line 35, in <module>
webserver_1 | from wtforms.compat import text_type
webserver_1 | ModuleNotFoundError: No module named 'wtforms.compat'
webserver_1 | [2021-11-09 15:56:45,795] {{settings.py:253}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=1
webserver_1 | [2021-11-09 15:56:45,800] {{settings.py:253}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=19
webserver_1 | Traceback (most recent call last):
webserver_1 | File "/usr/local/bin/airflow", line 26, in <module>
webserver_1 | from airflow.bin.cli import CLIFactory
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 70, in <module>
webserver_1 | from airflow.www.app import (cached_app, create_app)
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/app.py", line 37, in <module>
webserver_1 | from airflow.www.blueprints import routes
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/blueprints.py", line 25, in <module>
webserver_1 | from airflow.www import utils as wwwutils
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/utils.py", line 35, in <module>
webserver_1 | from wtforms.compat import text_type
webserver_1 | ModuleNotFoundError: No module named 'wtforms.compat'
webserver_1 | Traceback (most recent call last):
webserver_1 | File "/usr/local/bin/airflow", line 26, in <module>
webserver_1 | from airflow.bin.cli import CLIFactory
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 70, in <module>
webserver_1 | from airflow.www.app import (cached_app, create_app)
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/app.py", line 37, in <module>
webserver_1 | from airflow.www.blueprints import routes
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/blueprints.py", line 25, in <module>
webserver_1 | from airflow.www import utils as wwwutils
webserver_1 | File "/usr/local/lib/python3.7/site-packages/airflow/www/utils.py", line 35, in <module>
webserver_1 | from wtforms.compat import text_type
webserver_1 | ModuleNotFoundError: No module named 'wtforms.compat'
When I run:
context.run_checkpoint(checkpoint_name=my_checkpoint_name)
context.open_data_docs()
I get this error:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\johnarmstrong\OneDrive - Corus Entertainment Inc\Documents\ge_tutorials\great_expectations\uncommitted/validations/getting_started_expectation_suite_taxi\demo\20220118-150433-my-run-name-template\20220118T150433.457469Z\444fa93fe34e9e162c5f910bca5b5916.json'
Nice with the Docker tutorial for convenience! However the dockerfile probably got out of date since I got an error from the first task of the dag with ge:
[2020-06-27 11:21:35,173] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-06-27T11:21:27.511382+00:00 [running]> 7617b740a5b3
[2020-06-27 11:21:35,247] {{taskinstance.py:1128}} ERROR - You appear to have an invalid config version (1.0).
The version number must be at least 2. Please see the migration guide at https://docs.greatexpectations.io/en/latest/how_to_guides/migrating_versions.html
I think it will be a good idea to dockerize the setup which will make it easier to get started and run the repo. Right it requires a bunch of different steps and configs to be setup which can be done inside a docker-compose. If this is something that is of interest to you, I can work on it.
I idea will be to run the repo in a docker and run a Postgres database in another container which airflow, ge and dbt can connect with static address.
I try to run docker-compose file for this tutorial but It error in this task.
I don't know that what are you use dbt and greatexpectation version ? for this tutorial.
Thank you very much
Reading local file: /usr/local/airflow/logs/ge_tutorials_dag_with_ge/task_transform_data_in_db/2021-05-17T17:02:07.041650+00:00/1.log
[2021-05-17 17:02:35,023] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_transform_data_in_db 2021-05-17T17:02:07.041650+00:00 [queued]>
[2021-05-17 17:02:35,037] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_transform_data_in_db 2021-05-17T17:02:07.041650+00:00 [queued]>
[2021-05-17 17:02:35,037] {{taskinstance.py:866}} INFO -
--------------------------------------------------------------------------------
[2021-05-17 17:02:35,037] {{taskinstance.py:867}} INFO - Starting attempt 1 of 1
[2021-05-17 17:02:35,037] {{taskinstance.py:868}} INFO -
--------------------------------------------------------------------------------
[2021-05-17 17:02:35,046] {{taskinstance.py:887}} INFO - Executing <Task(BashOperator): task_transform_data_in_db> on 2021-05-17T17:02:07.041650+00:00
[2021-05-17 17:02:35,049] {{standard_task_runner.py:53}} INFO - Started process 855 to run task
[2021-05-17 17:02:35,090] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: ge_tutorials_dag_with_ge.task_transform_data_in_db 2021-05-17T17:02:07.041650+00:00 [running]> e031efb82af6
[2021-05-17 17:02:35,106] {{bash_operator.py:82}} INFO - Tmp dir root location:
/tmp
[2021-05-17 17:02:35,107] {{bash_operator.py:105}} INFO - Temporary script location: /tmp/airflowtmp0lt4ylbu/task_transform_data_in_dboeetxzin
[2021-05-17 17:02:35,107] {{bash_operator.py:115}} INFO - Running command: dbt run --project-dir /usr/local/airflow/dbt
[2021-05-17 17:02:35,112] {{bash_operator.py:122}} INFO - Output:
**[2021-05-17 17:02:37,650] {{bash_operator.py:126}} INFO - Running with dbt=0.19.1
[2021-05-17 17:02:37,660] {{bash_operator.py:126}} INFO - Encountered an error while reading the project:**
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO - ERROR: Runtime Error
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO - Invalid config version: 1, expected 2
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO -
[2021-05-17 17:02:37,661] {{bash_operator.py:126}} INFO - Error encountered in /usr/local/airflow/dbt/dbt_project.yml
[2021-05-17 17:02:37,665] {{bash_operator.py:126}} INFO - Encountered an error:
[2021-05-17 17:02:37,666] {{bash_operator.py:126}} INFO - Runtime Error
[2021-05-17 17:02:37,666] {{bash_operator.py:126}} INFO - Could not run dbt
[2021-05-17 17:02:37,754] {{bash_operator.py:130}} INFO - Command exited with return code 2
[2021-05-17 17:02:37,761] {{taskinstance.py:1128}} ERROR - Bash command failed
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/bash_operator.py", line 134, in execute
raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
[2021-05-17 17:02:37,763] {{taskinstance.py:1185}} INFO - Marking task as FAILED.dag_id=ge_tutorials_dag_with_ge, task_id=task_transform_data_in_db, execution_date=20210517T170207, start_date=20210517T170235, end_date=20210517T170237
[2021-05-17 17:02:45,018] {{logging_mixin.py:112}} INFO - [2021-05-17 17:02:45,017] {{local_task_job.py:103}} INFO - Task exited with return code 1
The airflow (GE) DAG is unable to successfully execute because the configs appear to be incompatible with the version of GE that is being installed within the container. See error below;
*** Reading local file: /usr/local/airflow/logs/ge_tutorials_dag_with_ge/task_validate_source_data/2020-07-08T14:23:36.810518+00:00/1.log
[2020-07-08 14:23:40,135] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-07-08T14:23:36.810518+00:00 [queued]>
[2020-07-08 14:23:40,153] {{taskinstance.py:655}} INFO - Dependencies all met for <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-07-08T14:23:36.810518+00:00 [queued]>
[2020-07-08 14:23:40,153] {{taskinstance.py:866}} INFO -
--------------------------------------------------------------------------------
[2020-07-08 14:23:40,153] {{taskinstance.py:867}} INFO - Starting attempt 1 of 1
[2020-07-08 14:23:40,153] {{taskinstance.py:868}} INFO -
--------------------------------------------------------------------------------
[2020-07-08 14:23:40,164] {{taskinstance.py:887}} INFO - Executing <Task(PythonOperator): task_validate_source_data> on 2020-07-08T14:23:36.810518+00:00
[2020-07-08 14:23:40,168] {{standard_task_runner.py:53}} INFO - Started process 1006 to run task
[2020-07-08 14:23:40,211] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: ge_tutorials_dag_with_ge.task_validate_source_data 2020-07-08T14:23:36.810518+00:00 [running]> 26f657a26611
[2020-07-08 14:23:40,252] {{taskinstance.py:1128}} ERROR - You appear to have an invalid config version (1.0).
The version number must be at least 2. Please see the migration guide at https://docs.greatexpectations.io/en/latest/how_to_guides/migrating_versions.html
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/ge_tutorials_dag_with_great_expectations.py", line 70, in validate_source_data
context = ge.data_context.DataContext(great_expectations_context_path)
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 2148, in __init__
project_config = self._load_project_config()
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context.py", line 2186, in _load_project_config
return DataContextConfig.from_commented_map(config_dict)
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/types/base.py", line 86, in from_commented_map
config = dataContextConfigSchema.load(commented_map)
File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 723, in load
data, many=many, partial=partial, unknown=unknown, postprocess=True
File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 886, in _do_load
field_errors=field_errors,
File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 1189, in _invoke_schema_validators
partial=partial,
File "/usr/local/lib/python3.7/site-packages/marshmallow/schema.py", line 774, in _run_validator
validator_func(output, partial=partial, many=many)
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/types/base.py", line 298, in validate_schema
data["config_version"], MINIMUM_SUPPORTED_CONFIG_VERSION
great_expectations.exceptions.UnsupportedConfigVersionError: You appear to have an invalid config version (1.0).
The version number must be at least 2. Please see the migration guide at https://docs.greatexpectations.io/en/latest/how_to_guides/migrating_versions.html
[2020-07-08 14:23:40,254] {{taskinstance.py:1185}} INFO - Marking task as FAILED.dag_id=ge_tutorials_dag_with_ge, task_id=task_validate_source_data, execution_date=20200708T142336, start_date=20200708T142340, end_date=20200708T142340
[2020-07-08 14:23:50,113] {{logging_mixin.py:112}} INFO - [2020-07-08 14:23:50,113] {{local_task_job.py:103}} INFO - Task exited with return code 1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.