calogica / dbt-expectations Goto Github PK
View Code? Open in Web Editor NEWPort(ish) of Great Expectations to dbt test macros
Home Page: https://calogica.github.io/dbt-expectations/
License: Apache License 2.0
Port(ish) of Great Expectations to dbt test macros
Home Page: https://calogica.github.io/dbt-expectations/
License: Apache License 2.0
Currently only returns 1
if there was a difference
Where it makes sense, we should check inputs for datatype, length etc before processing, and if necessary raise an exception via exceptions.raise_compiler_error
.
See:
Hi there,
I've just upgraded to v0.20.0 to do some evaluation and check for any breaks. I'm getting this error when I run dbt deps
Running with dbt=0.20.0
Encountered an error:
Found duplicate project "dbt_utils". This occurs when a dependency has the same project name as some other dependency.
This is my package.yml:
packages:
- package: dbt-labs/dbt_utils
version: 0.7.0
- package: calogica/dbt_expectations
version: 0.3.7
If I remove your package I don't get the error.
Running with dbt=0.20.0
Installing dbt-labs/[email protected]
Installed from version 0.7.0
We use your awesome package extensively. If you could have a look asap, that would be much appreciated.
Hello,
I have updated to dbt 0.19.2
and I'm now getting the following error:
Running with dbt=0.19.2
Encountered an error:
Compilation Error
In dispatch: No macro named 'from_unixtimestamp' found
Searched for: 'dbt_date.redshift__from_unixtimestamp', 'dbt_date.default__from_unixtimestamp'
I've found this commit: calogica/dbt-date@bc5a637 that adds this to dbt-date
however the currently released version of dbt-expectations
package is including an older version of that package (though it seems to be up to date on main
). So I get a version conflict like:
Running with dbt=0.19.2
Encountered an error:
Version error for package calogica/dbt_date: Could not find a satisfactory version from options: ['>=0.3.0', '<0.4.0', '>=0.2.0', '<0.3.0']
I'm not sure how it worked before (where that function was defined) or what really broke and I realize that dbt-expectations is probably not the right place to complain about it, but it seems you're in a position to fix it?
It will be great if we can limit the data set. Some tables could be huge and it just takes too long to complete within CI pipeline.
Hi!
I tried to run dbt-expectations
on my Azure SQL DB, and while some tests are working out of the box, other ones are raising SQL errors.
One of them is due to the macro expression_is_true.sql
. The current macro performs a logic test in {{expression}}
in the select
statement but this is not valid in SQL Server. My current workaround is to change the default test_condition="= true"
to test_condition="= 1"
and use a case when
statement in the select.
with grouped_expression as (
select
{% if group_by_columns %}
{% for group_by_column in group_by_columns -%}
{{ group_by_column }} as col_{{ loop.index }},
{% endfor -%}
{% endif %}
case when {{ expression }} then 1 else 0 end as expression
from {{ model }}
{%- if row_condition %}
where
{{ row_condition }}
{% endif %}
{% if group_by_columns %}
group by
{% for group_by_column in group_by_columns -%}
{{ group_by_column }}{% if not loop.last %},{% endif %}
{% endfor %}
{% endif %}
I've not done extensive testing but this seems to work with the few tests I've done so far.
Would it make sense to change the macro for all the SQL engines or put in place conditions for SQL Server vs the rest?
SnowflakeColumn(column='ENTITY', dtype='VARCHAR', char_size=16777216, numeric_precision=None, numeric_scale=None)
column.dtype returns "VARCHAR"
column.data_type returns "character varying(16777216)"
implementation could be changed to use .dtype instead of .data_type like so:
{%- macro test_expect_column_values_to_be_of_type(model, column_name, column_type) -%}
{%- if execute -%}
{%- set column_name = column_name | upper -%}
{%- set columns_in_relation = adapter.get_columns_in_relation(model) -%}
{%- set column_type = column_type| upper -%}
{%- set matching_column_types = [] -%}
{%- for column in columns_in_relation -%}
{%- if ((column.name | upper ) == column_name) and ((**column.dtype** | upper ) == column_type) -%}
{%- do matching_column_types.append(column.name) -%}
{%- endif -%}
{%- endfor -%}
select 1 - {{ matching_column_types | length }}
{%- endif -%}
{%- endmacro -%}
After the upgrade to 0.4
the expectation is failing on Redshift with
Invalid operation: failed to find conversion function from "unknown" to text;
I think this is because redshift cant guess the type of the ordered_column_names
and ordered_relation_column_names
. Locally I managed to get it to run by adding ::text
after.
select
count(*) as failures,
count(*) != 0 as should_warn,
count(*) != 0 as should_error
from (
with test_data as (
select
'COL_A, COL_B'::text as ordered_column_names,
'COL_A, COL_B'::text as ordered_relation_column_names
)
select *
from test_data
where ordered_column_names != ordered_relation_column_names
) dbt_internal_test
Hello, I've upgraded to dbt 0.20.0
and dbt-expectations 0.4
. After that I started seeing the following error:
Invalid operation: function regexp_matches(character varying, "unknown") does not exist
My understanding of what happened is that this #39 was fixed, which replaced the regexp_instr
with regex_matches
on Postgres. However because Redshift falls back to Postgress version of the macro under the new dispatch rules, this is also regex_matches
is invoked on Redshift. But redshift does not have that function (even though it does have the original regexp_instr): https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html
Can we consider a test that would check that all distinct source records are present on a target table? This could confirm that our warehouse isn't missing any ids. I would imagine it might be similar to the expect_table_row_count_to_equal_other_table test, but with a distinct
allowed?
If strictly: true
then between
and greater
tests should use >
and <
vs >=
and <=
operators.
If you add a expect_row_values_to_have_recent_data argument to a TIMESTAMP column, you will get a failure driven by two characteristics of the existing implementation:
Given that a master time zone is specified in vars under dbt_project.yml, it seems possible to rewrite these comparisons using the appropriate recasting statements to allow columns of a TIMESTAMP type to be compatible with the expect_row_values_to_have_recent_data test.
After the upgrade to 0.4
this expectation fails with
Invalid operation: failed to find conversion function from "unknown" to text;
I think its because of the quotes around the number_columns
and number_matching_columns
, those were plain numbers before, as far as I can tell.
should at least support row_condition
, compare_row_condition
if not other params of equal_expression
:
{%- macro test_expect_table_row_count_to_equal_other_table(model, compare_model, row_condition=None, compare_row_condition=None) -%}
{{ dbt_expectations.test_equal_expression(model, "count(*)", compare_model=compare_model, row_condition=row_condition, compare_row_condition=compare_row_condition) }}
{%- endmacro -%}
I know this issue does not belong here, but it seems that the 0.4.1
version, while released, has not been updated on the dbt hub which still shows 0.4.0
. I have created an issue there: dbt-labs/hub.getdbt.com#755 but nobody responded. Would you know who to contact about fixing this?
regex_instr
doesn't exist on Postgres, so we'll need to implement something using regexp_matches
with packages.yaml:
packages:
running dbt deps I get this error:
Running with dbt=0.18.0
Encountered an error:
Got a non-zero returncode running: ['git', 'clone', '--depth', '1', '[email protected]:calogica/dbt-date.git', '36b63a0bdea5f3e5d232c299b76e1160']
also having the dependencies in there directly does not solve the issue...
packages:
However...only having:
packages:
results in dbt deps:
Running with dbt=0.18.0
Installing https://github.com/calogica/[email protected]
Installed from revision 0.2.1
Installing fishtown-analytics/[email protected]
Installed from version 0.6.2
We should set up Github Actions or something similar to allow us to kick off integration tests. This could be triggered to avoid BigQuery charges, or automated (at least for Postgres)
as part the work in dbt-msft/tsql-utils#44 to make this package compatible with the dbt-msft adapters, I'm stuck trying to shim the first macro
how can I make this query TSQL- and dbt-expectations-compatible? I'm struggling to grok the intent behind dbt-expectations.test_equal_expression()
's default argument of groupby=['col']
.
Here's a long write up of the steps I went through to discover that get_select()
was the problematic macro.
Very much appreciate the work you've done here, we'd like to re-use your integration tests as much as possible
select
'col' as col_1,
sum(col_numeric_a) as expression
from
"dbt-msft-serverless-db"."integration_tests_azsql_dbt_expectations_integration_tests"."data_test"
group by
'col'
here's a generic example that I don't understand how this works on Snowflake or BigQuery right now because it doesn't work on Postgres or TSQL
WITH CTE AS (
SELECT 1 as col_numeric_a
UNION
SELECT 2 as col_numeric_a
)
SELECT
'col' AS col_1,
sum(col_numeric_a) AS expression
FROM CTE
GROUP BY 'col'
Error: 100046 (22003): Number out of representable range: type FIXEDSB16
The error is occurring due to below error in Snowflake.
https://community.snowflake.com/s/article/Number-out-of-representable-range-error-occurs-during-the-multiplication-of-numeric-values
Complete documentation for existing tests
Data warehouse: Bigquery
The test "expect_table_columns_to_match_set" is comparing the columns list provided in the "column_list" parameter with the columns in the model. When we have some additional column in the list which does not exist in the model the test is still successful.
This design misguides developers / testers in one particular scenario outlined below.
Ideally, the test should compare every column provided in the parameter and even if one is missing it should fail. The table column count test will reveal the count differences but this test also should cover this scenario. Thoughts ?
Hi!
I have been running the integration tests and the following test fails depending on the time of the day (before or after midday)
- name: timeseries_hourly_data_extended
columns:
- name: date_hour
tests:
- dbt_expectations.expect_row_values_to_have_recent_data:
datepart: hour
interval: 12
timeseries_hourly_data_extended
uses date_spine
up to to today, while excluding it (as described in dbt-utils: "This macro returns the sql required to build a date spine. The spine will include the start_date (if it is aligned to the datepart), but it will not include the end_date")
When running the test at 2 pm, going back 12 hours makes it 2 am, which is still after the last record, knowing that date_spine excludes the end_date
, so the test fails.
Would it make sense to change this to 24?
Cheers,
Looks like spark utils got upgraded to 0.2.0 and this is causing a breaking error in dbt deps because the packages can't find a satisfactory version.
Also applies to date package
The timeseries_hourly_data_extended
model for integrations tests does not work as intended on Snowflake.
I think that the final where clause in macro test_expect_column_distinct_values_to_be_in_set
should be
where
s.value_field is null
not
where
v.value_field is null
Running dbt deps
with the following packages.yml
packages:
- git: "https://github.com/calogica/dbt-expectations.git"
revision: 0.2.0
- package: fishtown-analytics/dbt_utils
version: 0.5.1
I get:
Running with dbt=0.17.0
Encountered an error:
Version error for package fishtown-analytics/dbt_utils: Could not find a satisfactory version from options: ['=0.5.1', '>=0.6.0', '<0.7.0']
It's confusing since I'm actually running with dbt_utils version 0.5.1
. Is this expected?
We should support the mostly
parameter in expression tests as a float that controls what % of rows need to satisfy the given expression test. Default is 100%.
This currently sets group_by to None
The tests are currently only integration tested in BigQuery. We need to test on Postgres and Snowflake at least, ideally also on Redshift.
If needed, this should spawn additional issues to make certain macros adapter macros for better cross-platform compatibility.
We should compare the macro parameters for all expectations to make sure they match (as much as applicable) their Python equivalents.
E.g. expect_column_values_to_be_between
does not implement the strict_min
, strict_max
or mostly
parameters.
Hi @clausherther !
I'm one of the maintainers of Great Expectations. I'm excited that you're working on bringing dbt and GE functionality together. Lots of teams use both tools, and we're very supportive of even closer collaboration.
Looking at your implementation, I can see several places where we can probably save you some work, but the right way to implement will depend on exactly what you're trying to accomplish. We'd be happy to have someone on the GE team collaborate to help.
Are you up for a call to discuss?
expression_difference_percent
abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
nullif(a.expression, 0) as expression_difference_percent
should be
abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
nullif(a.expression * 1.0, 0) as expression_difference_percent
or more explicit cast, otherwise result is 0 on platforms where integer division doesn't result in decimal
Hi folks,
I am using the following test: expect_column_values_to_be_in_set
when I included null inside of the value_set it keeps throwing out errors, I am working with Redshift
expect_column_values_to_be_between should use the "and" operator instead of "or"
{% set expression %}
{{ column_name ~ " >= " ~ min_value ~ " or " ~
column_name ~ " <= " ~ max_value }}
{% endset %}
Tests that involve the info schema only work on tables and views
We should add the ability to check for recent data at a grouping level via a group_by
parameter.
For example, a user might expect recent rows for every product in a fact table. So, a recency check should look at grouped rows for each product_id that does not meet the recency requirement.
e.g.
with latest_grouped_timestamps as (
select
product_id,
max(timestamp_column) as latest_timestamp_column
from
fact_table
group by
1
),
validation_errors as (
select *
from
latest_grouped_timestamps
where
latest_timestamp_column < { threshold_timestamp }
)
select count(*) from validation_errors
Name: expect_grouped_row_values_to_have_recent_data
?
Note: this would only cover grouped values that already exists in the model to be tested. So, for example, products that have never sold would not be checked. This would require a left join to a dimension table and is currently out of scope for this test.
With my packages like:
packages:
- package: calogica/dbt_expectations
version: [">=0.3.0", "<0.4.0"]
- package: fishtown-analytics/codegen
version: 0.3.2
- package: fivetran/fivetran_log
version: 0.3.1
- package: fishtown-analytics/audit_helper
version: 0.3.0
I'm getting this error:
Running with dbt=0.19.1
Encountered an error:
Found duplicate project "dbt_utils". This occurs when a dependency has the same project name as some other dependency.
But if we pin the version on 0.3.4 everything works fine
We should list each parameter in the example docs and indicate whether it's required or optional.
Example:
tests:
- dbt_expectations.expect_column_stdev_to_be_between:
# either min_value or max_value is required
min_value: 0 <optional>
max_value: 2 <optional>
group_by: [col1, col2] <optional>
strictly: True <optional>
Hey Claus! Planning to open this same issue in the dbt-date
package momentarily :)
We've made a change to adapter.dispatch
: Instead of passing a _get_namespaces()
macro to the packages
arg, you should pass the name of this package (as a string) to the macro_namespace
arg, which is now the second positional argument.
In practice, this just looks like switching each macro like:
dbt-expectations/macros/math/log_natural.sql
Lines 1 to 3 in 730d195
To:
{% macro log_natural(x) -%}
{{ adapter.dispatch('log_natural', 'dbt_expectations') (x) }}
{% endmacro %}
I hope this could be as simple as Find + Replace for packages=dbt_expectations._get_namespaces()
โ 'dbt_expectations'
.
If a user wishes to override/shim this package, instead of defining a var named dbt_expectations_dispatch_list
(or dbt_utils_dispatch_list
), they should now define a config in dbt_project.yml
, for instance:
dispatch:
- macro_namespace: dbt_expectations
search_order: ['my_project', 'dbt_expectations'] # enable override
- macro_namespace: dbt_utils
search_order: ['spark_utils', 'dbt_utils'] # shim for spark
This change is in dbt v0.19.2 as well. Both v0.19.2 and v0.20.0 have backwards compatibility for the old syntax, so there's no huge rush to upgrade. As soon as you do upgrade to the new syntax, you'll need to require dbt >= 0.19.2
(or just >=0.20.0
, for simplicity, since you're already making compatibility changes in #69).
See also:
Hi,
I'm looking at the source code for expect_column_values_to_be_in_set
and it doesn't look like it's validating the fields at all.
validation_errors as (
-- values from the model that are not in the set
select
v.value_field
from
all_values v
left join
set_values s on v.value_field = s.value_field
where
v.value_field is null
)
I think to validate for errors either v.value
needs to be s.value
or the left join
needs to be a right join
.. I can submit a pull request this week but thought I'd let you know first as it might affect a lot of tests.
Integration tests currently fail on local Postgres instance
Database Error in model timeseries_data (models/schema_tests/timeseries_data.sql)
function rand() does not exist
LINE 170: rand()
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
compiled SQL at target/run/dbt_expectations_integration_tests/models/schema_tests/timeseries_data.sql
Database Error in model timeseries_hourly_data_extended (models/schema_tests/timeseries_hourly_data_extended.sql)
type "datetime" does not exist
LINE 197: cast(d.date_hour as datetime) as date_hour,
^
compiled SQL at target/run/dbt_expectations_integration_tests/models/schema_tests/timeseries_hourly_data_extended.sql
We need to rewrite expectations to return failing rows, not counts
Hi,
I wanted to reference a view:
models:
resulting in:
Failure in test dbt_expectations_expect_table_column_count_to_equal_other_table_my_table_ref_my_table_view_ (models/my_first_folder/schema.yml)
Got 7 results, expected 0.
are the tests only working with tables?
Convert test macros to use expression_is_true where possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.