calogica / dbt-expectations Goto Github PK

View Code? Open in Web Editor NEW

957.0 14.0 113.0 4.16 MB

Port(ish) of Great Expectations to dbt test macros

Home Page: https://calogica.github.io/dbt-expectations/

License: Apache License 2.0

Shell 100.00%

dbt

dbt-expectations's People

Contributors

Stargazers

Watchers

Forkers

mike-weinberg b-per nysthee dataders sumantabaral4u alieus martinguindon mateuszklimek noel ahmedrad fivetran-joemarkiewicz vital-software s-vinay mirosval edbizarro pbien forestlzj lewisarmistead jeremyyeo uselessalias chapmanjacobd epapineau makotonium samantha-guerriero-cko agusfigueroa-htg skumar-st larscoolspoa mrjoosh lucasthenoob kylasam robolyst celestialized carouselindustries tevariou sherlockbeard b-moynihan osakkal dluftspring lightdash jitkasem-pdt judahrand princesharma10 bartlomiejolma artemboiko1 badal-io re-data isabella232 donnyzhao codep-ai dylanrigal bbrewington dsillman2000 analyticsworld1 hideto0710 mustafabozkurt84 erycm godzillent msaqr14 supersum9 pruthvi-raju14 fivetran suresht-113 danhphan javipo84 benhong-discord robbertdm niklas-newzoo samvardhan777 swagatamaiti jkvdkooi tana8m tconbeer sjrusso8 lookslikeitsnot grajk12-github dmlayton sascorporationinc vdfaller samousavih mbkupfer jgoldfed kunalgoyal98 apurva4242 massipssa seamus-mckinsey ilkernator vinamilkcorp eivl dtger jgillies jcoombes damian3031 y7116385 cathyzhai77 95jhack mejaybird pizicai1127 jroldan3 fps697608 samueldanielcb

dbt-expectations's Issues

equal_expression should return expression_difference

Currently only returns 1 if there was a difference

Validate macro input arguments

Where it makes sense, we should check inputs for datatype, length etc before processing, and if necessary raise an exception via exceptions.raise_compiler_error.
See:

dbt-expectations/macros/schema_tests/multi-column/expect_compound_columns_to_be_unique.sql

Line 16 in b69ac04

{% else %}

Implement Missing Expectations

Compatibility issue with dbt v0.20.0

Hi there,
I've just upgraded to v0.20.0 to do some evaluation and check for any breaks. I'm getting this error when I run dbt deps

Running with dbt=0.20.0
Encountered an error:
Found duplicate project "dbt_utils". This occurs when a dependency has the same project name as some other dependency.

This is my package.yml:

packages:
- package: dbt-labs/dbt_utils
  version: 0.7.0  
- package: calogica/dbt_expectations
   version: 0.3.7

If I remove your package I don't get the error.

Running with dbt=0.20.0
Installing dbt-labs/[email protected]
  Installed from version 0.7.0

We use your awesome package extensively. If you could have a look asap, that would be much appreciated.

Missing from_unixtimestamp

Hello,

I have updated to dbt 0.19.2 and I'm now getting the following error:

Running with dbt=0.19.2
Encountered an error:
Compilation Error
  In dispatch: No macro named 'from_unixtimestamp' found
      Searched for: 'dbt_date.redshift__from_unixtimestamp', 'dbt_date.default__from_unixtimestamp'

I've found this commit: calogica/dbt-date@bc5a637 that adds this to dbt-date however the currently released version of dbt-expectations package is including an older version of that package (though it seems to be up to date on main). So I get a version conflict like:

Running with dbt=0.19.2
Encountered an error:
Version error for package calogica/dbt_date: Could not find a satisfactory version from options: ['>=0.3.0', '<0.4.0', '>=0.2.0', '<0.3.0']

I'm not sure how it worked before (where that function was defined) or what really broke and I realize that dbt-expectations is probably not the right place to complain about it, but it seems you're in a position to fix it?

Update readme for the test "expect_table_column_count_to_be_between"

Attachment:

Issue description:
The readme says that the test "expect_table_column_count_to_be_between" is applied to column but it should be Model, Source and Seeds as this is the test for count of columns in the whole object.

filter_by needed for all methods

It will be great if we can limit the data set. Some tables could be huge and it just takes too long to complete within CI pipeline.

No boolean support in SQL Server

Hi!

I tried to run dbt-expectations on my Azure SQL DB, and while some tests are working out of the box, other ones are raising SQL errors.

One of them is due to the macro expression_is_true.sql. The current macro performs a logic test in {{expression}} in the select statement but this is not valid in SQL Server. My current workaround is to change the default test_condition="= true" to test_condition="= 1" and use a case when statement in the select.

with grouped_expression as (

    select
        {% if group_by_columns %}
        {% for group_by_column in group_by_columns -%}
        {{ group_by_column }} as col_{{ loop.index }},
        {% endfor -%}
        {% endif %}
        case when {{ expression }} then 1 else 0 end as expression
    from {{ model }}
     {%- if row_condition %}
    where
        {{ row_condition }}
    {% endif %}
    {% if group_by_columns %}
    group by
    {% for group_by_column in group_by_columns -%}
        {{ group_by_column }}{% if not loop.last %},{% endif %}
    {% endfor %}
    {% endif %}

I've not done extensive testing but this seems to work with the few tests I've done so far.
Would it make sense to change the macro for all the SQL engines or put in place conditions for SQL Server vs the rest?

expect_column_values_to_be_of_type returns "character varying (16777216) for text data types

SnowflakeColumn(column='ENTITY', dtype='VARCHAR', char_size=16777216, numeric_precision=None, numeric_scale=None)

column.dtype returns "VARCHAR"
column.data_type returns "character varying(16777216)"

implementation could be changed to use .dtype instead of .data_type like so:

{%- macro test_expect_column_values_to_be_of_type(model, column_name, column_type) -%}
{%- if execute -%}

    {%- set column_name = column_name | upper -%}
    {%- set columns_in_relation = adapter.get_columns_in_relation(model) -%}
    {%- set column_type = column_type| upper -%}

    {%- set matching_column_types = [] -%}

    {%- for column in columns_in_relation -%}
        {%- if ((column.name | upper ) == column_name) and ((**column.dtype** | upper ) == column_type) -%}
            {%- do matching_column_types.append(column.name) -%}
        {%- endif -%}
    {%- endfor -%}
    select 1 - {{ matching_column_types | length }}

{%- endif -%}
{%- endmacro -%}

expect_table_columns_to_match_ordered_list is failing on Redshift

After the upgrade to 0.4 the expectation is failing on Redshift with

Invalid operation: failed to find conversion function from "unknown" to text;

I think this is because redshift cant guess the type of the ordered_column_names and ordered_relation_column_names. Locally I managed to get it to run by adding ::text after.

select
  count(*) as failures,
  count(*) != 0 as should_warn,
  count(*) != 0 as should_error
from (
       with test_data as (

         select
           'COL_A, COL_B'::text as ordered_column_names,
           'COL_A, COL_B'::text as ordered_relation_column_names

       )
       select *
       from test_data
       where ordered_column_names != ordered_relation_column_names

     ) dbt_internal_test

Regex expectations broken on Redshift

Hello, I've upgraded to dbt 0.20.0 and dbt-expectations 0.4. After that I started seeing the following error:

Invalid operation: function regexp_matches(character varying, "unknown") does not exist

My understanding of what happened is that this #39 was fixed, which replaced the regexp_instr with regex_matches on Postgres. However because Redshift falls back to Postgress version of the macro under the new dispatch rules, this is also regex_matches is invoked on Redshift. But redshift does not have that function (even though it does have the original regexp_instr): https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html

Add test to compare distinct counts between tables

Can we consider a test that would check that all distinct source records are present on a target table? This could confirm that our warehouse isn't missing any ids. I would imagine it might be similar to the expect_table_row_count_to_equal_other_table test, but with a distinct allowed?

Implement `strictly` for between & greater tests

If strictly: true then between and greater tests should use > and < vs >= and <= operators.

[BUG] expect_column_distinct_count_to_equal

Was trying to use this test: expect_column_distinct_count_to_equal

but when I put it in my YML file like so:

I get the following error:

expect_row_values_to_have_recent_data fails in BigQuery for TIMESTAMP columns

If you add a expect_row_values_to_have_recent_data argument to a TIMESTAMP column, you will get a failure driven by two characteristics of the existing implementation:

This comparison fails because BigQuery does not allow TIMESTAMP columns to be compared with DATE objects using the <= operator.
This comparison fails because BigQuery does not allow TIMESTAMP columns to be compared with DATESTAMP using the <= operator.

Given that a master time zone is specified in vars under dbt_project.yml, it seems possible to rewrite these comparisons using the appropriate recasting statements to allow columns of a TIMESTAMP type to be compatible with the expect_row_values_to_have_recent_data test.

Expect table columns to match set fails on Redshift

After the upgrade to 0.4 this expectation fails with

Invalid operation: failed to find conversion function from "unknown" to text;

I think its because of the quotes around the number_columnsand number_matching_columns, those were plain numbers before, as far as I can tell.

expect_table_row_count_to_equal_other_table should support row_condition

should at least support row_condition, compare_row_condition if not other params of equal_expression:

{%- macro test_expect_table_row_count_to_equal_other_table(model, compare_model, row_condition=None, compare_row_condition=None) -%}
{{ dbt_expectations.test_equal_expression(model, "count(*)", compare_model=compare_model, row_condition=row_condition, compare_row_condition=compare_row_condition) }}
{%- endmacro -%}

Package is out of date on dbt hub

I know this issue does not belong here, but it seems that the 0.4.1 version, while released, has not been updated on the dbt hub which still shows 0.4.0. I have created an issue there: dbt-labs/hub.getdbt.com#755 but nobody responded. Would you know who to contact about fixing this?

Fix regex_instr for Postgres

regex_instr doesn't exist on Postgres, so we'll need to implement something using regexp_matches

dbt deps

with packages.yaml:

packages:

git: "https://github.com/calogica/dbt-expectations.git"
revision: 0.1

running dbt deps I get this error:
Running with dbt=0.18.0
Encountered an error:
Got a non-zero returncode running: ['git', 'clone', '--depth', '1', '[email protected]:calogica/dbt-date.git', '36b63a0bdea5f3e5d232c299b76e1160']

also having the dependencies in there directly does not solve the issue...

packages:

git: "https://github.com/calogica/dbt-date.git"
revision: 0.2.1
git: "https://github.com/calogica/dbt-expectations.git"
revision: 0.1

However...only having:

packages:

git: "https://github.com/calogica/dbt-date.git"
revision: 0.2.1

results in dbt deps:
Running with dbt=0.18.0
Installing https://github.com/calogica/[email protected]
Installed from revision 0.2.1

Installing fishtown-analytics/[email protected]
Installed from version 0.6.2

Implement RegEx Expectations

Set up automated integration testing

We should set up Github Actions or something similar to allow us to kick off integration tests. This could be triggered to avoid BigQuery charges, or automated (at least for Postgres)

get_select() TSQL compatibility

as part the work in dbt-msft/tsql-utils#44 to make this package compatible with the dbt-msft adapters, I'm stuck trying to shim the first macro

how can I make this query TSQL- and dbt-expectations-compatible? I'm struggling to grok the intent behind dbt-expectations.test_equal_expression()'s default argument of groupby=['col'].

Here's a long write up of the steps I went through to discover that get_select() was the problematic macro.

Very much appreciate the work you've done here, we'd like to re-use your integration tests as much as possible

failed compilation syntax

select
    'col' as col_1,
    sum(col_numeric_a) as expression
from
    "dbt-msft-serverless-db"."integration_tests_azsql_dbt_expectations_integration_tests"."data_test"
group by
    'col'

generic example

here's a generic example that I don't understand how this works on Snowflake or BigQuery right now because it doesn't work on Postgres or TSQL

WITH CTE AS (
    SELECT 1 as col_numeric_a
    UNION 
    SELECT 2 as col_numeric_a
)
SELECT
    'col' AS col_1,
    sum(col_numeric_a) AS expression
FROM CTE
GROUP BY 'col'

Number out of representable range

Error: 100046 (22003): Number out of representable range: type FIXEDSB16
The error is occurring due to below error in Snowflake.
https://community.snowflake.com/s/article/Number-out-of-representable-range-error-occurs-during-the-multiplication-of-numeric-values

Documentation tasks

Complete documentation for existing tests

clarification on the use of expect_table_columns_to_match_set

Data warehouse: Bigquery

The test "expect_table_columns_to_match_set" is comparing the columns list provided in the "column_list" parameter with the columns in the model. When we have some additional column in the list which does not exist in the model the test is still successful.

This design misguides developers / testers in one particular scenario outlined below.

developer has missed to add one or more column(s), which was required as per requirement, into a model
as per the requirement, the test set included this column(s) which was missed by the developer
now when the test is executed, the test does not fail as all other columns in the set exist in the model

Ideally, the test should compare every column provided in the parameter and even if one is missing it should fail. The table column count test will reveal the count differences but this test also should cover this scenario. Thoughts ?

Integration test fails depending of the time of the day

Hi!

I have been running the integration tests and the following test fails depending on the time of the day (before or after midday)

  - name: timeseries_hourly_data_extended
    columns:
      - name: date_hour
        tests:
          - dbt_expectations.expect_row_values_to_have_recent_data:
              datepart: hour
              interval: 12

timeseries_hourly_data_extended uses date_spine up to to today, while excluding it (as described in dbt-utils: "This macro returns the sql required to build a date spine. The spine will include the start_date (if it is aligned to the datepart), but it will not include the end_date")

When running the test at 2 pm, going back 12 hours makes it 2 am, which is still after the last record, knowing that date_spine excludes the end_date, so the test fails.

Would it make sense to change this to 24?

dbt-expectations/integration_tests/models/schema_tests/schema.yml

Line 90 in 0a64e5e

interval: 12

Cheers,

dbt Spark Utils causing breaking error

Looks like spark utils got upgraded to 0.2.0 and this is causing a breaking error in dbt deps because the packages can't find a satisfactory version.

Also applies to date package

Fix integration tests on Snowflake

The timeseries_hourly_data_extended model for integrations tests does not work as intended on Snowflake.

macro test_expect_column_distinct_values_to_be_in_set returns false positive

I think that the final where clause in macro test_expect_column_distinct_values_to_be_in_set
should be
where
s.value_field is null

not
where
v.value_field is null

Dbt deps issue

Running dbt deps with the following packages.yml

packages:

  - git: "https://github.com/calogica/dbt-expectations.git"
    revision: 0.2.0

  - package: fishtown-analytics/dbt_utils
    version: 0.5.1

I get:

Running with dbt=0.17.0
Encountered an error:
Version error for package fishtown-analytics/dbt_utils: Could not find a satisfactory version from options: ['=0.5.1', '>=0.6.0', '<0.7.0']

It's confusing since I'm actually running with dbt_utils version 0.5.1. Is this expected?

Implement 'mostly' in expression tests

We should support the mostly parameter in expression tests as a float that controls what % of rows need to satisfy the given expression test. Default is 100%.

Support `group_by` in `expect_table_row_count_to_equal`

This currently sets group_by to None

Refactor expectations to use new dbt 0.20 test syntax

See https://docs.getdbt.com/docs/building-a-dbt-project/tests

Test for cross-platform compatibility

The tests are currently only integration tested in BigQuery. We need to test on Postgres and Snowflake at least, ideally also on Redshift.

If needed, this should spawn additional issues to make certain macros adapter macros for better cross-platform compatibility.

Implement missing test parameters

We should compare the macro parameters for all expectations to make sure they match (as much as applicable) their Python equivalents.
E.g. expect_column_values_to_be_between does not implement the strict_min, strict_max or mostly parameters.

Alternatives to re-implementation?

Hi @clausherther !

I'm one of the maintainers of Great Expectations. I'm excited that you're working on bringing dbt and GE functionality together. Lots of teams use both tools, and we're very supportive of even closer collaboration.

Looking at your implementation, I can see several places where we can probably save you some work, but the right way to implement will depend on exactly what you're trying to accomplish. We'd be happy to have someone on the GE team collaborate to help.

Are you up for a call to discuss?

equal_expression macro doesn't support integer division correctly

expression_difference_percent

    abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
        nullif(a.expression, 0) as expression_difference_percent

should be

    abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
        nullif(a.expression * 1.0, 0) as expression_difference_percent

or more explicit cast, otherwise result is 0 on platforms where integer division doesn't result in decimal

expect_column_values_to_be_in_set not including nulls

Hi folks,

I am using the following test: expect_column_values_to_be_in_set

when I included null inside of the value_set it keeps throwing out errors, I am working with Redshift

expect_column_values_to_be_between unexpected behavior

expect_column_values_to_be_between should use the "and" operator instead of "or"

{% set expression %}
{{ column_name ~ " >= " ~ min_value ~ " or " ~
   column_name ~ " <= " ~ max_value }}
{% endset %}

Update docs to show which tests can be applied to ephemeral models

Tests that involve the info schema only work on tables and views

Add a test to check for recent rows at each grouping level

We should add the ability to check for recent data at a grouping level via a group_by parameter.
For example, a user might expect recent rows for every product in a fact table. So, a recency check should look at grouped rows for each product_id that does not meet the recency requirement.

e.g.

with latest_grouped_timestamps as (
    
    select
        product_id,
        max(timestamp_column) as latest_timestamp_column
    from
        fact_table
    group by
        1
),
validation_errors as (
    
    select *
    from
        latest_grouped_timestamps
    where
        latest_timestamp_column < { threshold_timestamp }

)
select count(*) from validation_errors

Name: expect_grouped_row_values_to_have_recent_data?

Note: this would only cover grouped values that already exists in the model to be tested. So, for example, products that have never sold would not be checked. This would require a left join to a dimension table and is currently out of scope for this test.

Error after upgrade to 0.3.6

With my packages like:

packages:
  - package: calogica/dbt_expectations
    version: [">=0.3.0", "<0.4.0"]
  - package: fishtown-analytics/codegen
    version: 0.3.2
  - package: fivetran/fivetran_log
    version: 0.3.1
  - package: fishtown-analytics/audit_helper
    version: 0.3.0

I'm getting this error:

Running with dbt=0.19.1
Encountered an error:
Found duplicate project "dbt_utils". This occurs when a dependency has the same project name as some other dependency.

But if we pin the version on 0.3.4 everything works fine

Update docs with parameters for each test

We should list each parameter in the example docs and indicate whether it's required or optional.

Example:

tests:
  - dbt_expectations.expect_column_stdev_to_be_between:
      # either min_value or max_value is required 
      min_value: 0 <optional>
      max_value: 2 <optional>
      group_by: [col1, col2] <optional>
      strictly: True <optional>

Changes to dispatch in dbt v0.20

Hey Claus! Planning to open this same issue in the dbt-date package momentarily :)

Required code change

We've made a change to adapter.dispatch: Instead of passing a _get_namespaces() macro to the packages arg, you should pass the name of this package (as a string) to the macro_namespace arg, which is now the second positional argument.

In practice, this just looks like switching each macro like:

dbt-expectations/macros/math/log_natural.sql

Lines 1 to 3 in 730d195

 {% macro log_natural(x) -%} 

 {{ adapter.dispatch('log_natural', packages=dbt_expectations._get_namespaces()) (x) }} 

 {% endmacro %}

To:

{% macro log_natural(x) -%}
    {{ adapter.dispatch('log_natural', 'dbt_expectations') (x) }}
{% endmacro %}

I hope this could be as simple as Find + Replace for packages=dbt_expectations._get_namespaces() → 'dbt_expectations'.

For the README

If a user wishes to override/shim this package, instead of defining a var named dbt_expectations_dispatch_list (or dbt_utils_dispatch_list), they should now define a config in dbt_project.yml, for instance:

dispatch:
  - macro_namespace: dbt_expectations
    search_order: ['my_project', 'dbt_expectations']  # enable override
  - macro_namespace: dbt_utils
    search_order: ['spark_utils', 'dbt_utils']  # shim for spark

Notes

This change is in dbt v0.19.2 as well. Both v0.19.2 and v0.20.0 have backwards compatibility for the old syntax, so there's no huge rush to upgrade. As soon as you do upgrade to the new syntax, you'll need to require dbt >= 0.19.2 (or just >=0.20.0, for simplicity, since you're already making compatibility changes in #69).

expect_column_values_to_be_in_set doesn't seem to be working

Hi,

I'm looking at the source code for expect_column_values_to_be_in_set and it doesn't look like it's validating the fields at all.

validation_errors as (
    -- values from the model that are not in the set
    select
        v.value_field
    from
        all_values v
        left join
        set_values s on v.value_field = s.value_field
    where
        v.value_field is null
)

I think to validate for errors either v.value needs to be s.value or the left join needs to be a right join .. I can submit a pull request this week but thought I'd let you know first as it might affect a lot of tests.

Fix integrations tests on Postgres

Integration tests currently fail on local Postgres instance

Database Error in model timeseries_data (models/schema_tests/timeseries_data.sql)
  function rand() does not exist
  LINE 170:     rand()
                ^
  HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
  compiled SQL at target/run/dbt_expectations_integration_tests/models/schema_tests/timeseries_data.sql

Database Error in model timeseries_hourly_data_extended (models/schema_tests/timeseries_hourly_data_extended.sql)
  type "datetime" does not exist
  LINE 197:         cast(d.date_hour as datetime) as date_hour,
                                        ^
  compiled SQL at target/run/dbt_expectations_integration_tests/models/schema_tests/timeseries_hourly_data_extended.sql

Update expectations to at least pass integrations tests for dbt==0.20.0-b1

We need to rewrite expectations to return failing rows, not counts

reference a view and not a table

Hi,

I wanted to reference a view:

models:

name: my_table
description: This table contains our toy model
tests:
- dbt_expectations.expect_table_column_count_to_equal_other_table:
  compare_model: ref("my_table_view")

resulting in:

Failure in test dbt_expectations_expect_table_column_count_to_equal_other_table_my_table_ref_my_table_view_ (models/my_first_folder/schema.yml)
Got 7 results, expected 0.

are the tests only working with tables?

Convert test macros to use expression_is_true where possible

Convert test macros to use expression_is_true where possible.

E.g. https://github.com/calogica/dbt-expectations/blob/main/macros/schema_tests/multi-column/expect_column_pair_values_to_be_equal.sql

	{% macro log_natural(x) -%}
	{{ adapter.dispatch('log_natural', packages=dbt_expectations._get_namespaces()) (x) }}
	{% endmacro %}

calogica / dbt-expectations Goto Github PK

dbt-expectations's People

Contributors

Stargazers

Watchers

Forkers

dbt-expectations's Issues

failed compilation syntax

generic example

Required code change

For the README

Notes

Recommend Projects

Recommend Topics

Recommend Org