Giter Club home page Giter Club logo

dbt-expectations's People

Contributors

agusfigueroa-htg avatar b-per avatar bartlomiejolma avatar chapmanjacobd avatar clausherther avatar dluftspring avatar dylanrigal avatar jeremyyeo avatar karanhegde avatar kdw2126 avatar lewisarmistead avatar lookslikeitsnot avatar lucasthenoob avatar makotonium avatar mateuszklimek avatar mbkupfer avatar mirosval avatar mrjoosh avatar nicholasyager avatar noel avatar nysthee avatar owlas avatar rcaddell avatar robbertdm avatar s-vinay avatar samantha-guerriero-cko avatar tconbeer avatar thomaslaber avatar uselessalias avatar vdfaller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-expectations's Issues

Compatibility issue with dbt v0.20.0

Hi there,
I've just upgraded to v0.20.0 to do some evaluation and check for any breaks. I'm getting this error when I run dbt deps

Running with dbt=0.20.0
Encountered an error:
Found duplicate project "dbt_utils". This occurs when a dependency has the same project name as some other dependency.

This is my package.yml:

packages:
- package: dbt-labs/dbt_utils
  version: 0.7.0  
- package: calogica/dbt_expectations
   version: 0.3.7

If I remove your package I don't get the error.

Running with dbt=0.20.0
Installing dbt-labs/[email protected]
  Installed from version 0.7.0

We use your awesome package extensively. If you could have a look asap, that would be much appreciated.

Missing from_unixtimestamp

Hello,

I have updated to dbt 0.19.2 and I'm now getting the following error:

Running with dbt=0.19.2
Encountered an error:
Compilation Error
  In dispatch: No macro named 'from_unixtimestamp' found
      Searched for: 'dbt_date.redshift__from_unixtimestamp', 'dbt_date.default__from_unixtimestamp'

I've found this commit: calogica/dbt-date@bc5a637 that adds this to dbt-date however the currently released version of dbt-expectations package is including an older version of that package (though it seems to be up to date on main). So I get a version conflict like:

Running with dbt=0.19.2
Encountered an error:
Version error for package calogica/dbt_date: Could not find a satisfactory version from options: ['>=0.3.0', '<0.4.0', '>=0.2.0', '<0.3.0']

I'm not sure how it worked before (where that function was defined) or what really broke and I realize that dbt-expectations is probably not the right place to complain about it, but it seems you're in a position to fix it?

filter_by needed for all methods

It will be great if we can limit the data set. Some tables could be huge and it just takes too long to complete within CI pipeline.

No boolean support in SQL Server

Hi!

I tried to run dbt-expectations on my Azure SQL DB, and while some tests are working out of the box, other ones are raising SQL errors.

One of them is due to the macro expression_is_true.sql. The current macro performs a logic test in {{expression}} in the select statement but this is not valid in SQL Server. My current workaround is to change the default test_condition="= true" to test_condition="= 1" and use a case when statement in the select.

with grouped_expression as (

    select
        {% if group_by_columns %}
        {% for group_by_column in group_by_columns -%}
        {{ group_by_column }} as col_{{ loop.index }},
        {% endfor -%}
        {% endif %}
        case when {{ expression }} then 1 else 0 end as expression
    from {{ model }}
     {%- if row_condition %}
    where
        {{ row_condition }}
    {% endif %}
    {% if group_by_columns %}
    group by
    {% for group_by_column in group_by_columns -%}
        {{ group_by_column }}{% if not loop.last %},{% endif %}
    {% endfor %}
    {% endif %}

I've not done extensive testing but this seems to work with the few tests I've done so far.
Would it make sense to change the macro for all the SQL engines or put in place conditions for SQL Server vs the rest?

expect_column_values_to_be_of_type returns "character varying (16777216) for text data types

SnowflakeColumn(column='ENTITY', dtype='VARCHAR', char_size=16777216, numeric_precision=None, numeric_scale=None)

column.dtype returns "VARCHAR"
column.data_type returns "character varying(16777216)"

implementation could be changed to use .dtype instead of .data_type like so:

{%- macro test_expect_column_values_to_be_of_type(model, column_name, column_type) -%}
{%- if execute -%}

    {%- set column_name = column_name | upper -%}
    {%- set columns_in_relation = adapter.get_columns_in_relation(model) -%}
    {%- set column_type = column_type| upper -%}

    {%- set matching_column_types = [] -%}

    {%- for column in columns_in_relation -%}
        {%- if ((column.name | upper ) == column_name) and ((**column.dtype** | upper ) == column_type) -%}
            {%- do matching_column_types.append(column.name) -%}
        {%- endif -%}
    {%- endfor -%}
    select 1 - {{ matching_column_types | length }}

{%- endif -%}
{%- endmacro -%}

expect_table_columns_to_match_ordered_list is failing on Redshift

After the upgrade to 0.4 the expectation is failing on Redshift with

Invalid operation: failed to find conversion function from "unknown" to text;

I think this is because redshift cant guess the type of the ordered_column_names and ordered_relation_column_names. Locally I managed to get it to run by adding ::text after.

select
  count(*) as failures,
  count(*) != 0 as should_warn,
  count(*) != 0 as should_error
from (
       with test_data as (

         select
           'COL_A, COL_B'::text as ordered_column_names,
           'COL_A, COL_B'::text as ordered_relation_column_names

       )
       select *
       from test_data
       where ordered_column_names != ordered_relation_column_names

     ) dbt_internal_test

Regex expectations broken on Redshift

Hello, I've upgraded to dbt 0.20.0 and dbt-expectations 0.4. After that I started seeing the following error:

Invalid operation: function regexp_matches(character varying, "unknown") does not exist

My understanding of what happened is that this #39 was fixed, which replaced the regexp_instr with regex_matches on Postgres. However because Redshift falls back to Postgress version of the macro under the new dispatch rules, this is also regex_matches is invoked on Redshift. But redshift does not have that function (even though it does have the original regexp_instr): https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html

Add test to compare distinct counts between tables

Can we consider a test that would check that all distinct source records are present on a target table? This could confirm that our warehouse isn't missing any ids. I would imagine it might be similar to the expect_table_row_count_to_equal_other_table test, but with a distinct allowed?

expect_row_values_to_have_recent_data fails in BigQuery for TIMESTAMP columns

If you add a expect_row_values_to_have_recent_data argument to a TIMESTAMP column, you will get a failure driven by two characteristics of the existing implementation:

  • This comparison fails because BigQuery does not allow TIMESTAMP columns to be compared with DATE objects using the <= operator.
  • This comparison fails because BigQuery does not allow TIMESTAMP columns to be compared with DATESTAMP using the <= operator.

Given that a master time zone is specified in vars under dbt_project.yml, it seems possible to rewrite these comparisons using the appropriate recasting statements to allow columns of a TIMESTAMP type to be compatible with the expect_row_values_to_have_recent_data test.

Expect table columns to match set fails on Redshift

After the upgrade to 0.4 this expectation fails with

Invalid operation: failed to find conversion function from "unknown" to text;

I think its because of the quotes around the number_columnsand number_matching_columns, those were plain numbers before, as far as I can tell.

expect_table_row_count_to_equal_other_table should support row_condition

should at least support row_condition, compare_row_condition if not other params of equal_expression:

{%- macro test_expect_table_row_count_to_equal_other_table(model, compare_model, row_condition=None, compare_row_condition=None) -%}
{{ dbt_expectations.test_equal_expression(model, "count(*)", compare_model=compare_model, row_condition=row_condition, compare_row_condition=compare_row_condition) }}
{%- endmacro -%}

Package is out of date on dbt hub

I know this issue does not belong here, but it seems that the 0.4.1 version, while released, has not been updated on the dbt hub which still shows 0.4.0. I have created an issue there: dbt-labs/hub.getdbt.com#755 but nobody responded. Would you know who to contact about fixing this?

dbt deps

with packages.yaml:

packages:

running dbt deps I get this error:
Running with dbt=0.18.0
Encountered an error:
Got a non-zero returncode running: ['git', 'clone', '--depth', '1', '[email protected]:calogica/dbt-date.git', '36b63a0bdea5f3e5d232c299b76e1160']

also having the dependencies in there directly does not solve the issue...

packages:

However...only having:

packages:

results in dbt deps:
Running with dbt=0.18.0
Installing https://github.com/calogica/[email protected]
Installed from revision 0.2.1

Installing fishtown-analytics/[email protected]
Installed from version 0.6.2

Set up automated integration testing

We should set up Github Actions or something similar to allow us to kick off integration tests. This could be triggered to avoid BigQuery charges, or automated (at least for Postgres)

get_select() TSQL compatibility

as part the work in dbt-msft/tsql-utils#44 to make this package compatible with the dbt-msft adapters, I'm stuck trying to shim the first macro

how can I make this query TSQL- and dbt-expectations-compatible? I'm struggling to grok the intent behind dbt-expectations.test_equal_expression()'s default argument of groupby=['col'].

Here's a long write up of the steps I went through to discover that get_select() was the problematic macro.

Very much appreciate the work you've done here, we'd like to re-use your integration tests as much as possible

failed compilation syntax

select
    'col' as col_1,
    sum(col_numeric_a) as expression
from
    "dbt-msft-serverless-db"."integration_tests_azsql_dbt_expectations_integration_tests"."data_test"
group by
    'col'

generic example

here's a generic example that I don't understand how this works on Snowflake or BigQuery right now because it doesn't work on Postgres or TSQL

WITH CTE AS (
    SELECT 1 as col_numeric_a
    UNION 
    SELECT 2 as col_numeric_a
)
SELECT
    'col' AS col_1,
    sum(col_numeric_a) AS expression
FROM CTE
GROUP BY 'col'

clarification on the use of expect_table_columns_to_match_set

Data warehouse: Bigquery

The test "expect_table_columns_to_match_set" is comparing the columns list provided in the "column_list" parameter with the columns in the model. When we have some additional column in the list which does not exist in the model the test is still successful.

This design misguides developers / testers in one particular scenario outlined below.

  • developer has missed to add one or more column(s), which was required as per requirement, into a model
  • as per the requirement, the test set included this column(s) which was missed by the developer
  • now when the test is executed, the test does not fail as all other columns in the set exist in the model

Ideally, the test should compare every column provided in the parameter and even if one is missing it should fail. The table column count test will reveal the count differences but this test also should cover this scenario. Thoughts ?

Integration test fails depending of the time of the day

Hi!

I have been running the integration tests and the following test fails depending on the time of the day (before or after midday)

  - name: timeseries_hourly_data_extended
    columns:
      - name: date_hour
        tests:
          - dbt_expectations.expect_row_values_to_have_recent_data:
              datepart: hour
              interval: 12

timeseries_hourly_data_extended uses date_spine up to to today, while excluding it (as described in dbt-utils: "This macro returns the sql required to build a date spine. The spine will include the start_date (if it is aligned to the datepart), but it will not include the end_date")

When running the test at 2 pm, going back 12 hours makes it 2 am, which is still after the last record, knowing that date_spine excludes the end_date, so the test fails.

Would it make sense to change this to 24?

Cheers,

dbt Spark Utils causing breaking error

Looks like spark utils got upgraded to 0.2.0 and this is causing a breaking error in dbt deps because the packages can't find a satisfactory version.

Also applies to date package

Dbt deps issue

Running dbt deps with the following packages.yml

packages:

  - git: "https://github.com/calogica/dbt-expectations.git"
    revision: 0.2.0

  - package: fishtown-analytics/dbt_utils
    version: 0.5.1

I get:

Running with dbt=0.17.0
Encountered an error:
Version error for package fishtown-analytics/dbt_utils: Could not find a satisfactory version from options: ['=0.5.1', '>=0.6.0', '<0.7.0']

It's confusing since I'm actually running with dbt_utils version 0.5.1. Is this expected?

Implement 'mostly' in expression tests

We should support the mostly parameter in expression tests as a float that controls what % of rows need to satisfy the given expression test. Default is 100%.

Test for cross-platform compatibility

The tests are currently only integration tested in BigQuery. We need to test on Postgres and Snowflake at least, ideally also on Redshift.

If needed, this should spawn additional issues to make certain macros adapter macros for better cross-platform compatibility.

Implement missing test parameters

We should compare the macro parameters for all expectations to make sure they match (as much as applicable) their Python equivalents.
E.g. expect_column_values_to_be_between does not implement the strict_min, strict_max or mostly parameters.

Alternatives to re-implementation?

Hi @clausherther !

I'm one of the maintainers of Great Expectations. I'm excited that you're working on bringing dbt and GE functionality together. Lots of teams use both tools, and we're very supportive of even closer collaboration.

Looking at your implementation, I can see several places where we can probably save you some work, but the right way to implement will depend on exactly what you're trying to accomplish. We'd be happy to have someone on the GE team collaborate to help.

Are you up for a call to discuss?

equal_expression macro doesn't support integer division correctly

expression_difference_percent

    abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
        nullif(a.expression, 0) as expression_difference_percent

should be

    abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
        nullif(a.expression * 1.0, 0) as expression_difference_percent

or more explicit cast, otherwise result is 0 on platforms where integer division doesn't result in decimal

Add a test to check for recent rows at each grouping level

We should add the ability to check for recent data at a grouping level via a group_by parameter.
For example, a user might expect recent rows for every product in a fact table. So, a recency check should look at grouped rows for each product_id that does not meet the recency requirement.

e.g.

with latest_grouped_timestamps as (
    
    select
        product_id,
        max(timestamp_column) as latest_timestamp_column
    from
        fact_table
    group by
        1
),
validation_errors as (
    
    select *
    from
        latest_grouped_timestamps
    where
        latest_timestamp_column < { threshold_timestamp }

)
select count(*) from validation_errors

Name: expect_grouped_row_values_to_have_recent_data?

Note: this would only cover grouped values that already exists in the model to be tested. So, for example, products that have never sold would not be checked. This would require a left join to a dimension table and is currently out of scope for this test.

Error after upgrade to 0.3.6

With my packages like:

packages:
  - package: calogica/dbt_expectations
    version: [">=0.3.0", "<0.4.0"]
  - package: fishtown-analytics/codegen
    version: 0.3.2
  - package: fivetran/fivetran_log
    version: 0.3.1
  - package: fishtown-analytics/audit_helper
    version: 0.3.0

I'm getting this error:

Running with dbt=0.19.1
Encountered an error:
Found duplicate project "dbt_utils". This occurs when a dependency has the same project name as some other dependency.

But if we pin the version on 0.3.4 everything works fine

Update docs with parameters for each test

We should list each parameter in the example docs and indicate whether it's required or optional.

Example:

tests:
  - dbt_expectations.expect_column_stdev_to_be_between:
      # either min_value or max_value is required 
      min_value: 0 <optional>
      max_value: 2 <optional>
      group_by: [col1, col2] <optional>
      strictly: True <optional>

Changes to dispatch in dbt v0.20

Hey Claus! Planning to open this same issue in the dbt-date package momentarily :)

Required code change

We've made a change to adapter.dispatch: Instead of passing a _get_namespaces() macro to the packages arg, you should pass the name of this package (as a string) to the macro_namespace arg, which is now the second positional argument.

In practice, this just looks like switching each macro like:

{% macro log_natural(x) -%}
{{ adapter.dispatch('log_natural', packages=dbt_expectations._get_namespaces()) (x) }}
{% endmacro %}

To:

{% macro log_natural(x) -%}
    {{ adapter.dispatch('log_natural', 'dbt_expectations') (x) }}
{% endmacro %}

I hope this could be as simple as Find + Replace for packages=dbt_expectations._get_namespaces() โ†’ 'dbt_expectations'.

For the README

If a user wishes to override/shim this package, instead of defining a var named dbt_expectations_dispatch_list (or dbt_utils_dispatch_list), they should now define a config in dbt_project.yml, for instance:

dispatch:
  - macro_namespace: dbt_expectations
    search_order: ['my_project', 'dbt_expectations']  # enable override
  - macro_namespace: dbt_utils
    search_order: ['spark_utils', 'dbt_utils']  # shim for spark

Notes

This change is in dbt v0.19.2 as well. Both v0.19.2 and v0.20.0 have backwards compatibility for the old syntax, so there's no huge rush to upgrade. As soon as you do upgrade to the new syntax, you'll need to require dbt >= 0.19.2 (or just >=0.20.0, for simplicity, since you're already making compatibility changes in #69).

See also:

expect_column_values_to_be_in_set doesn't seem to be working

Hi,

I'm looking at the source code for expect_column_values_to_be_in_set and it doesn't look like it's validating the fields at all.

validation_errors as (
    -- values from the model that are not in the set
    select
        v.value_field
    from
        all_values v
        left join
        set_values s on v.value_field = s.value_field
    where
        v.value_field is null
)

I think to validate for errors either v.value needs to be s.value or the left join needs to be a right join .. I can submit a pull request this week but thought I'd let you know first as it might affect a lot of tests.

Fix integrations tests on Postgres

Integration tests currently fail on local Postgres instance

Database Error in model timeseries_data (models/schema_tests/timeseries_data.sql)
  function rand() does not exist
  LINE 170:     rand()
                ^
  HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
  compiled SQL at target/run/dbt_expectations_integration_tests/models/schema_tests/timeseries_data.sql

Database Error in model timeseries_hourly_data_extended (models/schema_tests/timeseries_hourly_data_extended.sql)
  type "datetime" does not exist
  LINE 197:         cast(d.date_hour as datetime) as date_hour,
                                        ^
  compiled SQL at target/run/dbt_expectations_integration_tests/models/schema_tests/timeseries_hourly_data_extended.sql

reference a view and not a table

Hi,

I wanted to reference a view:

models:

  • name: my_table
    description: This table contains our toy model
    tests:
    • dbt_expectations.expect_table_column_count_to_equal_other_table:
      compare_model: ref("my_table_view")

resulting in:

Failure in test dbt_expectations_expect_table_column_count_to_equal_other_table_my_table_ref_my_table_view_ (models/my_first_folder/schema.yml)
Got 7 results, expected 0.

are the tests only working with tables?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.