Giter Club home page Giter Club logo

automate-dv's Introduction

AutomateDV AutomateDV

Documentation Status Slack

dbt Versions

Changelog and past doc versions

AutomateDV by Datavault

Build your own Data Vault data warehouse! AutomateDV is a free to use dbt package that generates & executes the ETL you need to run a Data Vault 2.0 Data Warehouse on your data platform.

What does AutomateDV offer?

  • productivity gains, fewer errors
  • multi-threaded execution of the generated SQL
  • your data modeller can generate most of the ETL code directly from their mapping metadata
  • your ETL developers can focus on the 5% of the SQL code that is different
  • dbt generates documentation and data flow diagrams

powered by dbt, a registered trademark of dbt Labs

Worked example project

Learn quickly with our worked example:

Supported platforms:

Platform support matrix

Installation

Check dbt Hub for the latest installation instructions, or read the docs for more information on installing packages.

Usage

  1. Create a model for your table.
  2. Provide metadata
  3. Call the appropriate template macro
# Configure model
{{- config(...)                          -}}

# Provide metadata
{%- set src_pk = ...                     -%}
...

# Call the macro
{{ automate_dv.hub(src_pk, src_nk, src_ldts,
                   src_source, source_model) }}

Join our Slack Channel

Talk to our developers and other members of our growing community, get support and discuss anything related to AutomateDV or Data Vault 2.0

Join our Slack

Social

Twitter Follow

LinkedIn

Youtube

Awards

innovation awards

Contributing

View our contribution guidelines

License

Apache 2.0

automate-dv's People

Contributors

datavaultuk avatar dependabot[bot] avatar dvaakashd avatar dvalexhiggs avatar dvchrisf avatar dvflynnsheridan avatar dvfrankiehomewood avatar dvjocelynshannon avatar dvnorbertacatrinei avatar dvsiobhanstrange avatar dvtimwilson avatar r-strange avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

automate-dv's Issues

[FEATURE] Allow composite derived columns via a list

Allow a list to be provided to derived columns like so:

derived_columns:
  CUSTOMER_NK: 
    - CUSTOMER_NAME
    - CUSTOMER_ID

Given CUSTOMER_NAME = "Alex"
And CUSTOMER_ID = 001
Then CUSTOMER_NK: = Alex||001

In future we will provide a means to configure the concatenation string, but for now the double pipe standard will be used.

This will allow concatenated values to be generated for situations where one column is not enough to define a required value.

This was suggested in #9 and is defined here as a separate feature for trackability.

[BUG] Adapter dispatch implementation

At this current time, the extension capability of dbtvault is causing errors.

When attempting to override dbtvault macros using the adapter_packages variable, dbtvault will complain that you are not overriding other macros; in other words, dbtvault does not fall back to its own implementation.

This has already been fixed internally and the fix will be in the next release.

[FEATURE] dbt 0.21.0 compatibility

Hi!

Have you guys had the chance to test dbtvault with dbt 0.21.0 by any chance? If so, would it be possible to update require-dbt-version in dbt_project.yml by any chance? We'd like to upgrade to use the great new on_chema_change feature for incremental models.

Best,
Stefan

[Question] Determining mappings from stage to RDV based on DV metadata

Hey team

Just a question this time around I'm afraid. I have been reading this page amongst the others in the documentation and what I am trying to determine is if there is a trivial/easy way to determine from the metadata how the data maps from a layer to another (stage to RDV).

I presume this can be determined by going over each DV model (hub, sat etc.), parsing out from the YML/Jinja code the parameter values and then building a traceability matrix by linking things together using the source model and the columns that are mapped through.

Is there anything that has been put together to serve this purpose? I know DBT can render a lineage graph of the models and it would be great to understand on whether there is a proven way to go around to render a a similar graph, but at column level.

Thanks

[BUG] Duplicate records in effectivity satellites (vault_insert_by_period)

Describe the bug
When adding macro vault_insert_by_period in eff_sat, it doesn't insert the most recent closed records from source data.

Versions

dbt: 0.20.0
dbtvault: 0.7.6.1

To Reproduce
Steps to reproduce the behavior:

  1. Set a staging table with two records: one with end_date which doesn't equal '9999-12-31', the other with a specific end_date. Other columns of the two records are the same.
  2. Set an eff_sat table which source_model is the staging table set in step 1
  3. Run dbt run -m ${model_name}
  4. There's only one record in the eff_sat table. The record which has end_date not equal '9999-12-31' doesn't exist in the eff_sat table.

Expected behavior
All the records in the staging table should exist in the eff_sat table in the first run of dbt run.

Screenshots
You can test with the code here.
sat_link_a_b_period.sql will have the following result in the first run:
Screenshot from 2021-09-17 15-37-55

[BUG] hash macro returns null for NULL string input

Describe the bug
When using hash macro to generate hash for a single string, NULL is returned for NULL input.

To Reproduce
Use multi_hash with a single column input, that can contains NULL.
{{ dbtvault.multi_hash([('CUSTOMER_PHONE, 'CUSTOMER_HASHDIFF')]) }}
NULL is returned for CUSTOMER_HASHDIFF

Expected behavior
A hash value for dummy value (^^) is returned.

[BUG] Link record is NULL and not loaded when some FKs are NULL

Describe the bug
Link record is NULL and not loaded when some FKs are NULL

Versions

dbt: 0.20.2
dbtvault: 0.7.7
snowflake: 5.34.1

To Reproduce
Steps to reproduce the behavior:

  1. Go to hashed view ddl
  2. See how hashkeys are generated: CAST(MD5_BINARY(NULLIF(CONCAT_WS('||', IFNULL(NULLIF(UPPER(TRIM(CAST(adrnr__address AS VARCHAR))), ''), '^^') ), '^^')) AS BINARY(16)) AS ADDRESS_HK
  3. It seems that snowflake returns null hashkey when there null value in foreign key field.
  4. Because in link populating sql's all hashkeys are checked that they are not null, the rows that have null foreign keys does not come to link table at all:
WITH row_rank_1 AS (
    SELECT lnk_sales_document_partner_hk, sales_document_hk, sales_document_partner_hkd, address_hk, customer_hk, sales_document_item_hk, dv_load_date, dv_record_source,
           ROW_NUMBER() OVER(
               PARTITION BY lnk_sales_document_partner_hk
               ORDER BY dv_load_date
           ) AS row_number
    FROM xxx.xxx.hashed_vbpa__sales_document_partner
    WHERE lnk_sales_document_partner_hk IS NOT NULL
    AND sales_document_hk IS NOT NULL
    AND sales_document_partner_hkd IS NOT NULL
    AND address_hk IS NOT NULL // <--- This
    AND customer_hk IS NOT NULL
    AND sales_document_item_hk IS NOT NULL
    QUALIFY row_number = 1
),
records_to_insert AS (
    SELECT a.lnk_sales_document_partner_hk, a.sales_document_hk, a.sales_document_partner_hkd, a.address_hk, a.customer_hk, a.sales_document_item_hk, a.dv_load_date, a.dv_record_source
    FROM row_rank_1 AS a
)
SELECT count(1) FROM records_to_insert

[FEATURE] Azure Synapse support?

A new user of Data Vault modelling paradigm here. First of all, huge thanks for implementing the dbtvault project.

I was wondering if it would be possible to add support for Azure Synapse Data Warehouse? As Azure takes more market share in cloud computing, I believe supporting Synapse will help propel the familiarity of Data Vault modelling and help end users like us to adopt it rapidly by using dbtvault.

AB#5378

[BUG] dbt 0.20.0 compatibility

Describe the bug
dbtvault 0.7.6 doesn't work with dbt 0.20.0 yet, and probably not with 0.19.2 either (untested).

Versions

dbt: 0.20.0
dbtvault: 07.6

To Reproduce
Steps to reproduce the behavior:

  1. dbt init dbt_test
  2. Add dbtvault 0.7.6 to packages.yml
  3. dbt run
  4. See error
Compilation Error
  As of v0.19.2, custom macros, such as 'get_dbtvault_namespaces', are no longer supported in the 'packages' argument of 'adapter.dispatch()'.
  See https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch for details.

Expected behavior
dbt run should work

Additional context
My guess is that the 3 new macros for PIT, bridge and xts tables did not get the dbt 0.20.0 treatment yet:

dbtvault/macros/tables/xts.sql
2:    {{- adapter.dispatch('xts', packages = dbtvault.get_dbtvault_namespaces())(src_pk=src_pk,

dbtvault/macros/tables/pit.sql
3:    {{- adapter.dispatch('pit', packages = dbtvault.get_dbtvault_namespaces())(source_model=source_model, src_pk=src_pk,

dbtvault/macros/tables/bridge.sql
3:    {{- adapter.dispatch('bridge', packages = dbtvault.get_dbtvault_namespaces())(source_model=source_model, src_pk=src_pk,

[BUG] insert_by_period materialization is leaving orphaned tmp relations

Description

I'm left with a bunch of orphan tmp tables when running the new materialization

dbt: 0.18.1
dbtvault: 0.7.0 (bigquery variant, but the bug is from your codebase)

To Reproduce

  1. do a full refresh / fresh run with the insert_by_period mat
  2. do a regular run for subsequent loads (not full-refresh)
  3. See extra temp tables in warehouse

Solution

Fixing the following 2 spots in the codebase gives the correct results:

  1. We're not actually dropping rel columns here
    https://github.com/Datavault-UK/dbtvault/blob/a90f480832b594b256dfd736a6619766eb30198a/macros/materialisations/vault_insert_by_period_materialization.sql#L124-L128

  2. Missing a {% do to_drop.append(tmp_relation) %} here after the for loop
    https://github.com/Datavault-UK/dbtvault/blob/a90f480832b594b256dfd736a6619766eb30198a/macros/materialisations/vault_insert_by_period_materialization.sql#L93-L101

[BUG] Duplicate records in effectivity satellites (Incremental)

Describe the bug
For records closed (end_date does not equal '9999-12-31'), when you run the dbt run -m ${model_name} in the second time, it will insert the same records into the eff_sat table.

Versions

dbt: 0.20.0
dbtvault: 0.7.6.1

To Reproduce
Steps to reproduce the behavior:

  1. Set a staging table with a record which end_date is not '9999-12-31'
  2. Set a effectivity satellite table which source_model is the staging table set in step 1
  3. Run dbt run -m ${model_name} twice
  4. See duplicated records in the effectivity satellite table

Expected behavior
In the second dbt run, it shouldn't insert the same closed record again

Screenshots
Screenshot from 2021-09-16 14-22-39

Log files
If applicable, provide dbt log files which include the problem.

Additional context
I think this should be a bug in dbtvault.eff_sat, https://github.com/Datavault-UK/dbtvault/blob/master/macros/tables/eff_sat.sql#L56

latest_closed AS (
    SELECT {{ dbtvault.alias_all(source_cols, 'd') }}
    FROM latest_records AS d
    WHERE TO_DATE(d.{{ src_end_date }}) != TO_DATE('9999-12-31')
),

{# Identifying the completely new link relationships to be opened in eff sat -#}
new_open_records AS (
    SELECT DISTINCT
        {{ dbtvault.alias_all(source_cols, 'f') }}
    FROM source_data AS f
    LEFT JOIN latest_records AS lr
    ON f.{{ src_pk }} = lr.{{ src_pk }}
    WHERE lr.{{ src_pk }} IS NULL
),

{# Identifying the currently closed link relationships to be reopened in eff sat -#}
new_reopened_records AS (
    SELECT DISTINCT
        lc.{{ src_pk }},
        {{ dbtvault.alias_all(fk_cols, 'lc') }},
        lc.{{ src_start_date }} AS {{ src_start_date }},
        g.{{ src_end_date }} AS {{ src_end_date }},
        g.{{ src_eff }} AS {{ src_eff }},
        g.{{ src_ldts }},
        g.{{ src_source }}
    FROM source_data AS g
    INNER JOIN latest_closed AS lc
    ON g.{{ src_pk }} = lc.{{ src_pk }}
),

For a record which status is closed (end_date doesn't equal '9999-12-31') in eff_sat, it should appear in latest_records. While if the status of the record doesn't change in the staging table, then in new_reopened_records, it will not be filtered out and will be inserted again.

Seems like adding a where clause to filter out end_date = TO_DATE('9999-12-31') in new_reopened_records can solve the problem.

[FEATURE] Support renaming sat hashdiff column

Is your feature request related to a problem? Please describe.
I keep finding myself in a situation where I have a source table that feeds multiple DV satellites.

I'd like to be able to stage everything once in a single stg_ file, and pass it to downstream satellites. I can do that right now, but I have to name each column differently, which means my satellite tables end up having different hashdiff column names!

Describe the solution you'd like
In addition to

dbtvault.sat(
        src_pk='my_hk',
        src_hashdiff='sat1_hashdiff',
        ...
)        

Add support for:

dbtvault.sat(
        src_pk='my_hk',
        src_hashdiff={ 'sat_hashdiff':  'sat1_hashdiff'},
        ...
)        โ€‹

This will read the sat1_hashdiff from the staging table, but alias it to sat_hashdiff

Describe alternatives you've considered
Of course, one can work around this by simply adding an extra staging file, but we already have so many files! would be nice if we could get rid of a bunch of them

Do you think this is valuable or is this bad practice ๐Ÿ˜“

Custom hash characters (concat_string and null_placeholder_string)

Is your feature request related to a problem? Please describe.
Different customers may have different hashing concatenation string (default ||) and null placeholder string (^^). So if you are extending old solution with dbtvault solution the hashes would not match. So it would be nice to be able to change those values in dbt_project.yml like this:

vars:
hash: "MD5"
concat_string: "||"
null_placeholder_string: "^^"

Describe the solution you'd like
I would like that local concat_string and null_placeholder_string variables would be changed as project variables in hash.sql macro.

So this code should be changed

{%- set concat_string = "||" -%}
{%- set null_placeholder_string = "^^" -%}

like this (same way as the hash variable is already)

{%- set concat_string = var('concat_string','||') -%}
{%- set null_placeholder_string = var('null_placeholder_string','^^') -%}

[FYI] Changes to dispatch in dbt v0.20

Hi Alex & team! I wanted to give you a heads up about a change to dispatch functionality coming in the next version of dbt. Let me know if you have questions, I'm more than happy to help.

Required code change

We've made a change to adapter.dispatch: Instead of passing a _get_namespaces() macro to the packages arg, you should pass the name of this package (as a string) to the macro_namespace arg, which is now the second positional argument.

In practice, this just looks like switching each macro like:
https://github.com/Datavault-UK/dbtvault/blob/d15aeab5691056348c0ccafe4e4f153cc7e84710/macros/internal/alias_all.sql#L1-L5
To:

{%- macro alias_all(columns=none, prefix=none) -%}


    {{- adapter.dispatch('alias_all', 'dbtvault')(columns=columns, prefix=prefix) -}}


{%- endmacro %}

I hope this could be as simple as Find + Replace for packages = dbtvault.get_dbtvault_namespaces() โ†’ 'dbtvault'.

If you prefer more explicit syntax, you could also make this:

{%- macro alias_all(columns=none, prefix=none) -%}


    {{- adapter.dispatch(macro_name = 'alias_all', macro_namespace = 'dbtvault')(columns=columns, prefix=prefix) -}}


{%- endmacro %}

For the README

If a user wishes to override/shim this package, instead of defining a var named adapter_packages, they should now define a config in dbt_project.yml, for instance:

dispatch:
  - macro_namespace: dbtvault
    search_order: ['my_project', 'dbtvault']  # enable override

Notes

This change is in dbt v0.19.2 as well. Both v0.19.2 and v0.20.0 have backwards compatibility for the old syntax, so there's no huge rush to upgrade.

However:

  • The old syntax will start raising a deprecation warning in v0.20.
  • You've named your override var adapter_packages, rather than the convention of dbtvault_dispatch_list, so we don't have backwards compatibility for checking that variable in dbt v0.19.2 + v0.20.0 as of now. Do you know if folks are relying on this today? If so, we can add (hard-coded) logic to check a var named adapter_packages IFF the dispatching package is named dbtvault.

As soon as you do upgrade to the new syntax, you'll need to require dbt >= 0.19.2 (or just >=0.20.0, for simplicity, since you're already making compatibility changes in #69).

See also:

[FEATURE] Make transactional link payload optional

The transactional link payload parameter should be optional to facilitate variances in modelling.

Sometimes we may way to have a very minimal transactional link which simply keeps track of all transactions, and the payload data is contained in one or more satellites. This is especially useful if a transaction contains many attributes which could be grouped into different categories, and therefore moved out into separate satellites for each.

In other scenarios, we may have only a few columns which form the payload, and we would want to include these in the transactional link themselves.

[FEATURE] Possibility to hashing algorithm as parameter

Microsoft databases like Synapse, SQL server, Azure SQL use hashing algorithm called hashbytes instead of md5_binary.

I would like that I could give the algorithm as parameter, same way as hash type, concat_string and null_placeholder_string.

I tried to make function called md5_binary in synapse that would redirect it to hashbytes built-in function, but it did not seem to be possible. That is because built-in functions are without schema name and it is not possible to do other kind of functions without schema.

Best place to do this modifications is (I suspect) macros/supporting/hash.sql script. Probably it needs as separate format clause, because hashbytes function requires additional input parameter which tells hash type (https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver15)

{#- Select hashing algorithm -#}
{%- if hash == 'MD5' -%}
    {%- set hash_alg = 'MD5_BINARY' -%}
    {%- set hash_size = 16 -%}
{%- elif hash == 'SHA' -%}
    {%- set hash_alg = 'SHA2_BINARY' -%}
    {%- set hash_size = 32 -%}
{%- else -%}
    {%- set hash_alg = 'MD5_BINARY' -%}
    {%- set hash_size = 16 -%}
{%- endif -%}

{%- set standardise = "NULLIF(UPPER(TRIM(CAST([EXPRESSION] AS VARCHAR))), '')" %}

{#- Alpha sort columns before hashing if a hashdiff -#}
{%- if is_hashdiff and dbtvault.is_list(columns) -%}
    {%- set columns = columns|sort -%}
{%- endif -%}

{#- If single column to hash -#}
{%- if columns is string -%}
    {%- set column_str = dbtvault.as_constant(columns) -%}
    {{- "CAST(({}({})) AS BINARY({})) AS {}".format(hash_alg, standardise | replace('[EXPRESSION]', column_str), hash_size, alias) | indent(4) -}}

[FEATURE] Requesting usable adapter dispatch macros

with the release of the new adapter.dispatch macro, I think you can accelerate the adoption of dbtvault across other adapters (unofficially). I think you may agree that official support for N adapters is a massive undertaking, but we can still make devs' lives better with some clever hooks.

When I was working with my bigquery clone, I had to essentially clone the entire codebase and modify the parts that I wanted because there was no way to properly tell dbt to look in my package, and THEN dbtvault to find implementations of certain macros.

Now that the new adapter.dispatch macro is out, we have some options to play with!.

Option 1: Call dbtvault macros from adapter packages

As long as all dbtvault macros are using the dispatch syntax, I can do

adapter.dispatch('link', packages=['dbtvault'])

whenever I just want to use your implementation and just have macros for all my overrides.

The caveat is that I think I'd still have to copy all the macro signatures and dispatch it to dbtvault through mypackage.someMacro()

Option 2: Have dbtvault call out to adapter packages

Here's where it gets interesting.

What if we could just define a variable that you can pass into dbtvault for it to look for implementations in adapter packages for?
Provided you add the dispatch syntax to everything you can do something like
adapter.dispatch('hash', packages=var('adapter_packages', ['dbtvault']))(...args)

This way, I don't have to copy everything in my adapter package and only have to implement the things I'm overwriting.

There is a chance I'm forgetting/misunderstanding some mechanics here so take it with a grain of salt, but I'd love to discuss this further

[BUG] dbtvault generated wrong sql

Describe the bug
A clear and concise description of what the bug is.

Versions

dbt: 1.0.4
dbtvault: 0.8.2

To Reproduce
Steps to reproduce the behavior:

  1. Go to : dbtvault in sqlserver 2019
  2. Click on : dbt run
  3. Scroll down to log
  4. See error

Expected behavior
dbtvault generated wrong sql: With in with does not exsit in sqlserver.

Screenshots
image

WITH staging AS (
-- Generated by dbtvault.


WITH source_data AS (

    SELECT

    "CustomerKey",
    "GeographyKey",
    "CustomerLabel",
    "Title",
    "FirstName",
    "MiddleName",
    "LastName",
    "NameStyle",
    "BirthDate",
    "MaritalStatus",
    "Suffix",
    "Gender",
    "EmailAddress",
    "YearlyIncome",
    "TotalChildren",
    "NumberChildrenAtHome",
    "Education",
    "Occupation",
    "HouseOwnerFlag",
    "NumberCarsOwned",
    "AddressLine1",
    "AddressLine2",
    "Phone",
    "DateFirstPurchase",
    "CustomerType",
    "CompanyName",
    "ETLLoadID",
    "LoadDate",
    "UpdateDate"

    FROM "dev"."dbo"."RAW__Contoso__dbo__Customer"
),

derived_columns AS (

    SELECT

    "CustomerKey",
    "GeographyKey",
    "CustomerLabel",
    "Title",
    "FirstName",
    "MiddleName",
    "LastName",
    "NameStyle",
    "BirthDate",
    "MaritalStatus",
    "Suffix",
    "Gender",
    "EmailAddress",
    "YearlyIncome",
    "TotalChildren",
    "NumberChildrenAtHome",
    "Education",
    "Occupation",
    "HouseOwnerFlag",
    "NumberCarsOwned",
    "AddressLine1",
    "AddressLine2",
    "Phone",
    "DateFirstPurchase",
    "CustomerType",
    "CompanyName",
    "ETLLoadID",
    "LoadDate",
    "UpdateDate",
    'Contoso.dbo.Customer' AS "RECORD_SOURCE",
    "CustomerKey" AS "Customer__Id"

    FROM source_data
),

hashed_columns AS (

    SELECT

    "CustomerKey",
    "GeographyKey",
    "CustomerLabel",
    "Title",
    "FirstName",
    "MiddleName",
    "LastName",
    "NameStyle",
    "BirthDate",
    "MaritalStatus",
    "Suffix",
    "Gender",
    "EmailAddress",
    "YearlyIncome",
    "TotalChildren",
    "NumberChildrenAtHome",
    "Education",
    "Occupation",
    "HouseOwnerFlag",
    "NumberCarsOwned",
    "AddressLine1",
    "AddressLine2",
    "Phone",
    "DateFirstPurchase",
    "CustomerType",
    "CompanyName",
    "ETLLoadID",
    "LoadDate",
    "UpdateDate",
    "RECORD_SOURCE",
    "Customer__Id",

    CAST(HASHBYTES('MD5', NULLIF(UPPER(TRIM(CAST("CustomerKey" AS VARCHAR(max)))), '')) AS BINARY(16)) AS "Customer__Hash_Key",
    CAST(HASHBYTES('MD5', (CONCAT_WS('||',
        ISNULL(NULLIF(UPPER(TRIM(CAST("CustomerKey" AS VARCHAR(max)))), ''), '^^'),
        ISNULL(NULLIF(UPPER(TRIM(CAST("FirstName" AS VARCHAR(max)))), ''), '^^'),
        ISNULL(NULLIF(UPPER(TRIM(CAST("Title" AS VARCHAR(max)))), ''), '^^')
    ))) AS BINARY(16)) AS "Customer__Hash_Diff"

    FROM derived_columns
),

columns_to_select AS (

    SELECT

    "CustomerKey",
    "GeographyKey",
    "CustomerLabel",
    "Title",
    "FirstName",
    "MiddleName",
    "LastName",
    "NameStyle",
    "BirthDate",
    "MaritalStatus",
    "Suffix",
    "Gender",
    "EmailAddress",
    "YearlyIncome",
    "TotalChildren",
    "NumberChildrenAtHome",
    "Education",
    "Occupation",
    "HouseOwnerFlag",
    "NumberCarsOwned",
    "AddressLine1",
    "AddressLine2",
    "Phone",
    "DateFirstPurchase",
    "CustomerType",
    "CompanyName",
    "ETLLoadID",
    "LoadDate",
    "UpdateDate",
    "RECORD_SOURCE",
    "Customer__Id",
    "Customer__Hash_Key",
    "Customer__Hash_Diff"

    FROM hashed_columns
)

SELECT * FROM columns_to_select
                  )

 
SELECT *
FROM staging

[BUG] vault_insert_by_period materialisation crashes when stop_date is omitted

Describe the bug

The SQL generated from the vault_insert_by_period materialisation when the stop_date configuration field is omitted is invalid and causes an error on Snowflake.

Versions

dbt: 0.18.1
dbtvault: 0.7.1

To Reproduce

Given the following dbt model:

{{
    config(
        schema='kr',
        alias='testing',
        materialized='vault_insert_by_period',
        period='day',
        timestamp_field='created',
        start_date='2021-01-10'
    )
}}

select *
from {{ ref('<some_model>') }}

where
    __PERIOD_FILTER__

The base load works fine, but the next incremental load gives me the following error:

Database Error in model mart_test (models/mart/kr/mart_test.sql)
  100035 (22007): Timestamp 'None' is not recognized
  compiled SQL at target/compiled/voi_snowflake/models/mart/kr/mart_test.sql

The error originates from the following line in the generated SQL:

...
    dateadd(
        millisecond,
        86399999,
        nullif('None','')::timestamp
        )
...

Probably from https://github.com/Datavault-UK/dbtvault/blob/3a627052a4bb721d36359ce61218ee222506bdd8/macros/materialisations/helpers.sql#L82.

Expected behavior

The documentation states that omitting stop_date should work (and make the materialisation use today as a stop date).

[BUG] Multi-column PKs not working as intended

Describe the bug
With the following configuration:

src_pk: 
    - 'CUSTOMER_ID'
    - 'CUSTOMER_EMAIL' 

and the following Jinja:

ON {{ dbtvault.prefix([src_pk], 'stg') }} = {{ dbtvault.prefix([src_pk], 'tgt') }}

We would end up with the following SQL:

ON stg.CUSTOMER_ID, stg.CUSTOMER_EMAIL = tgt.CUSTOMER_ID, tgt.CUSTOMER_EMAIL

Obviously this is a syntax error and the intended result is:

ON stg.CUSTOMER_ID = tgt.CUSTOMER_ID AND stg.CUSTOMER_EMAIL = tgt.CUSTOMER_EMAIL

This is possible with the multi_key() macro, which should have been used in all cases where this pattern is necessary.

Next steps and alternatives

We will be fixing this as a matter of priority (ideally in the next release)

For now, you can create a derived_column instead, which concatenates the two columns explicitly, in your stage:

...
source_model: "raw_source"
derived_columns:
  CUSTOMER_KEY: 
    - "CUSTOMER_ID"
    - "CUSTOMER_EMAIL"
...

[FEATURE] Redshift Support

Hello,

I would like to evaluate dbtvault in a AWS Redshift environment. Since Redshift support is part of the roadmap, I was wondering if you already have a some idea of what is necessary to get dbtvault to run on Redshift or maybe already started working on this.

Depending on the effort, we might be able to help reach the support of AWS Redshift.

Best regards,
Ingo

AB#5377

[FEATURE] Databricks support?

Hi,

Thank you for creating dbtvault in the first place, it looks like a promising way to implement Data Vault 2.0. My company is using Databricks at the backend, and I am wondering how if there is going to be support for Databricks in the future.

Ethan

[BUG] Effectivity Satellite macro not creating correct start and end dates

Describe the bug
When creating an Effectivity Sat I would expect proper start dates and end dates in the data. The current solution seems to have a bug when the data is flip floping.
The problem is already visible in the test documentation:
https://github.com/Datavault-UK/dbtvault/blob/develop/test/features/eff_sats/eff_sat_auto_end_dating_detail_inc.feature
line 114 ff.

After customer has changed 3 times the final output for LOAD_DATETIME=2018-06-01 18:00:00.000 is wrong:

ORDER_CUSTOMER_PK CUSTOMER_PK ORDER_PK START_DATE END_DATE EFFECTIVE_FROM LOAD_DATETIME SOURCE
md5('1002||100') md5('1002') md5('100') 2018-06-01 09:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *
md5('1001||100') md5('1001') md5('100') 2018-06-01 00:00:00.000 9999-12-31 23:59:59.999 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *

(compare to line 160/161 in the documentation)

--> e.g both lines would be valid for the time period between 09:00 and 18:00 o'clock.

Versions

dbt: 1.04
dbtvault: 0.8.2

To Reproduce
as the problem is already visible in your tests I skip that part!

Expected behavior
The correct output for the last load_datetime should be:

ORDER_CUSTOMER_PK CUSTOMER_PK ORDER_PK START_DATE END_DATE EFFECTIVE_FROM LOAD_DATETIME SOURCE
md5('1001||100') md5('1001') md5('100') 2018-06-01 00:00:00.000 2018-06-01 09:00:00.000 2018-06-01 00:00:00.000 2018-06-01 18:00:00.000 *
md5('1002||100') md5('1002') md5('100') 2018-06-01 09:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *
md5('1001||100') md5('1001') md5('100') 2018-06-01 18:00:00.000 9999-12-31 23:59:59.999 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *

in words: so for order 100:
1001 is valid for 00:00 to 09:00 then
1002 is valid for 09:00 to 18:00 then
1001 is valid for 18:00 until end of time (be

Additional context

As far as I have seen one problem would be fixed by changing the eff_sat Macro creation:
FROM:

new_reopened_records AS (
    SELECT DISTINCT
        lc."L_PLANT_CUSTOMER_PK",
        lc."PLANT_HK", lc."CUSTOMER_HK",
        **lc.**"STARTDATE" AS "STARTDATE",
        g."ENDDATE" AS "ENDDATE",
        g."DV_APPTS" AS "DV_APPTS",
        g."DV_LDTS",
        g."DV_RECSOURCE"
    FROM source_data AS g
    INNER JOIN latest_closed AS lc
    ON g."L_PLANT_CUSTOMER_PK" = lc."L_PLANT_CUSTOMER_PK"
    WHERE TO_DATE(g."ENDDATE") = TO_DATE('9999-12-31 23:59:59.999999')

TO:

new_reopened_records AS (
    SELECT DISTINCT
        lc."L_PLANT_CUSTOMER_PK",
        lc."PLANT_HK", lc."CUSTOMER_HK",
        **g.**"STARTDATE" AS "STARTDATE",
        g."ENDDATE" AS "ENDDATE",
        g."DV_APPTS" AS "DV_APPTS",
        g."DV_LDTS",
        g."DV_RECSOURCE"
    FROM source_data AS g
    INNER JOIN latest_closed AS lc
    ON g."L_PLANT_CUSTOMER_PK" = lc."L_PLANT_CUSTOMER_PK"
    WHERE TO_DATE(g."ENDDATE") = TO_DATE('9999-12-31 23:59:59.999999')

Additionally the macro has to be enriched by a cte that return old, already closed data, as in Additional context the first line:

ORDER_CUSTOMER_PK CUSTOMER_PK ORDER_PK START_DATE END_DATE EFFECTIVE_FROM LOAD_DATETIME SOURCE
md5('1001||100') md5('1001') md5('100') 2018-06-01 00:00:00.000 2018-06-01 09:00:00.000 2018-06-01 00:00:00.000 2018-06-01 18:00:00.000 *

[BUG] Null handling

Describe the bug
In the last few days we have identified a number of bugs around how NULLs are handled, reported by a client. We do have tests for these in our test suite which were passing, but due to a very subtle bug in our test suite, the tests were passing when they should not have been. The test suite has now been fixed and we have fixed the NULL issues. We will be releasing a public fix this for this shortly (this week or early next week). You may grab the latest version of the code (unreleased) from the following branch of our development repository:

https://github.com/Datavault-UK/dbtvault-dev/tree/int/0.7.4

Alternatively, as mentioned, there will be a short wait for the public release.

The effects of these bugs is that records are being loaded in to structures even though their respective PK or FKs contain NULL values. This is not desired behaviour as per Data Vault 2.0 standards: If we have NULL records, these mean nothing to the business.

Standard practise is to replace NULLs with tokens or codes to signify whether they are mandatory and missing, optional and missing, or just missing (typically 0000..., 1111...., 2222....). We will be developing documentation on these practises and how dbtvault handles NULLs to allow users to make more informed decisions and understand how and why NULLs are handled as they are.

Please reach out to us on Slack or via email if you have been affected by this bug in production, and we can offer guidance on fixing this. In most cases this is not a significant cause for concern, and Data Vault can support these fixes without much headache.

Versions

dbt: v0.19.0
dbtvault: v0.7.3

Affected structures

  • Hubs
  • Links
  • T-Links
  • Satellites

[BUG] Binary value too long and would be truncated

Describe the bug
from the worked example I get errors when creating the Links and the sats. The hubs are fine.

To Reproduce
Steps to reproduce the behavior:

  1. Following https://dbtvault.readthedocs.io/en/latest/loading/
  2. dbt run --models tag:link
  3. See my screenshot to see the error

Screenshots
If applicable, add screenshots to help explain your problem.
image

I'm using dbt 0.15 and revision: dbtVault v0.4
I choose sha as hash, maybe it is related? i guess not since the hubs are ok...

[BUG] stage exclude_columns upper case

Describe the bug
dbtvault.stage macro only takes into account column names listed in upper case and disregards lower case ones if used with exclude_columns: true

Versions

dbt: 1.0.1
dbtvault: 0.7.9

To Reproduce
Steps to reproduce the behavior:

  1. Set up a stage file
{%- set yaml_metadata -%}
source_model: 'dv_raw_country'
hashed_columns:
    country_hkey:
        - country_key
    country_hashdiff:
      is_hashdiff: true
      exclude_columns: true
      columns:
        - _fivetran_synced
  1. Run the model:
    dbt run -s dv_stg_country

  2. Check the generated code (target/run/.../dv_stg_country.sql), you'll see that _fivetran_synced is listed among the fields that feed into country_hashdiff.

This is bad because all records will be added to the SAT at every run.

However, if I convert the column names to upper case, they are excluded from the hashdiff as expected.

Expected behavior
I'd like to be able to list the columns i want to exclude from the hashdiff in lower case.

[BUG] Staging includes duplicate or missing columns in certain scenarios

  1. When using the stage macro, if hashed columns have the same name as a source column, a DUPLICATE COLUMN SQL error will be raised.

The expected behaviour is that the hashed_column definition will take precedence over the original source column, and the source column will not be present in the final stage table.

  1. Additionally, if include_source_columns = true and a derived_column is defined which derives its value from a source column, then the source column will not be included in the final stage layer.

The expected behaviour is that both the original column and the new aliased column will be present in the final stage layer.

We plan to fix this for the next release.

[BUG] dbt_utils dependency not made explicit

Describe the bug
Correct me if I'm wrong but it seems like your latest version depends on dbt_utils for some helpers.

It seems, however, as if nowhere in the codebase I can find a reference to a packages.yml to install it.

Won't this cause an issue if I choose not to bring in dbt_utils as a dependency to my project?

Versions

dbt: N/A
dbtvault: 0.7.0

Expected behavior
I expected to have dbt_utils as a hard dependency of datavault in a packages.yml

Screenshots
image

General-Transaction link template

The transaction link stores all the transactions from a source table as a transaction link. If the transactions come from multiple sources with same unit of work and same set of columns, I assume as per DV standards it would just be one link where data from all the sources is added to this link. If this is the case does dbtvault template refer to multiple sources? Looks like it only supports one source table only. Appreciate you help.

[BUG] ma_sat macro will insert duplicated records when rerun `dbt run`

Describe the bug
For the records with the same src_pk and multiple different values ofsrc_cdk, when you run the dbt run -m ${model_name} in the second time, it will still insert the same records into the ma_sat table.

Versions

dbt: 0.20.0
dbtvault: 0.7.6

To Reproduce
Steps to reproduce the behavior:

  1. Set a stage able with the same src_pk and two different values of src_cdk
  2. run dbt run -m ${model_name} twice

Expected behavior
In the second dbt run command, it shouldn't insert any record, as no new data come in.

Screenshots
Screenshot from 2021-08-20 17-13-28

Additional context

I think this should be a bug in dbtvault.ma_sat, https://github.com/Datavault-UK/dbtvault/blob/master/macros/tables/ma_sat.sql#L54

    {# Select latest records from satellite together with count of distinct hashdiffs for each hashkey #}
    latest_records AS (
        SELECT *, COUNT(DISTINCT {{ dbtvault.prefix([src_hashdiff], 'latest_selection') }}, {{ dbtvault.prefix(cdk_cols, 'latest_selection') }} )
                OVER (PARTITION BY {{ dbtvault.prefix([src_pk], 'latest_selection') }}) AS target_count
        FROM (
            SELECT {{ dbtvault.prefix(cdk_cols, 'target_records', alias_target='target') }}, {{ dbtvault.prefix(rank_cols, 'target_records', alias_target='target') }}
                ,RANK() OVER (PARTITION BY {{ dbtvault.prefix([src_pk], 'target_records') }}
                        ORDER BY {{ dbtvault.prefix([src_ldts], 'target_records') }} DESC) AS rank_value
            FROM {{ this }} AS target_records
            INNER JOIN
                (SELECT DISTINCT {{ dbtvault.prefix([src_pk], 'source_pks') }}
                FROM source_data AS source_pks) AS source_records
                    ON {{ dbtvault.prefix([src_pk], 'target_records') }} = {{ dbtvault.prefix([src_pk], 'source_records') }}
            QUALIFY rank_value = 1 
            ) AS latest_selection
    ),
    

When it gets the latest records, it groups by src_pk - RANK() OVER (PARTITION BY {{ dbtvault.prefix([src_pk], 'target_records') }} ORDER BY {{ dbtvault.prefix([src_ldts], 'target_records') }} DESC) AS rank_value rather than by RANK() OVER (PARTITION BY {{ dbtvault.prefix([src_pk], 'target_records') }} , {{ dbtvault.prefix(cdk_cols, 'target_records') }}, which will return just one record even for records with same src_pk and multiple cdk(children keys).

And this will cause the https://github.com/Datavault-UK/dbtvault/blob/master/macros/tables/ma_sat.sql#L89

not equal here stage.source_count != latest_records.target_count

And reinsert the same records

[FEATURE] PostgreSQL Support

I was checking that at the moment PostgreSQL is not supported for dbtvault

I read somewhere that there is a community effort going on ?

Does someone know who is involved and what the milestones are?

[BUG] NULL Handling

Describe the bug
We modified how NULLS are handled for single column hashes in #3. We now need to revert this change because it causes unpredictability when it comes to hashing consistency. This also causes problems because NULL values become 'hidden' behind a hash of '^^', causing NULL values to sneak in where they are unwanted (in Hubs, for example).

For Hashdiffs we want this behaviour, but not in primary keys.

To Reproduce
Steps to reproduce the behaviour:

  1. Use the hash or multi-hash macro on a single string

Expected behaviour
When I hash multiple columns, I expected the following:

Values: 1001, 1002, NULL, 1003
Hashed outcome: #1001 || #1002 || #^^ || #1003

When I hash a single column (not a HASHDIFF) I expect the following:

Value: NULL
Hashed outcome: NULL

When I hash a single column (and it is a HASHDIFF) I expect the following:

Value: NULL
Hashed outcome: #^^

This will be fixed in the next release, 0.6.

[FEATURE] Allow hashed_columns to see derived columns to allow concatenated business keys

I've come across a few instances where I need to hash a derived column in my staging area. Sometimes this is because I simply want to append a manual string to my BK, other times I want to rename something in my staging for use down the line, but I don't have access to it in the hashed_columns section so I have to write SQL or use a field name I'm renaming (or be forced to create an extra model to feed the stage macro)

for example, I have to do the following:

          hstg_sfdc__lead_core:
            vars:
              source_model:
                salesforce: Lead
              hashed_columns:
                HUB_FUNNEL_HK:
                  - "(SELECT 'sfdc.Lead' AS SCOPE_BK)" # SCOPE_BK
                  - Website # DOMAIN_BK        
              derived_columns:
                SCOPE_BK: '!sfdc.Lead'
                DOMAIN_BK: Website

What I'd like to do:

          hstg_sfdc__lead_core:
            vars:
              source_model:
                salesforce: Lead
              hashed_columns:
                HUB_FUNNEL_HK:
                  - SCOPE_BK
                  - DOMAIN_BK
              derived_columns:
                SCOPE_BK: '!sfdc.Lead'
                DOMAIN_BK: Website

Additionally, I'm finding an enterprise-wide unique natural key is hilariously difficult. For example imagine I have an HR system, Contractor system, and a partner company system and I want to represent a single concept of Employee.

image

As you can see the unique keys here are different in each case, so my Business Key has to be a concatenation of multiple keys in the source system. This is what is referred to as src_nk in your hub macro.

In the above example, if I had a 2-part key, passing both keys as src_nk to the hub macro would result in 2 columns SCOPE_BK and DOMAIN_BK, however, I'm only after a single concatenated column.

It would be amazing if dbtvault could support this concatenation as a first-class concept (using the same transformation built into the hash function)

In this case, I might simply have, in each of the staging configs that feed into the hub, something like:

          hstg_sfdc__lead_core:
            vars:
              source_model:
                salesforce: Lead
              hashed_columns:
                HUB_FUNNEL_HK: FUNNEL_BK
              derived_columns:
                HUB_FUNNEL_BK: 
                  - '!sfdc.Lead'
                  -  Website

This would generate a table that looks like this:

HUB_FUNNEL_HK HUB_FUNNEL_BK
98a7sd8a9s7da9fgag7a98a7sda sfdc.Lead||abc.com
98a7sd8a9gsdf3ggag7a98a7sda sfdc.Lead||def.com
98a7sd8a9s7da9fgag7a98a123a sfdc.Lead||efg.com

Currently, I don't even have access to the hash concat macro by itself cause it's a part of the hash macro.

Anyway, I know this is kinda multiple issues, but I wanted to put it out there to see if others felt similar pains.

Of course, as with everything else dbt, you can work around this by adding a separate model before your dbtvault one, but I felt I was doing this all the time and something that can be easily factored out.

[BUG] Column Escaping Doesn't Work with `NOT` Well

Describe the bug
Now dbtvault wraps all the columns with quotes, but not would be in the quotes as well.

Versions

dbt: 1.0.4
dbtvault: 0.8.2

To Reproduce

 {{
     dbtvault.stage(
         source_model = "source_model",
         hashed_columns = {
             "hk_a": "id",
             "a_hashdiff": {
                 "is_hashdiff": true,
                 "columns": [
                     "a",
                 ]
             }
         },
         derived_columns = {
             "derived_column_a": "not column_a",
         },
     )
 }}

The compiled SQL would be like:

derived_columns AS (
    SELECT

    "not column_a" AS "eff_flag",

   FROM source_data
),

Expected behavior
The not should be outside the quotes like:

derived_columns AS (
    SELECT

    not "column_a" AS "eff_flag",

   FROM source_data
),

Screenshots
If applicable, add screenshots to help explain your problem.

Log files
If applicable, provide dbt log files which include the problem.

Additional context
Add any other context about the problem here.

[BUG] XTS Satellite Name column

Describe the bug
In the dbtvault.xts macro, the satellite name is being materialised as a column instead of a constant text. The source_model does not generate a column called SATELLITE_NAME which the macro requires to render into s.{{satellite_name}} AS SATELLITE_NAME. The

Versions

dbt: 1.0.0+
dbtvault: 0.7.9

To Reproduce
Steps to reproduce the behavior:

  1. Create an xts.sql using this code
{{ config(materialized='incremental')         }}

{%- set source_model = 'my_dv_stage' -%}
{%- set src_pk = "CUSTOMER_HK" -%}
{%- set src_satellite = {"SATELLITE_CUSTOMER": {"sat_name": {"SATELLITE_NAME": "sat_customer_details"}, "hashdiff": {"HASHDIFF": "CUSTOMER_HASHDIFF"}}} -%}
{%- set src_ldts = "LOAD_DATETIME" -%}
{%- set src_source = "RECORD_SOURCE" -%}

{{ dbtvault.xts(src_pk=src_pk, src_satellite=src_satellite, src_ldts=src_ldts,
                src_source=src_source, source_model=source_model) }}

  1. Click Compile then gives this
-- Generated by dbtvault.

WITH 

        satellite_sat_customer_details_from_my_dv_stage AS (
            SELECT s.CUSTOMER_HK, s.CUSTOMER_HASHDIFF AS HASHDIFF, s.sat_customer_details AS SATELLITE_NAME, s.LOAD_DATETIME, s.RECORD_SOURCE
            FROM DEV001_ODS.dv.stg_ifs_customer_details AS s
            WHERE s.CUSTOMER_HK IS NOT NULL
        ),

union_satellites AS (
    SELECT * FROM satellite_sat_customer_details_from_my_dv_stage
),

records_to_insert AS (
    SELECT DISTINCT union_satellites.* FROM union_satellites
)

SELECT * FROM records_to_insert
limit 500
/* limit added automatically by dbt cloud */
  1. Click Preview and then get this error
Server error: Database Error in rpc request (from remote system)
000904 (42000): SQL compilation error: error line 4 at position 67
invalid identifier 'S.SAT_CUSTOMER_DETAILS'

Expected behavior
The satellite name ought be included as a string output in the SELECT statement.

-- Generated by dbtvault.

WITH 

        satellite_sat_customer_details_from_my_dv_stage AS (
            SELECT s.CUSTOMER_HK, s.CUSTOMER_HASHDIFF AS HASHDIFF, 'sat_customer_details' AS SATELLITE_NAME, s.LOAD_DATETIME, s.RECORD_SOURCE
            FROM DEV001_ODS.dv.stg_ifs_customer_details AS s
            WHERE s.CUSTOMER_HK IS NOT NULL
        ),

union_satellites AS (
    SELECT * FROM satellite_sat_customer_details_from_my_dv_stage
),

records_to_insert AS (
    SELECT DISTINCT union_satellites.* FROM union_satellites
)

SELECT * FROM records_to_insert
limit 500
/* limit added automatically by dbt cloud */

[BUG] Ephemeral model as source to staging.

As of now, it is not possible to have a ephemeral model as source to the stagning macro.

It would be nice to have the possibility to create Ephemeral models and then use them as sources to staging macro.

In the ephemeral we can put data together in a "raw" format from multiple tables, then use that to feed into the raw vault.

[BUG] Doesn't support columns with space in name

Describe the bug
We are sourcing data from many different source systems.
Some of these have column names with spaces.
Since dbtvault doesn't wrap column names in "", columns with spaces will cause syntax errors.

Not sure if this already have been brought up in #23, but I don't think I got where the discussion ended ๐Ÿ˜•
Guess column names with spaces are bad practice, but the source systems are out of our reach, and would prefer not having to rename all the columns at ingestion

Versions

dbt: 0.19.1
dbtvault: 0.7.4

To Reproduce
Steps to reproduce the behavior:
1 .Have a table that has columns with spaces
image

  1. Use e.g. the stage macro on this table
{%- set yaml_metadata -%}
source_model:
    EDW: 'TEST'
derived_columns:
    SOURCE: "!test"
    LOAD_DATE: "CURRENT_TIMESTAMP()"
hashed_columns:
    TEST_PK: "COLUMN1"
    TEST_HASHDIFF:
        - '"COLUMN WITH SPACE"'

{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}

{% set source_model = metadata_dict['source_model'] %}

{% set derived_columns = metadata_dict['derived_columns'] %}

{% set hashed_columns = metadata_dict['hashed_columns'] %}

{{ dbtvault.stage(include_source_columns=true,
                  source_model=source_model,
                  derived_columns=derived_columns,
                  hashed_columns=hashed_columns,
                  ranked_columns=none) }}
  1. generated SQL will not be valid

image

Expected behavior
Column names with spaces are handled gracefully

[BUG] stage macro does not support columns with reserved keywords as names

Describe the bug
Let's say your source contains a column called "order", if you try to run the stage macro with that as a hashed column (or even simply with include_source_columns = true) it will fail

dbtvault_bq.stage(
    include_source_columns=true,
    source_model={ 'outreach': 'stages'},
    hashed_columns={
      'SAT_HASH_DIFF': {
        'hashdiff': true,
        'columns': [
          'name'
          ,'order'
        ]
      }
    },
...
  )

Versions

dbt: 0.18.1
dbtvault: (BQ fork a.k.a 0.7.0), however looking at the code it should be present in the latest version as well

Expected behavior
I would expect all queried columns to have be qualified with the table name

Screenshots
image

'dbtvault' undefined error

Describe the bug
We are using DBT vault with Big query adapter. When we execute hub mapping as shown below, system gives error saying "dbtvault" undifined. This was working correctly in dbt 0.21.0 but when I upgraded to dbt 1.0.0, it stopped working

{{ dbtvault.hub(var_yaml.src_pk, var_yaml.src_nk, var_yaml.src_ldts, var_yaml.src_source, var_yaml.source_model )}}

Versions

dbt: 1.0.1
dbt-bigquery: 1.0.1

Log files
Compilation Error in model HUB_STUDY (models\load\hubs\HUB_STUDY.sql)
'dbtvault' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

Choosing a hashing algorithm in dbtvault: hash or lossless

Hi,
In https://dbtvault.readthedocs.io/en/v0.6/best_practices/#hashing-best-practices you recommend to use MD5 or SHA-256 as surrogate key generator for primary keys. I think that a lossless generator would be a better choice.

dbt_utils.surrogate_key uses the MD5 hash function in Snowflake to generate a surrogate primary key from a natural key by default. Hash algorithms are lossy: There is a small probability that hash collisions occur, so that two different natural keys get the same hash value. The risk can be reduced by using the hash function SHA-256 or even SHA-512, but there is still a remaining very, very small probability of risk collisions. SHA-256 serves cryptographic purposes, so it takes 64 rounds to calculate the hash value and uses quite a lot of CPU time compared to a simple lossless functions, e.g. Huffman ( https://en.wikipedia.org/wiki/Huffman_coding). So an alternative is:

  1. Concatenate all columns of the natural key.
  2. Use a simple lossless compression algorithm.

Snowflake supports SNAPPY,ZLIB,ZSTD and BZ2. ( https://docs.snowflake.com/en/sql-reference/functions/compress.html)

Would do you think?

Best regards,
Jens

[FEATURE] Add optional src_additional_columns parameter to all table macros

Proposing a feature which will allow users to include additional arbitrary columns in raw vault structures to support additional metadata and variations on modelling approaches.

Here is an example of the usage for a Hub

{{ config(materialized = 'table') }}

{% set additional_columns = ["APPLIED_TS", "TENANT_ID"] %}

{{ dbtvault.hub(src_pk=src_pk..., src_additional_columns=additional_columns) }}

This would then simply cause to hub macro SQL to select these additional columns; no processing or logic would be applied.

This would also be an optional parameter.

[FEATURE] Using computed columns in the `dbtutils.sat()` macro

Is your feature request related to a problem? Please describe.

I use Stitch to move data from my OLTP database into S3, which is then used as a source to load data into Snowflake. Usually, Stitch only reloads a row of data if it has changed, but there are exceptions. These exceptions happen because I had to "reset" Stitch's internal index a couple of times for technical reasons (e.g., when my team migrated from MySQL to Aurora DB).

This means that my source data tables contain duplicate rows that only differ regarding Stitch's metadata attributes (e.g., the date it was loaded into stitch, date extracted, etc.) My challenge now is how to define an effectivity_date.

I don't have any reliable date for when the data in the row has changed (updated_at isn't always updated). This needs to be calculated from the data itself.

Describe the solution you'd like

I want to define the effectivity date in a satellite using MIN(_sdc_sequence::datetime) OVER (PARTITION BY user_pk, hashdiff). This doesn't work because dbtvault.sat() supports only column names.

Describe alternatives you've considered

I tried defining this column in the staging area, but I can't because I'd need to use a hashed column in a derived column (only the opposite works).

It's possible, but annoying to do in the source model used by the staging area because I need to be able to detect if two rows are duplicates when I exclude Stitch's metadata columns. It's much easier to do so in dbtvault after I defined the hashdiffs.

[BUG] 'dict object' has no attribute 'get_query_results_as_dict'

Describe the bug
Finding it hard to put a reproducible example together for this but perhaps (a) I'm just doing something wrong or (b) you can already tell what's from from the below.

I have a model that looks like this:

{{ config(materialized='vault_insert_by_period', timestamp_field='DATE_KEY_UTC', period='day',
start_date=var('start_date'), end_date=var('end_date')) }}

SELECT
    rl.city_name
    , rl.zone_id
    , rl.date_key_utc

    , COUNT(DISTINCT rl.driver_id) AS cnt_riders_present
    , COUNT(*) * 12 / 60 AS mins_present

FROM production.events.rider_locations rl
WHERE __PERIOD_FILTER__
GROUP BY rl.city_name, r.zone_id, rl.date_key_utc

On dbt run against a Snowflake DB I get

14:27:24  Completed with 1 error and 0 warnings:
14:27:24
14:27:24  Compilation Error in macro default__get_period_boundaries (macros/materialisations/period_mat_helpers/get_period_boundaries.sql)
14:27:24    'dict object' has no attribute 'get_query_results_as_dict'
14:27:24
14:27:24    > in macro get_period_boundaries (macros/materialisations/period_mat_helpers/get_period_boundaries.sql)
14:27:24    > called by macro materialization_vault_insert_by_period_default (macros/materialisations/vault_insert_by_period_materialization.sql)
14:27:24    > called by macro default__get_period_boundaries (macros/materialisations/period_mat_helpers/get_period_boundaries.sql)

(side note: what tripped me up the first time around is that the timestamp_field argument is case sensitive because it gets quoted in the resulting SQL)

Versions

dbt: 1.0.3
dbtvault: 0.8.1

[FEATURE] BigQuery support?

I'm new to the idea of Data Vault 2.0 and I'm reading the Dan Linstedt book to understand it better.

To help me get a better perspective of how dbtvault works I would like to know how difficult do you think it would be to add support for BigQuery?

Are there specific features of Snowflake which makes it better for running dbt/dbtvault ?

Thanks,
Jacob

[BUG] dbtvault expands struct automatically when calling dbtvault.stage

Describe the bug
dbtvault expands the struct automatically from raw_stage to staging area.

Versions

dbt: 0.20.1
dbtvault: personalised branch starting from develop branch

To Reproduce
Steps to reproduce the behavior:

  1. Create a table in raw_stage area with a struct field as
    data_type: struct<scn:int,from:int,to:int,lines:array,postCode:string>
    address ["from": 20200101, "to": null, "lines": ["1 street", "Edinburgh", 'XX77 7XX']]
  2. The result query will be (see Log files) SELECT address, from, to, lines,

Expected behavior
Field should be kept with originally structure

Log files
dbt.log

[FEATURE] Provide Generic Tests with dbtvault for DataVault 2.0 specific use-cases

Testing is vital in any Data Warehouse.

Here's what we can do today with out-of-the-box dbt (and dbt-utils) tests:

  • We can perform many basic tests such as the Hash key being Unique and Not Null in all of our tables.

  • We can test that there are no orphaned records between hubs, links and satellites (i.e. test to make sure we don't have any satellite records without a corresponding hub record) This would use the relationships_where dbt_utils test.

  • Many, many more, including column type checks between layers using custom tests

  • Other tests which make use of dbt-expectations

These are great and a good starting point, however there are a number of other tests we have experience doing in clients which we have built specifically for certain cases and purposes.

We would like to add these tests to dbtvault as standard, so that users can make use of them in their own projects. This would be hugely beneficial for many reasons!

These would be offered as generic tests that ship with dbtvault and which users could configure and apply to the models which use dbtvault macros.

This is a good example of a custom test we could implement in dbtvault

AB#5375

Model generator

Hi

I was looking into this project and wondering how people are using this at scale? I think the macros are a massive benefit for making sure no boiler plate code is needed but still, creating a model file for each data vault object x 1000ths of tables seems a bit painful.

How are other people solving this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.