Giter Club home page Giter Club logo

scalefreecom / datavault4dbt Goto Github PK

View Code? Open in Web Editor NEW
115.0 19.0 19.0 31.83 MB

Scalefree's dbt package for a Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt including the Staging Area, DV2.0 main entities, PITs and Snapshot Tables.

Home Page: https://www.scalefree.com/

License: Apache License 2.0

PLSQL 100.00%
datavault hubs links pits satellites sourcemarts stagingarea bigquery snowflake scalefree

datavault4dbt's Introduction


Included Macros

  • Staging Area (For Hashing, prejoins and ghost records)
  • Hubs, Links & Satellites (allowing multiple deltas)
  • Non-Historized Links and Satellites
  • Multi-Active Satellites
  • Virtualized End-Dating (in Satellites)
  • Reference Hubs, - Satellites, and - Tables
  • PIT Tables
    • Hook for Cleaning up PITs
  • Snapshot Control

Features

With datavault4dbt you will get a lot of awesome features, including:

  • A Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt
  • Ready for both Persistent Staging Areas and Transient Staging Areas, due to the allowance of multiple deltas in all macros, without loosing any intermediate changes- Enforcing standards in naming conventions by implementing global variables for technical columns
  • A fully auditable solution for a Data Warehouse
  • Creating a centralized, snapshot-based Business interface by using a centralized snapshot table supporting logarithmic logic
  • A modern insert-only approach that avoids updating data
  • Optimizing incremental loads by implementing a high-water-mark that also works for entities that are loaded from multiple sources
  • A straight-forward, standardized approach to conduct agile datawarehouse development cycles

Requirements

To use the macros efficiently, there are a few prerequisites you need to provide:

  • Flat & Wide source data, available within your target database
  • Load Date column that represents the arriving time in the source data storage
  • Record Source column that gives information about where the source data is coming from (e.g. the file location inside a Data Lake)

Resources:

Supported platforms:

Currently supported platforms are:

  • Google Bigquery
  • Exasol
  • Snowflake
  • PostgreSQL
  • Amazon Redshift
  • Microsoft Azure Synapse

We are working continuously at high pressure to adapt the package for large variety of different platforms. In the future, the package will hopefully be available for SQL Server, Oracle and many more.


Installation instructions

  1. Include this package in your packages.yml -- check here for installation instructions.
  2. Run dbt deps

For further information on how to install packages in dbt, please visit the following link: https://docs.getdbt.com/docs/building-a-dbt-project/package-management

Global variables

datavault4dbt is highly customizable by using many global variables. Since they are applied on multiple levels, a high rate of standardization across your data vault 2.0 solution is guaranteed. The default values of those variables are set inside the packages dbt_project.yml and should be copied to your own dbt_project.yml. For an explanation of all global variables see the wiki.


Usage

The datavault4dbt package provides macros for Staging and Creation of all DataVault-Entities you need, to build your own DataVault2.0 solution. The usage of the macros is well-explained in the documentation: https://github.com/ScalefreeCOM/datavault4dbt/wiki


Contributing

View our contribution guidelines


License

Apache 2.0

Contact

For questions, feedback, etc. reach out to us via [email protected]!

datavault4dbt's People

Contributors

bschlottfeldt avatar jplangner avatar mgeerken avatar obause avatar polat-deniz avatar sandrakrenn avatar sergeyh avatar suhitadutta avatar thoffmann-sf avatar tkiehn avatar tkirschke avatar tta-scalefree avatar umattern avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datavault4dbt's Issues

Effectivity handled with duplicate unique keys

Hi
I have a question on how the logic handles your proposed approach of handling effectivity that comes from the source.
You specify that these attributes should just be handled like normal descriptive attributes.
In the Satellite it contains a parameter for unique key which is typically only the hk and the load date. Is this still the case?
Does dbt create a primary key on the table for this, or is this just used for incremental load checks? Should one include the effectivity dates in here as well?

We did a small test on the sql branch (which we got when it was still on) and it actually seemed to work because there was no primary key on the actual tables, so the fact that there were duplicate "unique keys" didn't seem to error. However we didn't test subsequent loads

Add Option to Filter Source Data in Stage macro

Discussed in #105

Originally posted by MatthiasWendtB24 July 26, 2023
Hi Folks,
is there a simple solution, to filter the rows processed by the stage macro?

Thanks for your response,
Kind regards,
Matthias

#105 (comment) :

Originally posted by @bschlottfeldt September 13, 2023

Might be a good idea for enhancement: allow an optional parameter in the yaml metadata of the stage called filter_source_by and allow the user to add a statement as the input.
We can add this to the list of features to be developed for enhancement, but I can't give a timeline on when it will be done as we have other features in the pipeline.

Kind regards,
Barbara

Non-Historized Satellites do not deduplicate in init-loads

The loading of non-historized satellites should deduplicate the incoming source data based on the hashkey column, and only select the earliest one per hashkey, based on the load-date. Although we do not expect changes in a non-historized satellite, there might be cases where we have duplicate rows across multiple loads.

Proposed changes:

  • modify n_sat macro to always use
    • QUALIFY ROW_NUMBER() OVER (PARTITION BY ORDER BY ) = 1

FYI @26091999 @obause

Ghost record values empty for DATE data type for all supported platforms except exasol

Hi,

In the macro: macros/supporting/ghost_record_per_datatype.sql there are ghost record definition.

However, in all supported platforms except Exasol, there is not ghost record for DATE data type defined. The effect for this is that any attribute with DATE data type gets NULL value, which is in contrary for ghost record idea.

Snowflake code extract example - no DATE data type:
{%- if ghost_record_type == 'unknown' -%} {%- if datatype in ['TIMESTAMP_NTZ','TIMESTAMP'] %}{{ datavault4dbt.string_to_timestamp(timestamp_format, beginning_of_all_times) }} AS {{ alias }} {%- elif datatype in ['STRING', 'VARCHAR'] %}'{{ unknown_value__STRING }}' AS {{ alias }} (...) {%- elif datatype in ['NUMBER','INT','FLOAT','DECIMAL'] %}0 AS {{ alias }} {%- elif datatype == 'BOOLEAN' %}CAST('FALSE' AS BOOLEAN) AS {{ alias }} {%- elif datatype == 'ARRAY' %} CAST('{{ unknown_value__STRING }}' as ARRAY) as "{{ alias }}" {%- else %}NULL AS {{ alias }} {% endif %}

Is there a reason for it?

Thanks,
Jarek

Hash Collision due to bug in hashing algorithm

In datavault4dbt with Snowflake we have identified a hash collision for values with small 'v'. Example: 0Q03Y0000033lviSAA and 0Q03Y0000033liSAA produce the same hash.
The issue seems to be caused by the REGEXP_REPLACE expression that should replace the '\v' values. However, it is also replacing any small 'v' letters which is not correct.
Same issue might be valid for n, t and r characters (haven't checked though).

Multi-active satellite: NULL rows not correctly identified

Discussed in #96

Originally posted by cmettler July 17, 2023
Hey Scalefree Team,
i am finishing a sqlserver adapter based on the previous sqlserver branch. I am trying to understand the ma sat template from the other adapters.
if i understand the template code + remarks correct the hashdiff should be something like the attached sample sql queries.

All three queries would create the same hashhdiff but for different ma groups. The delta check would not work.
But if i include the multi active key in the payload the hashdiffs would be different for the delta check to work.
I think this should be documented in the wiki or the ma keys automatically added to the hashdiff calc if not explicitly added in the src_payload. Or did i miss something else ?
Kind Regards,
Christoph
masat.sql.txt

To-Do: Change Multi Active Hashdiff calculation to properly identify NULL rows. Do that, by replacing the "NULLIF" around the "CONCAT()" by a "REPLACE()" to avoid NULL input to STRING_AGG() and LISTAGG().

Linux software center

It would be great to have datavault4dbt and turbovault4dbt available in the linux software center for example in debian or ubuntu. This way installation would be easy and much more users could go their first steps with datavault4dbt. I can imagine a preconfigured Postgres-Database with the TPCH example. Instead of excel libre office could be used for modelling.

One challenge is to package the files correctly.
The second challenge is to obey the licensing rules.
But in my opinion this could make datavault4dbt more popular.

Unknown and error values for calculated columns

I get duplicate business keys (NULL) for hubs, if I use a calculated column as source.

dbt version: 1.4.5
datavault4dbt version: 1.0.16
database: snowflake

Steps to reproduce the issue:
Config values are set to default.

Staging model definition:

{%- set yaml_metadata -%}
source_model: abc
derived_columns:
  SAMPLE_CALCULATED_COLUMN:
    value: "'TEST_' || ID"
    datatype: string
  OTHER_STRING_COLUMN: NAME
ldts: LOADTIMESTAMP
rsrc: '!123'
{%- endset -%}

Generated sql (shortened):

WITH

source_data AS (
    SELECT
     ...
      "ID",
      "NAME"
    FROM abc
  ),

ldts_rsrc_data AS (

  SELECT
    ...
    "ID",
    "NAME"
    FROM source_data

),

derived_columns AS (

  SELECT

  ...
  "NAME" AS "OTHER_STRING_COLUMN",
  'TEST_' || ID AS "SAMPLE_CALCULATED_COLUMN"
  FROM ldts_rsrc_data

),
unknown_values AS (

    SELECT

    ...
    '(unknown)' AS OTHER_STRING_COLUMN,
      NULL AS SAMPLE_CALCULATED_COLUMN

),

error_values AS (

    SELECT

    ...
    '(error)' AS OTHER_STRING_COLUMN,
    NULL AS SAMPLE_CALCULATED_COLUMN

),

ghost_records AS (
    SELECT * FROM unknown_values
    UNION ALL
    SELECT * FROM error_values
),
...

Expected values for the calculated string column would be '(unknown)' and '(error)'.

BINARY zero keys cause error in PIT generation

  Hi @mjahammel , 

Great to hear that the hotfix worked for you!
The problem that you have mentioned is the logical consequence of my fix, without thinking about where the zero keys are used additionally. Completely forgot the PIT macros ๐Ÿ˜„

Will create one more Hotfix for that ASAP!

Kind regards,
Tim

Originally posted by @tkirschke in #47 (reply in thread)

Incremental loads for hubs and links don't use high-water-mark

Hi,

I stumbled upon our hubs and links not using high-water-mark for their incremental loading.
For this issue I would only go into hubs since links probably share the same cause.

dbt version: 1.4.6
datavault4dbt version: 1.0.17
database: snowflake

We defined our own aliases in the dbt_project.yml

#Column Aliases
  datavault4dbt.ldts_alias: 'CDP_LOADDTS'
  datavault4dbt.rsrc_alias: 'CDP_RECORD_SOURCE'

Every stage defines these two attributes:

ldts: CDP_LOADDTS
rsrc: '[static source identifier]'

Our hub definition is very slim, as we define the matching columns (HK/BK) for the hubs in the source_models.
For multiple sources:

{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
hashkey: HASHKEY_SK
business_keys: BUSINESSKEY_BK
source_models:
    V_MODEL1: {}
    V_MODEL2: {}
    V_MODEL3: {}
{%- endset -%}

For single source hubs:

{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
hashkey: HASHKEY_SK
business_keys: BUSINESSKEY_BK
source_models: V_MODEL4
{%- endset -%}

Both definitions always process the entire history of the stage data to determine the delta. Are we missing something in the definition that is necessary to calculate the high-water-mark?

We would like to get the same result for delta calculation analogous to the sat definition and decrease our processing time.

{{ config(materialized='incremental') }}
{%- set yaml_metadata -%}
source_model: V_MODEL4
parent_hashkey: HASHKEY_SK
src_hashdiff:
  source_column: S_SAT_HASHDIFF
  alias: HASHDIFF
src_payload:
  - COL1
  - COL2
{%- endset -%}

Kind regards,
Lars

Snowflake: Link (and nhl?): Link cannot be source from the same stage twice

Hi,
I want to create a Link. The source has 3 Fields Key1, Email1, Email2.
The link should be created from Key1 / Email1 and Key1 / Email2. I created two BK and two Hub-HK from the emails
and two Link-HK containing either email1 or email2.
I get the error that the SQL doesn't find the "normal" hub-hk (without number). It seems that though I created two sources in the link-definition only one is created in the SQL (we had the same issue with hubs)

Regards

Hash algorithm SHA2 for Snowflake having defined hash_datatype "BINARY(32)" not working as expected

Discussed in #164

Originally posted by migma-ciklum April 15, 2024
We've defined these variables in our dbt_project file (using 1.3.0 version of ScalefreeCOM/datavault4dbt):

  • datavault4dbt.hash: 'SHA2'
  • datavault4dbt.hash_datatype: 'BINARY(32)'

But when running our models against Snowflake, our hash keys and hash diffs will turn out to be BINARY(8388608) instead of BINARY(32). Is there a way to make the hash keys and hash diffs use the defined datavault4dbt.hash_datatype, including the length?

Thanks in advance for your help and feedback!
BR

Same hashdiff inserted again in Satellite, without having in between delta states

Problem statement

Encountered while testing in my project this problem.
We have a Satellite that is holding a hashkey lets say for example '123' and hashdiff 'abcd' with load date '2023-09-08 13:08:04.421'
We load new records into the Stage with load date '2023-09-08 14:54:00.818' (few hours later). But those records dont have any delta, holding multiple records (from different files) the same hashdiff for the same hashkey.
The current implementation inserts the same hashdiff again into the satellite, which is not the expected behaviour.

Expected behaviour

None of the records are inserted into the satellite, as they hold the same hashdiff as the latest hashdiff present in the satellite and there are no inbetween deltas in the Stage.

Update PIT implementation in Redshift

I see that you are using COALESCE(LEAD(TIMESTAMP_SUB({{ ldts }}, INTERVAL 1 MICROSECOND)) in PIT for Redshift. However TIMESTAMP_SUB and INTERVAL 1 MICROSECOND are not supported by Redshift.

sqlserver hub.sql need alias for rsrc_static_query_source

Hi, I am working on the sqlserver pre-release and in hub.sql if you have a static record source the select syntax is incorrect that gets generated in line 50 - 62
The subquery needs an alias. It only failed on the 2nd load
datavault4dbt\macros\tables\sqlserver\hub.sql

{%- set rsrc_static_query_source -%}
SELECT count(*) FROM (
{%- for rsrc_static in rsrc_statics -%}
SELECT t.{{ src_rsrc }},
'{{ rsrc_static }}' AS rsrc_static
FROM {{ this }} t
WHERE {{ src_rsrc }} like '{{ rsrc_static }}'
{%- if not loop.last %}
UNION ALL
{% endif -%}
{%- endfor -%}
) sub -- I added this alias to the subquery otherwise it fails
{% endset %}

Control snapshot v0: is_daily flag not correctly set, when daily snapshot timestamp isn't 00:00:00

In the generated code for control_snapshot_v0, the Is_daily flag is currently not correctly set to reflect the user's input for daily_snapshot_time - If this parameter is set to any timestamp other than 00:00:00.

Current macro:
CASE
WHEN EXTRACT(MINUTE FROM sdts) = 0 AND EXTRACT(SECOND FROM sdts) = 0 AND EXTRACT(HOUR FROM sdts) = 0 THEN TRUE
ELSE FALSE
END AS is_daily

Meaning, is_daily flag is only true, if daily snapshot definition is 00:00:00.

Proposal to change the macro code to parse user definition of daily snapshot:
CASE
WHEN
EXTRACT(HOUR FROM sdts) = EXTRACT(HOUR FROM TO_TIME('{{ daily_snapshot_time }}'))
AND EXTRACT(MINUTE FROM sdts) = EXTRACT(MINUTE FROM TO_TIME('{{ daily_snapshot_time }}'))
AND EXTRACT(SECOND FROM sdts) = EXTRACT(SECOND FROM TO_TIME('{{ daily_snapshot_time }}'))
THEN TRUE
ELSE FALSE
END AS is_daily

BigQuery Snapshot Control v0

An error is thrown when trying to create a snapshot control model in BigQuery.

Details:

  • The version of dbt being used:
    • dbt-bigquery 1.6.7
    • dbt-core 1.6.6
  • The version of datavault4dbt being used.
    • 1.1.4
  • Steps to reproduce the issue
    1. Set up a dbt project using BigQuery
    2. Create a model using the example configuration in the docs
    3. execute the dbt run command

The following was the configuration used to reproduce the issue:
{{ config(materialized='incremental') }}

{%- set yaml_metadata -%}
start_date: '2015-01-01'
daily_snapshot_time: '07:30:00'
{%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{%- set start_date = metadata_dict['start_date'] -%}
{%- set daily_snapshot_time = metadata_dict['daily_snapshot_time'] -%}

{{ datavault4dbt.control_snap_v0(start_date=start_date,
daily_snapshot_time=daily_snapshot_time) }}

  • Any error messages or dbt log files which can give more detail of the problem

The following error message is given by dbt:
Database Error in model xxxxx A valid date part name is required at [55:33]

The following error message is given BigQuery if you execute the dbt compiled SQL in the target/compiled directory:
A valid date part name is required at [44:33]

Potential solution

The following update to line 58 in file datavault4dbt/macros/tables/bigquery/control_snap_v0.sql appears to resolve this issue:

existing SQL to create the is_end_of_month column:

CASE 
    WHEN LAST_DAY(sdts, 'month') = DATE(sdts) THEN TRUE
    ELSE FALSE
END as is_end_of_month,

potential update

CASE 
    WHEN LAST_DAY(DATE(sdts), MONTH) = DATE(sdts) THEN TRUE
    ELSE FALSE
END as is_end_of_month,

Multiactive satellite is not working properly with link

Hi,
We are using link as our hashkey and we are having 'Valid_To' as our ma_attribute. However, there should be only one valid record but there are multiple. Below are the sample of my ma_sat v0 and v1.

ma_sat v0

{%- set yaml_metadata -%}
source_model: 'source_model'
parent_hashkey: 'hk_key_l'
src_hashdiff: 'hd_sales_ms'
src_ma_key: 'VALID_TO'
src_payload: 
    - field
{%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{
    datavault4dbt.ma_sat_v0(
        source_model=metadata_dict["source_model"],
        parent_hashkey=metadata_dict["parent_hashkey"],
        src_hashdiff=metadata_dict["src_hashdiff"],
        src_ma_key=metadata_dict["src_ma_key"],
        src_payload=metadata_dict["src_payload"],
    )
}}

ma_sat v1

{%- set yaml_metadata -%}
sat_v0: 'ma_sat_v0'
hashkey: 'hk_key_l'
hashdiff: 'hd_sales_ms' 
ma_attribute:
    - VALID_TO
ledts_alias: 'ledts'
add_is_current_flag: true
{%- endset -%}    

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{%- set sat_v0 = metadata_dict["sat_v0"] -%}
{%- set hashkey = metadata_dict["hashkey"] -%}
{%- set hashdiff = metadata_dict["hashdiff"] -%}
{%- set ledts_alias = metadata_dict["ledts_alias"] -%}
{%- set ma_attribute = metadata_dict["ma_attribute"] -%}
{%- set add_is_current_flag = metadata_dict["add_is_current_flag"] -%}

{{
    datavault4dbt.ma_sat_v1(
        sat_v0=sat_v0,
        hashkey=hashkey,
        hashdiff=hashdiff,
        ma_attribute=ma_attribute,
        ledts_alias=ledts_alias,
        add_is_current_flag=add_is_current_flag,
    )
}}

Regards,
ShennWooi

ledts_alias in v1-views not correctly set

in the v1-views from ref_sat and sat the parameter-value of ledts_alieas is being overwritten by
set ledts_alias = var('datavault4dbt.ledts_alias', 'ledts')
instead of something like
set ledts_alias = datavault4dbt.replace_standard(ledts_alias, 'datavault4dbt.ledts_alias', 'ledts')

in ma_sat it is missing completly

dbt_utils dependency upgrade to 0.9.2

Hi guys,
like to test your dbt package but i have trouble because all other packages in my current project have been upgraded and now compatibility to dbt-utils 0.9.2.
Any chance you are able to check if an upgrade of your dependency is possible.

Thanks in advance.

PIT table usage

Hi *,

  1. I wonder if I can create a PIT from a SAT that keeps a Dependent Child key?
  2. The Dimension Key is unclear to me. Is it a field coming from the stage model? Or is it creating a "surrogate key" that can be used in the fact/dimension tables afterwards?

Best regards
Klaus

dbt deps creates an error in dbt cloud

I tried to recreate the code from the demo session. I get the following error when I run "dbt deps"

Server log summary

Compilation Error in model stg_order (models/staging/datavaultdemo/stg_order.sql)
  The name argument to ref() must be a string, got <class 'dbt.clients.jinja.create_undefined.<locals>.Undefined'>
  
  > in macro snowflake__stage (macros/staging/snowflake/stage.sql)
  > called by macro stage (macros/staging/stage.sql)
  > called by model stg_order (models/staging/datavaultdemo/stg_order.sql)
Compilation Error in model stg_order (models/staging/datavaultdemo/stg_order.sql)
  The name argument to ref() must be a string, got <class 'dbt.clients.jinja.create_undefined.<locals>.Undefined'>
  
  > in macro snowflake__stage (macros/staging/snowflake/stage.sql)
  > called by macro stage (macros/staging/stage.sql)
  > called by model stg_order (models/staging/datavaultdemo/stg_order.sql)

Server log Details

16:02:29  Set downloads directory='/tmp/dbt-downloads-4ifolly8'
16:02:29  Making package index registry request: GET https://hub.getdbt.com/api/v1/index.json
16:02:29  Response from registry index: GET https://hub.getdbt.com/api/v1/index.json 200
16:02:29  Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json
16:02:29  Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json 200
16:02:29  Making package registry request: GET https://hub.getdbt.com/api/v1/ScalefreeCOM/datavault4dbt.json
16:02:29  Response from registry: GET https://hub.getdbt.com/api/v1/ScalefreeCOM/datavault4dbt.json 200
16:02:29  Installing dbt-labs/dbt_utils
16:02:35    Installed from version 0.9.2
16:02:35    Up to date!
16:02:35  Sending event: {'category': 'dbt', 'action': 'package', 'label': '797fd050-690d-4e6b-a0b8-e889d588b98f', 'property_': 'install', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f865004caf0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f865004cc10>]}
16:02:35  Installing ScalefreeCOM/datavault4dbt
16:02:46    Installed from version 1.0.5
16:02:46    Up to date!
16:02:46  Sending event: {'category': 'dbt', 'action': 'package', 'label': '797fd050-690d-4e6b-a0b8-e889d588b98f', 'property_': 'install', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f865004c940>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f865004cac0>]}

My files are

packages.yml

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
  - package:  ScalefreeCOM/datavault4dbt 
    version: 1.0.5

models/staging/datavaultdemo/sources.yml


version: 2

sources: 
  - name: TPC-H_SF1
    database: SNOWFLAKE_SAMPLE_DATA
    schema: TPCH_SF1
    tables:
      - name: customer
  - name: deltas
    database: DATAVAULT4DBT_DEMO
    schema: CORE_SOURCE_SCHEMA
    tables: 
      - name: orders_initial
      - name: orders_1993
      - name: orders_1994
      - name: orders_1995
      - name: orders_1996
      - name: orders_1997
      - name: orders_1998

models/staging/datavaultdemo/stg_order.yml

    {{ config(materialized='view', schema='CORE_SOURCE_SCHEMA') }}

    {%- set yaml_metadata -%}
    source_model: 
          'deltas': 'orders_initial'
    hashed_columns: 
        hk_h_orders:
            - o_orderkey
        hk_h_customers:
            - customer_name
        hk_l_orders_customers:
            - o_orderkey
            - customer_name
        hd_orders_n_s:
            is_hashdiff: true
            columns:
                - o_orderstatus
                - o_totalprice
                - o_orderdate
                - o_orderpriority
                - o_clerk
                - o_shippriority
                - o_comment
                - is_highest_priority
                - description
                - legacy_orderkey
    derived_columns:
        is_highest_priority:
            value: "CASE WHEN (o_orderpriority = '1 - URGENT') THEN true ELSE false END"
            datatype: "BOOLEAN"
            src_cols_required: 'o_orderpriority'
        description:
            value: '!Orders from TPC_H, reference to Customers'
            datatype: 'STRING'
    missing_columns:
        legacy_orderkey: 'STRING'
    prejoined_columns:
        customer_name:
            src_name: 'TPC-H_SF1'
            src_table: 'customer'
            bk: 'c_name'
            this_column_name: 'o_custkey'
            ref_column_name: 'c_custkey'
    ldts: 'edwLoadDate'
    rsrc: '!TPC_H_SF1.Orders'
    {%- endset -%}

    {%- set metadata_dict = fromyaml(yaml_metadata) -%}

    {%- set source_model = metadata_dict['source_model'] -%}
    {%- set ldts = metadata_dict['ldts'] -%}
    {%- set rsrc = metadata_dict['rsrc'] -%}
    {%- set hashed_columns = metadata_dict['hashed_columns'] -%}
    {%- set derived_columns = metadata_dict['derived_columns'] -%}
    {%- set prejoined_columns = metadata_dict['prejoined_columns'] -%}
    {%- set missing_columns = metadata_dict['missing_columns'] -%}

    {{ datavault4dbt.stage(source_model=source_model,
                        ldts=ldts,
                        rsrc=rsrc,
                        hashed_columns=hashed_columns,
                        derived_columns=derived_columns,
                        prejoined_columns=prejoined_columns,
                        missing_columns=missing_columns) }}

Did I make a mistake or is there an issue with dbt cloud? I tried restarting the IDE multiple times

Snapshot v0 - IS_WEEKLY true on Tuesday?

Discussed in #49

Originally posted by mjahammel January 9, 2023
According to the documentation , the column IS_WEEKLY for a v0 Snapshot should be true only for rows where the sdts column is a Monday. I've tried several different combinations of start_date and daily_snapshot_time (including the example shown in Example 1), but in all cases IS_WEEKLY is true if the stds date is a Tuesday. Am I doing something wrong or just misunderstanding the documentation?

Thanks,
Maury

To Do

  • Change Snap v0 Code for Snowflake to set is_weekly to TRUE, when DAYOFWEEK equals 1 instead of 2
  • Check all other adapters again and verify, that is_weekly is set correctly

List of columns for hashdiff with include/exclude parameters

Discussed in #108

Originally posted by marcinsiewiera August 2, 2023
Hi All,

Is there a way to define columns for change hash in stage macro by specifying columns to be excluded rather than listing all columns 1 by 1?
Instead of using

hd_lineitem_n_s:
    is_hashdiff: true
    columns:
        - col1
        - col2
        - col3
        - ....
        - col50
        - ...

exclude business key

    columns:
      - include: '*' 
        exclude:
          - businessKey
          - hashKey
          - changeHash
          - loaddate
          - recordsource
          - ...

Regards,
Marcin

Hash key calculation in Multi-active stages only done for main hashkey column

It seems there's a bug, that for stage models, which contain a multi-active config, only the hash key attribute defined within the multi-active config as "main_hashkey_column" is calculated. All other hash key attributes would be ignored - in case of a stage that contains a link hash key and multiple hub hash keys.

Make Satellite HWM safe against outside truncate

In All Kinds of Satellites there is a HWM logic applied for incremental runs. This logic checks

WHERE ldts > (SELECT max(ldts) FROM {{ this }})

But for cases, where this table is truncated outside of dbt, SELECT max(ldts) would return NULL, and on some databases the comparison ldts > NULL throws an error.

To avoid this, we need to insert the beginning-of-all-times for cases when max(ldts) returns NULL, like this:

WHERE ldts > (SELECT COALESCE(max(ldts), {{ datavault4dbt.beginning_of_all_times() }}) FROM {{ this }})

Thanks to

for pointing out this issue and proposing a solution!

Hash Algorithm - Replacing NULL inputs for hash values

Discussed in #26

Originally posted by anitsche-bt November 4, 2022
Hi guys, following question:
i like to link a value to the unknown default value (0*), but this seems to be impossible, because of the IFNULLS within the hash logic you use. Wanted to use the unknown value as starting point (lvl 0) in a hierarchical link.
I don't believe that this was intentionally, because the outer IFNULL has the 0* as a replacing value.
Could you check please.

BR

prejoin not working for redshift

The version of dbt being used: 1.1.1
The version of datavault4dbt being used: 1.2.0
Issue in file: datavault4dbt --> macros --> staging --> redshift --> stage.sql
issue in line: 231
issue: below code is used:
left join {{ source(vals['src_name']|string, vals['src_table']) }} as pj_{{loop.index}} on lcte.{{ vals['this_column_name'] }} = pj_{{loop.index}}.{{ vals['ref_column_name'] }}
However, redshift doesn't allow this kind of join. datavault4dbt.multikey should be used to do it properly.

Multi Active Satellite - Wrong comma when having multiple hashkeys

Discussed in #80

Originally posted by saivigneshk June 14, 2023
Hi,
I am trying to create a multi active satellite with one main hash key, hash difference and a link hash key using bigquery/stage.sql.
In compiled code ma_hashdiff_prep CTE, I am getting extra comma. due to that the compilation is failing.
is that a bug?

Issue

To many commas when defining a multi active stage, that has more hashkeys than the main hashkey. See example SQL here:

SELECT
  
  IFNULL(TO_HEX(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT(
    IFNULL((CONCAT('\"', REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(TRIM(CAST(`d_member_code` AS STRING)), r'\\', r'\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^')
    ), r'\n', '') 
    , r'\t', '') 
    , r'\v', '') 
    , r'\r', '') AS STRING), '^^')))), '00000000000000000000000000000000') AS hk_member_h, 
, 
IFNULL(TO_HEX(LOWER(MD5(STRING_AGG(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT(
    IFNULL((CONCAT('\"', REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(TRIM(CAST(`zip_code` AS STRING)), r'\\', r'\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||',
    IFNULL((CONCAT('\"', REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(TRIM(CAST(`channel` AS STRING)), r'\\', r'\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||',
    IFNULL((CONCAT('\"', REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(TRIM(CAST(`zone_id` AS STRING)), r'\\', r'\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||',
    IFNULL((CONCAT('\"', REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(TRIM(CAST(`enabled` AS STRING)), r'\\', r'\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^')
    ), r'\n', '') 
    , r'\t', '') 
    , r'\v', '') 
    , r'\r', '') AS STRING), '^^||^^||^^||^^') ORDER BY zone_id)))), '00000000000000000000000000000000') AS hd_member_zone_hsat,
  ldts

FROM derived_columns
GROUP BY hk_member_h, ldts

Whats causing that issue?

In the "hash_columns" macro the comma is set outside of the "if , elif, endif" block, that identifies the main hashkey column and the hashdiff columns:

    {%- for col in columns -%}

        {%- if datavault4dbt.is_something(multi_active_key) -%}
            {% if columns[col] is mapping and columns[col].is_hashdiff -%}
                {{- datavault4dbt.hash(columns=columns[col]['columns'], 
                                alias=col, 
                                is_hashdiff=columns[col]['is_hashdiff'],
                                multi_active_key=multi_active_key,
                                main_hashkey_column=main_hashkey_column) -}}

            {%- elif columns[col] is not mapping and (col|upper) == (main_hashkey_column | upper) -%}
                {{- datavault4dbt.hash(columns=columns[col],
                                alias=col,
                                is_hashdiff=false) -}}  

            {%- endif -%}

            {{- ", \n" if not loop.last -}}

        {%- else -%}

What will fix this issue?

Moving that comma into the conditional logic, like this:

    {%- for col in columns -%}

        {%- if datavault4dbt.is_something(multi_active_key) -%}
            {% if columns[col] is mapping and columns[col].is_hashdiff -%}
                {{- datavault4dbt.hash(columns=columns[col]['columns'], 
                                alias=col, 
                                is_hashdiff=columns[col]['is_hashdiff'],
                                multi_active_key=multi_active_key,
                                main_hashkey_column=main_hashkey_column) -}}

            {{- ", \n" if not loop.last -}}
            
            {%- elif columns[col] is not mapping and (col|upper) == (main_hashkey_column | upper) -%}
                {{- datavault4dbt.hash(columns=columns[col],
                                alias=col,
                                is_hashdiff=false) -}}  

            {{- ", \n" if not loop.last -}}

            {%- endif -%}


        {%- else -%} 

Thanks to @saivigneshk for pointing this out!

invalid timeframe creation in SAT-v1 (MA, REF)

Using dbt or coalesce we added a Nanosecond to rows in our import-row to achieve uniqueness.
This generates invalid timeframes in the v1-sats because there a Microsecond is subtracted.
image

Could this be customizable?
Version is 1.1.2

Ghost Records may not be switched off

It would be nice to have the possibility to switch off the generation of ghost records as a whole or separately for unknown values and error values.

unexpexted-string-literal-inbig-query-staging-macro-because-of-quoted-alias

When the staging macro creates ghost records, Big Query returns:

Syntax error: Unexpected string literal

because the macro quotes the alias for the date datatype in the default__ghost_record_per_datatype macro in macros/supporting/ghost_record_per_datatype.sql

@tta-scalefree @tkirschke This leads to an error in BigQuery because BigQuery does not allow column names with quotes

Originally posted by @obause in #157 (comment)

sqlserver: Even if alias ldts is changed, on incremental load it still references this field

This could be broader than sqlserver, but I am testing on sqlserver branch.
I have changed the ldts global variable :
datavault4dbt.ldts_alias: 'loaddate'

The full initial load works, but when I run an incremental load the following code errors on a satellite: ( I removed table names)
delete from database.schema.sat0 table
where exists (
select null
from database.schema.sat0 table__dbt_tmp"
where
database.schema.sat0 table__dbt_tmp".hub_hashkey= database.schema.sat0 table.hub_hashkey
and
database.schema.sat0 table__dbt_tmp".ldts = database.schema.sat0 table.ldts
)
;

The field is correctly named to my alias loaddate in the satellite and thus it errors because ldts does not exist

Reference Hub not updated during incremental load in Snowflake

We are using the ref_hub macro in an incremental model in our Snowflake environment to build our reference tables. We have observed, that the initial load for the reference hub works fine, however, during the incremental loads new ref hub keys are not being added. The SQL syntax looks ok, however it seems not to work with Snowflake.
This issue is only observed in the reference hub model. The ref satellite is being correctly updated.
However, as a consequence new entries are missing in the final reference table, as the new hub entries are missing.
Steps to reproduce:

  1. Create new reference hub, example:

{{ config(materialized='incremental') }}

{%- set yaml_metadata -%}
source_models: staging_model
ref_keys:
- KEY1
- KEY2
{%- endset -%}

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.ref_hub(source_models=metadata_dict['source_models'],
ref_keys=metadata_dict['ref_keys']) }}

  1. Build the model
  2. Add new entries in the source with new key combination
  3. Re-run model to trigger incremental update
  4. Observe that new entries have not been added.

LDTS / RSRC naming leads to error (ambiguous column)

Hi,
I'm using the version 1.3.1. and having trouble to create a stage. It creates an error-message: ambiguous column name 'LDTS' (or RSCR.
I think it is because die attributes are already named like this in the source it is taking both of them...I hadn't had this issue in the version 1.1.0 (which I used before)

Regards
Jan Binge

Hash Datatypes other than STRING cause an error

Discussed in #47

Originally posted by mjahammel January 5, 2023
Are there allowed values for the variable hash_datatype other than STRING? I tried using BINARY (target is Snowflake), and am getting a couple of different errors, depending on the hash type selected. If the variable hash is set to SHA1 or SHA2, the error messages are "Invalid argument types for function 'LOWER': (BINARY(20))", "Invalid argument types for function 'LOWER': (BINARY(64))", respectively.

An example of the generated code for a hashkey is:

IFNULL(LOWER(SHA2_BINARY(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(UPPER(CONCAT(
        IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("BRANCH_CODE" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^')
        )), '\n', '')
        , '\t', '')
        , '\v', '')
        , '\r', '') AS STRING), '^^'))), 'TO_BINARY(0000000000000000000000000000000000000000000000000000000000000000)') AS DV_HASHKEY_HUB_BRANCH_TEST

The problem seems to be the call to LOWER immediately after the IFNULL. Also, the default value seems to be incorrect. Instead of:

'TO_BINARY(0000000000000000000000000000000000000000000000000000000000000000)'

should it not be:

TO_BINARY('0000000000000000000000000000000000000000000000000000000000000000')

Thanks,
Maury

To Do's

  • Remove LOWER function in hash default values, in case something else than STRING is defined as hash_datatype
  • Fix default unknown and default error value for SHA hash algorithms

Columns with spaces breaks satellites

On branch sqlserver:

If you have column names with spaces in them, for example "start time", this seems to get handled correctly in hubs, links and stage models. However, the satellites do not add quotation marks when it builds the select statements even though in the model definition these columns are enclosed with quotation marks ("")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.