lewisdavies / upstream-prod Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 4.0 88 KB

A dbt package for easily using production data in a development environment.

Shell 100.00%

dbt dbt-packages

upstream-prod's People

Contributors

Stargazers

Watchers

Forkers

garfieldthesam kpounder brendan-cook-87 shubhm13

upstream-prod's Issues

Error on upstream_prod fallback for bigquery ML model

Hi!

I am using upstream_prod package with a bigquery machine learning model and I am getting an error. The ML model is being referenced on a datamart, but it does not exist in production yet, so the fallback to development is being correctly triggered. However, it looks like the code on ref.sql {% set return_ref = load_relation(parent_ref) %} is returning none for this bigquery ML model, and generates the error:

19:55:52  1 of 1 START sql view model franzoni.dtm_revenue_forecast ................. [RUN]
19:55:52  [dtm_revenue_forecast] mdl_revenue_forecast not found in prod, falling back to default target
19:55:52  1 of 1 ERROR creating sql view model franzoni.dtm_revenue_forecast ........ [ERROR in 0.65s]
19:55:52  
19:55:52  Finished running 1 view model, 3 hooks in 0 hours 0 minutes and 6.14 seconds (6.14s).
19:55:52  
19:55:52  Completed with 1 error and 0 warnings:
19:55:52  
19:55:52  Compilation Error in model dtm_revenue_forecast (models/ml_models/dtm_revenue_forecast.sql)
19:55:52    [dtm_revenue_forecast] upstream_prod couldn't find the specified model:
19:55:52    
19:55:52    DATABASE: xxx
19:55:52    SCHEMA:   ml_models
19:55:52    RELATION: mdl_revenue_forecast
19:55:52    
19:55:52    Check your variable settings in the README or create a GitHub issue for more help.
19:55:52    
19:55:52    > in macro default__ref (macros/ref.sql)
19:55:52    > called by macro ref (macros/ref.sql)
19:55:52    > called by macro ref (macros/ref.sql)
19:55:52    > called by model dtm_revenue_forecast (models/ml_models/dtm_revenue_forecast.sql)
19:55:52  
19:55:52  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Support custom databases

Custom schemas are supported, but not custom databases (only partially with the prod_database_replace feature).

In our project, we write and read to and from different databases. This setup is currently not possible with upstream_prod (a model fails when it selects models from different databases).

We've added support for this by copying the solution for custom schemas.

Adding the macro get_custom_database.sql:

{% macro generate_database_name(custom_database_name=none, node=none, is_upstream_prod=False) -%}

    {%- set default_database = target.database -%}

    {%- if (target.name in ('prod') or is_upstream_prod == true) and custom_database_name is not none -%}

        {{ custom_database_name | trim }}

    {%- else -%}

        {{ default_database }}

    {%- endif -%}

{%- endmacro %}

Adding this to the upstream_prod.ref macro:

{% set custom_database_name = parent_node.config.database %}
{% set parent_database = generate_database_name(custom_database_name, parent_node, True) | trim %}

It would be cool if upstream_prod supports this out-of-the-box 🙂

--empty run option doesn't work when upstream_prod_enable is TRUE

When running a dbt model with the --empty flag and upstream_prod_enabled: False, dbt generates the sql from the ref() to include where false limit 0 which is what we would expect. However, when running with upstream_prod_enabled: True, the additional where condition is not included.

Ideally, the where condition is still included so we can take advantage of both the upstream_prod and --empty optimizations.

See https://docs.getdbt.com/reference/commands/run#the---empty-flag

Env schemas don't seem to work with snapshots that use the `target_schema` config.

I use env schemas, and implemented this package by updating the generate_schema_name macro. But, it seemed to not work in models where I was calling ref against a snapshot.

This is how my variables were set up in dbt_project.yml:

  upstream_prod_env_schemas: true
  upstream_prod_enabled: true
  upstream_prod_disabled_targets:
    - production
  upstream_prod_prefer_recent: true

This is how my macro was set up:

{% macro generate_schema_name(custom_schema_name, node, is_upstream_prod=False) -%}

	{%- set default_schema = target.schema -%}

	{# Dev, and it's a selected node - don't split out schemas #}
	{%- if target.name != "production" and is_upstream_prod == False -%}
		{{ default_schema }}

	{# Production, or dev when it's not a selected node - split out schemas #}
	{%- else -%}
		{# Tell upstream_prod to use our production schema #}
		{%- if is_upstream_prod -%}
			{%- set default_schema = 'dbt_production' -%}
		{%- endif -%}

		{# Break out separate schemas #}
		{%- if custom_schema_name is none -%}
			{{ default_schema }}
		{%- else -%}
			{{ default_schema }}_{{ custom_schema_name | trim }}
		{%- endif -%}

	{%- endif -%}

{%- endmacro %}

All of our snapshots have this in the config block at the top:

{{
	config(
		target_schema='snapshots',
		...,
		...,
)
}}

When I tried to dbt run a model that had a ref() pointing at a snapshot, raise_ref_not_found_error raised this:

From inspecting it, it looked like it was resolving {{ ref('snapshot__...') }} to my_project.dbt_production.snapshot__... rather than resolving it to the config value of target_schema. I needed the ref to resolve to my_project.snapshots.snapshot__....

As a hacky workaround, I added this clause to the top of my generate_schema_name macro:

	{# Override to get snapshots working #}
	{% if node.config.target_schema == 'snapshots' %}
		snapshots

	{# Dev, and it's a selected node - don't split out schemas #}
	{%- elif target.name != "production" and is_upstream_prod == False -%}

This got everything working again. I'm not exactly sure why my new configuration including is_upstream_prod broke the step in dbt core where it forces the schema to be written to target_schema.

Dict object has no attribute `file_key_name` on Singular tests

Fairly confident that this is only an issue on 'singular data tests' configured in the tests/ folder.

I assume this is because the file is not directly referenced.

Doesn't happen in targets where upstream prod disabled.

upstream-prod Version 0.5.1

DBT: Core:

installed: 1.4.9
latest: 1.7.6 - Update available!

Your version of dbt-core is out of date!
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:

sqlserver: 1.4.3 - Up to date!
synapse: 1.4.0 - Up to date!

macro 'dbt_macro__ref' takes no keyword argument 'v'

Describe the feature

With the update of dbt versions, we now have the possibility to add the version in the ref.sql file, as referenced here. When using the dbt Python Models and the upstream-prod I had a problem because the dbt does not find the argument 'v' which would be the version of the macro ref, as shown in the image below:

Possible Solution

Add the arguments 'v' or 'version' to the ref macro.

Environment

- OS: Linux 22.04.2 LTS
- Python: 3.9.16
- dbt: 1.5.1
- upstream_prod: 0.5.0

CLI sending a large number of `True` and `False` messages to console when running `upstream-prod` on `dbt-bigquery`

Absolutely love this package, but the one issue is that now the console returns a large number of True and False messages when running dbt commands, including run, build, and compile. For example see this (sanitized) output from the Github CI run for the PR where I integrate this package into my repo.

This example is from dbt-bigquery, so obviously not a connector the package has been verified with.

Any idea what might be causing this issue? I might like to try my hand at submitting a fix if there's a likely culprit.

18:24:22  Found X models, Y tests, Z snapshot, A analyses, B macros, C operations, D seed files, E sources, F exposures, G metrics
18:24:22  
18:24:28  Concurrency: 2 threads (target='target_name')
18:24:28  
18:24:29  1 of X START sql table model schemaname.tablename  [RUN]
18:24:29  2 of X START sql table model schemaname.tablename  [RUN]
18:24:29  False
18:24:29  False
18:24:29  False
18:24:29  False
18:24:37  2 of X OK created sql table model schemaname.tablename  [CREATE TABLE (3.5m rows, 277.1 MB processed) in 8.86s]
18:24:37  3 of X START sql incremental model schemaname.tablename  [RUN]
18:24:37  False
18:24:38  False
18:24:38  False
18:24:57  3 of X OK created sql incremental model schemaname.tablename  [CREATE TABLE (3.7m rows, 7.9 GB processed) in 20.04s]
18:24:57  4 of X START sql incremental model schemaname.tablename  [RUN]
18:24:57  False

Recreate all changed models in development environment

One interesting feature would be to always recreate modified models from the development environment.

Consider the dag:
model_1 -> model_2

Imagine you modified model_1 and model_2. If you run dbt run -s model_2, you are going to use model_1 from production environment, but I think one would want to use model_1 from dev environment in most cases.

`check_reqd_vars` Contradictory Error Message

Version: 0.8.0

The error message contained in the code block below contradicts the actual logic within the if statement.

    {% if prod_database is none and prod_database_replace is none %}
        {% set error_msg -%}
upstream_prod has been provided with two incompatible variables. Only one of the following should be set:
- upstream_prod_database
- upstream_prod_database_replace
        {%- endset %}
        {% do exceptions.raise_compiler_error(error_msg) %}
    {% endif %}

Based on the error message, it should be testing that both variables for is **not** none.

I was having some problems using this with the dbt-clickhouse adapter and overriding the macro with another version containingis not none in the if statement allowed me to compile.

`ref()` returns `None` when prod table doesn't exist even if `upstream_prod_fallack: True`

This might be a misunderstanding of how the package works so please let me know if this is just an unsupported feature.

Using dbt-bigquery I have a config like this in my profiles.yml for upstream_prod:

vars:
  # upstream-prod config
  upstream_prod_schema: prod_schema_prefix
  upstream_prod_enabled: True
  upstream_prod_fallack: True

Where prod_schema_prefix matches the prefix applied to the many prod schemas in our data warehouse (eg. there's my personal dev schemas like dbt_myname_staging and prod schemas like prod_schema_prefix_staging).

I have created a new table tbl1 and updated tbl2 to query from tbl1 like from {{ ref("tbl1") }}. When this compiles with upstream_prod_enabled: True, it compiles as from None.

Typo error in : raise_ref_not_found_error

Macro raise_ref_not_found_error.sql

Replace
SCHEMA: {{ realtion.schema }}

With
SCHEMA: {{ relation.schema }}

The modified "ref" macro breaks when using "dbt run-operation"

In my team, we commonly use the macro generate_model_yaml from the package dbt-codegen in order to easily generate the yml documentation for a new model.

However, while running dbt run-operation generate_model_yaml --args '{"model_names": [<model_name>]}', we had the following error mentioning the altered macro ref()

12:57:59  Running with dbt=1.3.2
12:58:00  Encountered an error while running operation: Compilation Error in macro ref (macros/ref.sql)
  'this' is undefined
  
  > in macro generate_model_yaml (macros/generate_model_yaml.sql)
  > called by macro ref (macros/ref.sql)

The problem is that in dbt run-operation the variable this does not seem to exist.

The variable this is the default value for the variable current_model, but this seems only to be used in a jinja's log macro.

Error with `prefer_recent` variable in ref.sql

Context

I want to configure the preference to select the latest model (even though it is in dev) or not, and this can be done using the variable upstream_prod_prefer_recent.

Problem

Even setting the variable to False and not having the most up-to-date tables in my staging, the package continues to reference the tables in my staging.

Ex: As the prefer_recent variable is False and dev_exists is True, it could not be entered in the if command.

When I did this test checking if the variable was False, the if command did not enter.

Error When Using ref.sql Macro

Hello,

I recently installed upstream_prod in my dbt/snowflake instance. Our team uses custom schemas, and I was able to effectively load production data into my dev environment by replacing the refs in the model with upstream_prod.ref. However, I'm now encountering an error when attempting to overwrite the default macro with ref.sql. Specifically, I'm receiving a 'dbt_macro__ref' takes no more than 6 argument(s) error. Could you potentially provide some guidance on how to navigate this issue? Thank you.

Ephemeral models are not supported

So if you add this to a project and do a full run, when it hits a model with an upstream ephemeral model, it tries to reference it, instead of inserting it into the code as a CTE like the normal adapter would

BigQuery breaks in `get_table_update_ts` due to missing backticks

get_table_update_ts.sql queries the information schema several times, for example here:

upstream-prod/macros/get_table_update_ts.sql

Line 13 in acbb79b

{{ relation.database }}.information_schema.tables

The queries fail on BigQuery because it expects backticks. For example, changing

{{ relation.database }}.information_schema.tables

`{{ relation.database }}.information_schema.tables`

does the fix. I guess one way to handle this could be to handle BigQuery in a separate condition?

redshift adapter.get_relation always returns null?

Hello!

Thank you for creating this project, it's exactly what my team and I were looking for! I'm trying to get this setup on our redshift cluster but I'm hitting an error. My adapter.get_relation call at the end keeps returning None. I've double checked the database, schema, and identifier inputs and everything looks good. It's just failing to find the relation using the function. Is there something extra I need to setup in my profiles or manifests that I'm missing? In my dbt project I've got upstream_prod_database: stg and we're using a custom generate_schema_name macro so I'm not specifying anything on the schema front.

When I carve up the package and do something like {% set prod_ref = parent_ref | replace(parent_ref.database, prod_database) %} it works fine. So, I'm guessing there's something about get_relation specifically I'm not understanding.

Package not working with metrics

[Using dbt 1.5, upstream-prod 0.7.1, dbt-labs/metrics 1.5.0]

Hi there! We've been using your package for over a year now, we are really happy with it 😁... but we are facing an issue.
The overriding of ref function does not seem to work when using the dbt-labs/metrics package, especially with the metrics.calculate macro.

The macro generates a cte with a name like model_5136b70a98f4d9c12207d67556e6ef01__aggregate in the compiled code, and in this cte the table referenced in always from dev environment and never from production.

Did you have this issue raised by anyone else? Would you know a trick to correct it? I'm trying to reverse engineer the dbt-labs/metrics code but I struggle understanding what is happening.

Thanks for your help 😄

--no-partial-parse flag fails with this library enabled on branch

Hi @LewisDavies,

Firstly huge thanks on your work pulling this package together, it's a great tool and helps us with workflows!

A small bug I noticed whilst trying to get this working inside a CI workflow:

dbt --no-partial-parse compile -t dev_ci
Fails, where:
dbt compile -t dev_ci
Works fine.

The error is like this:

This actually happens on basically all the models (try with a dbt -d --no-partial-parse compile -t dev_ci to see a full error set.

In this case dev_ci is a target with upstream prod enabled.

Is this behaviour known? If so I could raise a small PR to the README to document it. Else, any ideas?!

'dict object' has no attribute 'file_key_name' in macro ref (macros/ref.sql)

Hi there,
We've updated our upstream-prod to version 0.6.1 a few days ago, since then when we try to use dbt compile we face an error.
The issue does not happen when upstream-prod is set as False.

ERROR:

20:12:23  Running with dbt=1.5.1
20:12:32  Found 449 models, 645 tests, 4 snapshots, 2 analyses, 1144 macros, 3 operations, 0 seed files, 139 sources, 35 exposures, 0 metrics, 0 groups
20:12:32  
20:12:32  Concurrency: 4 threads (target='dev')
20:12:32  
20:12:35  Encountered an error:
Runtime Error
  Compilation Error in model int_active_...  (folder/int_active_...sql)
    'dict object' has no attribute 'file_key_name'
    
    > in macro ref (macros/ref.sql)
    > called by macro ref (macros/ref.sql)
    > called by model int_active_... (folder/int_active_...sql)

P.S: I've hidden the model and folder because the error is showing different model names when we test with different analysts.