Giter Club home page Giter Club logo

Comments (8)

drewbanin avatar drewbanin commented on July 3, 2024

hoo boy - this is rough.

dbt drops + recreates tables in order to:

  1. refresh the underlying data
  2. make the schema of the table consistent with the associated model code

While it sounds like we could do a delete + insert to refresh the data in the table, that doesn't solve the problem of schema updates. It's certainly possible to run a series of alter statements to synchronize the model code (maybe rendered into a temp table) and the destination table, but really, that sounds error prone and ineffective to me.

The linked SO post references eventual consistency on s3. Are you setting an s3 path for these delta tables? Or do they live in dbfs? I didn't realize this, but it sounds like dbfs persists data to s3, so maybe this could happen either way?

I'm sort of at the edge of my understanding of how all of this works, so please correct me if I'm way off here.

Idea 1:
Would it be possible to use a timestamp suffix for the s3 location? I think this should circumvent any issues with eventual consistency, since files in s3 would never be overwritten.

I think this might not apply to delta, but seems relevant: https://docs.databricks.com/user-guide/tables.html#simple-way-to-replace-table-contents

Idea 2:

via the delta docs:

Delta Lake lets you update the schema of a table. The following types of changes are supported:

  • Adding new columns (at arbitrary positions)
  • Reordering existing columns

You can make these changes explicitly using DDL or implicitly using DML.

What does that mean?? The docs are light here, but can you just insert data with an arbitrary schema into a delta table? I'm very curious to know what implicit schema updates with DML look like in practice.

from dbt-spark.

jtcohen6 avatar jtcohen6 commented on July 3, 2024

Per Michael Armbrust from Databricks:

  • This error occurs because "the transaction log is flickering in and out of existence"
  • The answer: create or replace table is coming in Spark v3!
    • It will still be slow for managed tables, since data needs to be physically moved
    • The trade-off is between a quick non-atomic metadata operation (external tables) or a slow atomic operation (managed tables)

from dbt-spark.

qsbao avatar qsbao commented on July 3, 2024

dbt drops + recreates tables in order to:

  1. refresh the underlying data
  2. make the schema of the table consistent with the associated model code

Hi @drewbanin , I have little doubt: If the table is an external table, drop table -> create table -> insert will append data instead of refresh data?

from dbt-spark.

jtcohen6 avatar jtcohen6 commented on July 3, 2024

If the table is an external table, drop table -> create table -> insert will append data instead of refresh data?

This is true, and I don't see a great way around it. We may need to take genuinely different steps in our materializations that differentiate between managed and external tables. What would you recommend @qsbao?

from dbt-spark.

qsbao avatar qsbao commented on July 3, 2024

This is true, and I don't see a great way around it. We may need to take genuinely different steps in our materializations that differentiate between managed and external tables. What would you recommend @qsbao?

I don't have any suggestion right now. I'm new to dbt, asked this question just to understand more clearly, and thank you for your reply.

The following is what I observed:

  1. If model is an external table, we use drop table -> CTAS to recreate table. I checked this will corretly refresh data.
models:
  jaffle_shop:
    materialized: table
    file_format: parquet
    location_root: /user/qsbao
drop table if exists dbt_alice.fct_orders

create table dbt_alice.fct_orders
    using parquet
    location '/user/qsbao/fct_orders'
    as
...
  1. But if seed is an external table, we use drop table -> create table -> insert into to recreate table. This way will append data instead of refresh data, see #112 .

from dbt-spark.

jtcohen6 avatar jtcohen6 commented on July 3, 2024

@qsbao Thanks for looking into that! It sounds like we're in okay shape for models, which are always CTAs. That also gives me hope for a workaround to the seed issue (#112).

from dbt-spark.

Fokko avatar Fokko commented on July 3, 2024

I think we can close this issue. When using delta as the file format, we now use create or replace table:
https://github.com/fishtown-analytics/dbt-spark/blob/6ad164b315748fef7c0ae0b87ff6b8292632f35e/dbt/include/spark/macros/adapters.sql#L81

This should fix the issues with the transaction log.

from dbt-spark.

jtcohen6 avatar jtcohen6 commented on July 3, 2024

Ooh thanks @Fokko! Good call here

from dbt-spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.