Giter Club home page Giter Club logo

Comments (4)

drewbanin avatar drewbanin commented on July 3, 2024

Thanks for the suggestion @tongqqiu! I love the idea of being able to use delta-specific DML in an execution environment like DataBricks.

dbt has the ability to define incremental strategies that define how incremental models should be build. I imagine the default could be insert_overwrite, but users could configure their models to use merge instead. I like that this would support both vanilla spark, as well as databricks runtimes.

So, the work to do here is really just adding the merge logic to the incremental flow. That should look something like: https://github.com/fishtown-analytics/dbt/blob/dev/louisa-may-alcott/core/dbt/include/global_project/macros/materializations/common/merge.sql#L12-L35

Is this something you're interested in contributing? We're super happy to help out if so!

from dbt-spark.

tongqqiu avatar tongqqiu commented on July 3, 2024

@drewbanin When model is a "table", the current behavior is to drop and create the table. Since spark doesn't support the transaction, it is not good to drop the table first. The alternative way is to use "Insert into overwrite" statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html. It is similar what you did for incremental type, just don't need partitions. It will keep the table live, and delta format will ensure ACID on a single table level as well. Any suggests how to make that change? BTW, set file format as delta works well like default parquet.

from dbt-spark.

jtcohen6 avatar jtcohen6 commented on July 3, 2024

Hey @tongqqiu, to follow up on this issue:

  • merge as an incremental strategy was added in #65 and included in the 0.15.3 release
  • You have an open PR (#66) to use the Delta merge functionality to add support for dbt snapshot

As far as the table materialization:

  • I hear your point about wanting to use insert overwrite instead of drop + create for atomic table replacement. We discussed a bit more here.
  • The main issue with using insert_overwrite in the general case is that it cannot handle changes to column names or data types. One of the core propositions of the table materialization is that it fully wipes the slate and creates the model from scratch, no matter whether/what the preexisting version looked like.
  • I think the atomic replacement you're suggesting is possible with the dbt-spark plugin today: If you know your model will not undergo any structural change, you could materialize the model as incremental, pick any arbitrary column to partition by, and re-select all the data in every run.
  • In the long run, I believe our best answer is to use create or replace table, which we understand to be coming in Spark 3.0.

from dbt-spark.

tongqqiu avatar tongqqiu commented on July 3, 2024

@jtcohen6 Sounds all good to me.

from dbt-spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.