Delta format support normal merging <a href="https://docs.databricks.com/spark/lat

Thanks for the suggestion <a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support delta lake format about dbt-spark HOT 4 CLOSED

dbt-labs commented on July 3, 2024 9

Support delta lake format

from dbt-spark.

Comments (4)

drewbanin commented on July 3, 2024

Thanks for the suggestion @tongqqiu! I love the idea of being able to use delta-specific DML in an execution environment like DataBricks.

dbt has the ability to define incremental strategies that define how incremental models should be build. I imagine the default could be insert_overwrite, but users could configure their models to use merge instead. I like that this would support both vanilla spark, as well as databricks runtimes.

So, the work to do here is really just adding the merge logic to the incremental flow. That should look something like: https://github.com/fishtown-analytics/dbt/blob/dev/louisa-may-alcott/core/dbt/include/global_project/macros/materializations/common/merge.sql#L12-L35

Is this something you're interested in contributing? We're super happy to help out if so!

from dbt-spark.

tongqqiu commented on July 3, 2024

@drewbanin When model is a "table", the current behavior is to drop and create the table. Since spark doesn't support the transaction, it is not good to drop the table first. The alternative way is to use "Insert into overwrite" statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html. It is similar what you did for incremental type, just don't need partitions. It will keep the table live, and delta format will ensure ACID on a single table level as well. Any suggests how to make that change? BTW, set file format as delta works well like default parquet.

from dbt-spark.

jtcohen6 commented on July 3, 2024

Hey @tongqqiu, to follow up on this issue:

merge as an incremental strategy was added in #65 and included in the 0.15.3 release
You have an open PR (#66) to use the Delta merge functionality to add support for dbt snapshot

As far as the table materialization:

I hear your point about wanting to use insert overwrite instead of drop + create for atomic table replacement. We discussed a bit more here.
The main issue with using insert_overwrite in the general case is that it cannot handle changes to column names or data types. One of the core propositions of the table materialization is that it fully wipes the slate and creates the model from scratch, no matter whether/what the preexisting version looked like.
I think the atomic replacement you're suggesting is possible with the dbt-spark plugin today: If you know your model will not undergo any structural change, you could materialize the model as incremental, pick any arbitrary column to partition by, and re-select all the data in every run.
In the long run, I believe our best answer is to use create or replace table, which we understand to be coming in Spark 3.0.

from dbt-spark.

tongqqiu commented on July 3, 2024

@jtcohen6 Sounds all good to me.

from dbt-spark.

Recommend Projects

Support delta lake format about dbt-spark HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent