Comments (4)
Thanks for the suggestion @tongqqiu! I love the idea of being able to use delta-specific DML in an execution environment like DataBricks.
dbt has the ability to define incremental strategies that define how incremental models should be build. I imagine the default could be insert_overwrite
, but users could configure their models to use merge
instead. I like that this would support both vanilla spark, as well as databricks runtimes.
So, the work to do here is really just adding the merge
logic to the incremental flow. That should look something like: https://github.com/fishtown-analytics/dbt/blob/dev/louisa-may-alcott/core/dbt/include/global_project/macros/materializations/common/merge.sql#L12-L35
Is this something you're interested in contributing? We're super happy to help out if so!
from dbt-spark.
@drewbanin When model is a "table", the current behavior is to drop and create the table. Since spark doesn't support the transaction, it is not good to drop the table first. The alternative way is to use "Insert into overwrite" statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html. It is similar what you did for incremental type, just don't need partitions. It will keep the table live, and delta format will ensure ACID on a single table level as well. Any suggests how to make that change? BTW, set file format as delta works well like default parquet.
from dbt-spark.
Hey @tongqqiu, to follow up on this issue:
merge
as an incremental strategy was added in #65 and included in the 0.15.3 release- You have an open PR (#66) to use the Delta
merge
functionality to add support fordbt snapshot
As far as the table
materialization:
- I hear your point about wanting to use
insert overwrite
instead ofdrop
+create
for atomic table replacement. We discussed a bit more here. - The main issue with using
insert_overwrite
in the general case is that it cannot handle changes to column names or data types. One of the core propositions of thetable
materialization is that it fully wipes the slate and creates the model from scratch, no matter whether/what the preexisting version looked like. - I think the atomic replacement you're suggesting is possible with the
dbt-spark
plugin today: If you know your model will not undergo any structural change, you could materialize the model asincremental
, pick any arbitrary column to partition by, and re-select all the data in every run. - In the long run, I believe our best answer is to use
create or replace table
, which we understand to be coming in Spark 3.0.
from dbt-spark.
@jtcohen6 Sounds all good to me.
from dbt-spark.
Related Issues (20)
- [ADAP-946] [CT-2689] [Bug] Incremental models ran from scratch when created in a pre-hook HOT 9
- [ADAP-955] [Feature] Add debug logging for driver/connector packages HOT 2
- [ADAP-970] [Feature] Incremental updates should update table description HOT 3
- [ADAP-999] [Feature] add support for Apache Paimon format HOT 1
- [ADAP-1012] Support for new agate data type in Spark
- [ADAP-1018] [Feature] Remove Databricks test profiles from integration tests HOT 2
- [ADAP-1019] [Bug] Table already exists, you need to drop it first in incremental models HOT 1
- [ADAP-1038] [Tests] Add tests for --empty flag
- [ADAP-1048] [Bug] Replacing existing table using incremental model HOT 1
- [ADAP-1071] [Bug] `latest` and `1.x.latest` tags for ghcr Docker releases are stale HOT 1
- [ADAP-1074] [Implementation] Remove `invalid_insert_overwrite_delta_msg` message
- [ADAP-1085] [Bug] When using iceberg format, dbt docs generate is unable to populate the columns information HOT 1
- [ADAP-1093] [Feature] Run integration tests against all supported python versions
- [Feature] Support HTTP transport protocol for Thrift method
- [Feature] Support OCI Dataflow as a backend for dbt-spark
- `dbt-core` Dockerfile does not work for `dbt-spark` due to `PyHive` HOT 2
- [Bug] CI is broken on `main` due to dependency resolution and timeout issues HOT 1
- [Feature] Spike on supporting Py3.12
- [Bug] The tblproperties are not applied when using Python Model to create a table HOT 1
- [Issue] sasl as a dependency HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-spark.