Comments (8)
hoo boy - this is rough.
dbt drops + recreates tables in order to:
- refresh the underlying data
- make the schema of the table consistent with the associated model code
While it sounds like we could do a delete
+ insert
to refresh the data in the table, that doesn't solve the problem of schema updates. It's certainly possible to run a series of alter
statements to synchronize the model code (maybe rendered into a temp table) and the destination table, but really, that sounds error prone and ineffective to me.
The linked SO post references eventual consistency on s3. Are you setting an s3 path for these delta tables? Or do they live in dbfs? I didn't realize this, but it sounds like dbfs persists data to s3, so maybe this could happen either way?
I'm sort of at the edge of my understanding of how all of this works, so please correct me if I'm way off here.
Idea 1:
Would it be possible to use a timestamp suffix for the s3 location
? I think this should circumvent any issues with eventual consistency, since files in s3 would never be overwritten.
I think this might not apply to delta, but seems relevant: https://docs.databricks.com/user-guide/tables.html#simple-way-to-replace-table-contents
Idea 2:
via the delta docs:
Delta Lake lets you update the schema of a table. The following types of changes are supported:
- Adding new columns (at arbitrary positions)
- Reordering existing columns
You can make these changes explicitly using DDL or implicitly using DML.
What does that mean?? The docs are light here, but can you just insert data with an arbitrary schema into a delta table? I'm very curious to know what implicit schema updates with DML look like in practice.
from dbt-spark.
Per Michael Armbrust from Databricks:
- This error occurs because "the transaction log is flickering in and out of existence"
- The answer:
create or replace table
is coming in Spark v3!- It will still be slow for managed tables, since data needs to be physically moved
- The trade-off is between a quick non-atomic metadata operation (external tables) or a slow atomic operation (managed tables)
from dbt-spark.
dbt drops + recreates tables in order to:
- refresh the underlying data
- make the schema of the table consistent with the associated model code
Hi @drewbanin , I have little doubt: If the table is an external table, drop table -> create table -> insert
will append data instead of refresh data?
from dbt-spark.
If the table is an external table,
drop table -> create table -> insert
will append data instead of refresh data?
This is true, and I don't see a great way around it. We may need to take genuinely different steps in our materializations that differentiate between managed and external tables. What would you recommend @qsbao?
from dbt-spark.
This is true, and I don't see a great way around it. We may need to take genuinely different steps in our materializations that differentiate between managed and external tables. What would you recommend @qsbao?
I don't have any suggestion right now. I'm new to dbt, asked this question just to understand more clearly, and thank you for your reply.
The following is what I observed:
- If model is an external table, we use
drop table -> CTAS
to recreate table. I checked this will corretly refresh data.
models:
jaffle_shop:
materialized: table
file_format: parquet
location_root: /user/qsbao
drop table if exists dbt_alice.fct_orders
create table dbt_alice.fct_orders
using parquet
location '/user/qsbao/fct_orders'
as
...
- But if seed is an external table, we use
drop table -> create table -> insert into
to recreate table. This way will append data instead of refresh data, see #112 .
from dbt-spark.
@qsbao Thanks for looking into that! It sounds like we're in okay shape for models, which are always CTAs. That also gives me hope for a workaround to the seed issue (#112).
from dbt-spark.
I think we can close this issue. When using delta
as the file format, we now use create or replace table
:
https://github.com/fishtown-analytics/dbt-spark/blob/6ad164b315748fef7c0ae0b87ff6b8292632f35e/dbt/include/spark/macros/adapters.sql#L81
This should fix the issues with the transaction log.
from dbt-spark.
Ooh thanks @Fokko! Good call here
from dbt-spark.
Related Issues (20)
- [ADAP-920] [ADAP-919] [Bug] Delta table metadata changed/concurrent update HOT 1
- [ADAP-930] [Feature] Implement relation filtering on get_catalog macro
- [ADAP-931] [Bug] Values in seeds that should convert to `null` aren't working for `session` connection method HOT 3
- [ADAP-946] [CT-2689] [Bug] Incremental models ran from scratch when created in a pre-hook HOT 9
- [ADAP-955] [Feature] Add debug logging for driver/connector packages HOT 2
- [ADAP-970] [Feature] Incremental updates should update table description HOT 3
- [ADAP-999] [Feature] add support for Apache Paimon format HOT 1
- [ADAP-1012] Support for new agate data type in Spark
- [ADAP-1018] [Feature] Remove Databricks test profiles from integration tests HOT 2
- [ADAP-1019] [Bug] Table already exists, you need to drop it first in incremental models HOT 1
- [ADAP-1038] [Tests] Add tests for --empty flag
- [ADAP-1048] [Bug] Replacing existing table using incremental model HOT 1
- [ADAP-1071] [Bug] `latest` and `1.x.latest` tags for ghcr Docker releases are stale HOT 1
- [ADAP-1074] [Implementation] Remove `invalid_insert_overwrite_delta_msg` message
- [ADAP-1085] [Bug] When using iceberg format, dbt docs generate is unable to populate the columns information HOT 1
- [ADAP-1093] [Feature] Run integration tests against all supported python versions
- [Feature] Support HTTP transport protocol for Thrift method
- [Feature] Support OCI Dataflow as a backend for dbt-spark
- `dbt-core` Dockerfile does not work for `dbt-spark` due to `PyHive` HOT 2
- [Bug] CI is broken on `main` due to dependency resolution and timeout issues HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-spark.