Light on details for now. A similar issue <a href="https://stackoverflow.com/questions

Per Michael Armbrust from Databricks: This error occurs becaus

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ooh thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

Transaction log has failed integrity checks about dbt-spark HOT 8 CLOSED

dbt-labs commented on July 3, 2024

Transaction log has failed integrity checks

from dbt-spark.

Comments (8)

drewbanin commented on July 3, 2024

hoo boy - this is rough.

dbt drops + recreates tables in order to:

refresh the underlying data
make the schema of the table consistent with the associated model code

While it sounds like we could do a delete + insert to refresh the data in the table, that doesn't solve the problem of schema updates. It's certainly possible to run a series of alter statements to synchronize the model code (maybe rendered into a temp table) and the destination table, but really, that sounds error prone and ineffective to me.

The linked SO post references eventual consistency on s3. Are you setting an s3 path for these delta tables? Or do they live in dbfs? I didn't realize this, but it sounds like dbfs persists data to s3, so maybe this could happen either way?

I'm sort of at the edge of my understanding of how all of this works, so please correct me if I'm way off here.

Idea 1:
Would it be possible to use a timestamp suffix for the s3 location? I think this should circumvent any issues with eventual consistency, since files in s3 would never be overwritten.

I think this might not apply to delta, but seems relevant: https://docs.databricks.com/user-guide/tables.html#simple-way-to-replace-table-contents

Idea 2:

via the delta docs:

Delta Lake lets you update the schema of a table. The following types of changes are supported:

Adding new columns (at arbitrary positions)

Reordering existing columns

You can make these changes explicitly using DDL or implicitly using DML.

What does that mean?? The docs are light here, but can you just insert data with an arbitrary schema into a delta table? I'm very curious to know what implicit schema updates with DML look like in practice.

from dbt-spark.

jtcohen6 commented on July 3, 2024

Per Michael Armbrust from Databricks:

This error occurs because "the transaction log is flickering in and out of existence"
The answer: create or replace table is coming in Spark v3!
- It will still be slow for managed tables, since data needs to be physically moved
- The trade-off is between a quick non-atomic metadata operation (external tables) or a slow atomic operation (managed tables)

from dbt-spark.

qsbao commented on July 3, 2024

dbt drops + recreates tables in order to:

refresh the underlying data

make the schema of the table consistent with the associated model code

Hi @drewbanin , I have little doubt: If the table is an external table, drop table -> create table -> insert will append data instead of refresh data?

from dbt-spark.

jtcohen6 commented on July 3, 2024

If the table is an external table, drop table -> create table -> insert will append data instead of refresh data?

This is true, and I don't see a great way around it. We may need to take genuinely different steps in our materializations that differentiate between managed and external tables. What would you recommend @qsbao?

from dbt-spark.

qsbao commented on July 3, 2024

This is true, and I don't see a great way around it. We may need to take genuinely different steps in our materializations that differentiate between managed and external tables. What would you recommend @qsbao?

I don't have any suggestion right now. I'm new to dbt, asked this question just to understand more clearly, and thank you for your reply.

The following is what I observed:

If model is an external table, we use drop table -> CTAS to recreate table. I checked this will corretly refresh data.

models:
  jaffle_shop:
    materialized: table
    file_format: parquet
    location_root: /user/qsbao

drop table if exists dbt_alice.fct_orders

create table dbt_alice.fct_orders
    using parquet
    location '/user/qsbao/fct_orders'
    as
...

But if seed is an external table, we use drop table -> create table -> insert into to recreate table. This way will append data instead of refresh data, see #112 .

from dbt-spark.

jtcohen6 commented on July 3, 2024

@qsbao Thanks for looking into that! It sounds like we're in okay shape for models, which are always CTAs. That also gives me hope for a workaround to the seed issue (#112).

from dbt-spark.

Fokko commented on July 3, 2024

I think we can close this issue. When using delta as the file format, we now use create or replace table:
https://github.com/fishtown-analytics/dbt-spark/blob/6ad164b315748fef7c0ae0b87ff6b8292632f35e/dbt/include/spark/macros/adapters.sql#L81

This should fix the issues with the transaction log.

from dbt-spark.

jtcohen6 commented on July 3, 2024

Ooh thanks @Fokko! Good call here

from dbt-spark.

Transaction log has failed integrity checks about dbt-spark HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent