Issue Summary When using dbt Spark with Hudi to create a Hudi form

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="18

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Inconsistency in Hudi Table Configuration between Initial Insert and Subsequent Merges about hudi HOT 10 OPEN

prashant462 commented on June 19, 2024

Inconsistency in Hudi Table Configuration between Initial Insert and Subsequent Merges

from hudi.

Comments (10)

pravin1406 commented on June 19, 2024

@codope @nsivabalan can you guys help out here, we are kind of stuck to integrate dbt with hudi in our production use case.

from hudi.

ad1happy2go commented on June 19, 2024

@pravin1406
Just to ask one question , you are using GLOBAL_SIMPLE but you don't have any partition column defined.
Can you post your table properties.

What all configurations you are not seeing that are missing in MERGE INTO ?

from hudi.

ad1happy2go commented on June 19, 2024

I just noticed that it is using Non Partitioned key Generator only in your case. (in debug diagram). So that may be the reason its using SIMPLE instead of GLOBAL_SIMPLE as the later doesn't make sense.

from hudi.

prashant462 commented on June 19, 2024

Hello @ad1happy2go , I am printing the hoodie configs in hoodie code before inserting records .
I am attaching the first run and second run conf we got .

DBT model executed

{{
config(
materialized = 'incremental',
file_format= 'hudi',
pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
location_root="file:///Users/B0279627/Downloads/Hudi",
unique_key="id",
partition_by="name",
incremental_strategy="merge",
options={
'preCombineField': 'id2',
'hoodie.index.type':"GLOBAL_SIMPLE",
'hoodie.datasource.write.partitionpath.field': 'name',
'hoodie.datasource.hive_sync.partition_fields': 'name',
'hoodie.datasource.hive_sync.table': 'hudi_test_two',
'hoodie.datasource.hive_sync.database':'qultyzn1_prepd',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.keep.min.commits':'145',
'hoodie.keep.max.commits':'288',
'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
'hoodie.cleaner.hours.retained':'72',
'hoodie.cleaner.fileversions.retained':'144',
'hoodie.cleaner.commits.retained':'144',
'hoodie.upsert.shuffle.parallelism':'200',
'hoodie.insert.shuffle.parallelism':'200',
'hoodie.bulkinsert.shuffle.parallelism':'200',
'hoodie.delete.shuffle.parallelism':'200',
'hoodie.parquet.compression.codec':'zstd',
'hoodie.datasource.hive_sync.support_timestamp':'true',
'hoodie.datasource.write.reconcile.schema':'true',
'hoodie.enable.data.skipping':'true',
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
}
)
}}

-- first run select 1 as id, 2 as id2,"test2" as name

-- second run select 1 as id, 2 as id2,"test" as name

second_run_conf.txt
first_run_conf.txt

from hudi.

pravin1406 commented on June 19, 2024

#9342
@ad1happy2go We were facing this issue as well. So basically we know which datasource options we want to use, but we want to use them with spark sql support given by hudi. In the second run , one of the property that was also changing was "hoodie.datasource.write.payload.class" . As seen in the issue I mentioned, it has been fixed in the 0.13.1 version release for InsertInto.
But for MergeInto command, it will still override the PAYLOAD_CLASS_NAME to ExpressionPayload, as that is part of the overriding options in buildMergeIntoConfig method in MergeIntoHoodieTableCommand .scala class

Our original requirement is we want to UPSERT on COW/MOR table while using Hudi DefaultHoodieRecordPayload.
On first run we do CreateTable -> InsertInto
On Second run we do MergeInto. Here the match condition look somewhat like this.

 when matched then update set * 
 when not matched then insert *

It would be great if you can articulate it better here for our understanding. Should we move to 0.13 or higher version will that solve the issue. Should we use InsertInto with some additional insert into behaviour properties?

from hudi.

ad1happy2go commented on June 19, 2024

@prashant462 I tried exact same model and it is working as expected with 0.14.1 version.

After first run (select 2 as id, 2 as id2,"test" as name) -

After second run (select 2 as id, 2 as id2,"test2" as name)

Can you please try with Hudi version 0.14.X

from hudi.

nsivabalan commented on June 19, 2024

yes, w/ pre 0.14.0, hudi expects to pass in all write configs w/ every write.
from 0.14.0, atleast for table properties, hudi tries to reuse from the properties already serialized as table props.
this is not applicable to write properties btw. those are not serialized anywhere.

from hudi.

prashant462 commented on June 19, 2024

@ad1happy2go I tried with hudi 0.14.1 hudi configs seems to be working now.
But I am facing one other issue with this property 'hoodie.simple.index.update.partition.path':'true',

I am running a dbt model with below config.

{{
config(
materialized = 'incremental',
file_format= 'hudi',
pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
location_root="file:///Users/B0279627/Downloads/Hudi",
unique_key="id",
partition_by="name",
incremental_strategy="merge",
options={
'hoodie.datasource.write.precombine.field': 'id2',
'hoodie.index.type':"GLOBAL_SIMPLE",
'hoodie.datasource.write.partitionpath.field': 'name',
'hoodie.datasource.hive_sync.partition_fields': 'name',
'hoodie.datasource.hive_sync.table': 'hudi_test_five',
'hoodie.datasource.hive_sync.database':'qultyzn1_prepd',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.keep.min.commits':'145',
'hoodie.keep.max.commits':'288',
'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
'hoodie.cleaner.hours.retained':'72',
'hoodie.cleaner.fileversions.retained':'144',
'hoodie.cleaner.commits.retained':'144',
'hoodie.upsert.shuffle.parallelism':'200',
'hoodie.insert.shuffle.parallelism':'200',
'hoodie.bulkinsert.shuffle.parallelism':'200',
'hoodie.delete.shuffle.parallelism':'200',
'hoodie.parquet.compression.codec':'zstd',
'hoodie.datasource.hive_sync.support_timestamp':'true',
'hoodie.datasource.write.reconcile.schema':'true',
'hoodie.enable.data.skipping':'true',
'hoodie.spark.sql.insert.into.operation': 'upsert',
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
}
)
}}

1st run -- select 1 as id, 4 as id2, "test5" as name

2nd run -- select 1 as id, 2 as id2, "test4" as name

But parition path is not updating for the record key.

I am attaching the table result.

from hudi.

Inconsistency in Hudi Table Configuration between Initial Insert and Subsequent Merges about hudi HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent