Giter Club home page Giter Club logo

Comments (10)

pravin1406 avatar pravin1406 commented on June 19, 2024

@codope @nsivabalan can you guys help out here, we are kind of stuck to integrate dbt with hudi in our production use case.

from hudi.

ad1happy2go avatar ad1happy2go commented on June 19, 2024

@pravin1406
Just to ask one question , you are using GLOBAL_SIMPLE but you don't have any partition column defined.
Can you post your table properties.

What all configurations you are not seeing that are missing in MERGE INTO ?

from hudi.

ad1happy2go avatar ad1happy2go commented on June 19, 2024

I just noticed that it is using Non Partitioned key Generator only in your case. (in debug diagram). So that may be the reason its using SIMPLE instead of GLOBAL_SIMPLE as the later doesn't make sense.

from hudi.

prashant462 avatar prashant462 commented on June 19, 2024

Hello @ad1happy2go , I am printing the hoodie configs in hoodie code before inserting records .
I am attaching the first run and second run conf we got .

DBT model executed

{{
config(
materialized = 'incremental',
file_format= 'hudi',
pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
location_root="file:///Users/B0279627/Downloads/Hudi",
unique_key="id",
partition_by="name",
incremental_strategy="merge",
options={
'preCombineField': 'id2',
'hoodie.index.type':"GLOBAL_SIMPLE",
'hoodie.datasource.write.partitionpath.field': 'name',
'hoodie.datasource.hive_sync.partition_fields': 'name',
'hoodie.datasource.hive_sync.table': 'hudi_test_two',
'hoodie.datasource.hive_sync.database':'qultyzn1_prepd',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.keep.min.commits':'145',
'hoodie.keep.max.commits':'288',
'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
'hoodie.cleaner.hours.retained':'72',
'hoodie.cleaner.fileversions.retained':'144',
'hoodie.cleaner.commits.retained':'144',
'hoodie.upsert.shuffle.parallelism':'200',
'hoodie.insert.shuffle.parallelism':'200',
'hoodie.bulkinsert.shuffle.parallelism':'200',
'hoodie.delete.shuffle.parallelism':'200',
'hoodie.parquet.compression.codec':'zstd',
'hoodie.datasource.hive_sync.support_timestamp':'true',
'hoodie.datasource.write.reconcile.schema':'true',
'hoodie.enable.data.skipping':'true',
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
}
)
}}

-- first run select 1 as id, 2 as id2,"test2" as name

-- second run select 1 as id, 2 as id2,"test" as name

Screenshot 2024-02-06 at 10 49 26 PM

second_run_conf.txt
first_run_conf.txt

from hudi.

pravin1406 avatar pravin1406 commented on June 19, 2024

#9342
@ad1happy2go We were facing this issue as well. So basically we know which datasource options we want to use, but we want to use them with spark sql support given by hudi. In the second run , one of the property that was also changing was "hoodie.datasource.write.payload.class" . As seen in the issue I mentioned, it has been fixed in the 0.13.1 version release for InsertInto.
But for MergeInto command, it will still override the PAYLOAD_CLASS_NAME to ExpressionPayload, as that is part of the overriding options in buildMergeIntoConfig method in MergeIntoHoodieTableCommand .scala class

Our original requirement is we want to UPSERT on COW/MOR table while using Hudi DefaultHoodieRecordPayload.
On first run we do CreateTable -> InsertInto
On Second run we do MergeInto. Here the match condition look somewhat like this.

 when matched then update set * 
 when not matched then insert *

It would be great if you can articulate it better here for our understanding. Should we move to 0.13 or higher version will that solve the issue. Should we use InsertInto with some additional insert into behaviour properties?

from hudi.

ad1happy2go avatar ad1happy2go commented on June 19, 2024

@prashant462 I tried exact same model and it is working as expected with 0.14.1 version.

After first run (select 2 as id, 2 as id2,"test" as name) -
image

After second run (select 2 as id, 2 as id2,"test2" as name)
image

Can you please try with Hudi version 0.14.X

from hudi.

nsivabalan avatar nsivabalan commented on June 19, 2024

yes, w/ pre 0.14.0, hudi expects to pass in all write configs w/ every write.
from 0.14.0, atleast for table properties, hudi tries to reuse from the properties already serialized as table props.
this is not applicable to write properties btw. those are not serialized anywhere.

from hudi.

prashant462 avatar prashant462 commented on June 19, 2024

@ad1happy2go I tried with hudi 0.14.1 hudi configs seems to be working now.
But I am facing one other issue with this property 'hoodie.simple.index.update.partition.path':'true',

I am running a dbt model with below config.

{{
config(
materialized = 'incremental',
file_format= 'hudi',
pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
location_root="file:///Users/B0279627/Downloads/Hudi",
unique_key="id",
partition_by="name",
incremental_strategy="merge",
options={
'hoodie.datasource.write.precombine.field': 'id2',
'hoodie.index.type':"GLOBAL_SIMPLE",
'hoodie.datasource.write.partitionpath.field': 'name',
'hoodie.datasource.hive_sync.partition_fields': 'name',
'hoodie.datasource.hive_sync.table': 'hudi_test_five',
'hoodie.datasource.hive_sync.database':'qultyzn1_prepd',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.keep.min.commits':'145',
'hoodie.keep.max.commits':'288',
'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
'hoodie.cleaner.hours.retained':'72',
'hoodie.cleaner.fileversions.retained':'144',
'hoodie.cleaner.commits.retained':'144',
'hoodie.upsert.shuffle.parallelism':'200',
'hoodie.insert.shuffle.parallelism':'200',
'hoodie.bulkinsert.shuffle.parallelism':'200',
'hoodie.delete.shuffle.parallelism':'200',
'hoodie.parquet.compression.codec':'zstd',
'hoodie.datasource.hive_sync.support_timestamp':'true',
'hoodie.datasource.write.reconcile.schema':'true',
'hoodie.enable.data.skipping':'true',
'hoodie.spark.sql.insert.into.operation': 'upsert',
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
}
)
}}

1st run -- select 1 as id, 4 as id2, "test5" as name

2nd run -- select 1 as id, 2 as id2, "test4" as name

But parition path is not updating for the record key.

I am attaching the table result.
Screenshot 2024-02-09 at 10 43 11 AM

from hudi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.