Comments (10)
@codope @nsivabalan can you guys help out here, we are kind of stuck to integrate dbt with hudi in our production use case.
from hudi.
@pravin1406
Just to ask one question , you are using GLOBAL_SIMPLE but you don't have any partition column defined.
Can you post your table properties.
What all configurations you are not seeing that are missing in MERGE INTO ?
from hudi.
I just noticed that it is using Non Partitioned key Generator only in your case. (in debug diagram). So that may be the reason its using SIMPLE instead of GLOBAL_SIMPLE as the later doesn't make sense.
from hudi.
Hello @ad1happy2go , I am printing the hoodie configs in hoodie code before inserting records .
I am attaching the first run and second run conf we got .
DBT model executed
{{
config(
materialized = 'incremental',
file_format= 'hudi',
pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
location_root="file:///Users/B0279627/Downloads/Hudi",
unique_key="id",
partition_by="name",
incremental_strategy="merge",
options={
'preCombineField': 'id2',
'hoodie.index.type':"GLOBAL_SIMPLE",
'hoodie.datasource.write.partitionpath.field': 'name',
'hoodie.datasource.hive_sync.partition_fields': 'name',
'hoodie.datasource.hive_sync.table': 'hudi_test_two',
'hoodie.datasource.hive_sync.database':'qultyzn1_prepd',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.keep.min.commits':'145',
'hoodie.keep.max.commits':'288',
'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
'hoodie.cleaner.hours.retained':'72',
'hoodie.cleaner.fileversions.retained':'144',
'hoodie.cleaner.commits.retained':'144',
'hoodie.upsert.shuffle.parallelism':'200',
'hoodie.insert.shuffle.parallelism':'200',
'hoodie.bulkinsert.shuffle.parallelism':'200',
'hoodie.delete.shuffle.parallelism':'200',
'hoodie.parquet.compression.codec':'zstd',
'hoodie.datasource.hive_sync.support_timestamp':'true',
'hoodie.datasource.write.reconcile.schema':'true',
'hoodie.enable.data.skipping':'true',
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
}
)
}}
-- first run select 1 as id, 2 as id2,"test2" as name
-- second run select 1 as id, 2 as id2,"test" as name
![Screenshot 2024-02-06 at 10 49 26 PM](https://private-user-images.githubusercontent.com/31952894/302721759-7dffaac5-aa05-487f-b0fb-162d28d4b1f7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDc1ODg4MjQsIm5iZiI6MTcwNzU4ODUyNCwicGF0aCI6Ii8zMTk1Mjg5NC8zMDI3MjE3NTktN2RmZmFhYzUtYWEwNS00ODdmLWIwZmItMTYyZDI4ZDRiMWY3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMjEwVDE4MDg0NFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdlNzc0N2FlMzQ0ODYyYWViMmI3NmIyMTdhZGQ4N2MwMWVkOTM5OTY5ODBhY2FhZmM4ZDFkZmVmNTE3NzFjM2UmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Z9DePi1TP90zflC20TIZ19EXCcaOkxl3ZcVHUc96SlU)
second_run_conf.txt
first_run_conf.txt
from hudi.
#9342
@ad1happy2go We were facing this issue as well. So basically we know which datasource options we want to use, but we want to use them with spark sql support given by hudi. In the second run , one of the property that was also changing was "hoodie.datasource.write.payload.class" . As seen in the issue I mentioned, it has been fixed in the 0.13.1 version release for InsertInto.
But for MergeInto command, it will still override the PAYLOAD_CLASS_NAME to ExpressionPayload, as that is part of the overriding options in buildMergeIntoConfig method in MergeIntoHoodieTableCommand .scala class
Our original requirement is we want to UPSERT on COW/MOR table while using Hudi DefaultHoodieRecordPayload.
On first run we do CreateTable -> InsertInto
On Second run we do MergeInto. Here the match condition look somewhat like this.
when matched then update set *
when not matched then insert *
It would be great if you can articulate it better here for our understanding. Should we move to 0.13 or higher version will that solve the issue. Should we use InsertInto with some additional insert into behaviour properties?
from hudi.
@prashant462 I tried exact same model and it is working as expected with 0.14.1 version.
After first run (select 2 as id, 2 as id2,"test" as name) -
After second run (select 2 as id, 2 as id2,"test2" as name)
Can you please try with Hudi version 0.14.X
from hudi.
yes, w/ pre 0.14.0, hudi expects to pass in all write configs w/ every write.
from 0.14.0, atleast for table properties, hudi tries to reuse from the properties already serialized as table props.
this is not applicable to write properties btw. those are not serialized anywhere.
from hudi.
@ad1happy2go I tried with hudi 0.14.1 hudi configs seems to be working now.
But I am facing one other issue with this property 'hoodie.simple.index.update.partition.path':'true',
I am running a dbt model with below config.
{{
config(
materialized = 'incremental',
file_format= 'hudi',
pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
location_root="file:///Users/B0279627/Downloads/Hudi",
unique_key="id",
partition_by="name",
incremental_strategy="merge",
options={
'hoodie.datasource.write.precombine.field': 'id2',
'hoodie.index.type':"GLOBAL_SIMPLE",
'hoodie.datasource.write.partitionpath.field': 'name',
'hoodie.datasource.hive_sync.partition_fields': 'name',
'hoodie.datasource.hive_sync.table': 'hudi_test_five',
'hoodie.datasource.hive_sync.database':'qultyzn1_prepd',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.keep.min.commits':'145',
'hoodie.keep.max.commits':'288',
'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
'hoodie.cleaner.hours.retained':'72',
'hoodie.cleaner.fileversions.retained':'144',
'hoodie.cleaner.commits.retained':'144',
'hoodie.upsert.shuffle.parallelism':'200',
'hoodie.insert.shuffle.parallelism':'200',
'hoodie.bulkinsert.shuffle.parallelism':'200',
'hoodie.delete.shuffle.parallelism':'200',
'hoodie.parquet.compression.codec':'zstd',
'hoodie.datasource.hive_sync.support_timestamp':'true',
'hoodie.datasource.write.reconcile.schema':'true',
'hoodie.enable.data.skipping':'true',
'hoodie.spark.sql.insert.into.operation': 'upsert',
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
}
)
}}
1st run -- select 1 as id, 4 as id2, "test5" as name
2nd run -- select 1 as id, 2 as id2, "test4" as name
But parition path is not updating for the record key.
I am attaching the table result.
from hudi.
Related Issues (20)
- [BUG] Failure Encountered When Reading Hudi with Flink in Batch Runtime Mode and FlinkOptions.READ_AS_STREAMING=false HOT 3
- [SUPPORT] Querying Hudi tables with Spark+Velox(C++), ObjectSizeCalculator.getObjectSize hangs causing about a 50-second delay in queries HOT 6
- [SUPPORT] Can't delete key (row) for all commits in HUDI Table (history)? HOT 4
- [SUPPORT] Unable to Set Database Name for Hive Metadata Sync using Flink SQL HOT 3
- Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 HOT 23
- [hudi bucket prune] HOT 2
- [SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 HOT 2
- [SUPPORT] Athena does not support s3a partition scheme anymore leading to missing data
- [SUPPORT] FileNotFoundException when clustering HOT 1
- [SUPPORT] Executor executes action [commits the instant 20240202161708414] error HOT 6
- RLI Spark Hudi Error occurs when executing map HOT 24
- [SUPPORT] Using MRO table and synchronizing to hive, Flink checkpoint failed, resulting in log files being unable to scroll to parquet files HOT 4
- Apache Hudi Auto-Size During Writes is not Working for Flink SQL HOT 4
- File not found while using metadata table for insert_overwrite table HOT 9
- [SUPPORT] BQ synch tool not working with HUDI bundle jar HOT 4
- [SUPPORT]Problems with Hudi's version using LSM-tree HOT 1
- [SUPPORT] Error upsetting bucketType UPDATE for partition :20240119 HOT 1
- [FeatureRequest] Inquiry Regarding Hudi Exporter with SQL Transformer for Data Filtering HOT 2
- [SUPPORT]How to continue an unfinished compaction task HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hudi.