Comments (7)
from hudi.
@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using?
Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.
from hudi.
@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using? Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.
I didnt try on flink. The problem happens when I use spark.
from hudi.
@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?
from hudi.
@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?
Is there any possibility that I bulk insert a dataset with some duplicate keys, then any following upsert key which is same with dup key, would update the item twice. Like the below photo
from hudi.
Did you only used 0.14.1 only or is this the upgraded table from previous version?
can you provide values for hudi meta columns also?
bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?
Running bulk_insert twice on same data also can cause this issue.
from hudi.
Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also?
bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?
Running bulk_insert twice on same data also can cause this issue.
"if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed.
I wonder if there are five rows for one dup key, it updates the five rows?
from hudi.
Related Issues (20)
- REPARTITION In Bloom Index Causing Slow Down HOT 4
- [SUPPORT] Compaction - Could not find - /opt/demo/config/schema.avsc - schema file HOT 2
- [SUPPORT] Upgrading table through CLI changes from CustomKeyGenerator to SimpleKeyGenerator HOT 3
- [SUPPORT] HOT 1
- [SUPPORT] Exception when write null value to table with timestamp partitioning
- [SUPPORT] Hudi CLI. java.lang.NoClassDefFoundError: org/apache/hudi/avro/model/HoodieWriteStat HOT 1
- [SUPPORT] Hudi CLI conf is hard coded to /opt/hudi/packaging/hudi-cli-bundle/conf/hudi-defaults.conf HOT 1
- [SUPPORT] Everytime you run hudi-cli-with-bundle.sh, it downloads jakarta over and over again HOT 6
- [SUPPORT] Hudi Streamer EMR Serverless ( 7.0.0) with Hudi Extension ( DELTA| ICEBERG ) HOT 2
- [SUPPORT] org.apache.hudi.exception.HoodieIOException: Could not read commit details from s3a://<path>/.hoodie/20240908172432285.replacecommit.requested HOT 11
- [SUPPORT] Clean Operation (Metadata Column Stats Index) failing with Copy on Write table. HOT 3
- [SUPPORT] Comments for partition columns not synced to Hive/AWS Glue
- [SUPPORT]Read Kafka and write to Hudi doesn't work HOT 3
- [SUPPORT]Behavior of streaming read table A join with streaming read table B with Flink HOT 2
- [SUPPORT][flinksql]flink cow hudi table sync to hive error:"Invalid partition key & values; keys [], values [xxx, ]" HOT 1
- [SUPPORT] How to configure spark and flink to write mor tables using bucket indexes? HOT 10
- [SUPPORT] Don't understand the result for bulk insert first and then upsert HOT 4
- [SUPPORT] flink cdc 3.x pipeline hudi sink HOT 6
- [SUPPORT] Hive meta sync null pointer issue HOT 11
- [SUPPORT] - Data loss after 3 days following upgrade from Hudi 0.11.1 to 0.14.0 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hudi.