Tips before filing an issue <p dir=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[SUPPORT]duplicate rows in my table about hudi HOT 7 CLOSED

chenbodeng719 commented on September 28, 2024

[SUPPORT]duplicate rows in my table

from hudi.

Comments (7)

ad1happy2go commented on September 28, 2024 1

Yes thats correct, You should remove dups after insterting using bull_insert or not use bulk insert at all in this case.

…

On Thu, Feb 29, 2024 at 4:04 PM chenbodeng719 ***@***.***> wrote: Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also? bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert? Running bulk_insert twice on same data also can cause this issue. "if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed. I wonder if there are five rows for one dup key, it updates the five rows? — Reply to this email directly, view it on GitHub <#10781 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APD55YQZYWWO3TQ7UAOZBPTYV4B4ZAVCNFSM6AAAAABD7P3VEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZQHA2TAMJTGI> . You are receiving this because you commented.Message ID: ***@***.***>

from hudi.

ad1happy2go commented on September 28, 2024

@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using?
Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.

from hudi.

chenbodeng719 commented on September 28, 2024

@chenbodeng719 Can you please let us know what Hudi/Flink/Spark Versions you are using? Are you are getting duplicate rows when reading using spark only OR you are getting the same behaviour when you try to read back using flink too.

I didnt try on flink. The problem happens when I use spark.

from hudi.

ad1happy2go commented on September 28, 2024

@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?

from hudi.

chenbodeng719 commented on September 28, 2024

@chenbodeng719 Can you post screenshot of the duplicate records. Are they belong to different file group?

Is there any possibility that I bulk insert a dataset with some duplicate keys, then any following upsert key which is same with dup key, would update the item twice. Like the below photo

from hudi.

ad1happy2go commented on September 28, 2024

Did you only used 0.14.1 only or is this the upgraded table from previous version?
can you provide values for hudi meta columns also?

bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?

Running bulk_insert twice on same data also can cause this issue.

from hudi.

chenbodeng719 commented on September 28, 2024

Did you only used 0.14.1 only or is this the upgraded table from previous version? can you provide values for hudi meta columns also?

bulk_insert itself can ingest duplicates. Did you got duplicates after bulk_insert itself. Yes if that's the case, upsert is going to update both records. Did you confirmed if you had these duplicates after bulk_insert?

Running bulk_insert twice on same data also can cause this issue.

"if that's the case, upsert is going to update both records. " I guess it's my case. First, bulk insert brings some duplicate key into the table. Then when the upsert with duplicate key comes, it updates the duplicate rows with same key. In my case, two rows for one dup key has been changed.
I wonder if there are five rows for one dup key, it updates the five rows?

from hudi.

[SUPPORT]duplicate rows in my table about hudi HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent