Hi! While using kafka connect to pump the data from kafka to pubsub,

Yes, that's correct, <a class="user-mention notranslate" data-hovercard-type="user" da

how to get kafka metadata? about pubsub HOT 8 CLOSED

googlecloudplatform commented on August 10, 2024

how to get kafka metadata?

from pubsub.

Comments (8)

kamalaboulhosn commented on August 10, 2024

It is not currently possible to get this information via the messages published into Cloud Pub/Sub from Kafka. The connector could be extended to store these fields as message attributes, though.

from pubsub.

johanesalxd commented on August 10, 2024

i see. usually i do check from the kafka message timestamp, partition, offset to do the audit. but given we dont have those information anymore when we ingest it to pub/sub, can you advise on how we do the data auditing?

from pubsub.

kir-titievsky commented on August 10, 2024

To clarify: you must log timestamp, partition, and offset to an external log whenever you process a message?

On Wed, Jan 31, 2018 at 2:00 AM alexanderv21 ***@***.***> wrote: i see. usually i do check from the kafka message timestamp, partition, offset to do the audit. but given we dont have those information anymore when we ingest it to pub/sub, can you advise on how we do the data auditing? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#149 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARrMFn2aUQ57qWdBwdGTRX5WFWL_XUM3ks5tQA-bgaJpZM4RyOGk> .

-- Kir Titievsky | Product Manager | Google Cloud Pub/Sub

from pubsub.

johanesalxd commented on August 10, 2024

Not entirely. some context from me: so currently i have kafka as a primary data pipeline in the company managed by others. but due to some reason, mostly for simplification, i do ingest the data from kafka to our gcp via pubsub.

the use case itself may vary, from raw ingestion to realtime aggregation. i wont need it if i do the aggregation using dataflow; but as a proof of concept, i need to do the auditing process and compare it with kafka (we do collect this information when we ingest the data using confluent-kafka-python).

can you advise for this auditing process that i want to do? if somehow i can get below information, i think its good enough:

the latency from message generated to kafka and delivered to pubsub
check whether pubsub generating duplication of data.
check whether there is drop message along the way

i plan to use this as the primary system for ingestion based on the audit process.

from pubsub.

kir-titievsky commented on August 10, 2024

Ah, I think I understand better now. When you say audit, you mean that you want to prove that the output of the Pub/Sub-based Pipeline matches the output of the Kafka-based one. Output being either the full set of messages. Auditing in this way will not be a permanent part of your production pipeline. Does that seem right?

If so, we don't provide tools that do this directly. You should expect additional latency and duplication due to Pub/Sub, although. Not sure if there are conditions where the kafka sync can lose messages on crashes and such. @kamalaboulhosn : Are there any guarantees to be made on the Kafka sync adapter here?

Otherwise, you have to write your own log parsing tools to compare the set of messages. You may be able to adapt the FLIC framework that can do end-to-end latency & throughput testing for Kafka and Pub/Sub. It's included in this repo.

from pubsub.

kamalaboulhosn commented on August 10, 2024

Without changing the connector code, you would need to have some unique identifier you attach to each message and use for auditing purposes. It would also be fairly straightforward to update the connector to attach these pieces of information to a message. It really comes down to whether or not that makes sense in the general use case, but I am certainly not opposed to it.

With regard to your specific questions:

Latency: The additional latency would basically be the amount of time spent in the Kafka connector + time spent in Google Cloud Pub/Sub. Google Cloud Pub/Sub currently offers no latency guarantees, but under 1s at the 99th percentile (assuming the subscriber is caught up and able to receive the most recently published messages) would be a good estimate.
Duplication: The entire pipeline could result in duplicated messages. It is possible that a publish into Cloud Pub/Sub by the connector could time out and a republish would need to be attempted, which would result in the same message being published twice. It is also possible that Cloud Pub/Sub itself would deliver the message multiple times as the system has at-least-once delivery guarantees.
Message loss: Cloud Pub/Sub itself guarantees delivery of messages that have been published successfully (assuming they are consumed within seven days). The connector itself is designed to prevent message loss, too: it only acknowledges a message it receives from Kafka when the messages has been successfully published to Cloud Pub/Sub. That means that if the connector were to crash while processing messages, you would see duplicates and not dropped messages.

from pubsub.

johanesalxd commented on August 10, 2024

Yes, that's correct, @kir-titievsky . What I plan previously is, if somehow I need to do some audit between the system, I can wrote a general code to compare the Kafka metadata (e.g. partition and offset) without looking into the data itself; while in the same time, I can do the audit in the data level by comparing the value of the message (but it can't be general since each data model itself is vary).

I think I've more clear understanding now, @kamalaboulhosn , @kir-titievsky . Currently we do the monitoring via stackdriver, create some alerting policy for the publish error and we found nothing so far. we do test by transferring around 20 to 30K messages per seconds and it works really well, and it can reach more than 150K messages per seconds in one worker if we do copy from the smallest offset available in our test topic.

I'll take a look with the FLIC framework and definitely will ask more if I found anything else. But, I just curious, do you plan to add this feature, adding Kafka metadata probably to messages attributes, in the near future? When I first check this Kafka Connect, I found it pretty weird that there is no metadata available to be accessed, while it's pretty common feature in the usual ingestion library that I've met along the way (this is the first Kafka Connect that I used though, probably other Kafka Connect doesn't include this feature as well).

Thanks!

from pubsub.

kamalaboulhosn commented on August 10, 2024

If you set the metadata.publish property to true in your sink connector config, you should now be able to access the kafka.topic, kafka.partition, kafka.offset, and kafka.timestamp attributes in the Pub/Sub message.

from pubsub.

how to get kafka metadata? about pubsub HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent