Giter Club home page Giter Club logo

Comments (8)

kamalaboulhosn avatar kamalaboulhosn commented on August 10, 2024

It is not currently possible to get this information via the messages published into Cloud Pub/Sub from Kafka. The connector could be extended to store these fields as message attributes, though.

from pubsub.

johanesalxd avatar johanesalxd commented on August 10, 2024

i see. usually i do check from the kafka message timestamp, partition, offset to do the audit. but given we dont have those information anymore when we ingest it to pub/sub, can you advise on how we do the data auditing?

from pubsub.

kir-titievsky avatar kir-titievsky commented on August 10, 2024

from pubsub.

johanesalxd avatar johanesalxd commented on August 10, 2024

Not entirely. some context from me: so currently i have kafka as a primary data pipeline in the company managed by others. but due to some reason, mostly for simplification, i do ingest the data from kafka to our gcp via pubsub.

the use case itself may vary, from raw ingestion to realtime aggregation. i wont need it if i do the aggregation using dataflow; but as a proof of concept, i need to do the auditing process and compare it with kafka (we do collect this information when we ingest the data using confluent-kafka-python).

can you advise for this auditing process that i want to do? if somehow i can get below information, i think its good enough:

  1. the latency from message generated to kafka and delivered to pubsub
  2. check whether pubsub generating duplication of data.
  3. check whether there is drop message along the way

i plan to use this as the primary system for ingestion based on the audit process.

from pubsub.

kir-titievsky avatar kir-titievsky commented on August 10, 2024

Ah, I think I understand better now. When you say audit, you mean that you want to prove that the output of the Pub/Sub-based Pipeline matches the output of the Kafka-based one. Output being either the full set of messages. Auditing in this way will not be a permanent part of your production pipeline. Does that seem right?

If so, we don't provide tools that do this directly. You should expect additional latency and duplication due to Pub/Sub, although. Not sure if there are conditions where the kafka sync can lose messages on crashes and such. @kamalaboulhosn : Are there any guarantees to be made on the Kafka sync adapter here?

Otherwise, you have to write your own log parsing tools to compare the set of messages. You may be able to adapt the FLIC framework that can do end-to-end latency & throughput testing for Kafka and Pub/Sub. It's included in this repo.

from pubsub.

kamalaboulhosn avatar kamalaboulhosn commented on August 10, 2024

Without changing the connector code, you would need to have some unique identifier you attach to each message and use for auditing purposes. It would also be fairly straightforward to update the connector to attach these pieces of information to a message. It really comes down to whether or not that makes sense in the general use case, but I am certainly not opposed to it.

With regard to your specific questions:

  1. Latency: The additional latency would basically be the amount of time spent in the Kafka connector + time spent in Google Cloud Pub/Sub. Google Cloud Pub/Sub currently offers no latency guarantees, but under 1s at the 99th percentile (assuming the subscriber is caught up and able to receive the most recently published messages) would be a good estimate.

  2. Duplication: The entire pipeline could result in duplicated messages. It is possible that a publish into Cloud Pub/Sub by the connector could time out and a republish would need to be attempted, which would result in the same message being published twice. It is also possible that Cloud Pub/Sub itself would deliver the message multiple times as the system has at-least-once delivery guarantees.

  3. Message loss: Cloud Pub/Sub itself guarantees delivery of messages that have been published successfully (assuming they are consumed within seven days). The connector itself is designed to prevent message loss, too: it only acknowledges a message it receives from Kafka when the messages has been successfully published to Cloud Pub/Sub. That means that if the connector were to crash while processing messages, you would see duplicates and not dropped messages.

from pubsub.

johanesalxd avatar johanesalxd commented on August 10, 2024

Yes, that's correct, @kir-titievsky . What I plan previously is, if somehow I need to do some audit between the system, I can wrote a general code to compare the Kafka metadata (e.g. partition and offset) without looking into the data itself; while in the same time, I can do the audit in the data level by comparing the value of the message (but it can't be general since each data model itself is vary).

I think I've more clear understanding now, @kamalaboulhosn , @kir-titievsky . Currently we do the monitoring via stackdriver, create some alerting policy for the publish error and we found nothing so far. we do test by transferring around 20 to 30K messages per seconds and it works really well, and it can reach more than 150K messages per seconds in one worker if we do copy from the smallest offset available in our test topic.

I'll take a look with the FLIC framework and definitely will ask more if I found anything else. But, I just curious, do you plan to add this feature, adding Kafka metadata probably to messages attributes, in the near future? When I first check this Kafka Connect, I found it pretty weird that there is no metadata available to be accessed, while it's pretty common feature in the usual ingestion library that I've met along the way (this is the first Kafka Connect that I used though, probably other Kafka Connect doesn't include this feature as well).

Thanks!

from pubsub.

kamalaboulhosn avatar kamalaboulhosn commented on August 10, 2024

If you set the metadata.publish property to true in your sink connector config, you should now be able to access the kafka.topic, kafka.partition, kafka.offset, and kafka.timestamp attributes in the Pub/Sub message.

from pubsub.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.