Comments (8)
It is not currently possible to get this information via the messages published into Cloud Pub/Sub from Kafka. The connector could be extended to store these fields as message attributes, though.
from pubsub.
i see. usually i do check from the kafka message timestamp, partition, offset to do the audit. but given we dont have those information anymore when we ingest it to pub/sub, can you advise on how we do the data auditing?
from pubsub.
from pubsub.
Not entirely. some context from me: so currently i have kafka as a primary data pipeline in the company managed by others. but due to some reason, mostly for simplification, i do ingest the data from kafka to our gcp via pubsub.
the use case itself may vary, from raw ingestion to realtime aggregation. i wont need it if i do the aggregation using dataflow; but as a proof of concept, i need to do the auditing process and compare it with kafka (we do collect this information when we ingest the data using confluent-kafka-python).
can you advise for this auditing process that i want to do? if somehow i can get below information, i think its good enough:
- the latency from message generated to kafka and delivered to pubsub
- check whether pubsub generating duplication of data.
- check whether there is drop message along the way
i plan to use this as the primary system for ingestion based on the audit process.
from pubsub.
Ah, I think I understand better now. When you say audit, you mean that you want to prove that the output of the Pub/Sub-based Pipeline matches the output of the Kafka-based one. Output being either the full set of messages. Auditing in this way will not be a permanent part of your production pipeline. Does that seem right?
If so, we don't provide tools that do this directly. You should expect additional latency and duplication due to Pub/Sub, although. Not sure if there are conditions where the kafka sync can lose messages on crashes and such. @kamalaboulhosn : Are there any guarantees to be made on the Kafka sync adapter here?
Otherwise, you have to write your own log parsing tools to compare the set of messages. You may be able to adapt the FLIC framework that can do end-to-end latency & throughput testing for Kafka and Pub/Sub. It's included in this repo.
from pubsub.
Without changing the connector code, you would need to have some unique identifier you attach to each message and use for auditing purposes. It would also be fairly straightforward to update the connector to attach these pieces of information to a message. It really comes down to whether or not that makes sense in the general use case, but I am certainly not opposed to it.
With regard to your specific questions:
-
Latency: The additional latency would basically be the amount of time spent in the Kafka connector + time spent in Google Cloud Pub/Sub. Google Cloud Pub/Sub currently offers no latency guarantees, but under 1s at the 99th percentile (assuming the subscriber is caught up and able to receive the most recently published messages) would be a good estimate.
-
Duplication: The entire pipeline could result in duplicated messages. It is possible that a publish into Cloud Pub/Sub by the connector could time out and a republish would need to be attempted, which would result in the same message being published twice. It is also possible that Cloud Pub/Sub itself would deliver the message multiple times as the system has at-least-once delivery guarantees.
-
Message loss: Cloud Pub/Sub itself guarantees delivery of messages that have been published successfully (assuming they are consumed within seven days). The connector itself is designed to prevent message loss, too: it only acknowledges a message it receives from Kafka when the messages has been successfully published to Cloud Pub/Sub. That means that if the connector were to crash while processing messages, you would see duplicates and not dropped messages.
from pubsub.
Yes, that's correct, @kir-titievsky . What I plan previously is, if somehow I need to do some audit between the system, I can wrote a general code to compare the Kafka metadata (e.g. partition and offset) without looking into the data itself; while in the same time, I can do the audit in the data level by comparing the value of the message (but it can't be general since each data model itself is vary).
I think I've more clear understanding now, @kamalaboulhosn , @kir-titievsky . Currently we do the monitoring via stackdriver, create some alerting policy for the publish error and we found nothing so far. we do test by transferring around 20 to 30K messages per seconds and it works really well, and it can reach more than 150K messages per seconds in one worker if we do copy from the smallest offset available in our test topic.
I'll take a look with the FLIC framework and definitely will ask more if I found anything else. But, I just curious, do you plan to add this feature, adding Kafka metadata probably to messages attributes, in the near future? When I first check this Kafka Connect, I found it pretty weird that there is no metadata available to be accessed, while it's pretty common feature in the usual ingestion library that I've met along the way (this is the first Kafka Connect that I used though, probably other Kafka Connect doesn't include this feature as well).
Thanks!
from pubsub.
If you set the metadata.publish property to true in your sink connector config, you should now be able to access the kafka.topic, kafka.partition, kafka.offset, and kafka.timestamp attributes in the Pub/Sub message.
from pubsub.
Related Issues (20)
- Support strimzi.io kafka and docker image HOT 1
- Issue running copy_tool.py for kafka connector in a single-machine HOT 2
- Connectors failing with NoClassDefFoundError on CDP clusters
- GCEComputeResourceController creates VM incompatible default Python version
- CMake build got broken in one of the latest versions (vcpkg) HOT 1
- netty dependency is too old to compile on aarch64 HOT 1
- Avro with Kafka Connect sink connector HOT 8
- Pub sub connector not working with 2.8.1
- PubSub Sink Task Flush issues with clearing partition tracking.
- Receiving base64 encoded messages in kafka from pubsub HOT 1
- A potential Denial of Service issue in protobuf-java HOT 3
- Kafka Connector [Quickstart - copy_tool.py] doesn't verify the integrity of the downloaded files
- Intermittent error related to AuthMetadataPluginCallback : ValueError: None could not be converted to unicode
- Question: How Can I enable encrypted transmission between the connector and On Premise Kafka
- loadtest: unable to connect to client
- Can't reach GCP PubSub Emulator
- Ordering Key prober does not build
- TopicAdminClient UnauthenticatedException When Settings Endpoints HOT 1
- Load test framework doesn't use pagination when calling instanceGroupManagers.listManagedInstances
- TypeError: this.auth.getUniverseDomain is not a function HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pubsub.