Giter Club home page Giter Club logo

Comments (4)

yuzhichang avatar yuzhichang commented on July 19, 2024 3

ClickHouse Kafka engine doesn't support exactly-once either. ClickHouse Kafka engine's flow:
A. ClickHouse Kafka engine fetches a batch of messages from.
B. Write to StorageEngine.
C. Ack kafka to commit the messages.
If ClickHouse crash just after B and restart. ClickHouse Kafka engine will fetch duplicated messages from Kafka.

My experience indicates it's buggy and complicated. Integration with the ClickHouse server lowers the database stability. Vertica's official kafka importer is also separated with the database server.

from clickhouse_sinker.

yuzhichang avatar yuzhichang commented on July 19, 2024

clickhouse_sinker guarantee:

  • at-least once
  • duplicated messages (per topic-partition-offset) are routed to the same ClickHouse node

So if you setup ClickHouse properly(ReplacingMergeTree ORDER BY (__kafak_topic, __kafka_partition, __kafka_offset)), you could get exactly-once semantic.

It's hard for clickhouse_sinker to guarantee exactly-once semantic without ReplacingMergeTree. Kafka consumer group load-balance cause duplicated messages if one consumer quit suddenly.

Recently clickhouse_sinker has been reconstructed and achieved very big preformance improvement. There's no design document right now. @sundy-li and I may add some later.

The flow is:

  • Fetch message via kafka-go, which starts internally an goroutine for each partition.
  • Parse messages in a global goroutine pool(pool size is customizable), fill the result to a ring according to the message's partition and offset.
  • Generate a batch if messages in a ring reach a batchSize bondary, or flush timer fire. This ensures offset/batchSize be same for all messages inside a batch.
  • Write batchs to ClickHouse in a global goroutine pool(pool size is fixed according to number of task and clickhouse instances). Batch is routed according to (offset/batchSize)%num_clickhouse_instances.

from clickhouse_sinker.

sundy-li avatar sundy-li commented on July 19, 2024

In the streaming processing system, such as kafka to clickhouse, it's hard to make Exactly-once semantics unless clickhouse supports 2pc protocol.


A. Flushed a batch messages to Clickhouse.
B. Ack to commit the messages.

We may got crashed after A. So clickhouse_sinker is at least once.

But there are some ways to improve it:

  • ReplacingMergeTree with order keys(kafka offsets, kafka parition), and query with group by
  • In clickhouse_sinker, after the crash without ack the messages, clickhouse_sinker recover to consume the same batch to the same ClickHouse nodes, ClickHouse will use insert_deduplicate to ensure that batch only accepts once.

from clickhouse_sinker.

roohitavaf avatar roohitavaf commented on July 19, 2024

Do you know by the way, if ClickHouse official Kafka engine support exactly-once or not? What is the advantage of using clickhouse_sinker over ClickHouse Kafka engine?

from clickhouse_sinker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.