Comments (4)
ClickHouse Kafka engine doesn't support exactly-once either. ClickHouse Kafka engine's flow:
A. ClickHouse Kafka engine fetches a batch of messages from.
B. Write to StorageEngine.
C. Ack kafka to commit the messages.
If ClickHouse crash just after B and restart. ClickHouse Kafka engine will fetch duplicated messages from Kafka.
My experience indicates it's buggy and complicated. Integration with the ClickHouse server lowers the database stability. Vertica's official kafka importer is also separated with the database server.
from clickhouse_sinker.
clickhouse_sinker guarantee:
- at-least once
- duplicated messages (per topic-partition-offset) are routed to the same ClickHouse node
So if you setup ClickHouse properly(ReplacingMergeTree ORDER BY (__kafak_topic, __kafka_partition, __kafka_offset)), you could get exactly-once semantic.
It's hard for clickhouse_sinker to guarantee exactly-once semantic without ReplacingMergeTree. Kafka consumer group load-balance cause duplicated messages if one consumer quit suddenly.
Recently clickhouse_sinker has been reconstructed and achieved very big preformance improvement. There's no design document right now. @sundy-li and I may add some later.
The flow is:
- Fetch message via kafka-go, which starts internally an goroutine for each partition.
- Parse messages in a global goroutine pool(pool size is customizable), fill the result to a ring according to the message's partition and offset.
- Generate a batch if messages in a ring reach a batchSize bondary, or flush timer fire. This ensures
offset/batchSize
be same for all messages inside a batch. - Write batchs to ClickHouse in a global goroutine pool(pool size is fixed according to number of task and clickhouse instances). Batch is routed according to
(offset/batchSize)%num_clickhouse_instances
.
from clickhouse_sinker.
In the streaming processing system, such as kafka
to clickhouse
, it's hard to make Exactly-once semantics
unless clickhouse supports 2pc protocol
.
A. Flushed a batch messages to Clickhouse.
B. Ack to commit the messages.
We may got crashed after A. So clickhouse_sinker
is at least once.
But there are some ways to improve it:
- ReplacingMergeTree with order keys(kafka offsets, kafka parition), and query with
group by
- In clickhouse_sinker, after the crash without ack the messages, clickhouse_sinker recover to consume the same batch to the same ClickHouse nodes, ClickHouse will use
insert_deduplicate
to ensure that batch only accepts once.
from clickhouse_sinker.
Do you know by the way, if ClickHouse official Kafka engine support exactly-once or not? What is the advantage of using clickhouse_sinker over ClickHouse Kafka engine?
from clickhouse_sinker.
Related Issues (20)
- How can I new multi task in on clickhouse_sinker HOT 2
- investigate bitmap 64 https://github.com/outcaste-io/sroar HOT 1
- 【Help】clickhouse_sinker auto stop when I config tasks, It run well when config single task. HOT 1
- bug nacos服务注册 无法正常消费 HOT 8
- 没有平衡的写入数据 HOT 2
- 一个sinker程序,只能拥有一个task吗? HOT 1
- 启动失败,说无此类登入,很好奇,出现的表也不是任务中的表 HOT 2
- Evaluate sonic parser
- [bug]分布式表配置检查错误 HOT 1
- Contact from ClickHouse HOT 2
- partition xx quit idle or partition xx became idle HOT 1
- 日志按时间滚动? HOT 2
- 使用sarama客户端鉴权问题 HOT 1
- clickhouse_sinker_parsing_pool_backlog 如何配置? HOT 3
- [help] 每次处理的message 批次大小不一? HOT 1
- 依赖版本找不到 HOT 1
- Build image
- Support for IPv4 and IPv6 types
- json parse using bytedance/sonic HOT 1
- Series table doesn’t exist
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clickhouse_sinker.