Kafka Knowledge Base

This KB is intended to introduce the reader to the fundamental concepts of Kafka. Included in the Repo are a few demos on running kafka clusters as well as references pulled from various usefull articles found on the web.

Demos
References
Key Concepts / Glossary

Demos

Linux local server demo

Docker + Kafkajs (Node) demo

Docker multi network demo (Coming Soon)

References

Articles

Confluent: How to choose the number of partitions in a kafka cluster

StackOverflow : How to create a new consumer group in kafka

StackOverflow : Kafka server configuration - listeners vs. advertised.listeners

Confuent : Kafka Listeners – Explained

Conduktor: Producer Default Partitioner & Sticky Partitioner

Confluent: Top 5 Things Every Apache Kafka Developer Should Know

Books

O'Reilly: Kafka, The Definitive Guide - Neha Narkhede, Gwen Shapira & Todd Palino

Official Documentation

Apache Foundation: Kafka Documentation

Apache Foundation: Kafka Protocole Guide

Docker : Networking overview

Key Concepts / Glossary

BackPressure :
Big Data Ingestion :
Brokers : Brokers are kafka’s servers. They receive from producers and hand out data to consumers.
Cluster Controllers : A designated broker in a kafka cluster that reads/writes metadata to/from Zookeper and is responsible for assigning leadership across the brokers for each partition.
Clusters : A collection of kafka brokers.
Consumer Groups : A collection of consumers that are assigned to reading the same topic, they are assigned the same group-id. Kafka ensures that each partition of a topic is consumed by only one member of the group, a message is only read once by a single member of a consumer group.

This process allows parallel consumption of a topic, enabling higher throughput, however the maximum benefit is limited to the number of partitions on a topic. You can have more consumers than partitions on a topics (N+1) where a consumer is idle to implement a hot failover (in case a consumer fails another is ready to take its place)
Disk Based Retention :
ETL / CDC :
Events : An event is a fact that happened in the past, they are representation of something that
Fault Tolerance : happened in a system as an array of bytes.
Java Message System :
Leaders :
Offsets: sequential identifiers that are unique to each record in the partition. When a new record is written to the end of the log file (partition )it is assigned a new offset that is incremented. They are useful for consumers who need to know where to pick up from the last time they read from the partition. Messages within the partition are always ordered but there is no guarantee of ordering ACROSS partitions
Partition : Partitions of topics hold a subset of records owned by a topic. Each partition is single log file that holds records in an append only fashion
.
A rough formula for determining the number of partitions needed is max(t/p ,t/c) where t is target throughput, p is producer throughput (on a single partition) and c is consumer throughput (on a single partition). One has to decide on a trade off between speed and reliability when it come to the number of partitions on a broker as in the case of failure, having too many partitions on a broker will slow down the process of electing a new partition leader (which are the only ones that can be written to from the producers) Partition keys: The key that indicates which partition a message will be written to. When producers send a message to a topic they use information from the application context to derive a key, using a hash function, this could be a userID or a deviceID. If the input that decides the partition key is not well distributed (1 clientID generating a disproportionate amount of records), a solution would be to let kafka decide the partition assignment using round robin.

Using hashed keys may be necessary if you need messages coming from the same user/device to be consumer in the order that they were produced.
Redundancy : Each partition has replicas across a cluster’s broker which are continuously updating and “catching” to the partition leader. This allows for failover to the follower partitions in case the leader suffers a crash.
Replication :
Retention Policy :
Scalability : Kafka distributes partitions across its cluster’s brokers which leads to higher throughput, a topic can be read by multiple consumers concurrently because they don’t all have to read from the same partition.
Streams : Streams are related events that flow in sequence.
Topics : Topics group related events and store them. They are similar to tables in databases and folders in file systems. In * Kafka they are used to decouple the producers and consumers of data. Producers will "POST" while consumers will "GET" from a topic without having to be directly connected between themselves, they reach out to the kafka brokers (specifically the leaders of a topic) for such ends

karimwitani / kafka_demo Goto Github PK

kafka_demo's Introduction

Kafka Knowledge Base

Table of contents

Demos

References

Articles

Books

Official Documentation

Key Concepts / Glossary

kafka_demo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent