Giter Club home page Giter Club logo

rivulet's Introduction

Introduction

This application shows how to use akka to stream data from a kafka topic to clickhouse cluster

Setup

  1. To start, you will need to have a working installation of docker and docker-compose.
  2. Clone this repository.
  3. Go to docker directory and run make config up to start the containers.
  4. It will start a clickhouse cluster, a kafka cluster and a zookeeper cluster.
  5. You can access the clickhouse cluster at http://localhost:8123
  6. You can access the kafka cluster at http://localhost:9092
  7. You can access the zookeeper cluster at http://localhost:2181
  8. You can access the kafka ui at http://localhost:8080
  9. create a topic named pizza-orders in kafka ui.
  10. create the database and table in clickhouse
CREATE DATABASE rivulet_db ON CLUSTER 'company_cluster';

CREATE TABLE IF NOT EXISTS rivulet_db.events
(
    timestamp Int64,
    message String
) ENGINE = Memory()


  1. Run the producer to start producing messages to kafka. I used https://github.com/aiven/python-fake-data-producer-for-apache-kafka to produce fake data.
    python main.py \                                                                                                                                                           ─╯
  --security-protocol plaintext \
  --host localhost \
  --port 9092 \
  --topic-name pizza-orders \
  --nr-messages 8 \
  --max-waiting-time 0 \
  --subject pizza
  1. You can see the messages in kafka ui.
  2. Run the application using

sbt clean compile run_client

  1. You can see the messages in clickhouse ui.
select count(*) from rivulet_db.events;

Details

  1. The application is built using akka streams and alpakka kafka client. It is using at least once delivery semantics.
  2. It is commiting the offset to kafka in batches. As of now, We are reading batch-size =4 message at a time and waiting for 1000 such batches before committing the offset to kafka. The flush interval is 5 seconds. All of this is configurable.
  3. The application is using the KafkaConsumer.committableSource to read the messages from kafka.
  4. The application is using the KafkaConsumer.committableOffsetBatch to commit the offset to kafka in batches.
  5. The application has abstracted the source and sink logic into a separate component. This can be used to create a source other than kafka and sink to other data stores.
  6. Clickhouse http api is used to create a reactive sink.Under the hood, it is using akka http client.

Limitation

  1. The application can lose messages at the time of re-balancing. We need to create balancing Listener to mitigate this.
  2. We are using just few query settings from clickhouse. We can add more query settings to make it more robust. One important addon can to show the query progression using HTTP headers. see the TODO section in QuerySettings.scala
  3. For test purpose it is using memory engine. I think other engine can be better suited for this use case.
  4. The table has two columns. The whole event is stored in one column. A view can be created on top of this table to make it more readable.
  5. It can be turned into a microservice. We can use akka http to expose the api to create the table and view and start the streaming
  6. once it is turned into a microservice we can add metrics and health check.
  7. Scalability - This app is quite simple and can be scaled horizontally. We can use akka cluster to make it more robust. topics can be used as sharding key.
  8. We can use akka persistence to store the offset in case of failure. We can use akka persistence query to read the offset from the store.

rivulet's People

Contributors

ab-arithmeticbird avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.