Giter Club home page Giter Club logo

kafka-demo's Introduction

kafka-demo

This repository captures a kafka demo and scripts to:

  1. Generate an avro schema and register it in the schema registry
  2. Generate some data and write to a kafka topic as a producer
  3. Consume the data from the kafka topic as a consumer
  4. Consume the data as a batch from the kafka topic as a spark batch job
  5. Stream the data from the kafka topic using spark streaming

Libraries

This demo is going to depend on the confluent kafka python libraries. They are the fastest and best supported of the available python libraries at this time and instructions for installation and dependencies can be found here

Installing the confluent kafka libs:

$ pip install confluent-kafka
$ pip install confluent-kafka[avro]

You may also want to try "kafkacat" from github here which can be done using apt, brew, etc.

Avro

Avro is a data serialization system that provide a compact, fast, binary data format that can be sent over the wire. It enables schema evolution without many of the draw backs of other formats (csv, json, xml). While many example for kafka have been done sending messages composed of strings, json and csv data, to make this closer to a real world application we are going to use avro as it's much more robust for non-trivial use cases.

Creating an Avro Schema File

Avro schemas are defined in json files and for schema definition purposes they tend to end in .avsc, but otherwise I am not aware of any naming conventions for files. Kafka used to reason about the world in terms of messages and offsets. This has evolved to be in terms of keys and values you may want to define a schema for both.

We are going to create a schema file: click_v1.avsc

To keep things simple we will only be defining a schema for the value:

{
     "type": "record",
     "namespace": "com.example",
     "name": "click",
     "version": 1,
     "fields": [
       { "name": "id", "type": "string" },
       { "name": "impression_id", "type": "string" },
       { "name": "creative_id", "type": "string" },
       { "name": "placement_id", "type": "string" },
       { "name": "timestamp", "type": 
          { "type": "long", "logicalType": "timestamp-millis" } 
       },
       { "name": "user_agent", "type": ["string", "null"] },
       { "name": "ip", "type": ["string", "null"] },
       { "name": "referrer", "type": ["string", "null"] },
       { "name": "cost", "type": "float" }
     ]
}

Schema Registry

In an effort to coordinate evolving schema versions, to protect against malformed messages, and manage change, enter the schema registry. This mechanism helps avoid several problems that would arise in the past:

  • Required changes to serializers and deserializers otherwise no new data could be processed
  • Required changes to producers and consumers otherwise no new data could be processed
  • Old messages could no longer be processed by anyone that was upgraded to the new schema
  • New messages could not be processed by producers or consumers that were not updated The schema registry is provided as part of the confluent platform and more information on it can be found here For our purposes we are going to generate an avro schema, upload that schema to the registry and then begin producing and consuming data. Evolving the schema to a new version will be handled potentially in a future entry.

Schema Naming Conventions

Topics names in Kafka should follow this convention {subject}-{format}, where subject would be something like clicks and the format would indicate what data format the data is in, so avro, protobuf, json, etc. For our purposes we are going to be using avro, so our topic would be clicks-avro. Correspondingly, schemas also have a naming convention. The convetion for schemas is {topic}-{key|value}. Based on our clicks-avro and the fact we are only providing a schema for values our schema name will be clicks-avro-value. In the schema registry an entry for a schema is called a subject.

The RESTful Schema Registry API

The schema registry operates a RESTful api that is defined here.

Adding a Schema to the Schema Registry

To add a schema a POST request needs to be made as outlined here. Generally you post your json schema to the end point like this /subjects/(string: subject)/versions.

Based on the schema above an add is done like this:

curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{ "schema": "{\r\n     \"type\": \"record\",\r\n     \"namespace\": \"com.example\",\r\n     \"name\": \"click\",\r\n     \"fields\": [\r\n       { \"name\": \"id\", \"type\": \"string\" },\r\n       { \"name\": \"impression_id\", \"type\": \"string\" },\r\n       { \"name\": \"creative_id\", \"type\": \"string\" },\r\n       { \"name\": \"placement_id\", \"type\": \"string\" },\r\n       { \"name\": \"timestamp\", \"type\": \r\n          { \"type\": \"long\", \"logicalType\": \"timestamp-millis\" } \r\n       },\r\n       { \"name\": \"user_agent\", \"type\": [\"string\", \"null\"] },\r\n       { \"name\": \"ip\", \"type\": [\"string\", \"null\"] },\r\n       { \"name\": \"referrer\", \"type\": [\"string\", \"null\"] },\r\n       { \"name\": \"cost\", \"type\": \"float\" }\r\n     ]\r\n}" }' \
  http://localhost:8081/subjects/clicks-avro-value/versions

Note that the json has the double quote (") characters escaped and has replaced new line chacters. Getting the json string properly escaped can be a tedious task, so you can use an only tool like the one here to do it for you.

Get a List of Subjects

The rest api allows you to query for subjects...listing the subjects in the schema registry can be accomplished by issuing a GET request to /subjects as shown here

$ curl http://localhost:8081/subjects/
["clicks-avro-value"]%

Get a List of Versions for a Subject

The rest api allows you to query for the available versions of a specific subject using a get request as shown here

$ curl http://localhost:8081/subjects/clicks-avro-value/versions
[1]%

Getting a specific schema version:

$ curl http://localhost:8081/subjects/clicks-avro-value/versions/1
{"subject":"clicks-avro-value","version":1,"id":1,"schema":"{\"type\":\"record\",\"name\":\"click\",\"namespace\":\"com.example\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"},{\"name\":\"impression_id\",\"type\":\"string\"},{\"name\":\"creative_id\",\"type\":\"string\"},{\"name\":\"placement_id\",\"type\":\"string\"},{\"name\":\"timestamp\",\"type\":{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}},{\"name\":\"user_agent\",\"type\":[\"string\",\"null\"]},{\"name\":\"ip\",\"type\":[\"string\",\"null\"]},{\"name\":\"referrer\",\"type\":[\"string\",\"null\"]},{\"name\":\"costs\",\"type\":\"float\"}]}"}%

Create the relevant kafka topic:

$ vagrant ssh
vagrant@vagrant-ubuntu-trusty-64:~$ kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic clicks
Created topic "clicks".

Verify topic creation:

kafkacat -L -b localhost
Metadata for all topics (from broker -1: localhost:9092/bootstrap):
 1 brokers:
  broker 0 at vagrant-ubuntu-trusty-64:9092
 4 topics:
  topic "__confluent.support.metrics" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "_schemas" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
  topic "clicks" with 2 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
    partition 1, leader 0, replicas: 0, isrs: 0
  topic "test-topic" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0

kafka-demo's People

Contributors

adevuyst avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.