Giter Club home page Giter Club logo

ratsadtarget's Introduction

Project-Title

RATS: Realtime Ad Target System. A data pipeline that ingests real time social media comment data and processes it with real time SQL queries.

In this project I :

  1. Implemented a data pipeline called RATS : Real time Ad Target System on AWS.
  2. Selected best posts to place ads on a stream of reddit.com comment data for selected target categories.
  3. Ingested more than 1 Tb of data through Kafka and then processed it with Spark Structured Streaming to implement SQL queries on real time data with 600 ms latency.
  4. Leveraged kafka consumer groups and multiple partitions to increase writes by more than 3 times and reduced latency by a factor of 10 by tuning spark.

Presentation Link

Video of the system demo here: Shows the posts with the highest page views updated in real time.

IMAGE ALT TEXT HERE


How to install and get it up and running

Setup cluster with pegasus on AWS.

Create a VPC

Create a public subnet within the VPC for 11 instances ( remember to increase the Elastic IP limit)

Install virtual environments on instances with python3, Tmux and kafka-python

Cluster specs :

  1. Kafka cluster - 4 nodes m4.large

  2. Spark cluster - 1 master, 4 slaves m4.2xlarge

  3. PostgreSQL instance m4.xlarge

  4. Instance for running producer and consumer m4.4xlarge


Introduction

Real time ad bidding industry requires high throughput pipelines that can process social media or user web session data with ultra low latencies. In this project I designed a pipeline to process real time streams of social media comment data to figure out the best place to post ads.

Architecture

Data is ingested through AWS S3 by a python producer into a Kafka topic. This data is then fed to Spark Structured Streaming for aggregation using Real time Spark SQL queries. The data is ingested into a new Kafka topic and then python consumers write the data to a PostgreSQL database.

alt_text

Dataset

Reddit.com comment dataset downloaded from Pushshift

Each month dataset is ~ 8 GB compressed and 150 GB uncompressed.

Each comment is stored in json format with the multiple keys of which following are used:

  1. post
  2. subreddit
  3. body
  4. timestamp
  5. author

A wildcard is used to filter for certain words on the body of the comment. In the following example, I use a few handpicked keywords to filter for Travel ads. The keywords could be improved by analyzing the user click behaviour.

Here line corresponds to input streaming dataframe where comments are filtered according to keywords from the body of the comment.

lines = spark.sql("SELECT * FROM updates WHERE body LIKE '%vacation%'\
                      OR body LIKE '%holiday%'\
                      OR body LIKE '%beach%'\
                      OR body LIKE '%urope%'\
                      OR body LIKE '%trip%'\
                      OR body LIKE '%tired%'\
                      OR body LIKE '%work%'\
                      OR body LIKE '%fatigue%'\
                      OR body LIKE '%overwork%'\
                      OR body LIKE '%party%'\
                      OR body LIKE '%fun%'\
                      OR body LIKE '%weekend%'\
                      OR body LIKE '%ecember%'\
                      OR body LIKE '%ummer%'\
                      OR body LIKE '%ingapore%'\
                      OR body LIKE '%alaysia%'\
                      OR body LIKE '%hailand%'\
                      OR body LIKE '%affari%'\
                      OR body LIKE '%kids%'\
                      OR body LIKE '%lions%'\
                      OR body LIKE '%event%'\
                      OR body LIKE '%ingapore%'\
                      OR body LIKE '%bored%'\
                      OR body LIKE '%happy%'\
                      OR body LIKE '%excited%'\
                      OR body LIKE '%sad%'\
                      OR body LIKE '%breakup%'\
                      OR body LIKE '%wedding%'\
                      OR body LIKE '%visit%'\
                      OR body LIKE '%no time%'\
                      OR body LIKE '%car%'\
                      OR body LIKE '%road%'\
                      OR body LIKE '%bonus%'\
                      OR body LIKE '%tan%'\
                      OR body LIKE '%road-trip%'\
                      OR body LIKE '%girl friend%'\
                      OR body LIKE '%bus%'\
                      OR body LIKE '%train%'\
                      OR body LIKE '%motel%'\
                      OR body LIKE '%visit%'\
                      OR body LIKE '%mother%'\
                      OR body LIKE '%father%'\
                      OR body LIKE '%parents%'\
                      OR body LIKE '%thanks giving%'\
                      OR body LIKE '%long week%'")

Engineering challenges

Increasing producer throughput Each producer sends data to kafka at the rate of 700 messages/s (acks = 1, as I dont want to loose any messages). To increase the throughput (messages/s) I increased the the number of producers parallely writing to multiple kafka partitions.

alt text

Increasing consumer throughput Each consumer can write to postgres at a rate of ~1000 messages/s. To increase the write speed I used multiple kafka partitions and multiple consumers in a consumer group.

alt text

Working with Kafka:

  1. Set failondataloss: false in spark structured streaming source. Incase streaming application shuts downdue to lost data in kafka or missing offsets.(low retention period or no replication of topics across brokers)

  2. Set retention period to a low number: Due to the size of the data source.

Working with spark strucutred streaming:

  1. To reduce latency use continuous trigger( 1 ms), this mode does not support aggregations.

ratsadtarget's People

Contributors

aspk avatar

Stargazers

Keoni Murray avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.