Giter Club home page Giter Club logo

spark-streaming-project's Introduction

Spark Streaming Project

Spark processing of streaming data using count-min sketches

Stream Generator

The stream enerator continuously generates a stream S of the following format: <sid, ip>. Each entry of this stream represents a single request of a client with IP address ip, to a server with ID sid. The order of the entries in the stream corresponds to the arriving order of the requests. The total number of servers is 100 (sid is between 0 and 99, inclusive). The number of distinct IP addresses is not known, and varies throughout the stream. A possible excerpt from the stream is seen below

4, 198.3.5.1

6, 139.3.7.15

5, 221.6.27.251

6, 139.3.7.15

5, 198.3.5.1

The stream is generated by a JAR file called “stream.jar” and can be ran by using “java -jar stream.jar 1200 2000 26” from the directory in which the JAR is located. Doing so will make a stream available at hostname “localhost” and port 9000. The values 1200, 2000 and 26 are defining the time in seconds after which the stream will time-out, the rate at which the stream provides output in number of rows per second and a number influencing some characteristics of the stream.

Problem

Given τ=3000 (representing a threshold), w=60 (representing the duration in seconds of a sliding window), 𝜖=0.001 (acceptable error on the similarity function) and 𝛿=0.01 (the acceptable probability that the error exceeds 𝜖).

Let $S$ denote the stream, and $A = {A^0, A^1, . . . A^{99}}$ the vectors of the 100 servers that correspond to the number of requests received by each server in the last 𝑤 seconds of $S$. Given two vectors $X$, $Y$ from $A$, with $X_i$, $Y_i$ denoting the value of the $i$'th component of vector $X$ (resp. $Y$), then $sim(X, Y) = \sum_i{X_i * Y_i}$ with $i \in [0, 2^{32} - 1]$.

The goal is to monitor all arrivals in $S$ and continuously print the number of pairs of servers $X$, $Y$, with $sim(X,Y) \geq τ$. Each pair of servers should be included only once (e.g., only one of (1,2) and (2,1) should be included in the answer). Additionally, no server should be paired with itself.

The computation of the exact value of $sim(X,Y)$ for all pairs of servers requires extensive space (for storing the respective vectors) and computation (for computing the inner product). To address these two limitations, the vectors need to be summarized in sub-linear space (either individually, or all together) using sketches. The approximation should offer additive error guarantees with high probability, i.e., the approximate similarity $sim(X,Y)$ should satisfy $Pr(\hat{sim}(X,Y) \leq sim(X,Y) + \sum_i{X_i}* \sum_i{Y_i}) \geq 1 - 𝛿$.

The Spark code should print at regular intervals (every 2 seconds):

  • the number of processed updates from the start of the program
  • the number of pairs contained in the estimated answer, i.e., the number of pairs with $sim(X,Y) \geq τ$
  • the time, in seconds since the start of the program

Solution

Similarity could be measured using the dot product. As the computation of the exact value of the similarity needs extensive space and computation, Count-Min Sketches could be used to summarize the vectors in sub-linear space. A CM sketch is a matrix with size $d_x w$, where $w=⌈e/ε⌉$, $d = ⌈ln(1/δ)⌉$. CM sketches use hash functions to map events to frequencies.Thus, each server can have a CM sketch, in which incoming IPs will be hashed. The similarity can be calculated using the following formulas:

  • $(sk_A * sk_B){j} = \sum{k=1}^{w} sk_{A}[k, j] * sk_{B}[k, j]$
  • $(sk_A * sk_B)=min_{j}(sk_A * sk_B)_j$

spark-streaming-project's People

Contributors

vasil-vis avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.