Giter Club home page Giter Club logo

sf-crime-statistics-spark-streaming's Introduction

sf-crime-statistics-spark-streaming

step 1

install package

$ pip install -r requirements.txt

Start up kafka server by docker-swarm.

$ docker stack deploy -c=kafka-docker.yml udacity-kafka

Execute kafka_server.py to produce records to the topic.

$ python kafka_server.py

Execute consumer_server.py to see the consumed records printed in console.

python consumer_server.py

Below picture is the screenshot of kafka-consumer-console output. kafka-server-console-output.png

step 2

Execute data_stream.py to see streaming results.

$ python data_stream.py

The screenshot of the streaming output. agg_batch_result.png

The screenshot of the progress reporter. progress_reporter.png

Because of the reason that the new spark structured streaming API doesn't support the streaming monitoring. (https://knowledge.udacity.com/questions/158733), So, the spark web UI don't show the streaming tab. spark-web-ui.png

step 3

  1. How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

    We can adjust the parameter maxOffsetsPerTrigger to specified more total number of offsets processed per trigger interval. By changing the value of maxOffsetsPerTrigger, we can find that the progress report's attribute, processedRowsPerSecond, will also change.

  2. What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

    In my tuning maxOffsetsPerTrigger experiment, I find that set maxOffsetsPerTrigger to 300,000 has the most optimal effect. the value of processedRowsPerSecond is up to 2709, comare to the beginning only 200 at maxOffsetsPerTrigger and 4 at processedRowsPerSecond. I find that the optimal setting is very depend on the host machine performance and incoming data size. if maxOffsetsPerTrigger is too small relative to incoming data size, then it need to trigger more batch to coming up with the newest data. if set maxOffsetsPerTrigger to bigger value, it may have less batch to coming up with the newest data, but may also have more trigger time to process the batch data. it is very depend on spark's cluster machine performance.

    optimal-value.png

sf-crime-statistics-spark-streaming's People

Contributors

mark1002 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.