Giter Club home page Giter Club logo

mooc-spark-coursera-bigdata-analysis-spark-epfl's Introduction

Big Data Analysis with Scala and Spark

Assignments code for Big Data Analysis with Scala and Spark course (Coursera EPFL)

Assignments

Final Grade 100%

  • Week 1: Wikipedia

Your overall score for this assignment is 10.00 out of 10.00

  • Week 2-3: StackOverflow

Your overall score for this assignment is 10.00 out of 10.00

  • Week 4: Time usage

Your overall score for this assignment is 10.00 out of 10.00

Details

Week 2-3: StackOverflow

Using the Spark web UI, we visualize the events timeline and DAGs.

Extracting vectors

  • Stages 1 and 2: load questions and answers.

  • Stage 3: groupedPostings, scoredPostings, vectorPostings

  • Stage 4: sampleVectors

vectors

K-Means clustering

  • Jobs 2 to 46 apply the k-means algorithm on the sampleVectors cached in previous step.

For each step, centroids are updated and collected to the driver to evaluate convergence, stopping when it is reached.

kmeans

Week 4: Time usage

The data set analyzed originates from the American Time Use Survey (ATUS) data, from 2003-2015, via Kaggle. It measures how people divide their time among misc life activities.

Displaying data with Zeppelin

We load the resulting dataset in Apache Zeppelin

Install
  • Wget archive from website

  • Untar

  • Run: SPARK_LOCAL_IP=127.0.0.1 zeppelin-0.7.1/bin/zeppelin-daemon.sh start

  • Stop: zeppelin-0.7.1/bin/zeppelin-daemon.sh stop

nb: SPARK_LOCAL_IP is set to workaround a port unable to bind exception on 0.7.1

Prepare data export

Export the resulting week 4 dataset as JSON.

1) From the Spark environment, export data to disk:

finalDf.coalesce(1) // (1)
 .write.json("dataset-week4.json")
  1. Repartition to obtain only 1 output file (else 1 per partition)

2) Upload to the host running Zeppelin, or wget from it (%sh then wget …​)

Zeppelin

Connect to the Zeppelin web UI on http://localhost:8080, and create a new notebook with the following content.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlData = sqlContext.jsonFile("dataset-week4.json")
sqlData.registerTempTable("data")
%sql SELECT * FROM data ORDER BY work DESC

Display as bar graph:

bar graph

nb: sort order seems not to be respected, as per open issue ZEPPELIN-87

mooc-spark-coursera-bigdata-analysis-spark-epfl's People

Contributors

arnaudj avatar

Stargazers

 avatar  avatar

Watchers

 avatar

mooc-spark-coursera-bigdata-analysis-spark-epfl's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.