mooc-spark-coursera-bigdata-analysis-spark-epfl's Introduction

Big Data Analysis with Scala and Spark

Assignments code for Big Data Analysis with Scala and Spark course (Coursera EPFL)

Assignments

Final Grade 100%

Week 1: Wikipedia

→ Your overall score for this assignment is 10.00 out of 10.00

Week 2-3: StackOverflow

→ Your overall score for this assignment is 10.00 out of 10.00

Week 4: Time usage

→ Your overall score for this assignment is 10.00 out of 10.00

Details

Week 2-3: StackOverflow

Using the Spark web UI, we visualize the events timeline and DAGs.

Extracting vectors

Stages 1 and 2: load questions and answers.
Stage 3: groupedPostings, scoredPostings, vectorPostings
Stage 4: sampleVectors

K-Means clustering

Jobs 2 to 46 apply the k-means algorithm on the sampleVectors cached in previous step.

For each step, centroids are updated and collected to the driver to evaluate convergence, stopping when it is reached.

Week 4: Time usage

The data set analyzed originates from the American Time Use Survey (ATUS) data, from 2003-2015, via Kaggle. It measures how people divide their time among misc life activities.

Displaying data with Zeppelin

We load the resulting dataset in Apache Zeppelin

Install

Wget archive from website
Untar
Run: SPARK_LOCAL_IP=127.0.0.1 zeppelin-0.7.1/bin/zeppelin-daemon.sh start
Stop: zeppelin-0.7.1/bin/zeppelin-daemon.sh stop

nb: SPARK_LOCAL_IP is set to workaround a port unable to bind exception on 0.7.1

Prepare data export

Export the resulting week 4 dataset as JSON.

1) From the Spark environment, export data to disk:

finalDf.coalesce(1) // (1)
 .write.json("dataset-week4.json")

Repartition to obtain only 1 output file (else 1 per partition)

2) Upload to the host running Zeppelin, or wget from it (%sh then wget …)

Zeppelin

Connect to the Zeppelin web UI on http://localhost:8080, and create a new notebook with the following content.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlData = sqlContext.jsonFile("dataset-week4.json")
sqlData.registerTempTable("data")

%sql SELECT * FROM data ORDER BY work DESC

Display as bar graph:

nb: sort order seems not to be respected, as per open issue ZEPPELIN-87

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

arnaudj / mooc-spark-coursera-bigdata-analysis-spark-epfl Goto Github PK

mooc-spark-coursera-bigdata-analysis-spark-epfl's Introduction

Big Data Analysis with Scala and Spark

Assignments

Details

Week 2-3: StackOverflow

Extracting vectors

K-Means clustering

Week 4: Time usage

Displaying data with Zeppelin

Install

Prepare data export

Zeppelin

mooc-spark-coursera-bigdata-analysis-spark-epfl's People

Contributors

Stargazers

Watchers

Forkers

mooc-spark-coursera-bigdata-analysis-spark-epfl's Issues

Recommend Projects

Recommend Topics

Recommend Org