Light

robillersomeone / dsc-4-38-04-big-data-analytics-apache-spark-nyc-career-ds-102218 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learn-co-students/dsc-4-38-04-big-data-analytics-apache-spark-nyc-career-ds-102218

0.0 0.0 0.0 143 KB

License: Other

Jupyter Notebook 100.00%

dsc-4-38-04-big-data-analytics-apache-spark-nyc-career-ds-102218's Introduction

Big data analytics on Apache Spark

Introduction

Big Data Analytics is an emerging area of interest both for business and academia. There are a lot of details around the characteristics of big data and How Apache spark eases up the job of analyzing huge amounts of data using a simple programming paradigm. In this section we shall look at understanding and implementing a simple problem using MapReduce in Pyspark. Real world problems, however, are much more complicated than this and you should be able to scale up the takeaways from our simple word count example to much bigger problems. This lesson aims to provide you with a wider understanding on MapReduce and big data computation in Apache Spark Environment.

Objectives:

You will be able to:

Understand the role of Apache Spark in Big Data analytics
Get an understanding of Apache spark stack allowing abstraction of data and computation
Describe RDDs as fundamental units of computation in Apache spark environment
Get an insight into Spark's Machin Learning Library, Graph analysis library and streaming features

In this lesson, you are required to read the following review article:

Big data analytics on Apache Spark

International Journal of Data Science and Analytics
November 2016, Volume 1, Issue 3–4, pp 145–164
Salman SalloumEmail authorRuslan DautovXiaojun ChenPatrick Xiaogang PengJoshua Zhexue Huang

The article is available at https://link.springer.com/article/10.1007/s41060-016-0027-9

"In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics." - from the abstract.

You are expected to spend around 90 - 120 minutes reading this article. It is an excellent article and all the key aspects of spark computational environment are summarized and presented in an excellent manner.

dsc-4-38-04-big-data-analytics-apache-spark-nyc-career-ds-102218's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.