Giter Club home page Giter Club logo

sdm-lab-2's Introduction

SDM - Lab 2 @ UPC ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป



Table of contents


Data drives the world. In this big data era, the need to analyse large volumes of data has become ever more challenging and quite complex. Several different eco-systems have been developed which try to solve some particular problem. One of the main tool in Big Data eco system is the Apache Spark

Apache Spark analysis of big data became essential easier. Spark brings a lot implementation of useful algorithms for data mining, data analysis, machine learning, algorithms on graphs. Spark takes on the challenge of implementing sophisticated algorithms with tricky optimization and ability to run your code on distributed cluster. Spark effectively solve problems like fault tolerance and provide simple API to make the parallel computation.

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.

This repository serves as a starting point for working with Spark GraphX API. As part of our SDM lab, we'd be focusing on getting a basic idea about how to work with pregel and get a hands-on experience with distributed processing of large graph.

Pregel, originally developed by Google, is essentially a message-passing interface which facilitates the processing of large-scale graphs. Apache Spark's GraphX module provides the Pregel API which allow us to write distributed graph programs / algorithms. For more details, kindly check out the original paper


Before starting, you may need to setup your machine first. Please follow the below mentioned guides to setup Spark and Maven on your machine.

We have created a setup script which will setup brew, apache-spark, maven and conda enviornment. If you are on Mac machine, you can run the following commands:

git clone https://github.com/mohammadzainabbas/SDM-Lab-2.git
cd SDM-Lab-2 && sh scripts/setup.sh

If you are on Linux, you need to install Apache Spark by yourself. You can follow this helpful guide to install apache spark. You can install maven via this guide.

We also recommend you to install conda on your machine. You can setup conda from here

After you have conda, create new enviornment via:

conda create -n spark_env python=3.8

Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)

Activate your enviornment:

conda activate spark_env

Now, you need to install pyspark:

pip install pyspark

If you are using bash:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
. ~/.bashrc

And if you are using zsh:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
. ~/.zshrc

Since, this is a typical maven project, you can run it however you'd like to run a maven project. To facilitate you, we provide you two ways to run this project.

In you are using VS Code, change the args in the Launch Main configuration in launch.json file located at .vscode directory.

See the main class for the supported arguments.

Just run the following with the supported arguments:

sh scripts/build_n_run.sh exercise1

Note: exercise1 here is the argument which you'd need to run the first exercise

Again, you can check the main class for the supported arguments.

sdm-lab-2's People

Contributors

kaiamj avatar mohammadzainabbas avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

sdm-lab-2's Issues

Error in Exercise_3

In sendMsg function, when I try to return Tuple2 Iterator, I get following errors:

  1. If I don't cast the return type to (Iterator<Tuple2<Object,Vertex>>), I get Type Mismatch error: cannot convert form hence I typecast it using the given form.
  2. Once I use the typecast, I just get warning but if I run the program I get follwoing error:
    22/05/12 10:41:55 ERROR Executor: Exception in task 2.0 in stage 4.0 (TID 26)
    java.lang.ClassCastException: class scala.collection.convert.Wrappers$JIteratorWrapper cannot be cast to class java.util.Iterator (scala.collection.convert.Wrappers$JIteratorWrapper is in unnamed module of loader 'app'; java.util.Iterator is in module java.base of loader 'bootstrap')

Could you please check the given issue, and let me know

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.