Giter Club home page Giter Club logo

mapreduce-hadoop's Introduction

MapReduce on DBLP Dataset


Pre-requiste

  • Install VmWare based on your OS.
  • Install Cloudera VM
  • Java 1.8 needs to be installed on the system
  • Update Java 1.8 on Cloudera (Follow this steps URL1, URL2)
  • Install Gephi

Setup using SBT

  1. Checkout this project or download the entire project directory and extract it.

  2. Open up IntelliJ. Navigate to File -> Open and select the directory of the project.

  3. Open the terminal tab in IntelliJ and type the following commands:

    Compile and run the unit tests: sbt clean compile test

    Compile and run the application: sbt clean compile assembly

Running

1. XML Parser for DBLP dataset

  • Open the workspace in IntelliJ

  • Open the file DBLPParser.java

  • Go to Edit configuration and do the following changes:

    a. Main class should be com.uic.mapreduce.xml.DBLPParser

    b. In VM Options set: -Xmx6G ( increase memory allocation to JVM for this task)

    c. Provide absolute paths to the DBLP dataset and UIC authors.txt(included in the project) as program arguments e.g.: F:\UIC\441\mapreduce\dblp\dblp.xml .\src\main\resources\UIC_authors.txt

  • Run the file.

It will generate a file in the logs which will have comma separated UIC authors for each article,inproceedings,proceedings,book,incollection,phdthesis

Sample Output:

Robert H. Sloan,Ugo A. Buy
Bhaskar DasGupta
Andrew E. Johnson
Luc Renambot,Andrew E. Johnson
Ajay D. Kshemkalyani
Luc Renambot,Andrew E. Johnson
Luc Renambot,Andrew E. Johnson

2. Assemble Jar

Use sbt clean compile assembly which will create a jar under \target\scala-2.12 by name of author-map-dblp.jar

3. SFTP to Cloudera VM

  • Start the Cloudera VM instance on VmWare
  • Get the IP address of the VM from the network settings of VmWare once the VM is up and running.
  • Using WinSCP or other tools or commands(base don your OS) transfer the files(jar and the logfile a location on the VM)
  • Use default username and password provided by Cloudera.

4. Run Hadoop Mapreduce

  • Navigate to the directory where the above files are stored on the VM.
  • Create input directory on hadoop hadoop fs -mkdir input_dir
  • Transfer the logfile to the input directory hadoop fs -put <logfilename> input_dir
  • Run the jar from the directory where the jar is present. hadoop jar author-map-dblp.jar AuthorMapping input_dir output_dir

5. Extract Output to System

  • Once the job is completed the output needs to be extracted from hadoop to the local VM directory hadoop fs -get output_dir/part-r-00000 ./

Sample Output:

A. Prasad Sistla,A. Prasad Sistla,	102
A. Prasad Sistla,Bing Liu 0001,	1
A. Prasad Sistla,Isabel F. Cruz,	2
A. Prasad Sistla,Lenore D. Zuck,	6
A. Prasad Sistla,Robert H. Sloan,	1
A. Prasad Sistla,V. N. Venkatakrishnan,	8
Ajay D. Kshemkalyani,Ajay D. Kshemkalyani,	112
Ajay D. Kshemkalyani,Ugo Buy,	1

6. SFTP to Cloudera

  • Move this file to a folder on the host system.

7. Run Gephi for Graph

  • Open Gephi and the workspace provided in the logs folder of this project.
  • Import the CSV file( convert the file moved from the Cloudera to CSV extension).

Output Graph:

Built With


  • Scala - Scala combines object-oriented and functional programming in one concise, high-level language
  • SBT - sbt is a build tool for Scala & Java
  • Cloudera - Cloudera QuickStart VMs (single-node cluster)
  • Hadoop - framework that allows for the distributed processing of large data sets

Authors


mapreduce-hadoop's People

Contributors

0x1docd00d avatar amrishjhaveri avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.