Giter Club home page Giter Club logo

makingmywaydowntown's Introduction

Introduction

This project is a Hadoop experimentation project looking to perform various data analyses on taxi trips in San Francisco via GPS tracking data.

Build and Install

This installation/build procedure is optimized for Linux. Unfortunately installation of Hadoop on Windows is a non-trivial exercise. Follow these steps:

  • Ensure you have Hadoop installed. Make sure to set the correct environment variables, as shown in the subsequent steps:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export SPARK_INSTALL=/cw/bdap/software/spark-2.4.0-bin-hadoop2.7
export HADOOP_INSTALL=/cw/bdap/software/hadoop-3.1.2
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
export HADOOP_CONF_DIR=/localhost/NoCsBack/bdap/clustera
  • Set your working environment to the src folder
  • In a terminal, run the command mvn clean package
  • You are done. You should see an Exercise2.jar executable in the home directory of this repository.

Running Examples

Trip Length Distribution

To run the code associated with Exercise 1, specifically the Trip Length Distribution, please run:

python3 ./Exercise1/exercise1.py

The environment variables there have already been set to defaults for the virtual machines on the cluster, but optional parameters are specified. Please edit the file to adjust these to any local machine you are using. This Spark aspect was done in Python, including using PySpark, as there was no restriction on this either in the assignment.

Trip Revenue Distribution

Run the below command (example usage shown below as well) to calculate the total revenue as well as output files for the trip revenue distribution

hadoop jar Exercise2.jar Assignment3.AirportRideRevenueMain <PATH_TO_INPUT_FILE> <PATH_TO_OUTPUT_FILE> <NO_OF_REDUCERS_STAGE_1> <NO_OF_REDUCERS_STAGE_2> <CONSIDER_OVERLAPPING_SEGMENTS?:true|false> <RECONSTRUCT_AIRPORT_TRIPS_ONLY?:true|false>

Example usage:

After you have extracted the file all.segments in a data folder you create in the home folder of the repository, you can run this via:

hadoop jar Exercise2.jar Assignment3.AirportRideRevenueMain /data/all.segments /user/r0781168/output 9 1 true true

makingmywaydowntown's People

Contributors

brutishguy avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.