Giter Club home page Giter Club logo

h1b-visa-analysis's Introduction

H1B-Visa-Analysis

Google Cloud DataProc analysis of H1B visa data for ECE 795 Advanced Big Data Analytics.

Pre-Installation

  1. Make sure you have a Google Cloud account with billing enabled
  2. Create a project
  3. Create a DataProc cluster
  4. Upload the CSV from here to Google Cloud Storage
  5. Download Google Cloud SDK (version 336.0.0)

Installation

Locally:

> py -3.8 --version
Python 3.8.5

On DataProc:

$ python --version
Python 3.8.8

For Local Development (Using VSCode)

  1. Optional - Create python virtual environment
    1. py -3.8 -m venv venv
      OR (make sure the below is the right version)
      python3 -m venv venv
    2. Win: ./venv/Scripts/activate
      Linux: source ./venv/bin/activate
    3. To leave (when done running code): deactivate
  2. Install dependencies
    1. pip install -r requirements.txt

Running

  1. SSH into cluster (in Command Prompt, reference)
    1. set PROJECT=<PROJECT_ID> && set HOSTNAME=<MASTER_CLUSTER_NAME> && set ZONE=<CLUSTER_ZONE> && set PORT=<PORT_VALUE>
      The values between < > should be replaced with their respective values - see the reference if there is confusion.
    2. gcloud compute ssh %HOSTNAME% --project=%PROJECT% --zone=%ZONE% -- -D %PORT%
      > gcloud --version
      Google Cloud SDK 336.0.0
      bq 2.0.66
      core 2021.04.09
      gsutil 4.61
      
  2. Create local file
    1. nano project.py
    2. Paste a copy of main.py by right clicking
    3. ctrl+x to save
  3. Consider flags python project.py --help
    $ python project.py --help
    usage: project.py [-h] [-f] [-q] [--hdfs HDFS] [--dataset DATASET] [-s SOURCE] [--table TABLE] [--no-basic] [--no-additional] [--no-task]
    
    H1B Visa Petition Analysis
    
    optional arguments:
    -h, --help            show this help message and exit
    -f, --force           always perform data transfers
    -q, --quiet           do not print notifications
    --hdfs HDFS           specify a HDFS directory to store data
    --dataset DATASET     specify a Google Cloud dataset name
    -s SOURCE, --source SOURCE
                            specify the path to a source data CSV file in Google Cloud Storage
    --table TABLE         specify a Google Cloud dataset table name
    --no-basic            do not execute basic queries
    --no-additional       do not execute additional queries
    --no-task             do not execute task queries
    --no-timing           do not execute timing queries
  4. Run default or with flags (e.g. python project.py,
    python project.py --force --no-basic, etc.)

h1b-visa-analysis's People

Contributors

n-wagner avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.