Google Cloud DataProc analysis of H1B visa data for ECE 795 Advanced Big Data Analytics.
- Make sure you have a Google Cloud account with billing enabled
- Create a project
- Create a DataProc cluster
- Upload the CSV from here to Google Cloud Storage
- Download Google Cloud SDK (version
336.0.0
)
Locally:
> py -3.8 --version
Python 3.8.5
On DataProc:
$ python --version
Python 3.8.8
- Optional - Create python virtual environment
py -3.8 -m venv venv
OR (make sure the below is the right version)
python3 -m venv venv
- Win:
./venv/Scripts/activate
Linux:source ./venv/bin/activate
- To leave (when done running code):
deactivate
- Install dependencies
pip install -r requirements.txt
- SSH into cluster (in
Command Prompt
, reference)set PROJECT=<PROJECT_ID> && set HOSTNAME=<MASTER_CLUSTER_NAME> && set ZONE=<CLUSTER_ZONE> && set PORT=<PORT_VALUE>
The values between< >
should be replaced with their respective values - see the reference if there is confusion.gcloud compute ssh %HOSTNAME% --project=%PROJECT% --zone=%ZONE% -- -D %PORT%
> gcloud --version Google Cloud SDK 336.0.0 bq 2.0.66 core 2021.04.09 gsutil 4.61
- Create local file
nano project.py
- Paste a copy of
main.py
by right clicking ctrl+x
to save
- Consider flags
python project.py --help
$ python project.py --help usage: project.py [-h] [-f] [-q] [--hdfs HDFS] [--dataset DATASET] [-s SOURCE] [--table TABLE] [--no-basic] [--no-additional] [--no-task] H1B Visa Petition Analysis optional arguments: -h, --help show this help message and exit -f, --force always perform data transfers -q, --quiet do not print notifications --hdfs HDFS specify a HDFS directory to store data --dataset DATASET specify a Google Cloud dataset name -s SOURCE, --source SOURCE specify the path to a source data CSV file in Google Cloud Storage --table TABLE specify a Google Cloud dataset table name --no-basic do not execute basic queries --no-additional do not execute additional queries --no-task do not execute task queries --no-timing do not execute timing queries
- Run default or with flags (e.g.
python project.py
,
python project.py --force --no-basic
, etc.)