- Implemented K-means clustering using pyspark
- Kmeans.py contains the source code
The following commands are used to run the application:
- Put the input files and Kmeans.py file in home folder of dsba hadoop cluster using winscp or using the following commands:
scp Chicago_Crimes_updated.csv [email protected]:/users/sbilgund
scp Kmeans.py [email protected]:/users/sbilgund
- Put the input files in HDFS using the following commands:
hdfs dfs -put Chicago_Crimes_updated.csv
- Change the directory to the folder containing spark-submit:
cd /usr/lib/spark/bin
- Run Kmeans.py for the input files using the commands below:
spark-submit /users/sbilgund/Kmeans.py Chicago_Crimes_updated.csv
- The output is written to folder named ChicagoKmeansOutput. Copy from hdfs to home folder of dsba hadoop cluster using the following command:
hdfs dfs -copyToLocal ChicagoKmeansOutput /users/sbilgund
- Implemented Naive Bayes using pyspark
- NaiveBayes.py contains the source code
- The output of K means is written to multiple files but in the same order as input.
- Merge K means output with Chicago_Crimes_updated.csv.
- The input to Naive Bayes are in parts named Naive-part-00000.csv, Naive-part-00001.csv, Naive-part-00002.csv
- Put the input files and NaiveBayes.py file in home folder of dsba hadoop cluster using winscp or using the following commands:
scp Naive-part-00000.csv [email protected]:/users/sbilgund
scp Naive-part-00001.csv [email protected]:/users/sbilgund
scp Naive-part-00002.csv [email protected]:/users/sbilgund
scp NaiveBayes.py [email protected]:/users/sbilgund
- Put the input files in HDFS using the following commands:
hdfs dfs -put Naive-part-00000.csv
hdfs dfs -put Naive-part-00001.csv
hdfs dfs -put Naive-part-00002.csv
- Change the directory to the folder containing spark-submit:
cd /usr/lib/spark/bin
- Run NaiveBayes.py for the input files using the commands below:
spark-submit /users/sbilgund/NaiveBayes.py 5 11 20
- Implemented Linear Regression using pyspark for 2013 and 2014 data
- LinearRegression.py contains the source code
- Prints the beta values
- The model is tested for 2015 data in TestModelRegression.xls
The following commands are used to run the application:
- Put the input files and LinearRegression.py file in home folder of dsba hadoop cluster using winscp or using the following commands:
scp Chicago_Crimes_updated.csv [email protected]:/users/sbilgund
scp LinearRegression.py [email protected]:/users/sbilgund
- Put the input files in HDFS using the following commands:
hdfs dfs -put Chicago_Crimes_updated.csv
- Change the directory to the folder containing spark-submit:
cd /usr/lib/spark/bin
- Run Kmeans.py for the input files using the commands below:
spark-submit /users/sbilgund/LinearRegression.py Chicago_Crimes_updated.csv