Peer-graded Assignment: Prediction Assignment Writeup
Overview
Utilizing gadgets, for example, Jawbone Up, Nike FuelBand, and Fitbit, it is currently conceivable to gather a lot of information about close to home movement moderately reasonably. The point of this venture is to foresee the way in which members play out a free weight lift. The information originates from http://groupware.les.inf.puc-rio.br/har wherein 6 members were solicited to play out a similar set from activities accurately and mistakenly with accelerometers put on the belt, lower arm, arm, and dumbell. For the purpose of this project, the following steps would be followed: • Data Preprocessing • Exploratory Analysis • Prediction Model Selection • Predicting Test Set Output To begin with, we load the preparation and testing set from the online sources and afterward split the preparation set further into preparing and test sets. Below R libraries used for the analysis. library(knitr) library(caret) library(rpart) library(rpart.plot) library(rattle) library(randomForest) library(corrplot)
-
Data Preprocessing library(caret) setwd("~/Projects/R/Coursera-Practical-Machine-Learning-Assignment-1/") trainURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" testURL <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" training <- read.csv(url(trainURL)) testing <- read.csv(url(testURL)) label <- createDataPartition(training$classe, p = 0.7, list = FALSE) train <- training[label, ] test <- training[-label, ] From among 160 variables present in the dataset, some variables have nearly zero variance whereas some contain a lot of NA terms which need to be excluded from the dataset. Moreover, other 5 variables used for identification can also be removed. NZV <- nearZeroVar(train) train <- train[ ,-NZV] test <- test[ ,-NZV] label <- apply(train, 2, function(x) mean(is.na(x))) > 0.95 train <- train[, -which(label, label == FALSE)] test <- test[, -which(label, label == FALSE)] train <- train[ , -(1:5)] test <- test[ , -(1:5)] As a result of the preprocessing steps, we were able to reduce 160 variables to 54.
-
Exploratory Analysis Now that we have cleaned the dataset off absolutely useless varibles, we shall look at the dependence of these variables on each other through a correlation plot. library(corrplot) corrMat <- cor(train[,-54]) corrplot(corrMat, method = "color", type = "lower", tl.cex = 0.8, tl.col = rgb(0,0,0)) In the plot above, darker gradient corresponds to having high correlation. A Principal Component Analysis can be run to further reduce the correlated variables but we aren’t doing that due to the number of correlations being quite few.
-
Prediction Model Selection We will use 3 methods to model the training set and thereby choose the one having the best accuracy to predict the outcome variable in the testing set. The methods are Decision Tree, Random Forest and Generalized Boosted Model. A confusion matrix plotted at the end of each model will help visualize the analysis better. Decision Tree library(rpart) library(rpart.plot) library(rattle) set.seed(13908) modelDT <- rpart(classe ~ fancyRpartPlot(modelDT) ., data = train, method = "class") Rattle 2017−Aug−16 01:03:52 Yash_Kumar_Singh predictDT <- predict(modelDT, test, type = "class") confMatDT <confMatDT confusionMatrix(predictDT, test$classe)
Prediction A B C D E ## A 1505 233 44 80 29 ## B 39 609 36 21 25 ## C 21 76 818 143 88 ## D 86 145 51 612 131 ## E 23 76 77 108 809
5
0.9083 0.9745 0.9325 0.9161 0.9409
0.7959 0.8342 0.7138 0.5971 0.7402
0.9577 0.8972 0.9561 0.9276 0.9430
0.2845 0.1935 0.1743 0.1638 0.1839
0.2557 0.1035 0.1390 0.1040 0.1375
0.3213 0.1240 0.1947 0.1742 0.1857
0.9037 0.7546 0.8649 0.7755 0.8443 Random Forest library(caret) set.seed(13908) control <- trainControl(method = "cv", number = 3, verboseIter=FALSE) modelRF <- train(classe ~ ., data = train, method = "rf", trControl = control) modelRF$finalModel
A 3904 1 0 0 1 0.0005120328 ## B 7 2645 5 1 0 0.0048908954 ## C 0 4 2392 0 0 0.0016694491 ## D 0 0 8 2243 1 0.0039964476
E 0 0 0 5 2520 0.0019801980 predictRF <- predict(modelRF, test) confMatRF <- confusionMatrix(predictRF, test$classe) confMatRF
6
Class: A Class: B Class: C Class: D Class: E
1.0000 0.9921 0.9990 0.9938 0.9972
0.9988 0.9998 0.9981 0.9992 1.0000
0.9970 0.9991 0.9913 0.9958 1.0000
1.0000 0.9981 0.9998 0.9988 0.9994
0.2845 0.1935 0.1743 0.1638 0.1839
0.2845 0.1920 0.1742 0.1628 0.1833
0.2853 0.1922 0.1757 0.1635 0.1833
0.9994 0.9959 0.9986 0.9965 0.9986 Generalized Boosted Model ) FALSE)
There were 53 predictors of which 44 had non-zero influence. predictGBM <- predict(modelGBM, test) confMatGBM <- confusionMatrix(predictGBM, test$classe) confMatGBM
Accuracy : 0.9845 library(caret) set.seed(13908) control <- trainControl(method = "repeatedcv", number = 5, repeats = 1, verboseIter = FALSE modelGBM <- train(classe ~ ., data = train, trControl = control, method = "gbm", verbose = modelGBM$finalModel
7
0.9964 0.9930 0.9955 0.9961 0.9996
0.9911 0.9711 0.9787 0.9803 0.9981
0.9988 0.9939 0.9971 0.9961 0.9950
0.2845 0.1935 0.1743 0.1638 0.1839
0.2836 0.1886 0.1720 0.1606 0.1798
0.2862 0.1942 0.1757 0.1638 0.1801
0.9967 0.9838 0.9909 0.9882 0.9887 As Random Forest offers the maximum accuracy of 99.75%, we will go with Random Forest Model to predict our test data class variable. Predicting Test Set Output predictRF <- predict(modelRF, testing) predictRF