garybaylor / data_mining Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 0 B

data_mining's Introduction

data_mining

data_mining's People

Contributors

Watchers

data_mining's Issues

Classification and Regression Tree in R

We compare Classification And Regression Tree (CART) and Support Vector Machine (SVM) on a dataset kyphosis (in R package rpart). The dataset contains 81 rows and 4 columns, representing data on children who have had corrective spinal surgery. The 4 columns are
Kyphosis -- a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
Age -- in months
Number -- the number of vertebrae involved
Start-- the number of the first (topmost) vertebra operated on.

We are going to use Age, Number and Start to predict Kyphosis. We compare CART and SVM by using misclassification rate as measure of error. For CART, we prune the decision trees by different complexity parameter cp. For SVM, we select the tuning parameter.

CART

n <- nrow(kyphosis)
index <- sample(n, 0.8 * n) 

train <- kyphosis[index, ]
test <- kyphosis[-index, ]

fit <- rpart(Kyphosis ~ Age + Number + Start,
   method="class", data = train)

pred <- predict(fit, test[, -1])

error <- mean((pred[, 1] > 0.5) != (test[, 1] == "absent"))

## classification error for different pruning parameter
mean(replicate(1000, select_tree()))          ## 21.0%
mean(replicate(1000, select_tree(cp = 0.9)))  ## 21.1%
mean(replicate(1000, select_tree(cp = 0.8)))  ## 21.2%
mean(replicate(1000, select_tree(cp = 0.7)))  ## 20.5%
mean(replicate(1000, select_tree(cp = 0.6)))  ## 21.2%
mean(replicate(1000, select_tree(cp = 0.5)))  ## 21.2%
mean(replicate(1000, select_tree(cp = 0.4)))  ## 21.1%
mean(replicate(1000, select_tree(cp = 0.3)))  ## 23.1%
mean(replicate(1000, select_tree(cp = 0.2)))  ## 24.1%
mean(replicate(1000, select_tree(cp = 0.1)))  ## 24.0%

## svm
mean(replicate(1000, select_svm(cost = 10, gamma = 0.2)))   ## 18.3%

svm_tune <- tune(svm, train.x=kyphosis[, -1], train.y= kyphosis[, 1], 
              kernel="radial", ranges=list(cost=10^(-1:2), gamma = seq(0.1, 2, 0.1)))

Conclusion: The best misclassification error achieved by CART is 20.5%, compared to 18.3% by SVM. So SVM is a little bit better than CART.

## a function to prune decision tree by cross-validation
select_tree <- function(tp = 0.8, cp = 1) {
	n <- nrow(kyphosis)
    index <- sample(n, tp * n) 
    train <- kyphosis[index, ]
    test <- kyphosis[-index, ]
    fit <- rpart(Kyphosis ~ Age + Number + Start,
   method="class", data = train)
    fit1 <- prune(fit, cp = cp)
    pred <- predict(fit1, test[, -1])
    error <- mean((pred[, 1] > 0.5) != (test[, 1] == "absent"))
    error
}

select_svm <- function(tp = 0.8, ...) {
	n <- nrow(kyphosis)
    index <- sample(n, tp * n) 
    train <- kyphosis[index, ]
    test <- kyphosis[-index, ]
    fit <- svm(Kyphosis ~ Age + Number + Start, data = train, ...)
    pred <- predict(fit, test[, -1])
    error <- mean(pred != test[, 1])
    error
}

Comparing Naive Bayes and SVM for iris dataset

Both naive Bayes and support vector machine (SVM) can be used for classification. We compare the performance of them by using a classical data set iris.

Accuracy

We conduct 1000 simulations. For each simulation we divide the data into a training set and a testing set. We use the training set to fit the model and predictions are made for the testing set. Classification error is calculated. After all 1000 simulations, the mean and standard deviations of the classification error is reported.

> nb_svm(iris, B = 1000)
     naiveBayes        svm
mean 0.04688000 0.04222000
std  0.02563831 0.02465927

From simulation we find svm produces 4.2% classification error on average, slightly less than 4.7% by naive Bayes.

Computation time

library(microbenchmark)
> microbenchmark(naiveBayes(iris[,1:4], iris[, 5]), svm(iris[,1:4], iris[, 5]), times = 1000)
Unit: milliseconds
                               expr      min       lq     mean
 naiveBayes(iris[, 1:4], iris[, 5]) 1.302364 1.432913 1.554624
        svm(iris[, 1:4], iris[, 5]) 3.105816 3.317584 3.557402
   median       uq      max neval
 1.492442 1.575168 16.33110  1000
 3.433982 3.572570 22.91197  1000

By 1000 simulations, we find naiveBayes just needs 44% of time needed by SVM.

Therefore, SVM is more accurate than naive Bayes, but needs more computation time. This result is only obtained for dataset iris, which contains 150 observations on 4 predictors. For larger dataset, we expect naiveBayes to be even faster, compared to SVM.

Appendix: R code

## 1. load the library
library(e1071)

## 2. divide the data into a training set and a testing set
n <- nrow(iris)
index <- sample(n, n * 2 / 3)
training <- iris[index, ]
testing <- iris[-index, ]

## 3. naive Bayes model
mod1 <- naiveBayes(training[, 1:4], training[, 5])
pred1 <- predict(mod1, testing[, 1:4])
err1 <- mean(pred1 != testing[, 5])

## 4. SVM
mod2 <- svm(training[, 1:4], training[, 5])
pred2 <- predict(mod2, testing[, 1:4])
err2 <- mean(pred2 != testing[, 5])

## 5. Monte Carlo simulation
nb_svm <- function(data, B = 100, training.p = 2/3) {
	n <- nrow(data)
	err1 <- err2 <- numeric(B)
	for(i in 1:B) {
		index <- sample(n, n * training.p)
		training <- iris[index, ]
		testing <- iris[-index, ]
		mod1 <- naiveBayes(training[, 1:4], training[, 5])
        pred1 <- predict(mod1, testing[, 1:4])
        err1[i] <- mean(pred1 != testing[, 5])
        mod2 <- svm(training[, 1:4], training[, 5])
        pred2 <- predict(mod2, testing[, 1:4])
        err2[i] <- mean(pred2 != testing[, 5])
	}
	plot(density(err1))
	lines(density(err2), col = "red")
	legend("topright", legend = c("naiveBayes", "svm"), lty = c(1, 1), col = c("black", "red"))
	res1 <- c(mean(err1), sd(err1))
	res2 <- c(mean(err2), sd(err2))
	res <- data.frame(naiveBayes = res1, svm = res2)
    rownames(res) <- c("mean", "std")
	res
}

garybaylor / data_mining Goto Github PK

data_mining's Introduction

data_mining

data_mining's People

Contributors

Watchers

data_mining's Issues

Classification and Regression Tree in R

CART

Comparing Naive Bayes and SVM for iris dataset

Accuracy

Computation time

Appendix: R code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent