garybaylor / r-code Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 108 KB

A collection of algorithms written by myself for solving statistical problems

r-code's People

Contributors

Watchers

r-code's Issues

MH algorithm to generate beta random variables

We compare the density of random samples by using MH algorithm and that by using base R function rbeta. The code is as follows.

mh.beta <- function(a = 2.7, b = 6.3, N = 1e4) {
    X <- numeric(N)
    X[1] <- runif(1)

    for(i in 2:N) {
        y <- runif(1)
        rho <- dbeta(y, a, b) / dbeta(X[i-1], a, b)
        X[i] <- X[i-1] + (y - X[i-1]) * (runif(1) < rho)
    }

    plot(density(X), type = "l", main = "MH simulation & Direct simulation Comparision", xlab = "")
    lines(density(rbeta(N, a, b)), type = "l", col = "red")
    legend(0.65, 2.65, legend = c("MH simulation", "Direct simulation"), lty = c(1, 1), col = c("black", "red"))
}

The plot when N = 1e5 is like the following.

We can see that MH is performing quite good, except near the peak.

Random Walk

Random walk is a non-convergent Markov Chain. We can show that it is not convergent

random.walk <- function(t = 1e4) {
    res <- numeric(t + 1)
    res[0] <- rnorm(1)
    for(i in 1:t) {
        res[i + 1] <- res[i] + rnorm(1) 
    }
    plot(res, type = "l", lwd = 0.5)
}

The plot looks like the following graph.

Each time the random walk function is run, the plot is different.

If we slightly change the code, the Markov Chain will be convergent.

Markov.chain <- function(t = 1e4, rho = 0.9) {
    res <- numeric(t + 1)
    res[0] <- rnorm(1)
    for(i in 1:t) {
        res[i + 1] <- rho * res[i] + rnorm(1) 
    }
    par(mfrow = c(1, 2))
    plot(res, type = "l", lwd = 0.5)
    plot(density(res), type = "l")
}

The plot is like this

Each time the function is run, the density will be almost the same

How to use SVM in R

An introduction on how to use SVM in R (only two class classification)

##  This code is to study SVM using package 'e1071'

##  =====Create data ======================
#   in order to perform classification rather than regression,
#   we need to change y to factors
set.seed(1)
x <- matrix(rnorm(20 * 2), ncol = 2)
y <- c(rep(-1, 10), rep(1, 10))
x[y ==1, ] = x[y == 1] +1
plot(x, col = (3 - y))
data <- data.frame( x = x, y = as.factor(y))

##  =====fitting model =====================
#   the lower the cost is, the more support vectors we have
library(e1071)
svmfit1 <- svm(y ~., data = data, kernel = "linear", cost = 10, scale = FALSE)
svmfit2 <- svm(y ~., data = data, kernel = "linear", cost = 10, scale = FALSE)
svmfit3 <- svm(y ~., data = data, kernel = "linear", cost = 10, scale = FALSE)

#   the support vectors are denoted by 'X', others are denoted by "'O'
plot(svmfit1, data) # when cost = 10, there are 7 support vectors
plot(svmfit2, data) # when cost = 1, there are 10 support vectors
plot(svmfit3, data) # when cost = .1, there are 16 support vectors

##  =====Tuning =============================
#   tunning is the cross-validation for svm
set.seed(1)
tune.out <- tune(svm, y ~., data = data, kernel = "linear",
                 ranges = list(cost = c(.001, .01, .1, 1, 5, 10, 100)))

#   we found the best cost is .1
summary(tune.out)
#   tunning stores the best model for us
bestmod <- tune.out$best.model

##  =====prediction ==========================
xtest <- matrix(rnorm(20 * 2), ncol = 2)
ytest <- sample(c(1, -1), 20, replace = TRUE)
xtest[ytest == 1, ] <- xtest[ytest == 1, ] + 1
test <- data.frame(x = xtest, y = as.factor(ytest))
pred <- predict(bestmod, test)
xtabs(~ pred + ytest)


##  ===== other types of kernel: radial =============
#   we need to set kernel = "radial", and set the value of gamma
#   we still need to set cost value
set.seed(1)
x <- matrix(rnorm(200 * 2), ncol = 2)
x[1:100, ] <- x[1:100, ] + 2
x[101:150, ] <- x[101:150, ] - 2
y <- c(rep(1, 150), rep(2, 50))
data <- data.frame(x = x, y = as.factor(y))
plot(x, col = y)

# setting the cost = 1, we found there are 7 observations that are misclassified
train <- sample(200, 100)
svmfit <- svm(y ~., data = data[train, ], kernel = "radial", gamma = 1, cost = 1)
plot(svmfit, data[train, ])
pred <- predict(svmfit, data[train, ])
table(pred = pred, true = y[train])

# increase the cost will fit data more precisely, but tend to cause overfitting
# setting the cost = 1e5, we found there are only 1 observation misclassified
# the decision boundary becomes more irregular
quartz()
svmfit <- svm(y ~., data = data[train, ], kernel = "radial", gamma = 1, cost = 1e5)
plot(svmfit, data[train, ])
pred <- predict(svmfit, data[train, ])
table(pred = pred, true = y[train])

##  we need crossvalidation (tunning)
tune.out <- tune(svm, y ~., data = data[train, ], kernel = "radial",
                 ranges = list(cost = seq(.1, 2, by = 0.1),
                               gamma = seq(.05, .5, by = .05)))

# we found %13 testing data are misclassified
pred <- predict(tune.out$best.model, data[-train, ])
table(pred = pred, true = y[-train])

Asymptotic simulation of delta method

The purpose is to do a simulation to see the performance of Delta method.
Let X1, ..., Xn is i.i.d. from Poisson(theta), Yi = I(Xi = 0 ). Find the asymptotic distribution of 2 mean(Xn) + log (Yn).
The solution by using CLT and multivariate delta method is N(theta, exp(theta)/n). The following is R simulation code. We showed that when n is large, the approximation is good.

a <- expression(
n <- 5e4, # we set n to be fairly large. When n is small, the performance is not good.
theta <- 5,
x <- rpois(n, theta),
y <- x == 0,
Z <- 2 * mean(x) + log(mean(y))
)

sim <- replicate(1e4,eval(a))

xlow <- theta - 3 * sqrt(exp(theta) / n)
xhigh <- theta + 3 * sqrt(exp(theta) / n)
xseq <- seq(xlow, xhigh, length = 100)
plot(density(sim))
lines(xseq, dnorm(xseq, theta, sqrt(exp(theta)/n)), col = "red")

The result is pretty good.

Fligner-Policello test in R

Fligner-Policello Test in R

I try to find a package to do FP test, but found nothing. So I make a small crude function which take two vectors as input and return a p-value.

FP.test <- function(x, y) {
    n <- length(x); m <- length(y)
    Pl.x <- numeric(n); Pl.y <- numeric(m)

    for(i in 1:n) {
        Pl.x[i] <- sum(y < x[i]) + 1/2 * sum(y == x[i])
    }

    for(i in 1:m) {
        Pl.y[i] <- sum(x < y[i]) + 1/2 * sum(x == y[i])
    }

    Plxmean <- mean(Pl.x); Plymean <- mean(Pl.y)

    Vx <- (n - 1) * var(Pl.x); Vy <- (m - 1) * var(Pl.y)

    z <- (sum(Pl.y) - sum(Pl.x)) / (2 * sqrt(Vx + Vy + Plxmean * Plymean))

    p <- pnorm(abs(z), lower.tail = FALSE) * 2

    return (list(p.value = p))

}

We can try some example

> set.seed(100)
> x <- rnorm(20, mean = 0, sd = 1)
> set.seed(200)
> y <- rnorm(30, mean = 0, sd = 1)
> set.seed(201)
> y1 <- rnorm(30, mean = 1, sd = 1)
> set.seed(202)
> y2 <- rnorm(30, mean = 2, sd = 1)
> set.seed(203)
> y3 <- rnorm(30, mean = 3, sd = 1)
> FP.test(x,y)
$p.value
[1] 0.7070176

> FP.test(x,y1)
$p.value
[1] 5.794937e-05

> FP.test(x,y2)
$p.value
[1] 2.43404e-46

> FP.test(x,y3)
$p.value
[1] 1.124729e-250

MH algorithm for polynomial regression

We will write a program in R to estimate coefficients of a polynomial regression. The data we use is cars in R. The response is dist and the predictor is speed. We will fit a polynomial regression to two degrees.

> head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

If we use frequentist's method. We can use lm in R to fit a model and the following is the graph.

y <- cars$dist
x1 <- cars$speed
x2 <- (cars$speed)^2
mod <- lm(y ~ x1 + x2)
pred <- predict(mod, data.frame(x1, x2))
plot(x1, y, xlab = "speed", ylab = "dist")
xseq <- seq(min(x1), max(x1), length = 100)
lines(x1, pred, type = "l", col = "red", lwd = 2)

Now we use Bayesian method. The procedure is like this. We choose some priors for each coefficients, and use MH algorithm to generate a MCMC chain and then estimate the posterior distribution. Here we using a normal distribution for each coefficient and a uniform distribution for sigma2 -- the variance of the error. The code for each function is like this.

##  MH algorithm
#   define likelihood
likelihood <- function(param){
    # X is design matrix with first column all 1s
    params <- matrix(param[-4])     
    pred = X %*% params
    pred <- as.vector(pred)
    singlelikelihoods = dnorm(y, mean = pred, sd = param[4], log = T)
    sumll = sum(singlelikelihoods)
    return(sumll)   
}

#   define prior
prior <- function(param){
    b0 = param[1]
    b1 = param[2]
    b2 = param[3]
    sd = param[4]
    aprior = dnorm(b0, mean=0, sd=10, log = T)
    bprior = dnorm(b1, mean=0, sd = 5, log = T)
    cprior = dnorm(b2, mean=0, sd = 1, log = T)
    sdprior = dunif(sd, min=0, max=1000, log = T)
    return(aprior + bprior + cprior + sdprior)
}

# define posterior
posterior <- function(param){
   return (likelihood(param) + prior(param))
}

# define proposal function
proposalfunction <- function(param, theta = c(0.5,0.5,0.3, 1)){
    return(runif(4, min = param - theta, max = param + theta))
}

##  MH
MH <- function(startvalue, N = 1e5, burn = N/10){
    chain = array(dim = c(N,4))
    chain[1,] = startvalue
    for (i in 2:N){
        proposal = proposalfunction(chain[i-1,])

        rho = exp(posterior(proposal) - posterior(chain[i-1,]))
        if (runif(1) < rho){
            chain[i,] = proposal
        }else{
            chain[i,] = chain[i-1,]
        }
    }
    chain <- chain[-(1:burn), ]

    par(mfrow = c(2, 2))
    # plot(density(chain[, 1]), type = "l", main = "beta0")
    # plot(density(chain[, 2]), type = "l", main = "beta1")
    # plot(density(chain[, 3]), type = "l", main = "beta2")
    # plot(density(chain[, 4]), type = "l", main = "sigma2")

    hist(chain[, 1], freq = FALSE, breaks = 50, main = "beta0")
    hist(chain[, 2], freq = FALSE, breaks = 50, main = "beta1")
    hist(chain[, 3], freq = FALSE, breaks = 50, main = "beta2")
    hist(chain[, 4], freq = FALSE, breaks = 50, main = "sigma2")

    invisible(chain)
}

Next we show a simulation result with 5e5 times of simulations. The posterior distribution of each parameter is like the following graph.

We can compare this with the model built with lm.

> mod <- lm(y ~ x1 + x2^2)
> summary(mod)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.720  -9.184  -3.188   4.628  45.152 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.47014   14.81716   0.167    0.868
x1           0.91329    2.03422   0.449    0.656
x2           0.09996    0.06597   1.515    0.136

Residual standard error: 15.18 on 47 degrees of freedom
Multiple R-squared:  0.6673,    Adjusted R-squared:  0.6532 
F-statistic: 47.14 on 2 and 47 DF,  p-value: 5.852e-12

The posterior distribution of beta0 is pretty unstable while the others are stable.

How does the MH algorithm perform? Check the following graph, which shows the prediction of Bayesian model for randomly selected 100 posterior data points from the last 1000 simulations after removing repetitions.

If we increase the simulation size to 1e6, the result will be improved a lot,

A problem is that: Why the acceptance rate is so low (0.03), and how to improve it?

We see that MH algorithm is not bad.
The above discuss is just my draft idea, without deliberate finalization.

Gibbs Sampler for Poisson distribution ((Robert and Casella, 10.17)

Gibbs Sampler is used a lot in Bayesian inference. The following is an R function that output the mean and standard deviation of lambda1 to lambda10 and beta.

Gibbs.pump <- function(alpha = 1.8, r = 0.01, delta = 1, simulation = 1000, burn = 100) {
    ##  data
    y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
    t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
    lambda <- numeric(10)

    ##  initial values
    beta <- 0.01

    ##  burn-in session
    for (i in 1:(burn + 1)) {       
        for (j in 1:10) {
            lambda[j] <- rgamma(1, y[j] + alpha, t[j] + beta)
        }

        beta <- rgamma(1, 10 * alpha + r, delta + sum(lambda))      
    }

    ##  markov chain session
    lambda2 <- matrix(NA, nrow = simulation, ncol = 10, byrow = TRUE)
    lambda2[1, ] <- lambda

    beta2 <- numeric(simulation)
    beta2[1] <- beta

    for (i in 2:simulation) {
        beta2[i] <- rgamma(1, 10 * alpha + r, delta + sum(lambda2[i-1, ]))

        for (j in 1:10) {
            lambda2[i, j] <- rgamma(1, y[j] + alpha, t[j] + beta2[i])
        }
    }

    ##  returned values
    lambda.mean <- apply(lambda2, 2, mean)
    lambda.sd <- apply(lambda2, 2, sd)
    lambda.dataframe <- data.frame(mean = lambda.mean, sd = lambda.sd)

    beta.mean <- mean(beta2)
    beta.sd <- sd(beta2)
    beta.dataframe <- data.frame(mean = beta.mean, sd = beta.sd)

    res <- list(lambda = lambda.dataframe, beta = beta.dataframe)
    return(res)
}

We can run the function and specify the burn-in number and simulation times.

> Gibbs.pump(burn = 1000, simulation = 1e4)
$lambda
         mean         sd
1  0.07051077 0.02698162
2  0.15207612 0.09187032
3  0.10421768 0.03991478
4  0.12289529 0.03097761
5  0.64916456 0.30120514
6  0.62272301 0.13671542
7  0.85414118 0.54445976
8  0.85574205 0.54394850
9  1.35889842 0.60607078
10 1.92709361 0.40330237

$beta
      mean        sd
1 2.406072 0.6915125

Good Bootstrap, Bad Bootstrap

Bootstrapping is a method to estimate the distribution of a statistic in a resampling way. It is useful when it is difficult to get the distribution of some complex statistics in other ways.
There are two bootstrapping methods: Parametric and Non-parametric.
For an i.i.d. sample X1, X2, ..., Xn. Whether good or not to use bootstrapping method is influenced by

(1) The method of bootstrapping: non-parametric or parametric;
(2) The distribution of the sample(kurtosis, variance, etc.);
(3) The distribution of the hypothesised population(kurtosis, variance, etc).

The following is a R program I make to compare parametric and non-parametric bootstrapping method in learning the mean and variance for a sample from standard normal distribution. An example is as follows.
> bootstrap(n=50, seed = 2015)

While some example seems to show that this method is pretty good, other examples may not be so lucky.
> bootstrap(n=50, seed = 1)

The original R code is as follows.

bootstrap <- function(B = 1000, n = 20, mu = 0, sigma = 1, seed = round(runif(1, -1e4, 1e4))) {

    ##  initial sample
    set.seed(seed)
    X <- rnorm(n, mu, sigma)

    ##   parametric distribution
        #    use MLE to estimate the parameters of population
        meanX <- mean(X)
        varX <- var(X) *(n-1)/n

        #    generate bootstrap samples
        meanx <- varx <- numeric(B)
        for (i in 1:B) {
            x <- rnorm(n, meanX, varX)
            meanx[i] <- mean(x)
            varx[i] <- var(x)
        }

        #    plot true and simulated distirbution for mean
        par(mfrow = c(2, 2))
            xlower <- mu - 3.5*sigma/sqrt(n)
            xupper <- mu + 3.5*sigma/sqrt(n)
            xi <- seq(xlower, xupper, length = 200)
            yi <- dnorm(xi, mu, sigma/sqrt(n))
        plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n mean")        
        lines(density(meanx), type = "l", col = "red")  

        # plot true and simulated distirbution for variance
            vlower <- qgamma(0.025, shape = (n-1)/2, rate = n / (2*sigma^2))
            vupper <- qgamma(0.975, shape = (n-1)/2, rate = n / (2*sigma^2))
            vi <-  seq(min(quantile(varx, .025),vlower), max(vupper, quantile(varx, .975)), length = 200)
            wi <- dgamma(vi, shape = (n-1)/2, rate = n / (2*sigma^2))
            plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n variance")
            lines(density(varx), type = "l", col = "blue")

    ##  non-parametric distribution
    meanx.n <- varx.n <- numeric(B)
    for (i in 1:B) {
            x <- sample(X, replace = TRUE)
            meanx.n[i] <- mean(x)
            varx.n[i] <- var(x)
        }
    plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n mean")      
        lines(density(meanx.n), type = "l", col = "red")
    plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n variance")
            lines(density(varx.n), type = "l", col = "blue")    

}

From many examples I have tried, it seems that the parametric and non-parametric method has almost the same effect on the distribution of mean and variance of normal data. While, this is only the result from normal distribution, which is symmetric. Other distributions may have difference results if they are skewed.

Coding is funny, Coding is thrilling, Coding is great !

The more I code using R, the more I found it amazing. Sometimes just an idea comes to my mind and I wonder can I fulfil it using R? I am not sure if I can succeed and I just have an picture in my mind. After I spend an hour or a lot more time on coding the great moment come ! R gives me the same output as I want, and even more. So how can I not love R?
The following is a tiny piece of code I wrote this afternoon to visualize the coverage of confidence interval of MLE.
n = 20

n = 50

n=100

Attach the code.

coverage <- function(repetetion = 20, n = 11, mu = 10, sigma = 3) {
    lower <- upper <- include <- numeric(repetetion)
    for(i in 1:repetetion) {
        x <- rnorm(n, mu, sigma)
        meanx <- mean(x)
        sigma2x <- sum((x - meanx)^2)/n
        lower[i] <- meanx - sd(x) * qt(1 - 0.05/2, n-1) / sqrt(n)
        upper[i] <- meanx + sd(x) * qt(1 - 0.05/2, n-1) / sqrt(n)
        if(lower[i] <= mu & upper[i] >= mu) {
            include[i] <- 1
        }
        else {
            include[i] <- 0
        }       
    }
    maxx <- max(upper)
    minx <- min(lower)
    plot(0, 0, xlim = c(0, repetetion+1), ylim = c(minx, maxx), type = "n", xlab = "repetetion", ylab = "Coverage", main = "Confidence Interval Coverage")
    for(i in 1:repetetion) {
        xi <- rep(i, 10)
        yi <- seq(from = lower[i], to = upper[i], length = 10)
        lines(xi, yi, lty = 1, lwd = 1.5, col = ifelse(include[i] == 0, "red", "black"))
    }
    return(list(total = repetetion, covered = sum(include), coverage = sum(include) / repetetion))
}

F distribution with different degree of freedom

We want to compare F distributions with different degree of freedoms. The following is the plot.

The code is as following.

x <- seq(0, 3, by = 0.01) 

##  df1 = 1, df2 = 1, 2, 3, 5, 10, 20
y1.1 <- df(x, 1, 1)
y1.2 <- df(x, 1, 2)
y1.3 <- df(x, 1, 3)
y1.5 <- df(x, 1, 5)
y1.10 <- df(x, 1, 10)
y1.20 <- df(x, 1, 20)
    ##  plot
plot(x, y1.1, type = "l", ylim = c(0, 2.5), col = "black", xlab = "X", ylab = "density of F", main = "density of F for different df")
lines(x, y1.2, type = "l", col = "red")
lines(x, y1.3, type = "l", col = "green")
lines(x, y1.5, type = "l", col = "black", lty = 2)
lines(x, y1.10, type = "l", col = "red", lty = 2)
lines(x, y1.20, type = "l", col = "green", lty = 2)
legend(2.2, 2.5, 
    legend = c("d1 = 1, d2 = 1", "d1 = 1, d2 = 2", "d1 = 1, d2 = 3", "d1 = 1, d2 = 5", "d1 = 1, d2 = 10", "d1 = 1, d2 = 20"),
    col = c("black", "red", "green", "black", "red", "green"), lty = c(1,1,1,2,2,2) 
    )

    ##  df1 = 5, df2 = 1, 2, 3, 5, 10, 20
y1.1 <- df(x, 5, 1)
y1.2 <- df(x, 5, 2)
y1.3 <- df(x, 5, 3)
y1.5 <- df(x, 5, 5)
y1.10 <- df(x, 5, 10)
y1.20 <- df(x, 5, 20)
    ##  plot
plot(x, y1.1, type = "l", ylim = c(0, 2.5), col = "black", xlab = "X", ylab = "density of F", main = "density of F for different df")
lines(x, y1.2, type = "l", col = "red")
lines(x, y1.3, type = "l", col = "green")
lines(x, y1.5, type = "l", col = "black", lty = 2)
lines(x, y1.10, type = "l", col = "red", lty = 2)
lines(x, y1.20, type = "l", col = "green", lty = 2)
legend(2.2, 2.5, 
    legend = c("d1 = 5, d2 = 1", "d1 = 5, d2 = 2", "d1 = 5, d2 = 3", "d1 = 5, d2 = 5", "d1 = 5, d2 = 10", "d1 = 5, d2 = 20"),
    col = c("black", "red", "green", "black", "red", "green"), lty = c(1,1,1,2,2,2) 
    )

    ##  df1 = 1, 2, 3, 5, 10, 20, df2 = 5
y1.1 <- df(x, 1, 5)
y1.2 <- df(x, 2, 5)
y1.3 <- df(x, 3, 5)
y1.5 <- df(x, 5, 5)
y1.10 <- df(x, 10, 5)
y1.20 <- df(x, 20, 5)
    ##  plot
plot(x, y1.1, type = "l", ylim = c(0, 2.5), col = "black", xlab = "X", ylab = "density of F", main = "density of F for different df")
lines(x, y1.2, type = "l", col = "red")
lines(x, y1.3, type = "l", col = "green")
lines(x, y1.5, type = "l", col = "black", lty = 2)
lines(x, y1.10, type = "l", col = "red", lty = 2)
lines(x, y1.20, type = "l", col = "green", lty = 2)
legend(2.2, 2.5, 
    legend = c("d1 = 1, d2 = 5", "d1 = 2, d2 = 5", "d1 = 3, d2 = 5", "d1 = 5, d2 = 5", "d1 = 10, d2 = 5", "d1 = 20, d2 = 5"),
    col = c("black", "red", "green", "black", "red", "green"), lty = c(1,1,1,2,2,2) 
    )

Performance of Wald confidence interval for relative risk

The performance of Wald confidence interval for relative risk

The relative risk is important in statistics in comparing the probabilities of binary events. It is also easier to interpret statistical knowledge to general public. For example, people will easily understand

The probability of getting lung cancer is ten times higher for people who smoke at least 1 cigarettes every day than those who don't smoke at all.

But it is confused if people are told

The odds ratio of people who smoke at least 1 cigarettes every day and those who don't smoke at all is 15.

When calculating the confidence intervals of relative risk, it turns out the log relative risk has better asymptotic properties than relative risk itself. The following R code is for testing the performance of Wald confidence interval for log relative risk.

##  This is to calculate the coverage of Wald C.I. of relative risk of 2 by 2 table
##  --------------------------------------------------------------
##                          Lung Cancer     No Lung Cancer
##  Smoke                       n11                             n12
##  Non-Smoke               n21                             n22
##  --------------------------------------------------------------

#       This function just return TRUE or FALSE to indicate whether the C.I. 
#       will cover the real relative risk or not
RRcover_or_not <- function(p1 = 0.007, p2 = 0.003, n1=5000, n2 = 10000) {
    n11 <- rbinom(1, size = n1, prob = p1)
    n21 <- rbinom(1, size = n2, prob = p2)
    n12 <- n1 - n11
    n22 <- n2 - n21

    p1hat <- n11/n1
    p2hat <- n21/n2
    logrr <- log(p1hat / p2hat)
    sd <- sqrt((1 - p1hat) / (p1hat * n1) + (1 - p2hat) / (p2hat * n2))

    lower <- logrr - sd * qnorm(1 - 0.05/2)
    upper <- logrr + sd * qnorm(1 - 0.05/2)

    real <- log(p1/p2)
    return (real > lower && real < upper)
}

#       This function calculate the percentage of coverage 
#       by running the above function repeatedly
RRcoverage <- function(n = 1000, ...) {
    temp <- rep(0, n)

    for (i in 1:n) {
        temp[i] <- RRcover_or_not(...)
    }
    res <- sum(temp) / n
    return(res)
}

#       This function calcuate coverage repeatedly and plot the density
#       It also shades the area which is above 95% so we can check how
#       good this method (Wald) is in computing C.I. of log relative risk
RRcoverageplot <- function(rep = 1000, ...) {
    temp <- rep(0, rep)
    for (i in 1:rep) {
        temp[i] <- RRcoverage(...)
    }
    den <- density(temp)
    plot(den, type = "l", col = "red", lwd = 2, main = "Coverage for Wald 95% CI for RR", xlab = "Coverage")
    abline(v = 0.95, lty = 2)

    index <- abs(den$x - 0.95) == min(abs(den$x - 0.95))
    pin <- which.max(index)

    x1 <- den$x[pin: length(den$x)]
    y1 <- rep(0, length(x1))
    x2 <- rev(x1)
    y2 <- rev(den$y[pin:length(den$y)])
    polygon(c(x1, x2), c(y1, y2), col = "skyblue")

    per_over_95 <- sum(temp > 0.95)/rep* 100
    text(0.955, 20, labels = paste(bquote(.(per_over_95)), "%", sep = ""), cex = 1.5)
}

The following is the result for the case that the even is rare (p1 = 0.0007, p2 = 0.003). The Wald CI turns to be very good. Because most of the time, 95% coverage actually is higher than 95% !

New R functions used are

bquote( )
polygon( )

Gibbs Sampler for Bivariate Normal distribution

This is a realization of the Gibbs sampler for bivariate normal distribution from Gelman's book Bayesian Data Analysis (Third edition), Page 277.

Gibbs.bivariate.normal <- function(n = 1e3, burn = 1e2, rho = .8, y1 = 0, y2 = 0) {
    N <- n + burn
    theta1 <- numeric(N+1)
    theta2 <- numeric(N)
    theta1[1] <- 2.5

    for(i in 1:N) {
        theta2[i] <- rnorm(1, mean = y2 + rho *(theta1[i] - y1), sd = sqrt(1 - rho^2))
        theta1[i+1] <- rnorm(1, mean = y1 + rho *(theta2[i] - y2), sd = sqrt(1 - rho^2))
    }

    theta1 <- theta1[(burn+1):N]
    theta2 <- theta2[(burn+1):N]
    plot(theta1, theta2, pch = 19, cex = .3)
}

When rho = 0.7, the plot is like this

When rho = -0.3, the plot is like this

This is when rho = -0.9

Gibbs Sampler is really easy to implement and handy to use!

SVM and logistic regression in simple case

We compare the performance of SVM and logistic regression by generating samples using mvrnorm function in R. The following is the R code.

library(MASS)

##  generate data
mu1 <- c(0, 0, 0)
mu2 <- c(2.5,2.5, 2.5)
Sigma1 <- matrix(c(2, 1, 1, 1, 2, 1, 1, 1, 3), nrow = 3)
Sigma2 <- matrix(c(3, 1, 1, 1, 3, 1, 1, 1, 2), nrow = 3)
data1 <- mvrnorm(1000, mu1, Sigma1)
data2 <- mvrnorm(1000, mu2, Sigma2)
y <- rep(c(1, 0), each = 1000)
data <- cbind(rbind(data1, data2),y)
data <- as.data.frame(data)
names(data) <- c("x1", "x2", "x3", "y")

##  glm model
model <- glm(as.factor(y) ~ x1 + x2 + x3, data = data, family = "binomial")
pred <- predict(model, data = data, type = "response")
pred <- as.numeric(pred >= .5)
# ggplot(data = data, aes(x = x1, y = x2)) + geom_point(aes(color = as.factor(pred)))


## svm model
tuneResult <- tune(svm, as.factor(y) ~ x1 + x2 + x3, data = data,
                   ranges = list(epsilon = seq(.1, .9, .2), cost = 1:5)
                   )
pred.svm <- predict(tuneResult$best.model, data = data)

##  compare two models
xtabs(~y + pred)
xtabs(~y + pred.svm)
error.glm <- sum(y != pred)/nrow(data)
error.svm <- sum(y != pred.svm)/nrow(data)
paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
paste("The error rate of svm is: %", error.svm * 100, sep = "")

The performance is as follows.

> xtabs(~y + pred)
   pred
y     0   1
  0 847 153
  1 146 854
> xtabs(~y + pred.svm)
   pred.svm
y     0   1
  0 865 135
  1 129 871
> paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
[1] "The error rate of logistic regression is: %14.95"
> paste("The error rate of svm is: %", error.svm * 100, sep = "")
[1] "The error rate of svm is: %13.2"

Another simulation result.

> xtabs(~y + pred)
   pred
y     0   1
  0 830 170
  1 170 830
> xtabs(~y + pred.svm)
   pred.svm
y     0   1
  0 816 184
  1 124 876
> paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
[1] "The error rate of logistic regression is: %17"
> paste("The error rate of svm is: %", error.svm * 100, sep = "")
[1] "The error rate of svm is: %15.4"

My finding is that:

In very simple case, the two methods are almost equivalent. When the data complexity increases, svm becomes slightly better than logistic regression in many cases, but svm takes much more time than logistic regression.

Cochran Mantel Haenszel Test in R

The import thing is to input the data in the right form. The form should be an array with 2 X 2 X q dimension, q is the number of strata. The following code is an example in Cochran–Mantel–Haenszel test for repeated tests of independence.

data <- array(c(708, 50, 169, 13,
        136, 24, 73, 14,
        106, 32, 17, 4,
        109, 22, 16, 26,
        801, 102, 180, 25,
        159, 27, 18, 13,
        151, 51, 28, 15,
        950, 173, 218, 33),
        dim = c(2, 2, 8),
        dimnames = 
        list(hand = c("right", "left"),
              hair = c("cw", "ccw"),
              strata = c("white_children", "British_adult", "Penn_white", "Welsh_men", 
                         "German_soldiers", "German_children", "New_York", "American_men"))
)
mantelhaen.test(data)

The result is as follows.

Mantel-Haenszel chi-squared test with continuity correction

data:  data
Mantel-Haenszel X-squared = 6.0702, df = 1, p-value = 0.01375
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
 1.061250 1.608976
sample estimates:
common odds ratio 
         1.306723

The p-value is 0.01375, showing that the two variables(hand and hair) are not independent.
The Chi-square statistic is 6.07.

Remember that for CMH test, the degree of freedom is always ONE.
If the p-value is significant, we can reject the null hypothesis (the two variables are independent). But if the p-value is nonsignificant, we can't say that the null is true. What we need to do is to analyze strata by strata to look at each 2 by 2 table.
There is a output named ''common odds ratio", which is meaningful only when the assumption-- odds ratio across strata is the same -- is true. How can we tell the trend is the same in each table? The answer is Breslow-Day test.

Cauchy distribution doesn't have mean

The mean of Cauchy distribution doesn't exist, which can be show by the following example.
Create a random vector x1, x2, ..., x1000 of Cauchy distribution. Then calculate Yi = mean(x1 + ... + xi). At last, plot the sequence of Yi. Thus you can have a glance at how Yi goes as i increases from 1 to 1000. The following is R code.

crazyCauchy <- function() {
rv <- rcauchy(1e3)
x <- numeric(1e3)
for(i in 1:1e3) {
    x[i] <- mean(rv[1:i])
}
plot(x, pch = 19, cex = 0.5)
}

Running the function and the plot looks this this.

There is no trend of convergence.

Sample mean and sample variance is independent for normal distribution

mean_var_indep <- function(n) {
    m <- v <- numeric(n)
    for (i in 1:n) {
        temp <- rnorm(n)
        m[i] <- mean(temp)
        v[i] <- var(temp)       
    }
    res <- cor(m, v)
    return(res)
}

Use Monte Carlo to calculate the normal CDF

If we want to calculate int_0^1 1/sqrt(2pi) (-x^2/2) dx, there is no easy integral table for use to get the exact value. While, we can use Monte Carlo simulation, based on the law of large numbers. The following is R code.

MCnormal <- function(n) {
     a <- runif(n)
     b <- dnorm(a)
     c <- mean(b)
     real <- pnorm(1) - pnorm(0)
     diff <- abs(real - c)/ real
     return(list(simulated = c, real = real, diff = diff))
}

The large the n, the more precise the estimation.

> MCnormal(100)
$simulated
[1] 0.3388052

$real
[1] 0.3413447

$diff
[1] 0.007439889

> MCnormal(1000)
$simulated
[1] 0.3418248

$real
[1] 0.3413447

$diff
[1] 0.00140629

> MCnormal(1000000)
$simulated
[1] 0.3413338

$real
[1] 0.3413447

$diff
[1] 3.214687e-05

> MCnormal(1000000)
$simulated
[1] 0.3413239

$real
[1] 0.3413447

$diff
[1] 6.112997e-05

When n comes to 1e6, the difference is only .006%, which is pretty negligible.

MLE for gamma distribution using Bisection method

This is a function that I wrote to test bisection method for getting MLE for gamma distribution. After running the function code, we can try an example in R.

> set.seed(101)
> data <- rgamma(500, shape = 4, rate = 2)
> mle.gamma(data)
$hat_alpha
[1] 3.800928

$hat_lambda
[1] 1.88312

We can see that the estimation is reasonable. It can be used for any other estimation when there is no closed form solution. The following is the code for the function mle.gamma().

mle.gamma <- function(X, tol = 1e-6) {
    # X should be a random sample from a gamma distribution
    n <- length(X)
    # define the function f(x) for which we want to find the solution of f(x) = 0
    f <- function(x) n * log(x / mean(X)) + sum(log(X)) - n * digamma(x)

    sign <- function(x) {
        if (x > 0) sign <- 1
        if (x < 0) sign <- -1
        else sign <- 0
        return(sign)
    }

    x.left <- 1
    x.right <- 10
    repeat {
        temp <- (x.left + x.right)/2
        if(abs(f(temp)) < tol) {
            x.final <- temp
            break
        }
        else {
            if(sign(f(temp)) == sign(f(x.right))) x.right <- temp
            else x.left <- temp
        }
    }
    hat_alpha <- x.final
    hat_lambda <- hat_alpha / mean(X)
    lst <- list(hat_alpha = hat_alpha, hat_lambda = hat_lambda)
    return(lst)
}

Weighted Bootstrapping in obtaining posterior distribution

An example for binomial (n, p) likelihood and beta (a, b) prior.

wb.bin <- function(n, x, a = 1, b = 1, simulation = 5e3, k = 1e3) {
    random.prior <- rbeta(simulation, a, b)
    likelihood <- dbinom(x, n, random.prior)
    random.posterior <- sample(random.prior, k, replace = TRUE, prob = likelihood)
    actual <- (a + x)/(a + b + n)
    simulated <- mean(random.posterior)
    return(list(actual = actual, simulated = simulated))
}

When running this function we get

> wb.bin(20, 3)
$actual
[1] 0.1818182

$simulated
[1] 0.1844998

> wb.bin(12, 4)
$actual
[1] 0.3571429

$simulated
[1] 0.3570672

This method is pretty easy to understand and to implement. Thus it is great!

X1 to Xn ~ Gumbel(0, theta), max(X1, ..., Xn) - theta log(n) distribution

Purpose: do simulation to prove the distribution of max(X1, ..., Xn) - theta log(n) is Gumbel(0, theta).

## mu= 0, sigma = 1
a <- expression(
n <- 50000,
x <- rgumbel(n, 0, 1),
xmax <- max(x - log(n))
)


res <- replicate(5000, eval(a))

plot(density(res))

xseq <- seq(-2, 6, by = 0.01)
yseq <- dgumbel(xseq, 0, 1)
lines(xseq, yseq, col = "red")

Wilcoxon Signed-Rank test Normal approximation

We can use Wilcoxon signed rank test if the paired - t test can't be used in testing paired samples. When the number of pair N is large, we can use normal approximation. The following R simulation shows that When N >= 15, the approximation is pretty good.

rankdist <- function(n = 3, k = 1e3) {
    x <- 1:n
    res <- numeric(k)
    for (i in 1:k) {
        sign <- sample(c(1, -1), size = n, replace = TRUE)
        y <- x * sign
        res[i] <- sum(y)
    }

        t <- max(density(res)$y)
        hist(res, freq = FALSE, ylim = c(0, t * 1.1), main = "Barplot of the Signed-Rank Statistic", xlab = "signed-rank")
        mu <- 0
        sd <- sqrt(n * (n + 1) * (2 * n + 1) / 6)
        xseq <- seq(-120, 120, length = 1000)
        yseq <- dnorm(xseq, mu, sd)
        lines(xseq, yseq, type = "l", lwd = 2, col = "red")
}

N = 5 for the top graph, N = 10 for the second, N = 15 for the bottom graph.

A funny function to redefine "["

This function redefine the use of "[". When the index goes beyond the length of the vector, it just restart from beginning.

`[` <- function(x, k) {
    n <- length(x)
    k <- k %% n
    if(k == 0) k <- n
    subset(x, 1:n == k)

}
> x
[1] 3 2 4 5 1
> x[1]
[1] 3
> x[2]
[1] 2
> x[3]
[1] 4
> x[4]
[1] 5
> x[5]
[1] 1
> x[6]
[1] 3
> x[7]
[1] 2

First JAGS model in R

The JAGS can perform MCMC and R package rjags can be used to access JAGS in R. The process is pretty easy. First we write the model file in R (the file can be in .R format), save the file. Then invoke the model using function in the package 'rjags'. The following is an easy example.
Step 1. Save the following code with name "myFirstJags.R".

model {
  for(i in 1:J) {
    y[i] ~ dbin(theta[i], 100)
    theta[i] ~ dbeta(alpha, beta)
  }
  alpha ~ dgamma(1, 1)
  beta ~ dgamma(1, 1)
}

Step 2. Run the following code.

set.seed(1001)
theta <- rbeta(10, 1, 5)
set.seed(1002)
y <- rbinom(10, 100, theta)

jagresult <- jags.model("myFirstJags.R", data = list("y" = y, "J" = 10), n.adapt = 1000)
thesamps <- coda.samples(jagresult, c("alpha", "beta"), n.iter = 10000, thin = 1, progress.bar = "gui")

We can use plot function to get the graph of alpha and beta.

plot(thesamps)

We will get the following plot.

But sadly, as I know, JAGS cannot use improper prior!

Use Normal to approximate Poisson when lambda is large

When lambda is large, we can use Normal(lambda, lambda) to approximate Poisson (This can be proved by using Continuity Theorem). The following is R code and output.

Pois2Norm <- function(lambda = 100, n=1000) {
    x <- rpois(n, lambda)
    minx <- min(x)
    maxx <- max(x)
    plot(density(x), main = "Use Normal to approximate Poisson when lambda is large", xlab = "X")
    xnorm <- seq(from = minx, to = maxx, length = n)
    lines(xnorm, dnorm(xnorm, mean = lambda, sd = sqrt(lambda)), col = "red")
    x.pos <- minx + 0.01*(maxx - minx)
    miny <- min(density(x)$y)
    maxy <- max(density(x)$y)
    y.pos <- maxy - 0.01*(maxy - miny)
    legend(x.pos, y.pos, lty = c(1, 1), col  = c("black", "red"), legend = c("Poisson", "Normal"))
}

When lambda = 10,

When lambda = 100,

When lambda = 1000,