garybaylor / r-code Goto Github PK
View Code? Open in Web Editor NEWA collection of algorithms written by myself for solving statistical problems
A collection of algorithms written by myself for solving statistical problems
We compare the density of random samples by using MH algorithm and that by using base R function rbeta
. The code is as follows.
mh.beta <- function(a = 2.7, b = 6.3, N = 1e4) {
X <- numeric(N)
X[1] <- runif(1)
for(i in 2:N) {
y <- runif(1)
rho <- dbeta(y, a, b) / dbeta(X[i-1], a, b)
X[i] <- X[i-1] + (y - X[i-1]) * (runif(1) < rho)
}
plot(density(X), type = "l", main = "MH simulation & Direct simulation Comparision", xlab = "")
lines(density(rbeta(N, a, b)), type = "l", col = "red")
legend(0.65, 2.65, legend = c("MH simulation", "Direct simulation"), lty = c(1, 1), col = c("black", "red"))
}
The plot when N = 1e5 is like the following.
Random walk is a non-convergent Markov Chain. We can show that it is not convergent
random.walk <- function(t = 1e4) {
res <- numeric(t + 1)
res[0] <- rnorm(1)
for(i in 1:t) {
res[i + 1] <- res[i] + rnorm(1)
}
plot(res, type = "l", lwd = 0.5)
}
The plot looks like the following graph.
Each time the random walk function is run, the plot is different.
If we slightly change the code, the Markov Chain will be convergent.
Markov.chain <- function(t = 1e4, rho = 0.9) {
res <- numeric(t + 1)
res[0] <- rnorm(1)
for(i in 1:t) {
res[i + 1] <- rho * res[i] + rnorm(1)
}
par(mfrow = c(1, 2))
plot(res, type = "l", lwd = 0.5)
plot(density(res), type = "l")
}
The plot is like this
Each time the function is run, the density will be almost the same
An introduction on how to use SVM in R (only two class classification)
## This code is to study SVM using package 'e1071'
## =====Create data ======================
# in order to perform classification rather than regression,
# we need to change y to factors
set.seed(1)
x <- matrix(rnorm(20 * 2), ncol = 2)
y <- c(rep(-1, 10), rep(1, 10))
x[y ==1, ] = x[y == 1] +1
plot(x, col = (3 - y))
data <- data.frame( x = x, y = as.factor(y))
## =====fitting model =====================
# the lower the cost is, the more support vectors we have
library(e1071)
svmfit1 <- svm(y ~., data = data, kernel = "linear", cost = 10, scale = FALSE)
svmfit2 <- svm(y ~., data = data, kernel = "linear", cost = 10, scale = FALSE)
svmfit3 <- svm(y ~., data = data, kernel = "linear", cost = 10, scale = FALSE)
# the support vectors are denoted by 'X', others are denoted by "'O'
plot(svmfit1, data) # when cost = 10, there are 7 support vectors
plot(svmfit2, data) # when cost = 1, there are 10 support vectors
plot(svmfit3, data) # when cost = .1, there are 16 support vectors
## =====Tuning =============================
# tunning is the cross-validation for svm
set.seed(1)
tune.out <- tune(svm, y ~., data = data, kernel = "linear",
ranges = list(cost = c(.001, .01, .1, 1, 5, 10, 100)))
# we found the best cost is .1
summary(tune.out)
# tunning stores the best model for us
bestmod <- tune.out$best.model
## =====prediction ==========================
xtest <- matrix(rnorm(20 * 2), ncol = 2)
ytest <- sample(c(1, -1), 20, replace = TRUE)
xtest[ytest == 1, ] <- xtest[ytest == 1, ] + 1
test <- data.frame(x = xtest, y = as.factor(ytest))
pred <- predict(bestmod, test)
xtabs(~ pred + ytest)
## ===== other types of kernel: radial =============
# we need to set kernel = "radial", and set the value of gamma
# we still need to set cost value
set.seed(1)
x <- matrix(rnorm(200 * 2), ncol = 2)
x[1:100, ] <- x[1:100, ] + 2
x[101:150, ] <- x[101:150, ] - 2
y <- c(rep(1, 150), rep(2, 50))
data <- data.frame(x = x, y = as.factor(y))
plot(x, col = y)
# setting the cost = 1, we found there are 7 observations that are misclassified
train <- sample(200, 100)
svmfit <- svm(y ~., data = data[train, ], kernel = "radial", gamma = 1, cost = 1)
plot(svmfit, data[train, ])
pred <- predict(svmfit, data[train, ])
table(pred = pred, true = y[train])
# increase the cost will fit data more precisely, but tend to cause overfitting
# setting the cost = 1e5, we found there are only 1 observation misclassified
# the decision boundary becomes more irregular
quartz()
svmfit <- svm(y ~., data = data[train, ], kernel = "radial", gamma = 1, cost = 1e5)
plot(svmfit, data[train, ])
pred <- predict(svmfit, data[train, ])
table(pred = pred, true = y[train])
## we need crossvalidation (tunning)
tune.out <- tune(svm, y ~., data = data[train, ], kernel = "radial",
ranges = list(cost = seq(.1, 2, by = 0.1),
gamma = seq(.05, .5, by = .05)))
# we found %13 testing data are misclassified
pred <- predict(tune.out$best.model, data[-train, ])
table(pred = pred, true = y[-train])
The purpose is to do a simulation to see the performance of Delta method.
Let X1, ..., Xn is i.i.d. from Poisson(theta), Yi = I(Xi = 0 ). Find the asymptotic distribution of 2 mean(Xn) + log (Yn).
The solution by using CLT and multivariate delta method is N(theta, exp(theta)/n). The following is R simulation code. We showed that when n is large, the approximation is good.
a <- expression(
n <- 5e4, # we set n to be fairly large. When n is small, the performance is not good.
theta <- 5,
x <- rpois(n, theta),
y <- x == 0,
Z <- 2 * mean(x) + log(mean(y))
)
sim <- replicate(1e4,eval(a))
xlow <- theta - 3 * sqrt(exp(theta) / n)
xhigh <- theta + 3 * sqrt(exp(theta) / n)
xseq <- seq(xlow, xhigh, length = 100)
plot(density(sim))
lines(xseq, dnorm(xseq, theta, sqrt(exp(theta)/n)), col = "red")
I try to find a package to do FP test, but found nothing. So I make a small crude function which take two vectors as input and return a p-value.
FP.test <- function(x, y) {
n <- length(x); m <- length(y)
Pl.x <- numeric(n); Pl.y <- numeric(m)
for(i in 1:n) {
Pl.x[i] <- sum(y < x[i]) + 1/2 * sum(y == x[i])
}
for(i in 1:m) {
Pl.y[i] <- sum(x < y[i]) + 1/2 * sum(x == y[i])
}
Plxmean <- mean(Pl.x); Plymean <- mean(Pl.y)
Vx <- (n - 1) * var(Pl.x); Vy <- (m - 1) * var(Pl.y)
z <- (sum(Pl.y) - sum(Pl.x)) / (2 * sqrt(Vx + Vy + Plxmean * Plymean))
p <- pnorm(abs(z), lower.tail = FALSE) * 2
return (list(p.value = p))
}
We can try some example
> set.seed(100)
> x <- rnorm(20, mean = 0, sd = 1)
> set.seed(200)
> y <- rnorm(30, mean = 0, sd = 1)
> set.seed(201)
> y1 <- rnorm(30, mean = 1, sd = 1)
> set.seed(202)
> y2 <- rnorm(30, mean = 2, sd = 1)
> set.seed(203)
> y3 <- rnorm(30, mean = 3, sd = 1)
> FP.test(x,y)
$p.value
[1] 0.7070176
> FP.test(x,y1)
$p.value
[1] 5.794937e-05
> FP.test(x,y2)
$p.value
[1] 2.43404e-46
> FP.test(x,y3)
$p.value
[1] 1.124729e-250
We will write a program in R to estimate coefficients of a polynomial regression. The data we use is cars
in R. The response is dist
and the predictor is speed
. We will fit a polynomial regression to two degrees.
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
If we use frequentist's method. We can use lm
in R to fit a model and the following is the graph.
y <- cars$dist
x1 <- cars$speed
x2 <- (cars$speed)^2
mod <- lm(y ~ x1 + x2)
pred <- predict(mod, data.frame(x1, x2))
plot(x1, y, xlab = "speed", ylab = "dist")
xseq <- seq(min(x1), max(x1), length = 100)
lines(x1, pred, type = "l", col = "red", lwd = 2)
## MH algorithm
# define likelihood
likelihood <- function(param){
# X is design matrix with first column all 1s
params <- matrix(param[-4])
pred = X %*% params
pred <- as.vector(pred)
singlelikelihoods = dnorm(y, mean = pred, sd = param[4], log = T)
sumll = sum(singlelikelihoods)
return(sumll)
}
# define prior
prior <- function(param){
b0 = param[1]
b1 = param[2]
b2 = param[3]
sd = param[4]
aprior = dnorm(b0, mean=0, sd=10, log = T)
bprior = dnorm(b1, mean=0, sd = 5, log = T)
cprior = dnorm(b2, mean=0, sd = 1, log = T)
sdprior = dunif(sd, min=0, max=1000, log = T)
return(aprior + bprior + cprior + sdprior)
}
# define posterior
posterior <- function(param){
return (likelihood(param) + prior(param))
}
# define proposal function
proposalfunction <- function(param, theta = c(0.5,0.5,0.3, 1)){
return(runif(4, min = param - theta, max = param + theta))
}
## MH
MH <- function(startvalue, N = 1e5, burn = N/10){
chain = array(dim = c(N,4))
chain[1,] = startvalue
for (i in 2:N){
proposal = proposalfunction(chain[i-1,])
rho = exp(posterior(proposal) - posterior(chain[i-1,]))
if (runif(1) < rho){
chain[i,] = proposal
}else{
chain[i,] = chain[i-1,]
}
}
chain <- chain[-(1:burn), ]
par(mfrow = c(2, 2))
# plot(density(chain[, 1]), type = "l", main = "beta0")
# plot(density(chain[, 2]), type = "l", main = "beta1")
# plot(density(chain[, 3]), type = "l", main = "beta2")
# plot(density(chain[, 4]), type = "l", main = "sigma2")
hist(chain[, 1], freq = FALSE, breaks = 50, main = "beta0")
hist(chain[, 2], freq = FALSE, breaks = 50, main = "beta1")
hist(chain[, 3], freq = FALSE, breaks = 50, main = "beta2")
hist(chain[, 4], freq = FALSE, breaks = 50, main = "sigma2")
invisible(chain)
}
Next we show a simulation result with 5e5 times of simulations. The posterior distribution of each parameter is like the following graph.
We can compare this with the model built with lm
.
> mod <- lm(y ~ x1 + x2^2)
> summary(mod)
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-28.720 -9.184 -3.188 4.628 45.152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.47014 14.81716 0.167 0.868
x1 0.91329 2.03422 0.449 0.656
x2 0.09996 0.06597 1.515 0.136
Residual standard error: 15.18 on 47 degrees of freedom
Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
The posterior distribution of beta0
is pretty unstable while the others are stable.
We see that MH algorithm is not bad.
The above discuss is just my draft idea, without deliberate finalization.
Gibbs Sampler is used a lot in Bayesian inference. The following is an R function that output the mean and standard deviation of lambda1 to lambda10 and beta.
Gibbs.pump <- function(alpha = 1.8, r = 0.01, delta = 1, simulation = 1000, burn = 100) {
## data
y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)
t <- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10)
lambda <- numeric(10)
## initial values
beta <- 0.01
## burn-in session
for (i in 1:(burn + 1)) {
for (j in 1:10) {
lambda[j] <- rgamma(1, y[j] + alpha, t[j] + beta)
}
beta <- rgamma(1, 10 * alpha + r, delta + sum(lambda))
}
## markov chain session
lambda2 <- matrix(NA, nrow = simulation, ncol = 10, byrow = TRUE)
lambda2[1, ] <- lambda
beta2 <- numeric(simulation)
beta2[1] <- beta
for (i in 2:simulation) {
beta2[i] <- rgamma(1, 10 * alpha + r, delta + sum(lambda2[i-1, ]))
for (j in 1:10) {
lambda2[i, j] <- rgamma(1, y[j] + alpha, t[j] + beta2[i])
}
}
## returned values
lambda.mean <- apply(lambda2, 2, mean)
lambda.sd <- apply(lambda2, 2, sd)
lambda.dataframe <- data.frame(mean = lambda.mean, sd = lambda.sd)
beta.mean <- mean(beta2)
beta.sd <- sd(beta2)
beta.dataframe <- data.frame(mean = beta.mean, sd = beta.sd)
res <- list(lambda = lambda.dataframe, beta = beta.dataframe)
return(res)
}
We can run the function and specify the burn-in number and simulation times.
> Gibbs.pump(burn = 1000, simulation = 1e4)
$lambda
mean sd
1 0.07051077 0.02698162
2 0.15207612 0.09187032
3 0.10421768 0.03991478
4 0.12289529 0.03097761
5 0.64916456 0.30120514
6 0.62272301 0.13671542
7 0.85414118 0.54445976
8 0.85574205 0.54394850
9 1.35889842 0.60607078
10 1.92709361 0.40330237
$beta
mean sd
1 2.406072 0.6915125
Bootstrapping is a method to estimate the distribution of a statistic in a resampling way. It is useful when it is difficult to get the distribution of some complex statistics in other ways.
There are two bootstrapping methods: Parametric and Non-parametric.
For an i.i.d. sample X1, X2, ..., Xn. Whether good or not to use bootstrapping method is influenced by
(1) The method of bootstrapping: non-parametric or parametric;
(2) The distribution of the sample(kurtosis, variance, etc.);
(3) The distribution of the hypothesised population(kurtosis, variance, etc).
The following is a R program I make to compare parametric and non-parametric bootstrapping method in learning the mean and variance for a sample from standard normal distribution. An example is as follows.
> bootstrap(n=50, seed = 2015)
While some example seems to show that this method is pretty good, other examples may not be so lucky.
> bootstrap(n=50, seed = 1)
The original R code is as follows.
bootstrap <- function(B = 1000, n = 20, mu = 0, sigma = 1, seed = round(runif(1, -1e4, 1e4))) {
## initial sample
set.seed(seed)
X <- rnorm(n, mu, sigma)
## parametric distribution
# use MLE to estimate the parameters of population
meanX <- mean(X)
varX <- var(X) *(n-1)/n
# generate bootstrap samples
meanx <- varx <- numeric(B)
for (i in 1:B) {
x <- rnorm(n, meanX, varX)
meanx[i] <- mean(x)
varx[i] <- var(x)
}
# plot true and simulated distirbution for mean
par(mfrow = c(2, 2))
xlower <- mu - 3.5*sigma/sqrt(n)
xupper <- mu + 3.5*sigma/sqrt(n)
xi <- seq(xlower, xupper, length = 200)
yi <- dnorm(xi, mu, sigma/sqrt(n))
plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n mean")
lines(density(meanx), type = "l", col = "red")
# plot true and simulated distirbution for variance
vlower <- qgamma(0.025, shape = (n-1)/2, rate = n / (2*sigma^2))
vupper <- qgamma(0.975, shape = (n-1)/2, rate = n / (2*sigma^2))
vi <- seq(min(quantile(varx, .025),vlower), max(vupper, quantile(varx, .975)), length = 200)
wi <- dgamma(vi, shape = (n-1)/2, rate = n / (2*sigma^2))
plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx)$y)), xlab = "mean", ylab = "density", main = "Parametric bootstrap for \n variance")
lines(density(varx), type = "l", col = "blue")
## non-parametric distribution
meanx.n <- varx.n <- numeric(B)
for (i in 1:B) {
x <- sample(X, replace = TRUE)
meanx.n[i] <- mean(x)
varx.n[i] <- var(x)
}
plot(xi, yi, type = "l", ylim = c(0, max(yi, density(meanx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n mean")
lines(density(meanx.n), type = "l", col = "red")
plot(vi, wi, type = "l", ylim = c(0, max(wi, density(varx.n)$y)), xlab = "mean", ylab = "density", main = "Non-parametric bootstrap for \n variance")
lines(density(varx.n), type = "l", col = "blue")
}
From many examples I have tried, it seems that the parametric and non-parametric method has almost the same effect on the distribution of mean and variance of normal data. While, this is only the result from normal distribution, which is symmetric. Other distributions may have difference results if they are skewed.
The more I code using R, the more I found it amazing. Sometimes just an idea comes to my mind and I wonder can I fulfil it using R? I am not sure if I can succeed and I just have an picture in my mind. After I spend an hour or a lot more time on coding the great moment come ! R gives me the same output as I want, and even more. So how can I not love R?
The following is a tiny piece of code I wrote this afternoon to visualize the coverage of confidence interval of MLE.
n = 20
Attach the code.
coverage <- function(repetetion = 20, n = 11, mu = 10, sigma = 3) {
lower <- upper <- include <- numeric(repetetion)
for(i in 1:repetetion) {
x <- rnorm(n, mu, sigma)
meanx <- mean(x)
sigma2x <- sum((x - meanx)^2)/n
lower[i] <- meanx - sd(x) * qt(1 - 0.05/2, n-1) / sqrt(n)
upper[i] <- meanx + sd(x) * qt(1 - 0.05/2, n-1) / sqrt(n)
if(lower[i] <= mu & upper[i] >= mu) {
include[i] <- 1
}
else {
include[i] <- 0
}
}
maxx <- max(upper)
minx <- min(lower)
plot(0, 0, xlim = c(0, repetetion+1), ylim = c(minx, maxx), type = "n", xlab = "repetetion", ylab = "Coverage", main = "Confidence Interval Coverage")
for(i in 1:repetetion) {
xi <- rep(i, 10)
yi <- seq(from = lower[i], to = upper[i], length = 10)
lines(xi, yi, lty = 1, lwd = 1.5, col = ifelse(include[i] == 0, "red", "black"))
}
return(list(total = repetetion, covered = sum(include), coverage = sum(include) / repetetion))
}
We want to compare F distributions with different degree of freedoms. The following is the plot.
The code is as following.
x <- seq(0, 3, by = 0.01)
## df1 = 1, df2 = 1, 2, 3, 5, 10, 20
y1.1 <- df(x, 1, 1)
y1.2 <- df(x, 1, 2)
y1.3 <- df(x, 1, 3)
y1.5 <- df(x, 1, 5)
y1.10 <- df(x, 1, 10)
y1.20 <- df(x, 1, 20)
## plot
plot(x, y1.1, type = "l", ylim = c(0, 2.5), col = "black", xlab = "X", ylab = "density of F", main = "density of F for different df")
lines(x, y1.2, type = "l", col = "red")
lines(x, y1.3, type = "l", col = "green")
lines(x, y1.5, type = "l", col = "black", lty = 2)
lines(x, y1.10, type = "l", col = "red", lty = 2)
lines(x, y1.20, type = "l", col = "green", lty = 2)
legend(2.2, 2.5,
legend = c("d1 = 1, d2 = 1", "d1 = 1, d2 = 2", "d1 = 1, d2 = 3", "d1 = 1, d2 = 5", "d1 = 1, d2 = 10", "d1 = 1, d2 = 20"),
col = c("black", "red", "green", "black", "red", "green"), lty = c(1,1,1,2,2,2)
)
## df1 = 5, df2 = 1, 2, 3, 5, 10, 20
y1.1 <- df(x, 5, 1)
y1.2 <- df(x, 5, 2)
y1.3 <- df(x, 5, 3)
y1.5 <- df(x, 5, 5)
y1.10 <- df(x, 5, 10)
y1.20 <- df(x, 5, 20)
## plot
plot(x, y1.1, type = "l", ylim = c(0, 2.5), col = "black", xlab = "X", ylab = "density of F", main = "density of F for different df")
lines(x, y1.2, type = "l", col = "red")
lines(x, y1.3, type = "l", col = "green")
lines(x, y1.5, type = "l", col = "black", lty = 2)
lines(x, y1.10, type = "l", col = "red", lty = 2)
lines(x, y1.20, type = "l", col = "green", lty = 2)
legend(2.2, 2.5,
legend = c("d1 = 5, d2 = 1", "d1 = 5, d2 = 2", "d1 = 5, d2 = 3", "d1 = 5, d2 = 5", "d1 = 5, d2 = 10", "d1 = 5, d2 = 20"),
col = c("black", "red", "green", "black", "red", "green"), lty = c(1,1,1,2,2,2)
)
## df1 = 1, 2, 3, 5, 10, 20, df2 = 5
y1.1 <- df(x, 1, 5)
y1.2 <- df(x, 2, 5)
y1.3 <- df(x, 3, 5)
y1.5 <- df(x, 5, 5)
y1.10 <- df(x, 10, 5)
y1.20 <- df(x, 20, 5)
## plot
plot(x, y1.1, type = "l", ylim = c(0, 2.5), col = "black", xlab = "X", ylab = "density of F", main = "density of F for different df")
lines(x, y1.2, type = "l", col = "red")
lines(x, y1.3, type = "l", col = "green")
lines(x, y1.5, type = "l", col = "black", lty = 2)
lines(x, y1.10, type = "l", col = "red", lty = 2)
lines(x, y1.20, type = "l", col = "green", lty = 2)
legend(2.2, 2.5,
legend = c("d1 = 1, d2 = 5", "d1 = 2, d2 = 5", "d1 = 3, d2 = 5", "d1 = 5, d2 = 5", "d1 = 10, d2 = 5", "d1 = 20, d2 = 5"),
col = c("black", "red", "green", "black", "red", "green"), lty = c(1,1,1,2,2,2)
)
The probability of getting lung cancer is ten times higher for people who smoke at least 1 cigarettes every day than those who don't smoke at all.
The odds ratio of people who smoke at least 1 cigarettes every day and those who don't smoke at all is 15.
## This is to calculate the coverage of Wald C.I. of relative risk of 2 by 2 table
## --------------------------------------------------------------
## Lung Cancer No Lung Cancer
## Smoke n11 n12
## Non-Smoke n21 n22
## --------------------------------------------------------------
# This function just return TRUE or FALSE to indicate whether the C.I.
# will cover the real relative risk or not
RRcover_or_not <- function(p1 = 0.007, p2 = 0.003, n1=5000, n2 = 10000) {
n11 <- rbinom(1, size = n1, prob = p1)
n21 <- rbinom(1, size = n2, prob = p2)
n12 <- n1 - n11
n22 <- n2 - n21
p1hat <- n11/n1
p2hat <- n21/n2
logrr <- log(p1hat / p2hat)
sd <- sqrt((1 - p1hat) / (p1hat * n1) + (1 - p2hat) / (p2hat * n2))
lower <- logrr - sd * qnorm(1 - 0.05/2)
upper <- logrr + sd * qnorm(1 - 0.05/2)
real <- log(p1/p2)
return (real > lower && real < upper)
}
# This function calculate the percentage of coverage
# by running the above function repeatedly
RRcoverage <- function(n = 1000, ...) {
temp <- rep(0, n)
for (i in 1:n) {
temp[i] <- RRcover_or_not(...)
}
res <- sum(temp) / n
return(res)
}
# This function calcuate coverage repeatedly and plot the density
# It also shades the area which is above 95% so we can check how
# good this method (Wald) is in computing C.I. of log relative risk
RRcoverageplot <- function(rep = 1000, ...) {
temp <- rep(0, rep)
for (i in 1:rep) {
temp[i] <- RRcoverage(...)
}
den <- density(temp)
plot(den, type = "l", col = "red", lwd = 2, main = "Coverage for Wald 95% CI for RR", xlab = "Coverage")
abline(v = 0.95, lty = 2)
index <- abs(den$x - 0.95) == min(abs(den$x - 0.95))
pin <- which.max(index)
x1 <- den$x[pin: length(den$x)]
y1 <- rep(0, length(x1))
x2 <- rev(x1)
y2 <- rev(den$y[pin:length(den$y)])
polygon(c(x1, x2), c(y1, y2), col = "skyblue")
per_over_95 <- sum(temp > 0.95)/rep* 100
text(0.955, 20, labels = paste(bquote(.(per_over_95)), "%", sep = ""), cex = 1.5)
}
New R functions used are
This is a realization of the Gibbs sampler for bivariate normal distribution from Gelman's book Bayesian Data Analysis (Third edition), Page 277.
Gibbs.bivariate.normal <- function(n = 1e3, burn = 1e2, rho = .8, y1 = 0, y2 = 0) {
N <- n + burn
theta1 <- numeric(N+1)
theta2 <- numeric(N)
theta1[1] <- 2.5
for(i in 1:N) {
theta2[i] <- rnorm(1, mean = y2 + rho *(theta1[i] - y1), sd = sqrt(1 - rho^2))
theta1[i+1] <- rnorm(1, mean = y1 + rho *(theta2[i] - y2), sd = sqrt(1 - rho^2))
}
theta1 <- theta1[(burn+1):N]
theta2 <- theta2[(burn+1):N]
plot(theta1, theta2, pch = 19, cex = .3)
}
When rho = 0.7, the plot is like this
When rho = -0.3, the plot is like this
This is when rho = -0.9
We compare the performance of SVM and logistic regression by generating samples using mvrnorm
function in R. The following is the R code.
library(MASS)
## generate data
mu1 <- c(0, 0, 0)
mu2 <- c(2.5,2.5, 2.5)
Sigma1 <- matrix(c(2, 1, 1, 1, 2, 1, 1, 1, 3), nrow = 3)
Sigma2 <- matrix(c(3, 1, 1, 1, 3, 1, 1, 1, 2), nrow = 3)
data1 <- mvrnorm(1000, mu1, Sigma1)
data2 <- mvrnorm(1000, mu2, Sigma2)
y <- rep(c(1, 0), each = 1000)
data <- cbind(rbind(data1, data2),y)
data <- as.data.frame(data)
names(data) <- c("x1", "x2", "x3", "y")
## glm model
model <- glm(as.factor(y) ~ x1 + x2 + x3, data = data, family = "binomial")
pred <- predict(model, data = data, type = "response")
pred <- as.numeric(pred >= .5)
# ggplot(data = data, aes(x = x1, y = x2)) + geom_point(aes(color = as.factor(pred)))
## svm model
tuneResult <- tune(svm, as.factor(y) ~ x1 + x2 + x3, data = data,
ranges = list(epsilon = seq(.1, .9, .2), cost = 1:5)
)
pred.svm <- predict(tuneResult$best.model, data = data)
## compare two models
xtabs(~y + pred)
xtabs(~y + pred.svm)
error.glm <- sum(y != pred)/nrow(data)
error.svm <- sum(y != pred.svm)/nrow(data)
paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
paste("The error rate of svm is: %", error.svm * 100, sep = "")
The performance is as follows.
> xtabs(~y + pred)
pred
y 0 1
0 847 153
1 146 854
> xtabs(~y + pred.svm)
pred.svm
y 0 1
0 865 135
1 129 871
> paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
[1] "The error rate of logistic regression is: %14.95"
> paste("The error rate of svm is: %", error.svm * 100, sep = "")
[1] "The error rate of svm is: %13.2"
Another simulation result.
> xtabs(~y + pred)
pred
y 0 1
0 830 170
1 170 830
> xtabs(~y + pred.svm)
pred.svm
y 0 1
0 816 184
1 124 876
> paste("The error rate of logistic regression is: %", error.glm * 100, sep = "")
[1] "The error rate of logistic regression is: %17"
> paste("The error rate of svm is: %", error.svm * 100, sep = "")
[1] "The error rate of svm is: %15.4"
My finding is that:
The import thing is to input the data in the right form. The form should be an array with 2 X 2 X q dimension, q is the number of strata. The following code is an example in Cochran–Mantel–Haenszel test for repeated tests of independence.
data <- array(c(708, 50, 169, 13,
136, 24, 73, 14,
106, 32, 17, 4,
109, 22, 16, 26,
801, 102, 180, 25,
159, 27, 18, 13,
151, 51, 28, 15,
950, 173, 218, 33),
dim = c(2, 2, 8),
dimnames =
list(hand = c("right", "left"),
hair = c("cw", "ccw"),
strata = c("white_children", "British_adult", "Penn_white", "Welsh_men",
"German_soldiers", "German_children", "New_York", "American_men"))
)
mantelhaen.test(data)
The result is as follows.
Mantel-Haenszel chi-squared test with continuity correction
data: data
Mantel-Haenszel X-squared = 6.0702, df = 1, p-value = 0.01375
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
1.061250 1.608976
sample estimates:
common odds ratio
1.306723
The p-value is 0.01375, showing that the two variables(hand
and hair
) are not independent.
The Chi-square statistic is 6.07.
The mean of Cauchy distribution doesn't exist, which can be show by the following example.
Create a random vector x1, x2, ..., x1000 of Cauchy distribution. Then calculate Yi = mean(x1 + ... + xi). At last, plot the sequence of Yi. Thus you can have a glance at how Yi goes as i increases from 1 to 1000. The following is R code.
crazyCauchy <- function() {
rv <- rcauchy(1e3)
x <- numeric(1e3)
for(i in 1:1e3) {
x[i] <- mean(rv[1:i])
}
plot(x, pch = 19, cex = 0.5)
}
Running the function and the plot looks this this.
There is no trend of convergence.
mean_var_indep <- function(n) {
m <- v <- numeric(n)
for (i in 1:n) {
temp <- rnorm(n)
m[i] <- mean(temp)
v[i] <- var(temp)
}
res <- cor(m, v)
return(res)
}
If we want to calculate int_0^1 1/sqrt(2pi) (-x^2/2) dx, there is no easy integral table for use to get the exact value. While, we can use Monte Carlo simulation, based on the law of large numbers. The following is R code.
MCnormal <- function(n) {
a <- runif(n)
b <- dnorm(a)
c <- mean(b)
real <- pnorm(1) - pnorm(0)
diff <- abs(real - c)/ real
return(list(simulated = c, real = real, diff = diff))
}
The large the n, the more precise the estimation.
> MCnormal(100)
$simulated
[1] 0.3388052
$real
[1] 0.3413447
$diff
[1] 0.007439889
> MCnormal(1000)
$simulated
[1] 0.3418248
$real
[1] 0.3413447
$diff
[1] 0.00140629
> MCnormal(1000000)
$simulated
[1] 0.3413338
$real
[1] 0.3413447
$diff
[1] 3.214687e-05
> MCnormal(1000000)
$simulated
[1] 0.3413239
$real
[1] 0.3413447
$diff
[1] 6.112997e-05
When n comes to 1e6, the difference is only .006%, which is pretty negligible.
This is a function that I wrote to test bisection method for getting MLE for gamma distribution. After running the function code, we can try an example in R.
> set.seed(101)
> data <- rgamma(500, shape = 4, rate = 2)
> mle.gamma(data)
$hat_alpha
[1] 3.800928
$hat_lambda
[1] 1.88312
We can see that the estimation is reasonable. It can be used for any other estimation when there is no closed form solution. The following is the code for the function mle.gamma()
.
mle.gamma <- function(X, tol = 1e-6) {
# X should be a random sample from a gamma distribution
n <- length(X)
# define the function f(x) for which we want to find the solution of f(x) = 0
f <- function(x) n * log(x / mean(X)) + sum(log(X)) - n * digamma(x)
sign <- function(x) {
if (x > 0) sign <- 1
if (x < 0) sign <- -1
else sign <- 0
return(sign)
}
x.left <- 1
x.right <- 10
repeat {
temp <- (x.left + x.right)/2
if(abs(f(temp)) < tol) {
x.final <- temp
break
}
else {
if(sign(f(temp)) == sign(f(x.right))) x.right <- temp
else x.left <- temp
}
}
hat_alpha <- x.final
hat_lambda <- hat_alpha / mean(X)
lst <- list(hat_alpha = hat_alpha, hat_lambda = hat_lambda)
return(lst)
}
An example for binomial (n, p) likelihood and beta (a, b) prior.
wb.bin <- function(n, x, a = 1, b = 1, simulation = 5e3, k = 1e3) {
random.prior <- rbeta(simulation, a, b)
likelihood <- dbinom(x, n, random.prior)
random.posterior <- sample(random.prior, k, replace = TRUE, prob = likelihood)
actual <- (a + x)/(a + b + n)
simulated <- mean(random.posterior)
return(list(actual = actual, simulated = simulated))
}
When running this function we get
> wb.bin(20, 3)
$actual
[1] 0.1818182
$simulated
[1] 0.1844998
> wb.bin(12, 4)
$actual
[1] 0.3571429
$simulated
[1] 0.3570672
This method is pretty easy to understand and to implement. Thus it is great!
Purpose: do simulation to prove the distribution of max(X1, ..., Xn) - theta log(n) is Gumbel(0, theta).
## mu= 0, sigma = 1
a <- expression(
n <- 50000,
x <- rgumbel(n, 0, 1),
xmax <- max(x - log(n))
)
res <- replicate(5000, eval(a))
plot(density(res))
xseq <- seq(-2, 6, by = 0.01)
yseq <- dgumbel(xseq, 0, 1)
lines(xseq, yseq, col = "red")
We can use Wilcoxon signed rank test if the paired - t test can't be used in testing paired samples. When the number of pair N is large, we can use normal approximation. The following R simulation shows that When N >= 15, the approximation is pretty good.
rankdist <- function(n = 3, k = 1e3) {
x <- 1:n
res <- numeric(k)
for (i in 1:k) {
sign <- sample(c(1, -1), size = n, replace = TRUE)
y <- x * sign
res[i] <- sum(y)
}
t <- max(density(res)$y)
hist(res, freq = FALSE, ylim = c(0, t * 1.1), main = "Barplot of the Signed-Rank Statistic", xlab = "signed-rank")
mu <- 0
sd <- sqrt(n * (n + 1) * (2 * n + 1) / 6)
xseq <- seq(-120, 120, length = 1000)
yseq <- dnorm(xseq, mu, sd)
lines(xseq, yseq, type = "l", lwd = 2, col = "red")
}
N = 5 for the top graph, N = 10 for the second, N = 15 for the bottom graph.
This function redefine the use of "[". When the index goes beyond the length of the vector, it just restart from beginning.
`[` <- function(x, k) {
n <- length(x)
k <- k %% n
if(k == 0) k <- n
subset(x, 1:n == k)
}
> x
[1] 3 2 4 5 1
> x[1]
[1] 3
> x[2]
[1] 2
> x[3]
[1] 4
> x[4]
[1] 5
> x[5]
[1] 1
> x[6]
[1] 3
> x[7]
[1] 2
The JAGS can perform MCMC and R package rjags
can be used to access JAGS in R. The process is pretty easy. First we write the model file in R (the file can be in .R format), save the file. Then invoke the model using function in the package 'rjags'. The following is an easy example.
Step 1. Save the following code with name "myFirstJags.R".
model {
for(i in 1:J) {
y[i] ~ dbin(theta[i], 100)
theta[i] ~ dbeta(alpha, beta)
}
alpha ~ dgamma(1, 1)
beta ~ dgamma(1, 1)
}
Step 2. Run the following code.
set.seed(1001)
theta <- rbeta(10, 1, 5)
set.seed(1002)
y <- rbinom(10, 100, theta)
jagresult <- jags.model("myFirstJags.R", data = list("y" = y, "J" = 10), n.adapt = 1000)
thesamps <- coda.samples(jagresult, c("alpha", "beta"), n.iter = 10000, thin = 1, progress.bar = "gui")
We can use plot
function to get the graph of alpha
and beta
.
plot(thesamps)
We will get the following plot.
When lambda is large, we can use Normal(lambda, lambda) to approximate Poisson (This can be proved by using Continuity Theorem). The following is R code and output.
Pois2Norm <- function(lambda = 100, n=1000) {
x <- rpois(n, lambda)
minx <- min(x)
maxx <- max(x)
plot(density(x), main = "Use Normal to approximate Poisson when lambda is large", xlab = "X")
xnorm <- seq(from = minx, to = maxx, length = n)
lines(xnorm, dnorm(xnorm, mean = lambda, sd = sqrt(lambda)), col = "red")
x.pos <- minx + 0.01*(maxx - minx)
miny <- min(density(x)$y)
maxy <- max(density(x)$y)
y.pos <- maxy - 0.01*(maxy - miny)
legend(x.pos, y.pos, lty = c(1, 1), col = c("black", "red"), legend = c("Poisson", "Normal"))
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.