Statistical Computing (EPI-853B)

In this course we will cover computational methods commonly used in statistics, including algorithms used for fitting and non-linear regressions, maximum likelihood estimation, simulation of random variables, bootstrap, cross-validation and algorithms for implementing high dimensional regressions.

Software: The course will be mostly based on R. If time permits we will also work with Julia.

Approach: Although the focus of the course is on computational methods, for each topic we will first describe the problem from a statistical perspective. If they exist, exact analytical solutions will be discussed and implemented. Otherwise numerical methods will be presented. Derivations will be presented in class and students are expected to take their own notes. Scripts for computations will be developed in class and a summary will be posted in this repository. Students are expected to bring their own laptops. If you do not have access to a laptop, please check with the instructor to get access to one.

Evaluation: The evaluation will be based on 4-5 HW and two in-class exams.

Textbook: An Introduction to Statistical Learning. This book covers many of the topics we will discuss. For topics not covered in the book we will provide aditional materials.

Instructor: Gustavo de los Campos ([email protected])

Syllabus

Course Content

Note: this is a tentative list of topics, if time permits we will try to cover all of them; however, the list of topics is ambitious and we may not cover all the topics listed.

Introduction
The R-software
Matrix Algebra in R
Ordinary Least Squares I: Estimation
Ordinary Least Squares II: Inference
Non-linear regression using Splines
Multi-core computing in R
Maximum Likelihood
Simulation of random variables
Monte Carlo Markov Chain Methods
Cross-Validation
Penalised Regressions

Homework

## Introduction

Goals
Rules
Approach
Evaluation and grading

Back to Outline

## (1) The [R](https://www.r-project.org/)-software

Topics you should be familiar with

Installation CRAN
Types (boolean, integer, numeric, character, factors)
Variables and simple operations
Arrays * Vectors * Matrices * Multi-dimensional arrays * Lists * Data Frames
Conditional statements
For and while loops
I/O * read.table and write.table * load and save * dput and dget * scan and wirte * readBin and writeBin
Functions

Materials

Section 2.3 of the book
R website (see entry for manuals)
R for Data Science

Back to Outline

## (2) Matrix Algebra (definitions and computational methods)

A few matrices and vectors for demonstration

  x1=c(1,2,0)
  x2=c(1,-1,3)
  A=cbind(x1,x2)
  x3=c(3,4,1)
  x4=c(2,0,1)
  B=rbind(x3,x4)
  dim(A)
  dim(B)

Matrix addition, subtraction and other cell-by-cell operations

  # cell-by-cell operations
  B+B
  B-B
  B^2
  exp(B) # note, this is very different than the matrix exponential, see below
  B*B    # note, this is very different than the matrix product

Transpose

  Bt=t(B)
  all.equal(t(Bt),B)

Let's write our own function to transopose a matrix

 ## A very inneficient function that illustrates matrix transposition and the creation of functions in R
 myT=function(X){
   nCol.out=nrow(X)
   nRow.out=ncol(X)
   OUT=matrix(nrow=nRow.out,ncol=nCol.out,NA)
   
   for(i in 1:nRow.out){
   	for(j in 1:nCol.out){
   		OUT[i,j]=X[j,i] # simply flip the indexes
   	}
   }
   return(OUT)
 }
 
 all(t(B)==myT(B))

Matrix product (%*%, crossprod and tcrossprod)

  x=matrix(nrow=5,ncol=3,data=rnorm(15))
  t(X)%*%X   # X'X
  crossprod(X) # same but a bit faster
  X%*%t(X)   # XX'
  tcrossprod(X) # same but a bit faster

QR-decomposition

The QR-decomposition can be used to factorize a matrix into the product of an orthonomal basis (Q) and a triangular matrix (R) so that X=QR with Q'Q=I.

 TMP=qr(X) # computes QR decomposition, returns a list
 qr.Q(TMP) # recovers Q from a QR decomposition
 qr.R(TMP) # recovers R from a QR decomposition
 all(qr.X(TMP)==X) # recovers X
 round(crossprod(qr.Q),5) # Q is orthonormal

Rank

  qr(X)$rank

Determinant

 X=matrix(nrow=3,ncol=3,data=rnorm(9))
 det(X)
 XtX=crossprod(cbind(X,X[,1])) # det of a singular (or rank defficient) matrix is zero
 det(XtX)
 qr(XtX)$rank

Matrix Inversion

   XInv=solve(X)
   round(XInv%*%X,5)
   round(X%*%XInv,5) 
   solve(XtX) # rank-deficient matrix does not have an inverse

Singular Value Decomposition Any matrix X (nxp) can be decomposed as X=UDV', where U(nxq), is an orthonormal basis for the row-space of X, V(qxn) is an orthonormal basis for the column-space of X and D(qxq) is a diagnoal matrix with singular values in the diagnoal. Here q=min(n,p).

   SVD=svd(X)
   str(SVD)
   round(crossprod(SVD$u),5) #$u gives an orthonormal basis for the row-space of X
   round(crossprod(SVD$v),5) #$v gives an orthonormal basis for the row-space of X
   SVD$d
   
   ## Now for a rank-deficient matrix
   SVD=svd(XtX) 
   sum(SVD$d>1e-10) # the rank is the sum of positive singular values

Materials

Book: pages 9-12
A Review of Matrix Algebra

Back to Outline

## (3) Ordinary-least squares [Chapter 3, plus materials provided below]

####The OLS Problem

Consider a regression problem of the form

  y=Xb+e

where: y (nx1) is a response vector, X (nxp) is a design matrix of effects, b (px1) is a vector of regression coefficients and e (nx1) is a vector of model residuals.

The ordinary least-squares estimate of b is obtained by miniminzing the residual sum of squares, RSS=e'e=(y-Xb)'(y-Xb), with respect to b.

Derivation The steps to find the analythical solution arre:

Differentiate RSS(y,X,b) with respect to the jth coefficient (bj)
Set all the queations equal to zero (this gives a stationary point).
Because e'e is a quadratic form, the matrix of second deriviatives is PSD, so we know that the stationary point is a solution.

Analythical Solutions

We show in class that the solution is given by the vector bHat that solves the follwing systems of equations

(X'X)bHat=X'y

If X is full-column rank, then we have

bHat=solve(X'X)%*%X'y

Notation: We usually call C=X'X the matrix of coefficients of the system of equations, rhs=X'y, the right-and-side of the sytem and sol=bHat. The short notation for the OLS system is

C%*%sol=rhs

Computing OLS esitmates

In this examples we show alternative ways of computing OLS estimates. We begin with R functions lm and lsfit and then include alternative ways of computing estimates using matrix operations, factorizations and with iterative procedures.

A simple simulated data set

 p=100
 n=5000
 b=rnorm(p)
 X=matrix(nrow=n,ncol=p,data=runif(n*p))
 signal=150+X%*%b
 error=rnorm(sd=sd(signal),n=n)
 y=signal+error

The lm function

 fm<-lm(y~X)
 bHat=coef(fm)
 summary(fm)

The lsfit function

 fm2<-lsfit(y=y,x=X)
 bHat=coef(fm)
 ls.print(fm2) # the summary method is not very useful with lsfit

Our own lm using matrix operations

This method will work only if X has a rank equal to the number of columsn of X.

myLS.solve=function(y,X,int=TRUE){
  if(int){
         X=cbind(1,X)
  }
  C=crossprod(X)
  rhs=crossprod(X,y)
  CInv=solve(C)  
  sol=crossprod(CInv,rhs)
  return(sol)
}

Inversion using cholesky decomposition

This method will work only if X has a rank equal to the number of columsn of X. This guarantees that X'X can be factorized using the Cholesky decomposition. Sinche the cholesky factor is triangular (upper-triangular for R, we have X'X=U'U, where U is the upper-triangular Cholesky), we can easily compute the inverse of X'X from its Cholesky.

Let's check that the inverse obtained via Cholesky is equivalent than the one computed using solve

 Z=matrix(nrow=10000,ncol=1000,rnorm(1000*10000))
 C=crossprod(cbind(Z))
 system.time(CInv.solve<-solve(C))
 system.time(CInv.chol<-chol2inv(chol(C)))
 all(round(CInv.solve,8)==round(CInv.chol,8))

OLS via Cholesky

myLS.chol=function(y,X,int=TRUE){
  if(int){
         X=cbind(1,X)
  }
  C=crossprod(X)
  rhs=crossprod(X,y)
  CInv=chol2inv(chol(C))  
  sol=crossprod(CInv,rhs)
  return(sol)
}

OLS using the QR-decomposition

myLS.qr=function(y,X,int=TRUE){
  if(int){
      X=cbind(1,X)
  }
  QR=qr(X)
  Q=qr.Q(QR)
  R=qr.R(QR)
  gHat=crossprod(Q,y)
  sol=backsolve(R,gHat)
  return(sol)
}
#or

myLS.qr=function(y,X,int=TRUE){
  if(int){
      X=cbind(1,X)
  }
  QR=qr(X)
  sol=qr.coef(QR,y)
  return(sol)
}

OLS using the singular-value decomposition

myLS.svd=function(y,X,int=TRUE){
  if(int){
      X=cbind(1,X)
  }
  p=min(dim(X))
  SVD=svd(X,nu=p,nv=p)
  gHat=crossprod(SVD$u,y)
  sol=tcrossprod(SVD$v,t(gHat/SVD$d))
  return(sol)
}

Computing OLS estimates iterative procedures (Gauss-Seidel method)

We can also find OLS estimates using iterative procedures. The following function implements the Gauss-Seidel method. This method is equivalent to the back-fitting algorithm and it is also a coordinate descent gradient algorithm.

 GS.solver=function(X,y,tol=1e-5,int=TRUE,center=TRUE){
  if(center){
       X=scale(X,center=TRUE,scale=FALSE) 
  }
  if(int){
   X=cbind(1,X)
  }
  C=crossprod(X)
  rhs=crossprod(X,y)
  p=ncol(X)
  b=rep(0,p)
  if(center&int){ b[1]=mean(y) }
  ready=FALSE
  iter=0
   while(!ready){
     iter=iter+1
     b0=b
     for(j in 1:p){
      tmp=sum(C[j,-j]*b[-j])
      b[j]=(rhs[j]- tmp )/C[j,j]
     }
     tmp=abs(b-b0)
     ready= all(tmp<tol)
     print(paste0('Iter=',iter, ' Max abs dif=',(max(tmp))))
   }
   return(b)
 }

Let's test it with a simple simulation.

 n=5000
 p=100
 X=matrix(nrow=n,ncol=p,data=rnorm(n*p))
 bTRUE=c(runif(p))
 signal=X%*%bTRUE+100
 error=rnorm(n=length(signal),mean=0,sd=sd(signal)/2)
 y=signal+error

 system.time(bOLS<-coef(lm(y~X)))
 system.time(bOLS.GS<-GS.solver(y=y,X=X,center=TRUE,int=TRUE))

 max(abs(bOLS[-1]-bOLS.GS[-1])) # note: since we are centering for GS, the est. intercepts are different

Regression with categorical predictors (`model.matrix`)

The model.matrix() function can be used to generate incidence matrices for regression models. The following example illustrates how to produce contrats for a means-model.

Contrasts for categorical predictors

The following webpage has a summary of some of the contrasts commonly used.

Contrasts for Means Model

  sex=sample(c('M','F'),size=10,replace=T)
  Z=model.matrix(~sex-1) # by default R uses treatment contrasts, '-1' means removes the intercept
  crossprod(Z)
  table(sex)

R offers various types of contrasts, the following functions generate contrasts for factors with n levels:

contr.treatment(n,...) is the default option and generates contrasts for differences with a reference category.
contr.helmert(n,...) sequencial differences (group 2-group1, group3-mean of groups 1 and 2,...etc.
contr.poly(n,...)
contr.sum(n,...)
contr.SAS(n,...)

  contr.treatment(4)
  contr.helmert(4)
  contr.sum(4)

We can tell model.matrix whant contrasts to use using the contrasts.arg argument.

x1=c('a','a','b','b','b','c','c','c')
x2=c('1','1','2','2','1','3','3','1')
X=model.matrix(~x1+x2,contrasts.arg=list(x1=contr.treatment,x2=contr.helmert))

We can also recover the contrasts used by model matrix using the attr() function.

attr(X,'contrasts')

We can also pass these arguments into lm.

 y=rnorm(8)
 
 fm1=lm(y~x1+x2) # uses standars dummy coding (contr.tratment)
 fm2=lm(y~x1+x2,contrasts=list(x1=contr.helmert,x2=contr.sum))
 fm1$contrasts

We can control what level will be the reference for the treatment contrasts by changhing the order of the levls of a factor.

 x=factor(c('a','a','b','b','b','c','c','c'))
 levels(x)
 y=rnorm(length(x))
 summary(lm(y~x))
 z=factor(x,levels=c('c','a','b'))
 summary(lm(y~z))

Back to Outline

## (4) Inference in the OLS regression [Chapters 3 and 5, plus materials provided below]

Bias Under relatively minimal conditions, OLS estimates are unbiased. What are these conditions? How do we show that OLS are unbiased? What conditions can lead to systematic bias?

Consider a linear regession of the form y=Xb+e, if X is a full-rank matrix and b is a vector of effects. The OLS estimate of b is given by

   `bHat=Inv(X'X)X'y`

The expected value of this estimator, for fixed X is

E[bHat|X]=E[Inv(X'X)X'y|X]=Inv(X'X)X'E[y|X]

Case 1: If the linear model holds, that is if E[y|X]=Xb, then we have E[bHat|X]==Inv(X'X)X'Xb=b; therfore, we conclude that OLS estimates are unbiased. This result involves two assumptions: (i) that X is full rank and therfore X'X can be inverted and (ii) that the expected value of the error terms are zero. However, other assumptions that are sometimes made for inferential purposues, e.g., that the errors to be independen and/or normally distributed or that they have equal variance (homoskedasticity) are not required for the OLS estimates to be un-biased. Only (i) and (ii) are needed.

Case 2: Even if the model is non linear, the error terms from projection of y on X are, by construction, orthognal to X. So, even if the model is non-linear we have E[X'e]=0 and this makes OLS unbiased estimates of the projection of y on X, provided that X is full rank.

Rank deficient case If X is rank deficient, then we need to use a generalized-inverse, in this case, Ginv(X'X)X'X is not equal to an identity matrix and therfore, the expected value of the OLS estimate is not equal to the true value of the regression. However, even in this case we can still get un-biased estimates of 'estimable functions' for instance, contrasts between treatments.

Variance The conditional varinace of the OLS estimates is given by:

Var(bHat|X)=Var(Inv(X'X)X'y|X)=Inv(X'X)X'Var(y)XInv(X'X)=Inv(X'X)X'GXInv(X'X)

where 'g' is the variance-covariance matrix of the error terms. If the errors are indpeendent and homoskedastic, then we have Var(e)=I*vE, and therefore,

Var(bHat|X)=Inv(X'X)X'GXInv(X'X)=Inv(X'X)X'X(X'X)*vE=Inv(X'X)*vE

Therofore, under (iii) independence and (iv) homoskedasticity of the error terms, the variance covaraince matrix of the OLS estimates is the product of the inverse of the coefficient matrix, Inv(X'X) times the error variance.

How do we estimate the error variance? An un-biased estimator can be obtained as: eHat'eHat/(n-p) where eHat are the OLS residuals (eHat=y-XbHat), n is sample size and p is the rank of X.

Standard errors and p-values. The square-root of the diagonal elements of the variance co-variance matrix of the OLS estimates are the SE of the estimates. Using this and the estimated effects we can obtain t-statistics and the corresponding p-values (see pt()).

Omitted Variable Bias. Suppose that the true model is given by y=Xa+Zb+e, and that instead we regress y on X only, are the OLS estimates still unbiased? The answer is yes, the OLS estimates from the regression of y on X are unbiased estimates of the projection of y on X, however, these are not un-biased estimates of a. To see this is useful to consider two models:

Full Model, or long regression: y=Xa+Zb+e Short regression: y=Xc+d

The OLS estimate of c is Inv(X'X)X'y. The expected value of this estimator is

E[cHat]=Inv(X'X)X'E[y]=Xa+Zb=a+Inv(X'X)X'Zb=a+Tb

where T is a matrix contianing the regressions of the columns of Z on X. Therfore, the OLS estimates of the regression coefficients in the short regression can be bias with respect to the 'true' effects of X (a), however, they are still un-biased with respect to c, the projection of y on X.

(5) Monte Carlo Methods

Introduction
Computing mean, variance and bias of estimates using MC methods
Monte Carlo Error

## (7) Non-linear regression using splines [Chapter 7]

Approximating a conditional expectation function using bins

stepFunction=function(y,x,nW){
   if(nW<3){stop("The minimum # of windows is 3") }
   thresholds=quantile(x,prob= seq(from=1/nW,to=1-1/nW,by=1/nW))

   X=matrix(nrow=length(y),ncol=length(thresholds)+1)
   X[,1]=x<=thresholds[1]
   for(i in 2:(length(thresholds))){
      X[,i]=as.integer(x>thresholds[i-1] & x<=thresholds[i] )
   }
   X[,ncol(X)]<-x>max(thresholds)
   fm=lm(y~X-1)
   return(predict(fm))
 }

 n=100
 x=seq(from=0,to=4*pi,length=n)
 signal=sin(x)+sin(x/2)
 error=rnorm(n)
 y=signal+error
 plot(y~x,col=4)
 lines(x=x,y=signal,col=2,lwd=2)
 lines(x=x,stepFunction(y,x,20),col=4,lty=2,lwd=2)

As we increase the number of windows, the bias of the estimator is reduced (we can approximate the true conditional expectation function better) but the variance increases. How could we chose an optimal number of windows? One possiblity is to use cross-validation (see HW5).

While the step function is very flexible, it does not render a 'smooth' approximation; for instnace, the function is not continous.

 # This function evaluates (x-tau)+^degree for parametes x, tau and degree
 bf=function(x,tau,degree){ 
   z=x-tau	
   ifelse(z>0,z^degree,0)
 }
 

 nKnots=5
 degree=3
 
 knots=quantile(x,prob= seq(from=1/nKnots,length=nKnots,by=1/nKnots))
 Z=matrix(ncol=degree+1+length(knots),nrow=length(x),NA)
 
 # First let's through in our incidence matrix the polynomials
 Z[,1]=1
 for(i in 1:degree){ Z[,i+1]=x^i }
 
 # Now the local basis functions
 for(i in 1:nKnots){
   Z[,i+degree+1]=bf(x,tau=knots[i],degree=degree)
 }
 
  fm=lm(y~Z-1)
  plot(y~x)
  lines(x=x,y=signal,col=2,lwd=2)
  lines(x=x,y=predict(fm),col=4)
  
# Lets' compare with bs()
  W=bs(x,knots=knots,degree=degree) # note: by default bs does not include an intercept
  fm2=lm(y~W)
  points(x=x,y=predict(lm(y~W)),col='red',cex=.5)

The same approach can be used with higher order polynomials.

There are many different ways of creating basis functions for a spline. The spline package offeres functions for the so-called B-Splines (bs) and the Natural Spline (ns) which is a cubic spline with interior and boundary knots.

 library(splines)

Back to Outline

(8) Bootstrap

Chapter 5, Section 5.2
The Bootstrap method Efron & Gong, Am. Stat., 2012 Back to Outline

## (9) Parallel computing in R (Alexander Grueneberg)

Introduction to multi-core computing
The parallel package
Apply-type operations in parallel
Matrix computations in parallel
Monte Carlo Simulations using parallele computing

Introduction

The parallel package has been included in R since version 2.14.0. It combines the work of the two packages snow and multicore that both pursued different approaches in bringing parallel computing capabilities to R (as it lacks those features by design). The easiest way to use the approach of the snow package is by using the parLapply function. For the approach of the multicore package it is the mclapply function.

Both functions have important caveats: among those, mclapply will only work with UNIX-like operating systems (R, macOS) while parLapply will work with all operating systems but takes several manual steps to use. Both functions act similarly to the lapply function that comes with R that runs a function on each element of a list, returning the results of those functions calls as a list of the same length as X:

lapply(X, FUN)
mclapply(X, FUN, mc.cores = detectCores())
parLapply(cluster, X, FUN)

The mc.cores argument of mclapply defines how many CPUs/cores should be used. parLapply takes a cluster argument that needs to be defined using cluster <- makeCluster(detectCores()).

Example: Bootstrap in parallel

Task: Compute 90% bootstrap confidence interval of the bootstrap distribution of the sample mean for a small sample.

Serial as a for loop:

start <- proc.time()
x <- c(8, 3, 2, 5, 6, -1, 2, 3, 5, 6, 3, 9)
nRep <- 1000000
means <- numeric(nRep)
seed <- 1000
for (i in 1:nRep) {
  set.seed(seed + i)
  z <- sample(x, replace = T, size = length(x))
  means[i] <- mean(z)
}
quantile(means, c(0.05, 0.95))
proc.time() - start

Serial using apply:

start <- proc.time()
x <- c(8, 3, 2, 5, 6, -1, 2, 3, 5, 6, 3, 9)
nRep <- 1000000
seed <- 1000
meanFunction <- function(i) {
  set.seed(seed + i)
  z <- sample(x, replace = T, size = length(x))
  return(mean(z))
}
means <- sapply(1:nRep, meanFunction) # or lapply followed by unlist
quantile(means, c(0.05, 0.95))
proc.time() - start

In parallel using mclapply (this will be serial on Windows):

start <- proc.time()
library(parallel)
x <- c(8, 3, 2, 5, 6, -1, 2, 3, 5, 6, 3, 9)
nRep <- 1000000
seed <- 1000
meanFunction <- function(i) {
  set.seed(seed + i)
  z <- sample(x, replace = T, size = length(x))
  return(mean(z))
}
means <- mclapply(1:nRep, meanFunction, mc.cores = detectCores())
means <- unlist(means)
quantile(means, c(0.05, 0.95))
proc.time() - start

In parallel using parLapply (this should work with all operating systems):

start <- proc.time()
library(parallel)
x <- c(8, 3, 2, 5, 6, -1, 2, 3, 5, 6, 3, 9)
nRep <- 1000000
seed <- 1000
meanFunction <- function(i) {
  set.seed(seed + i)
  z <- sample(x, replace = T, size = length(x))
  return(mean(z))
}
cluster <- makeCluster(detectCores())
clusterExport(cluster, c("x", "seed"))
means <- parLapply(cluster, 1:nRep, meanFunction)
means <- unlist(means)
quantile(means, c(0.05, 0.95))
stopCluster(cluster)
proc.time() - start

Benchmarks on my machine (Core i5, 4 cores);

		      min        lq      mean    median        uq       max neval
for		12.662996 12.747717 13.041865 12.845498 13.070518 13.882597     5
apply		14.074011 14.250181 14.840325 14.264057 14.683107 16.930270     5
mclapply	 6.591227  6.643596  7.054739  6.790404  6.893870  8.354599     5
parLapply	 8.253781  8.363489  8.408053  8.389866  8.413008  8.620121     5

Back to Outline

## (10) Maximum Likelihood estimation [Chapter 4, sections 4.1-4.3]

For a paper describing the origins of the principle of Maximum Lilkelihood see Aldrich, 1997

** 1. Closed Form Solutions** The liklelihood function is the probability of the data given the parameters viewed as a function of the parameters with data fixed. Maximum likelihood Estimates (MLEs) are the values of the parameters that maximize the likelihood function.

In simple models (e.g., Poisson, Bernoully, multiple-linear regression with Gaussian IID error terms) we can find a closed-form for the ML estimates. In these cases MLEs are obtained by:

Taking the 1st derivative of the log-likelihood with respect to each of the parameters
(First Order Conditions, FOC) Set all the derivatives equal to zero, this gives a stationary point and renders a system with as many equations as parameters.
Solve for the parameter values that satisfy the FOC.
Check, based on the sign of the 2nd derivatives that the function is concave.

However, in the vast majority of the cases we cannot find a closed-form for the ML estimates. In these cases we use numerical methods. There are many numerical methdos for optimizing a function, we will briefly consider three approaches: grid search, Newton-Rapson and general purpouse optimization algorithm (functions optim and optimize in R).

1. Grid search

In it's simplest form a grid search requires:

Implementing a function for evaluating the log-likelihood
Defining a grid of values for the parameters
Evaluating the log-likelihood for every value in the grid
Finding the value within the grid that yields the largest value of the likelihood

The following R-code illustrates this for the case of Poisson data

NOTE: In this case the MLE has a closed form (it is simply the sample mean)

## Simulating data
 trueLambda=10
 y=rpois(lambda=trueLambda,n=100)

## A function to evaluate the Poisson log-likelihood
 logLikPois=function(y,lambda){     
      logLik=log(lambda)*sum(y)-length(y)*lambda
      return(logLik)
 }

## A grid
myGrid=seq(from=.01,to=100,length=10000)

## Search
logLik=rep(NA,length(myGrid))
for(i in 1:length(myGrid)){
 logLik[i]=logLikPois(y=y,lambda=myGrid[i])
 print(i)
}
plot(logLik~myGrid,type='l',col=4)
abline(v=mean(y),lty=2,col=2)

If we have a parameter vector instead of a single parameter our grid will be multi-dimensional (as many dimensions as parameters) but the princple of the grid search is the same.

Grid Search is computationally intensive, most of the values of the grid could be ruled out y looking at the derivatives of the function.

The above approach very crude. If we know that our function is concave, the we can set a very sparse grid, then find three points where the interior point gives higher likelihood than the flanking points. This interval must contain the maximum. We can then refine our grid by making a denser grid within that interval, and interate in this fashion untlil our estimate does not change more than a pre-established precision.

2 Newton Rapson (NR) Method

This method can be used to find a stationary point (either a minima or maxima) of a twice differentiable function. Starting from an intial guess theta_0, the NR method moves from theta_0 to tehta_1=theta_0+delta_0 where delta_0 is the negative of the ratio of the first and seconde derivatives of the objective function both evaluated at theta_0. The method can be motivated using a 2nd order Taylor expansion to the objective function. We will discuss this further in class.

The following example illustrates the aplictaion of the NR methods for estimating the parameter of a Poisson distribution.

## Data
 ## Simulating data
  trueLambda=10
  y=rpois(lambda=trueLambda,n=100)


 ## Derivatives
  firstD=function(lambda,y){  
     sum(y)/lambda -length(y)
  }
  
  secondD=function(lambda,y){
     -sum(y)/(lambda^2)
  }
  
  
  theta=3
  tol=1e-5
  ready=FALSE
  counter=0
  while(!ready){
  	print(theta)
  	delta=- firstD(theta,y)/secondD(theta,y)
	new_theta=theta+delta
	ready=(abs(theta-new_theta)<tol)
	counter=counter+1
	print(paste0(counter,'  ', new_theta))
	theta=new_theta
  }

3 General Purpose Optimization Functions

In this section we illustrate how to obtain MLEs using optim(), this is a general purpose optimizaton function that can be used to minimize an objective function with respect to a set of parameters. The inputs for optim are: (i) the function to be minimized (e.g., negative-log-likelihood), (ii) a parameter vector (all the parameters to be estimated must be provided in a single vector), and (iii) additional arguments to the function that are to be kept fixed during the optimization procedure (e.g., data, covariates, etc.). We illustrate the use optim in the context of logistic regression and censored regression.

3.1. Logistic Regression

Toy simulation

 x=rnorm(1000)
 risk=2+x*2
 prob=exp(risk)/(1+exp(risk))
 y=rbinom(prob=prob,n=length(x),size=1)

Logistic Regression using GLM

 fm0=glm(y~x,family=binomial)
 fm0$coef

Logistic regression using optim

 fm1=optim(fn=negLogLikLogistic,y=y,X=cbind(1,x),par=c(log(mean(y)/(1-mean(y))),0))
 fm1$par

Inference

Under regularity conditons, the large-sample distribution of ML estimates is Multivariate Normal with mean equal to the true parmeter values (i.e., MLs are asymptotically un-biased) and (co)variance matrix equal to the negative of the inverse of Fisher's Information matrix which is the expected value of the matrix of second derivatieves of the log-likelihood function with respect to the parameters. This matrix often depends on the true parameter values which are uknown. Thus, in practice, Expected Information is replaced with the Observed Information which is the inverse of the matrix of the second derivatives of the log-likelihood evaluated at the ML estimates. We can obtain this matrix from optim() by setting hessian=TRUE. The followin example illsutrates this.

 fm1=optim(fn=negLogLikLogistic,y=y,X=cbind(1,x),par=c(log(mean(y)/(1-mean(y))),0),hessian=T)
 OI=solve(fm1$hessian)
 cbind(fm1$par,sqrt(diag(OI)))
 summary(fm0)$coef# GLM

3.2. Censored Regression

Biomedical and engeneering data is usually censored. Censored means that for some subjects we do not know the actual value of the response but we have information (bounds) about it. There are three types of censoring:

Right censored, we know that y is greater than some constant, y>c.
Left censored, we know that y is smaler than some constant, y<c.
Interval censored, means that we know that y lies in an interval, c1<y<c2.

As always, the likelihood function is the probability of the data given the parameters viewed as a function of the parameters with data fixed. For un-censored data points this is just the density function. For right-censored data point the likelihood is the Survival function (i.e., S(c)=p(y>c)=1-CDF(c)). For left-censored data points the likelihood is the CDF, p(y<c)=CDF(c), and for interval censored the likelihood is the difference in the CDF function evaluated at the right and left-bounds of the interval, p(c1<y<c2)=CDF(c2)-CDF(c1). The following example illustrates how to compute ML estimates with censored data assuming that data follows a Gamma distribution.

Suppose we have gamma data

  x=rgamma(shape=8,rate=.5,n=100)

ML estimation with the complete data

 negLogLikGamma=function(x,theta){
	logLik=dgamma(x=x,shape=exp(theta[1]),rate=exp(theta[2]),log=TRUE)
	return(-sum(logLik))
 }

 fm=optim(fn=negLogLikGamma,x=x,par=log(c(2,2)))
 exp(fm$par)

Suppose now that a proportion of the data are right-censored

 event=runif(length(x))<.7
 y=x
 y[!event]=x[!event]-runif(n=sum(!event),min=1,max=3)

Log-likelihood function for right-censored data

 negLogLikGammaRightCen=function(time,theta){
 	logLik=pgamma(q=time, shape=exp(theta[1]), rate = exp(theta[2]), lower.tail = FALSE, log.p = TRUE)
 	 return(-sum(logLik))
 }

 logLikGammaCen=function(time,event,theta){
 	
 		time_observed=time[event]
 		time_right_cen=time[!event]
 		
 		negLogLik_1=negLogLikGamma(x=time_observed,theta)
 		negLogLik_2=negLogLikGammaRightCen(time=time_right_cen,theta=theta)
 		return((negLogLik_1+negLogLik_2))
 }
 
 fm2=optim(fn=logLikGammaCen,time=x,event=event==T,par=log(c(2,2)))
 
  exp(fm2$par)

The survaival R-package

negLogLikRightCenNormal=function(y,d,theta){
   mu=theta[1]
   V=exp(theta[2])
   yObserved=y[d==1]
   yCensored=y[d==0]

   logLik_observed=sum(dnorm(x=yObserved,mean=mu,sd=sqrt(V),log=TRUE))
   logLik_censored=sum(pnorm(q=yCensored,mean=mu,sd=sqrt(V),lower.tail=FALSE,log=TRUE))
   return(-(logLik_observed+logLik_censored))
}


y=rnorm(n=1000,sd=2,mean=10)

d=ifelse(y>11,0,1)
time=y

time[d==0]=11

fm=optim(fn=negLogLikRightCenNormal,y=time,d=d,par=c(mean(time),log(var(time))))

Back to Outline

P-values play a central role in hypothesis testing; once a significance level is established (e.g., 0.05) a null hypothesis is rejected if the corresponding p-value is smaller than the significance level. A p-value quantifies the probability of rejecting the null hypothesis given than the null is true. The significance level you establish also affects the power of your experiment. The following table describes the four scenarios you may face, in rows we have the state of nature and in columns your decision. The cells in the diagonal represent decisions that are in agreement with the state of nature (i.e., they are "correct"): you do not reject the null when the null is true and you reject it when the alternative is trye. The upper-right square is the so-called Type-I error which occurs when you reject the null when the null is true, and the lower left cell represent the Type-II error (you fail to reject when the alternative is true).

Estimating Type-I error rate and power

Permutation Tests

Suggested references:

Mulitple testing

Family-wise error type-I error rate
- With independent tests (Bonferroni)
- With 'correlated' tests.
False Discovery Rate

sidavad / epi853b Goto Github PK

epi853b's Introduction

Statistical Computing (EPI-853B)

Course Content

Homework

Computing OLS esitmates

Regression with categorical predictors (`model.matrix`)

Contrasts for categorical predictors

(5) Monte Carlo Methods

(8) Bootstrap

Introduction

Example: Bootstrap in parallel

epi853b's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

sidavad / epi853b Goto Github PK

epi853b's Introduction

Statistical Computing (EPI-853B)

Course Content

Homework

Computing OLS esitmates

Regression with categorical predictors (model.matrix)

Contrasts for categorical predictors

(5) Monte Carlo Methods

(8) Bootstrap

Introduction

Example: Bootstrap in parallel

epi853b's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Regression with categorical predictors (`model.matrix`)