mrc-ide / drjacoby Goto Github PK

Flexible Markov chain monte carlo via reparameterization

Home Page: https://mrc-ide.github.io/drjacoby/

License: Other

R 75.63% C++ 24.37%

drjacoby's Introduction

DrJacoby

drjacoby is an R package for performing Bayesian inference via Markov chain monte carlo (MCMC). In addition to being highly flexible it implements some advanced techniques that can improve mixing in tricky situations. Full details can be found on the drjacoby website.

So far, drjacoby has been used in the following projects:

The paper Estimates of the severity of coronavirus disease 2019: a model-based analysis used drjacoby version 1.0.1
The report COVID-19 Infection Fatality Ratio: Estimates from Seroprevalence used drjacoby version 1.2.0

drjacoby's People

Contributors

Stargazers

Watchers

Forkers

laasousa reda1512 mrc-ide owain-s arthurshem86 ojwatson

drjacoby's Issues

Theta named parameters

Can we ensure that theta is passed around within R as a named vector, so that at least parameters can be accessed with theta["foo"] rather than theta[5]. This would only apply within R, no such luck within C++.

sample_chains() adjustments

Remove samples column
Pass in the DrJ object - allowing for a default sample of chains X samples

Parallel and c++ causes error

When tryniog to run drjacboy in parallel with cluster argument the following error occurs:

could not find function "create_xptr"

Sampler fails when using custom hybrid likelihood function written with a mixture of R and Rcpp

Really great package and documentation! I've finally got around to seeing how drjacoby will do with a hierarchical ODE model (my sampler is struggling).

Hopefully a quick question, but will provide a minimal example if not...

I am running a pared down version of my full model, and drjacoby seems to be working nicely. However, I have two identical custom likelihood functions: one written in R and one in Cpp with an Rcpp interface. See below: if I just swap from likelihood to likelihood_fast, the sampler tries to run off into weird parts of parameter space. Works totally fine with likelihood.

Is there something about the way the C++ code is organized that prohibits Rcpp function use like this? Maybe I'm accidentally accessing/overwriting some variables in the C++ environment that I shouldn't be?

The reason I think it's some conflict in C++ is that the breakdown occurs even if I call the likelihood_fast function but don't use its output. eg., sampling from the prior by having the likelihood function return 0 at the end.

## pars_all is a vector taking a combination of values from params and misc
## solved is some model output
## obs is a list passed via data
## Likelihood in Rcpp -- doesn't work
tmp_lik <- likelihood_fast(solved, obs[[indiv]], pars_all)
## Likelihood in R -- works fine
tmp_lik <- likelihood(solved, obs[[indiv]], pars_all)

Main Github page documentation

Consistency of output names and types

Need to decide between "stage" and "phase" for burnin vs sampling, and then be consistent. Also, is there any advantage to to chains and rungs being output as "chain1" and "rung1", or could we simplify by making these integers?

plot_rung_loglike() not working when rungs=1

plot_rung_loglike() not working when rungs=1, think because beta_raised no longer stored in output$diagnostics when rungs = 1

problem installing drjacoby

Dear Dr Verity,

I have a problem to install the drjacoby.
Message is as below.
package ‘testthat’ successfully unpacked and MD5 sums checked
Error: Failed to install 'drjacoby' from GitHub:
Failed to install 'ggisoband' from GitHub:
(converted from warning) cannot remove prior installation of package ‘testthat’

Thank you.

vignette on reparameterisation

Demonstrate what highly correlated output looks like, making use of some existing plotting functions. Talk about ideas of model specification and how sometimes problems can be improved via reparameterisation. Make sure these are dealy with correctly mathematically, i.e. doesn't end up changing the prior accidentally.

Optionally store data

Currently we store data in the MCMC output, but this could be quite large. Perhaps we want an option to not store data (default is to store it)?

Write C++ functions and read in as text string

Should be able to write a C++ function in a separate file, then read that file in as a character string, rather than defining the function as a string. Would be much more convenient.

Transfer CI to GHA

Given that travis is not really an option anymore, we should probably transfer CI (and code cov?) over to github actions. Am happy to do this at some point if that works for you @bobverity?

plot_par show argument not specific

When running plot_par() with, for example, show = "eta", it will also show all parameters containing this string, for example "beta" and "zeta". Need to make show argument completely specific.

Rcpp string compilation check function

Check that the XPtr package can successfully compile the Rcpp function string (before using in parallelised MCMC)

Problem with plot_par() when parameter name is "x"

Something going wrong with the data wrangling within plot_par(). When the parameter is called "x" it appears to get overwritten by a sequence of increasing values.

Output refactor

Tidy output:to levels in the object list 1. Output and 2. DIagnostics.
Output would therefore be a tidy dataframe over all chains, rungs etc.
Plot function and diagnostics would have to be reformatted accordingly.

Allow easy exit within MCMC

Put in catch function to allow easy exit by Esc

Automatic initial values

Should be possible to automatically generate initial values even for infinite ranges, by simply drawing from uniform distribution and then applying log/logit transformation. Would need documenting.

Improve error message

Very small change - current error message upon NA/Inf likelihood is not very useful. Should list parameters next to their current value at time of crash

Utility functions that can be accessed

As I understand it, the use has access to all drjacoby functions with export tags, even C++ functions. So does this mean if we put export tags on C++ utility functions they can be accessed by the user within their likelihood and prior functions? If so might be very useful, even if just for printing things etc.

Correlation grid plot

Thought of another plot that might be useful - a coloured grid plot of the correlation matrix between all parameters. Useful for diagnosing which parameters are highly correlated.

Parameter summary

Function to return common summaries of all parameters from the output, e.g. ML, MAP, posterior median, credible intervals etc. Think about if/how these should also be part of the standard output.

MCMC diagnostic plots + convergence stats

Error thrown with no intitial conditions and Inf

With infinitie bound specified for parameter, and no initial conditions specified error thrown during fitting.

E.G. running the intro vignette without initial conditions:

df_params <- data.frame(name = c("mu", "sigma"),
                        min = c(-10, 0),
                        max = c(10, Inf))

throws the error: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases

with traceback:

7: stop("0 (non-NA) cases")
6: lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...)
5: lm(x[, i] ~ z)
4: spectrum0.ar(x)
3: FUN(newX[, i], ...)
2: apply(all_chains, 2, coda::effectiveSize)
1: run_mcmc(data = x, df_params = df_params, loglike = r_loglike, 
       logprior = r_logprior, chains = 1, burnin = 1000, samples = 1000)

Potential bug in likelihood calculation at iteration 0

Potential bug in particle.cpp, function Particle::init_like(). Looks like perhaps this only considers the final parameter?

Checks on non-finite likelihoods

Can we add a simple internal check that the loglikelihood is not inf/nan within the MCMC? If so, make program exit elegantly rather than just crashing.

Effective sample size

Pre-run check on init

Check log_likelihood(init_params) and log_prior(init_params)` a not returning bad values (-Inf/Inf) prior to run. Throw an error if they are.

Non-numeric data

Not sure to what extent we can/cannot handle non-numeric data. Now that we input data as list it's possible that this should be perfectly fine already - it's just a catch within R that is preventing this data type from being allowed. But even if so, this should be checked properly with a test and potentially an example.

Improve rungs vignette

Vignette on parallel tempering currently jumps in at the deep end. Change text to ramp up, starting by talking about issues with correlation and how these can often be fixed with other methods (e.g. reparameterisation). When it comes to temperature rungs, make sure it's clear to the user exactly what order to do things in - i.e. better to get the model mixing nicely over a single output chain before adding more chains.

A proposal for fitting a model with multiple random effects

#Simple GLMM. Based on mosquito death in a experimental hut trial
#Here random effects are crossed, rather than nested.
#But the overall method should apply more generally

#Can provide stan model comparison, if useful
library(ggplot2)
library(drjacoby)

#In this trial, we are evaluating two mosquito nets, one of which is an untreated (control) net
#There are four huts: each night a volunteer sleeps in each hut, under the net.
#The sleepers & the nets are moved around the different huts over the course of 32 nights

#Here's the trial structure. For 'net', 0=untreated control, 1=insecticide-treated net (ITN)
dfm <- data.frame('hut'=rep(c(1,2,3,4),32), 'sleeper' = c(rep(c(1,2,3,4),8), rep(c(4,1,2,3),8),
rep(c(3,4,1,2),8), rep(c(2,3,4,1),8)), 'net' = rep(c(0,0,1,1,1,1,0,0), 16) )
#How many mosquitoes enter each hut each night? Draw from a Poisson distribution
dfm$n <- rpois(128, lambda = 11)
dfm$dead <- NA

#To simulate how many mosquitoes die each night (binomial sampling), we include an effect for the ITN
#We also include a weaker effect, depending on the hut & sleeper
for(i in 1:128){
rho <- -1 + 0.1dfm$hut[i] - 0.2dfm$sleeper[i] + 1.5 * dfm$net[i]#prob on the logistic scale
#whack a bit o' noise on it
rho2 <- rnorm(1,mean = rho, sd = 0.1)
#convert to probability scale
prb <- exp(rho2)/(1+exp(rho2))
dfm$dead[i] <- rbinom(1, size = dfm$n[i], prob = prb)
}

#Now prepare what we need for drjacoby. There are 4 random effects for hut and 4 for sleeper.
#In total, there are 16 combinations of these
M <- 16 # no. of combinations of the two random effects? (4 values each)

df_params <- define_params(name = "hut1", min = -Inf, max = Inf, init = 0, block = c(1,2,3,4,M+1),
name = "hut2", min = -Inf, max = Inf, init = 0, block = c(5,6,7,8,M+1),
name = "hut3", min = -Inf, max = Inf, init = 0, block = c(9,10,11,12,M+1),
name = "hut4", min = -Inf, max = Inf, init = 0, block = c(13,14,15,16,M+1),
name = "sleeper1", min = -Inf, max = Inf, init = 0, block = c(1,5,9,13,M+1),
name = "sleeper2", min = -Inf, max = Inf, init = 0, block = c(2,6,10,14,M+1),
name = "sleeper3", min = -Inf, max = Inf, init = 0, block = c(3,7,11,15,M+1),
name = "sleeper4", min = -Inf, max = Inf, init = 0, block = c(4,8,12,16,M+1),
name = "sigma_hut", min = 0, max = Inf, init = 1, block = M+1,
name = "sigma_sleeper", min = 0, max = Inf, init = 1, block = M+1,
name = "a1", min = -Inf, max = Inf, init = 0, block = 1:M,
name = "a0", min = -Inf, max = Inf, init = 0, block = 1:M)

#Now we have to work out which data points need to go with each block
#empty list for the data
lst <- list()
#empty dataframe
dfa <- data.frame('blck'=rep(0,16), 'blckH'=rep(0,16), 'blckS'=rep(0,16))
count <- 1
for(j in 1:4){
for(k in 1:4){
aux <- dfm[dfm$hut == j & dfm$sleeper == k,]
#preare data for a block, store as a list
lst[[as.character(count)]] <- list(total = aux$n, dead = aux$dead, net = aux$net)
dfa$blck[count] <- count
dfa$blckH[count] <- j
dfa$blckS[count] <- k
count <- count + 1
}
}
#Look at dfa. For each block, which random effect do we need for hut (H) and sleeper (S)?
head(dfa)

#define log-likelihood function
r_loglike <- function(params, data, misc) {

#unpack parameters
#Patient-specific first
hut <- rep(0,4)
sleeper <- rep(0,4)
for (i in 1:4) {
hut[i] <- params[i]
sleeper[i] <- params[4 + i]
}
a1 <- params["a1"]
a0 <- params["a0"]
sigma_hut <- params["sigma_hut"]
sigma_sleeper <- params["sigma_sleeper"]

#get current update block
block = misc$block

#distinct method for first M blocks, vs (M+1)
ret = 0.0;
if (block == (M+1)) { # // likelihood for the global parameters
for(i in 1:4){
ret <- ret + dnorm(hut[i], mean=0, sd=sigma_hut, log=T) +
dnorm(sleeper[i], mean=0, sd=sigma_sleeper, log=T)
}
}else{
tot <- data[[block]]$total
ded <- data[[block]]$dead
nt <- data[[block]]$net
nnn <- length(tot)
for(j in 1:nnn){
#use dfa to index the random effects?
lprob <- a0 + a1*nt[j] + hut[dfa$blckH[block]] + sleeper[dfa$blckS[block]] # on logistic scale
#convert to probability scale
prb <- exp(lprob)/(1+exp(lprob))
ret <- ret + dbinom(ded[j], tot[j], prb, log = T)
}
}
return(ret)
}

#define log-prior function
r_logprior <- function(params, misc) {
sigma_hut = params["sigma_hut"]
sigma_sleeper = params["sigma_sleeper"]
a0 = params["a0"]
a1 = params["a1"]
l1 <- dnorm(x=a0, mean = 0, sd = 2.5, log = T)
l2 <- dnorm(x=a1, mean = 0, sd = 1, log = T)
l3 <- dexp(x = sigma_sleeper, rate = 1, log = T)
l4 <- dexp(x = sigma_hut, rate = 1, log = T)
ret <- l1 + l2 + l3 + l4
return(ret)
}

date()
mcmc <- run_mcmc(data = lst,
df_params = df_params,
loglike = r_loglike,
logprior = r_logprior,
burnin = 2e3,
samples = 6e3,
chains = 3,
#cluster = cl,
silent = F)
date()
plot_par(mcmc, show = c('a0','a1','sigma_hut','sigma_sleeper'), phase = "sampling")
plot_credible(mcmc, show = c('a0','a1','sigma_hut','sigma_sleeper'))
#random effect for sleeper should be (more-or-less) descending (see line 19)
plot_credible(mcmc, show = paste0('sleeper',seq(1,4)))
#random effect for hut should be (more-or-less) asscending (see line 19)
plot_credible(mcmc, show = paste0('hut',seq(1,4)))

Inbuilt likelihoods for fitting Gaussian processes

I have a couple of example on my local machine of likelihoods for fitting 1D and 2D Gaussian processes to data. This is a pretty common problem, and it's also one where the posterior can (in rare situations) end up being multi-modal if there are multiple plausible smoothing levels, which motivates the use of drj.
Wondering if we want to bring these likelihoods inside the package so they are available to user, and write vignettes showing how to use them? More generally, wondering if we want a folder full of example likelihood/prior functions available to user?

Misc object passed to likelihood and prior

Would be useful to allow a separate "misc" object (Rcpp list) to be passed through. For example, for storing lookup tables. I suppose should also have a function called on these misc objects (e.g. initialise_misc()) before MCMC is run to allow initiation of lookups etc?

Likleihoods and DIC

Check that loglikelihoods returned are in fact likelihood only and not likelihood * prior.
We should be able to calculate DIC and store e.g. in diagnostics.

Blocks vignette likelihood not displayed properly

Pass data and params by reference

Can we pass by reference, as would significantly speed up when data/params are large? May not be possible given strange things we are doing with function pointers. Also should try and apply const so that user cannot modify accidentally.

Don't store parameters for all rungs

We are currently storing all likelihoods and all parameters for every temperature rung. I've found that when running MCMCs with large numbers of parameters this makes the output object very large, when we rarely (if ever) need to see the parameter values for hot rungs. We do, however, need to see the likelihood and prior for all rungs. So I think we need to split the output so that likelihood & prior are stored in one object over all rungs, and parameters are stored in another object just for the cold rung.

Benchmarking system

May be worth formally assessing performance at some point:

Turner and Neal (2017).pdf

Correlated proposals

Because we work in transformed parameter space (-Inf to Inf for all parameters) we can easily handle proposals from multivariate normal distribution with correlations. These correlations are typically calcualted during the burn-in from the parameter draws. However, this can create a circularity problem in which poor mixing initially causes the MCMC to tune to the wrong place, and then get worse subsequently. Look into this to see if there are simple fixes that can be implemented that improve inference for correlated parameters while avoiding pitfalls.

Show Rhat use in vignette

Pass data in as Rcpp::List

Should be possible to pass data in as a list, given that it is static. Would require proper versioning as would break any likelihoods passed in using earlier version of the software.

Allow logLike and logPrior functions to call other functions internally

Currently c++ functions for log likelihood and prior only work when stand alone. It may be possible to be more flexible, allowing users to define and call other functions internally. This may be an example:

https://cran.r-project.org/web/packages/rust/vignettes/rust-c-using-rcpp-vignette.html#cpp_fun

Multi-level fitting example

Demonstrate multi-level fitting with hyper-priors (e.g. measles model) in website documentation.

Passing c++ or R function as argument

Some code testing the possibility of passing a Rcpp function as an argument to an Rcpp function and also a speed comparison compared to jumping in and out to the equivalent R function each time:




library(Rcpp)
library(microbenchmark)
library(ggplot2)

cppFunction('double m(std::vector<double> y){
  double s;
  for(unsigned int i = 0; i < y.size(); i++){
    s += y[i];
  }
  return s / y.size();
}')
cppFunction('NumericVector t2(Function f1, int N, std::vector<double> y){
  NumericVector x;
  for(int i = 0; i < N; i++){
    x = f1(y);
  }
  return x;
}')

bm <- microbenchmark(passing_R = t2(mean, 10000, 1:10),
                     passing_cpp = t2(m, 10000, 1:10),
                     times = 20)

ggplot(data = bm, aes(y = time, x = expr, colour = expr)) +
  geom_violin()

bm2 <- microbenchmark(mean_R = mean(1:100),
                     mean_cpp = m(1:100),
                     times = 20)
ggplot(data = bm2, aes(y = time, x = expr, colour = expr)) +
  geom_violin()

Initial parameters argument

Separate initial parameters argument, removing from df_params.
Optional - not needed if stating values can be drawn from min & max. Warning that it is needed if either of min or max = Inf
Input is a list, whcih can contain a vector for each chain of a parameter

Progress bar

Does the txtProgressBar lead to behaviour of jumping in and out of R from Rcpp @bobverity ?

plot_param() takes too long

I think we currently produce all plots, and then display them one at a time. But if we're only really interested in viewing the first couple and then we quit then this ends up being very slow. Can we produce and view them one at a time (while still preserving the ability to save plots to a list rather than plotting)?

Help typo in run_mcmc

Parameter beta_manual help has GTI_code when it shouldbe GTI_pow I think

microbenchmark in example vignette producing misleading results

When C++ functions are passed, these need to be compiled the first time. Therefore there is an initial time cost, after which MCMC runs much faster. microbenchmark comparison currently seems to be recompiling, therefore making it seem like C++ is only ~4 times faster, when in fact more like 30 times faster.
Maybe store MCMC run-time within output, so can be extracted? Then maybe load microbenchmark results rather than re-calculate, to speed up vignette build?

Switch function likelihood method

Sometimes there are intermediate outputs calculated within the likelihood that we want to access from our posterior draws, e.g. model expectations, splines etc. A workaround is to write the likelihood in such a way that output can switch between returning the raw loglikelihood and these intermediate values. This function can then be run directly within R on MCMC output to get model predictions without having to duplicate this calculation, which could introduce errors.

Write vignette to demonstrate this approach on simple model.