The bookdown version lives here:

Home Page: https://bookdown.org/content/4857/

License: Creative Commons Zero v1.0 Universal

TeX 5.10% HTML 93.93% CSS 0.57% JavaScript 0.41%

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed's Introduction

Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition

Welcome to the sister project of my Statistical Rethinking with brms, ggplot2, and the tidyverse. Our aim is to translate the code from McElreath's second edition to fit within a brms and tidyverse framework. The current 0.4.0 version contains drafts of all 17 chapters, which have been stitched together into a bookdown ebook, which you can access at https://bookdown.org/content/4857/. See issues #16 and #17 for the biggest unresolved problems.

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed's People

Contributors

Stargazers

Watchers

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed's Issues

license statement

Add a license statement in the index.Rmd file similar to the template in this book.

Model 7.6

In Section 7.1.1, McElreath fit a series of higher-order polynomial models. Most work fine with brms. However, the sixth-order polynomial model 7.6 seems elusive. The catch is it's based on seven data points. McElreath got it to work with rethinking::quap() like this:

# load
library(tidyverse)
library(rethinking)

# define the data
d <- 
  tibble(species = c("afarensis", "africanus", "habilis", "boisei", "rudolfensis", "ergaster", "sapiens"), 
         brain   = c(438, 452, 612, 521, 752, 871, 1350), 
         mass    = c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5)) %>% 
  mutate(mass_std  = (mass - mean(mass)) / sd(mass),
         brain_std = brain / max(brain))
 
# fit the the model
m7.6 <- quap( 
  alist(
    brain_std ~ dnorm(mu, 0.001), 
    mu <- a + b[1] * mass_std + b[2] * mass_std^2 + b[3] * mass_std^3 + b[4] * mass_std^4 + b[5] * mass_std^5 + b[6] * mass_std^6,
    a ~ dnorm(0.5, 1),
    b ~ dnorm(0, 10)
  ), 
  data = d, start = list(b = rep(0, 6)))

Here's what McElreath wrote about that model on page 198: "That last model, m7.6, has one trick in it. The standard deviation is replaced with a constant value 0.001. The model will not work otherwise, for a very important reason that will become clear as we plot these monsters."

If you have the brms solution, I'd love to see it.

Simulating new observations for non-observed levels

As per the email from Ladislas Nalborczyk on Oct 14, check Section 12.4.1. The current method for simulating new observations for non-observed levels may be off. For details, see the email.

homework

Ania Kawiecki has been posting answers to the homework problems using INLA. Her original twitter post is
https://twitter.com/aniakawiecki/status/1308552061240225792 and her code lives here: https://akawiecki.github.io/statistical_rethinking_inla/
David Salazar has also been working through the homework problems. Here's his original twitter announcement (https://twitter.com/DavidSalazarVir/status/1253410313287766017) and here's a link to his first blog post (https://david-salazar.github.io/2020/04/19/statistical-rethinking-week-1/)

lkj

In ch14, add a link to Stephen Martin's blog post walking out the lkj: http://srmart.in/is-the-lkj1-prior-uniform-yes/

current brms limitations to keep an eye on

Watch out for brms updates on these issues. There may come a time when these models will be feasible within brms.

Section 4.6, fit b4.8: The current mgcv syntax for b-splines is a mess. This may be a personal limitation. Regardless, keep an eye out for syntax changes and for work by Gavin Simpson on the topic.
Section 14.4, fit m14.7: Social relations models (SRM) are not currently possible because brms does not allow covariances among distinct random effects with the same levels (see issue #502)
Section 14.5.2, fit m14.11: The Ornstein–Uhlenbeck (OU) process phylogenetic model is not currently possible in brms (see issue #234).

Section 4.4.3.4. Predictions for E{height} at a given weight

In this code snippet:

mu_at_50 <- 
  post %>% 
  transmute(mu_at_50 = b_Intercept + b_weight_c + 5.01)

Is the formula for mu_at_50 correct?

Or should we compute it as mu_at_50 = b_Intercept + b_weight_c * 5.01?

Model 16.5

In Section 16.4.2, McElreath fit a bivariate population dynamics model using ordinary differential equations (ODEs). His proposed statistical model was

McElreath fit the model with rstan::stan(). His Stan model code comes in the rethinking package. Here's how to fit the model.

# load
library(rethinking)

# data
data(Lynx_Hare)

dat_list <- list(
  N = nrow(Lynx_Hare), 
  pelts = Lynx_Hare[,2:3] )

# Stan code
data(Lynx_Hare_model)

# fit the model
m16.5 <- stan( model_code=Lynx_Hare_model , 
               data=dat_list , 
               chains=3 , cores=3 , 
               control=list( adapt_delta=0.95 ) )

If you execute cat(Lynx_Hare_model), you can see the Stan model code.

functions {
  real[] dpop_dt( real t,                 // time
                real[] pop_init,          // initial state {lynx, hares}
                real[] theta,             // parameters
                real[] x_r, int[] x_i) {  // unused
    real L = pop_init[1];
    real H = pop_init[2];
    real bh = theta[1];
    real mh = theta[2];
    real ml = theta[3];
    real bl = theta[4];
    // differential equations
    real dH_dt = (bh - mh * L) * H;
    real dL_dt = (bl * H - ml) * L;
    return { dL_dt , dH_dt };
  }
}
data {
  int<lower=0> N;              // number of measurement times
  real<lower=0> pelts[N,2];    // measured populations
}
transformed data{
  real times_measured[N-1];    // N-1 because first time is initial state
  for ( i in 2:N ) times_measured[i-1] = i;
}
parameters {
  real<lower=0> theta[4];      // { bh, mh, ml, bl }
  real<lower=0> pop_init[2];   // initial population state
  real<lower=0> sigma[2];      // measurement errors
  real<lower=0,upper=1> p[2];  // trap rate
}
transformed parameters {
  real pop[N, 2];
  pop[1,1] = pop_init[1];
  pop[1,2] = pop_init[2];
  pop[2:N,1:2] = integrate_ode_rk45(
    dpop_dt, pop_init, 0, times_measured, theta,
    rep_array(0.0, 0), rep_array(0, 0),
    1e-5, 1e-3, 5e2);
}
model {
  // priors
  theta[{1,3}] ~ normal( 1 , 0.5 );    // bh,ml
  theta[{2,4}] ~ normal( 0.05, 0.05 ); // mh,bl
  sigma ~ exponential( 1 );
  pop_init ~ lognormal( log(10) , 1 );
  p ~ beta(40,200);
  // observation model
  // connect latent population state to observed pelts
  for ( t in 1:N )
    for ( k in 1:2 )
      pelts[t,k] ~ lognormal( log(pop[t,k]*p[k]) , sigma[k] );
}
generated quantities {
  real pelts_pred[N,2];
  for ( t in 1:N )
    for ( k in 1:2 )
      pelts_pred[t,k] = lognormal_rng( log(pop[t,k]*p[k]) , sigma[k] );
}

Based on @mages's blog post, PK/PD reserving models, I know ODE models are possible with brms. However, this is beyond my current skill set.

drop {urbnmapr} for {tigris}

There are now problems with the urbnmapr::get_urbn_map() function. Happily, one can replace that functionality with functions from the tigris package, which also has the benefit of active development. Add the tigris package to Zotero and update the beginning of Chapter 5.

Note, there is a small difficulty with the resulting workflow for relevant plot in Section 5.1. Here's the updated workflow:

# get the map data
states(cb = TRUE, resolution = "20m") %>%
  shift_geometry() %>% 
  # add the primary data
  right_join(
    d %>% 
      mutate(NAME = Location %>% as.character()) %>% 
      select(d:a, NAME),
    by = "NAME") %>% 
  # convert to the long format for faceting
  pivot_longer(cols = c("d", "m", "a"), names_to = "variable") %>% 
  
  # plot!
  ggplot() +
  geom_sf(aes(fill = value, geometry = geometry),
          size = 0) +
  scale_fill_gradient(low = "#f8eaea", high = "firebrick4") +
  theme_void() +
  theme(legend.position = "none",
        strip.text = element_text(margin = margin(0, 0, .5, 0))) +
  facet_wrap(~ variable, labeller = label_both)

The issue is that the data frame is of class sf, and objects of class sf don't pivot nicely with the pivot_longer() function unless you specifically refer to the columns within quotes, as in pivot_longer(cols = c("d", "m", "a")). The details are in this GitHub issue. Just make a note of it.

replacement for sim_happieness

You said:

Quite frankly, I can't make sense of it. So we'll just have to move forward and use the convenience function rather than practicing a tidyverse alternative. If you have a handle on what's going on and have a tidyverse alternative, please share your code.

Challenge accepted. There is a fundamental looping over time, but we can separate out the looping from the annual updates pretty easily. Here's one way to do it.

# Return a data frame with n 1-year-olds of varying happiness
new_borns <- function(n = 20) {
  tibble(
    A = 1,   # 1 year old
    M = 0,   # not married
    H = seq(-2, 2, length.out = n)   # range of happiness scores
  )
}

# Everyone ages;
# Some unmarried adults get married (happier people are more likely to)
# Old people die;
# New people are born.

update_population <- function(
  pop, N_births = 20, aom = 18, max_age = 65) {
  
  pop %>%
    mutate(
      A = A + 1, # everyone gets one year older
      # some people get married
      M = ifelse(M >= 1, 1, (A >= aom) * rbinom(nrow(pop), 1, inv_logit(H - 4)))
    ) %>%
    filter(A <= max_age) %>%    # old people die
    bind_rows(
      new_borns(N_births)            # new people are born
    )
}

# Run the population simulation for 1000 years.
set.seed(1997)
Pop <- new_borns(20)          # year 1
for(i in 2:1000) {            # years 2 through 1000
  Pop <- update_population(Pop, aom = 18, max_age = 65, N_births = 20)
}

retraction whitehouseComplexSocietiesPrecede2019

The Moralizing_gods data in Section 15.2.3 are from Whitehouse et al (2019), which is saved in my Zotero as @whitehouseComplexSocietiesPrecede2019. On July 12, 2021, Zotero alerted me this work was recently retracted, the details of which can be found at https://retractionwatch.com/2021/07/07/critique-topples-nature-paper-on-belief-in-gods/. It appears the retraction is, in part, a reaction to this preprint which, interestingly enough, questions the authors' missing data methods. This should be noted in the ebook.

Section 12. 4 Figure 12.8

Hi there, thanks so much for publishing this resource! It is an absolute life-saver when trying to work through the Statistical Rethinking course while using brms.

I am opening this issue because I think the labels on your version of Fig 12.8 in Section 12.4 may be incorrect. Perhaps I am misunderstanding the differences between the rethinking and brms packages but I think the code for the GGally paired plot should be switched from:

library(GGally)

delta_labels <- c("Elem", "MidSch", "SHS", "HSG", "SCol", "Bach", "Mast", "Grad")

posterior_samples(b12.6) %>% 
  select(contains("simo_moedu_new1")) %>% 
  set_names(str_c(delta_labels[2:8], "~(delta[", 1:7, "])")) %>% 
  ggpairs(upper = list(continuous = my_upper),
          diag = list(continuous = my_diag),
          lower = list(continuous = my_lower),
          labeller = label_parsed) +
  theme(strip.text = element_text(size = 8))

to:

library(GGally)

delta_labels <- c("Elem", "MidSch", "SHS", "HSG", "SCol", "Bach", "Mast", "Grad")

posterior_samples(b12.6) %>% 
  select(contains("simo_moedu_new1")) %>% 
  set_names(str_c(delta_labels[1:7], "~(delta[", 1:7, "])")) %>% 
  ggpairs(upper = list(continuous = my_upper),
          diag = list(continuous = my_diag),
          lower = list(continuous = my_lower),
          labeller = label_parsed) +
  theme(strip.text = element_text(size = 8))

where delta_labels are subset from 1:7 instead of from 2:8.

Thanks again for this amazing resource!

residuals and history

McElreath is a little inconsistent with how he talks about residuals. The typical definition of residuals now a days is

$$r_i = y_i - \hat y_i,$$

but in McCullagh & Nelder (1989, p. 9), we read "Gauss defined he residuals with opposite sign to that in current use, i.e. by $X \beta - y$," which we might express in the notation above as

$$r_{i \text{, old}} = \hat y_i - y_i.$$

In chapters 5 through 7 or so, McElreath seems to use both. For example, on page 135, we read:

And then we compute the residuals by subtracting the observed marriage rate in each State from the predicted rate, based upon the model above (emphasis added)

But his plot on page 136 seems to use the conventional modern definition of residuals. Consider specifically defining residuals in chapter 5 of the ebook, referencing McCullagh & Nelder (1989, p. 9), and commenting on McElreath's usage. Note this issue also comes up with the R2_is_bad() function in chapter 7.

McCullagh & Nelder: https://www.routledge.com/Generalized-Linear-Models/McCullagh-Nelder/p/book/9780412317606

An attempt to reproduce Figure 9.3

Hello Solomon, I have been enjoying your work!

Responding to the first paragraph of Section 9.2.2, I am providing a link to my attempt to reproduce Figure 9.3 here.

One thing I want to point out: in the text, McElreath says that the step size for the plot on the left is 0.01. However, the figure indicates that the step size is 0.1. I think the figure is correct.

Thank-you's

Make sure to thank:

4.3.6 tidyverse parallel for the transformed base R

In section 4.3.6 the base R version t(apply(post[, 1:2], 2, quantile, probs = c(.5, .025, .75))) doesn't have a parallel tidyverse version.
One can use a solution like this:

post %>%
  select(b_Intercept, sigma) %>%
  reframe(qs = c(.5, .025, 0.75), across(.cols=c(b_Intercept, sigma), .fns=~quantile(.x, probs=qs)))

The existing tidyverse code uses a superseded function mutate_if the code can be updated to (only the mutate function changes)

post %>%
  pivot_longer(b_Intercept:sigma) %>% 
  group_by(name) %>%
  summarise(mean = mean(value),
            sd   = sd(value),
            `2.5%`  = quantile(value, probs = .025),
            `97.5%` = quantile(value, probs = .975)) %>%
  mutate(across(.cols=is.numeric, .fns=~round(.x, digits = 2)))

Thanks for an excellent book!

Consider using facet labellers?

Instead of using str_c() to modify the data, facet labelling can be used to include the variable name and value in facet labels. See https://ggplot2.tidyverse.org/reference/labellers.html for examples.

Ch 5 DAGs

For better formatting, consider adjusting the scale_y_continuous() settings to something like

scale_y_continuous(NULL, breaks = NULL, expand = c(0.2, 0.2))

Note that I'm including leading zero's, here. The current code in the chapter doesn't and it probably should.

Also, for the last DAG at the end of Section 5.2.0.1, consider setting labeller = label_both within facet_wrap().

DAG on 15.2.1

It appears as the variables in the four panel dag in 15.2.1 are not in the correct order. Page 500 in Rethinking 2nd edition

cross link ebook editions

I always seem to have a hard time finding the 2nd edition directly from a search engine. I suppose this is mostly user error on my part & I ought to just bookmark the page already. But the problem suggests that new users might also have difficulty finding the edition they want. Would you consider adding a link to the other edition right at the top to help folks immediately identify when they've stumbled onto the wrong version? Something like:

and

Kudos for putting together these lovely resources.

Model 16.2

In Section 16.3.2, McElreath fit a mixture of unobserved strategies. On page 534, we see his statistical model was:

He fit the model using rstan::stan() and already has his Stan model code saved in the rethinking package. Here's how to fit his model:

# load
library(rethinking)

# data
data(Boxes)
data(Boxes_model)


# prep data 
dat_list <- list(
  N = nrow(Boxes),
  y = Boxes$y,
  majority_first = Boxes$majority_first )

# run the sampler
m16.2 <- stan( model_code=Boxes_model , data=dat_list , chains=3 , cores=3 )

If you execute cat(Boxes_model), you can get McElreath's Stan code:

data{
    int N;
    int y[N];
    int majority_first[N];
}
parameters{
    simplex[5] p;
}
model{
    vector[5] phi;
    
    // prior
    p ~ dirichlet( rep_vector(4,5) );
    
    // probability of data
    for ( i in 1:N ) {
        if ( y[i]==2 ) phi[1]=1; else phi[1]=0; // majority
        if ( y[i]==3 ) phi[2]=1; else phi[2]=0; // minority
        if ( y[i]==1 ) phi[3]=1; else phi[3]=0; // maverick
        phi[4]=1.0/3.0;                         // random
        if ( majority_first[i]==1 )             // follow first
            if ( y[i]==2 ) phi[5]=1; else phi[5]=0;
        else
            if ( y[i]==3 ) phi[5]=1; else phi[5]=0;
        
        // compute log( p_s * Pr(y_i|s )
        for ( j in 1:5 ) phi[j] = log(p[j]) + log(phi[j]);
        // compute average log-probability of y_i
        target += log_sum_exp( phi );
    }
}

To my eye, it looks like one might be able to fit a model like this with brms if you make a custom likelihood (see here). However, that would be well beyond my current skill set.

HMC

Add a link to McElreath's blog post, Markov chains: Why walk when you can flow?, somewhere in the HMC intro chapter.

residuals

In Section 5.1.5.1.1, link to the reference in this tweet

poststratification

Reference RoS (Ch 17) as a resource for learning about single-level poststratification. This might go in either Section 13.5.3 or somewhere in Section 13.6.

tables

Replace tables made with knitr::kable() with

flextable::flextable() or possibly
flextable::flextable() %>% flextable::theme_alafoli()

Use of 'package' argument in data()

Thanks for the nice resource to supplement Statistical Rethinking. I had one suggestion for loading datasets from the rethinking package. Instead of going to the trouble of detaching rethinking after pulling out a dataset, you might be able to do something like:

data(UCBadmit, package="rethinking")

I believe that rethinking is not loaded in that case, but you still get the dataset. It might streamline some future code, though it is probably not worth changing the existing code. (so feel free to close this issue immediately)

4.4.3.1 posterior_samples deprecation

First, thanks for sharing this incredible resource! It's exactly what I've been looking for and I'm certain it will benefit many more people to come.

The following code is provided at 4.4.3.1 to display the variance/covariance matrix with sigma included (I'll try to provide a quick repex):

library(rethinking)
library(brms)
library(tidyverse)

data("Howell1"); d <- Howell1; d2 <- d[d$age >= 18,]

d2 <-
  d2 %>% 
  mutate(weight_c = weight - mean(weight))

b4.3 <- 
  brm(data = d2, 
      family = gaussian,
      height ~ 1 + weight_c,
      prior = c(prior(normal(178, 20), class = Intercept),
                prior(lognormal(0, 1), class = b),
                prior(uniform(0, 50), class = sigma)),
      iter = 28000, warmup = 27000, chains = 4, cores = 4,
      seed = 4,
      file = "fits/b04.03")

posterior_samples(b4.3) %>%
  select(-lp__) %>%
  cov() %>%
  round(digits = 3)
#> Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
#> recommended alternatives.
#>             b_Intercept b_weight_c sigma
#> b_Intercept       0.076      0.000 0.000
#> b_weight_c        0.000      0.002 0.000
#> sigma             0.000      0.000 0.036

^{Created on 2021-09-07 by the reprex package (v2.0.1)}

Given this warning, would it help to update with code to use as_draws()? I was thinking something like this:

as_draws_df(b4.3) %>%
  select(-lp__) %>%
  cov() %>%
  round(digits = 3)

But it would be nice to more cleanly drop the hidden reserved variables (chain, iteration, draw):

as_draws_df(b4.3) %>%
  select(b_Intercept, b_weight_c, sigma) %>%
  cov() %>%
  round(digits = 3)

Though this ends with the warning: Warning message: Dropping 'draws_df' class as required metadata was removed.

b11.11

Try refitting b11.11 w/o adjusting adapt_delta.

lp__

Add a link to Stephen Martin’s explanation of the lp__ somewhere in Chapter 4.

Model 15.8

In Section 15.3.1, McElreath fit a model using what he called the “weighted average” approach to handling missing values for categorical data (p. 517). Here's his model:

# load
library(rethinking)


# data
set.seed(9) 

N_houses <- 100L 
alpha <- 5
beta <- (-3)
k <- 0.5
r <- 0.2
cat <- rbern( N_houses , k )
notes <- rpois( N_houses , alpha + beta*cat )
R_C <- rbern( N_houses , r )
cat_obs <- cat
cat_obs[R_C==1] <- (-9L)

dat <- list(
  notes = notes,
  cat = cat_obs,
  RC = R_C,
  N = as.integer(N_houses) )


# fit the model
m15.8 <- ulam( alist(
  # singing bird model
  ## cat known present/absent: 
  notes|RC==0 ~ poisson( lambda ), 
  log(lambda) <- a + b*cat,
  ## cat NA:
  notes|RC==1 ~ custom( log_sum_exp(
    log(k) + poisson_lpmf( notes | exp(a + b) ),
    log(1-k) + poisson_lpmf( notes | exp(a) ) 
  ) ),
  
  # priors
  a ~ normal(0,1), 
  b ~ normal(0,0.5),
  
  # sneaking cat model 
  cat|RC==0 ~ bernoulli(k),
  k ~ beta(2,2)
), data=dat , chains=4 , cores=4 )

I'm not sure if this is even possible with brms. If you know the answer or, better yet, know how to fit the model, I'd love to see your code.

3.4 brms::fitted() results in "not an exported object from 'namespace:brms'" error

I am new to Bayesian statistics and generally do not have much experience with statistics. In reading McElreath Statistical Rethinking, I enjoy very much your translation to the tidyverse approach. Thank you!

You mentioned in section 3.4. in the function brms::fitted() in one paragraph twice.

Much like the way we used the samples() function to simulate probability values, above, we can do so with the brms::fitted() function. But we will have to specify scale = "linear" in order to return results in the probability metric. By default, brms::fitted() will return summary information. Since we want actual simulation draws, we’ll specify summary = F.

But brms::fitted() results in an error:

Error: 'fitted' is not an exported object from 'namespace:brms'

There is no error in the following code chunk, but it just used fitted(). Could it be that here the base R function stats::fitted() is called and that brms::fitted() does not exist?

Please use `linewidth` instead.

This is largely fixed. However, two main types of cases remain:

The linewidth parameter isn't fully supported by tidybayes and ggdist, yet. Based on GitHub (see here), it looks like this will be fixed in the next release of ggdist.
There are a few cases where I used scale_size_manual(), which should be replaced by scale_linewidth_manual(). However, there was a goof and the current version of ggplot2 doesn't have an option for scale_linewidth_manual() (see here). I'm hoping this will be fixed in the next release of ggplot2. (e.g., Section 7.3)

dplyr::lag()

In Section 2.2.2, make sure you use the n argument within lag(). The current code uses k, as if you were using the lag() function from the stats package. You, however, are using dplyr::lag(), which uses n, instead.

Credit @JFern83, who alerted me in a Twitter DM, for the catch

Ch 6.3.1 sim_happiness function

I would like to propose tidyverse version of sim_happiness function as below. Note it is slow.

sim_happiness_tidyverse <- function(seed = 1977, N_years = 1000, max_age = 65, N_births = 20, aom = 18) {
  set.seed(seed)
  
  # initial state
  d <- tibble(
    age = rep(1:max_age, each = N_births),
    happiness = rep(seq(-2, 2, length.out = N_births), max_age),
    married = 0L
  )
  
  # repeat one year later for N_years
  for (i in seq_len(N_years)) {
    d <- d %>% 
      # one year later
      mutate(
        age = age + 1,
        married = if_else(age >= aom & married == 0,
                          rbinom(n(), size = 1,
                                 prob = inv_logit_scaled(happiness - 4)),
                          married)) %>% 
      # death and birth
      filter(age <= max_age) %>% 
      bind_rows(
        tibble(
          age = 1L,
          happiness = seq(-2, 2, length.out = N_births),
          married = 0L
        )
      )
  }
  
  d
}

Quarto port and content update for 2023 lectures + beginning of 3rd edition

With the start of the 2023 flavor of the course, I've started porting the book over to quarto in a fork. I see that the contributing guide asks for that sort of thing to be mentioned here over pull requests. I'm more than happy to PR my changes, but I completely understand if you'd prefer I didn't.

One of the biggest benefits of the quarto port and publishing with Quarto Pubs is a major reduction in the repo size since the _book folder is no longer needs to be kept around.

Happy to hear your thoughts on how / if you'd like to collaborate on this.

Stan translation

Somewhere--perhaps in Chapter 17--mention Vincent Arel-Bundock's translation into Stan and the tidyverse.

{posterior}

The posterior package is now on CRAN (https://github.com/stan-dev/posterior/releases/tag/v1.0.0). Update the relevant sections.

R2_is_bad() uses internal function var2() and has the residual reversed

These both match what is in the text, but I think they are bad ideas.

Readers cannot be expected to know what var2() does, and the documentation says: "These functions are not intended to be called directly." (In a cleaner implementation of the package, they would not be exported.)
The residual should be observed - predicted, not the other way around.

Section 4.4.3.5 Using `nesting` when doing posterior calculations

In Section 4.4.3.5.1 Overthinking: Rolling your own predict(), you use nesting(b_Intercept, b_weight_c, sigma) to calculate prediction intervals.

This seems to drop duplicate posterior draws. Is this the correct behavior? Shouldn't we use all draws from the posterior to do inference? That is, if we've already checked that there are no convergence issues.

# There are 4,000 samples from the posterior.
post %>%
  nrow()
#> [1] 4000

# But 3,949 unique samples.
post %>%
  select(b_Intercept, b_weight_c, sigma) %>%
  distinct() %>%
  nrow()
#> Warning: Dropping 'draws_df' class as required metadata was removed.
#> [1] 3949

# By nesting we drop the duplicates.
post %>%
  expand(
    nesting(b_Intercept, b_weight_c, sigma)
  ) %>%
  nrow()
#> [1] 3949

# But perhaps we should keep all the draws?
# Or thin the draws... If duplicates remain (very unlikely), keep them.
post %>%
  crossing(
    weight = 50 - mean(Howell1$weight)
  ) %>%
  nrow()
#> [1] 4000

ggplot2 facet syntax

Add spaces around tilde's when faceting: facet_wrap(~ factor) and facet_grid(factor1 ~ factor2)

Still working on 2nd edition?

Thanks for all your work on "Statistical Rethinking with brms, ggplot2, and the tidyverse"! I started reading the book and was happy to find your project. I am reading the second edition of the book and your project for the 1st edition is still very useful as (at least in the first chapters) there seems to be a lot of overlap. I wondered whether you were working on a version for the 2nd edition of the book and found this repository. Are you still planning to release this? Thanks so much!

Chapter 10.1.1: plot of shape and entropy

I think the issue with the slope looking different to that in the book is because you're computing entropy with a grid approximation. When I used the exact formula for the entropy of the generalised normal distribution (also on the Wikipedia page) I got something that looked much more like the book. Here's my code (my styling is slightly different to yours as I'm sticking more closely to the book's look, although I like the Studio Ghibli palette):

tibble(beta = seq(1, 4, length.out = 100)) %>% 
  mutate(alpha = solve_for_alpha(beta)) %>% 
  mutate(entropy = (1 / beta) - log((beta) / (2 * alpha * gamma(1 / beta)))) %>% 
  ggplot(aes(beta, entropy)) + 
  geom_line(colour = "steelblue") + 
  geom_vline(xintercept = 2, linetype = 2, colour = "grey50") + 
  scale_x_continuous("shape", breaks = seq(1, 5, by = 0.5))

I only noticed the difference because I was originally computing entropy on a grid as well, but a much finer one. When I got much higher values for the entropy than the book it got me digging around to figure out why.

typos

15.1.3: "To get a sense of the havoc ignoring measurement error can cause, we’ll fit [two] models."
15.5.7: Just below the forrest plot, we read: "I’m not aware this [is] typical in random effect meta-analyses"

broom::tidy()

Newer editions of broom::tidy() no longer support brms. De-emphasize this approach in the later chapters.

meta-analysis selection bias

I'm not aware that brms currently offers a convenient way to fit a meta-analysis model that takes selection bias into account. However, it might be nice to link to this thread in the Stan forums, Bayesian random-effects meta-analysis with selection models (publication bias), which provides the Stan code for such a model.

Splines, section 4.6

Howdy,

One area you asked for help with was for section 4.6, with splines. The code below should reproduce m4.7 using brms. It's not Tidyverse style, but it works.

The trick is to bind d2$doy and B, as each column of B is a predictor. You can see this in McElreath's code, as he uses B %*% w in quap.

Here's the code:

library(brms)
library(splines)
library(tidyverse)

d <- read.csv("https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/cherry_blossoms.csv", 
              sep = ";")
d2 <- subset(d, !is.na(doy))



num_knots <- 15
knot_list <- quantile(d2$year, probs = seq(from = 0, to = 1, length.out = num_knots))

B <- bs(d2$year,
        knots = knot_list[-c(1, num_knots)], 
        degree = 3, 
        intercept = TRUE)

d3 <- data.frame(doy = d2$doy, 
                 B)

m4.7 <- brm(data = d3,
            family = gaussian,
            formula = doy ~ 1 + .,
            prior = c(prior(normal(100, 10), class = Intercept),
                      prior(normal(0, 10), class = b),
                      prior(exponential(1), class = sigma)),
            cores = 4,
            chains = 4,
            seed = 62531,
            iter = 2000,
            warmup = 500)

summary(m4.7)

fits <- fitted(m4.7)
d2 <- cbind(d2, data.frame(fits))

ggplot(d2) + 
  geom_line(mapping = aes(x = year,
                          y = Estimate)) +
  geom_ribbon(mapping = aes(x = year,
                            ymin = Q2.5,
                            ymax = Q97.5),
              alpha = 0.5) +
  labs(x = "year",
       y = "day in year") +
  geom_hline(yintercept = mean(d2$Estimate), linetype = 2) +
  theme_minimal()

I hope that helps.

Steve

negative binomial 12.1.2

Somewhere in Section 12.1.2, the section on the negative binomial model, enter this explicit explanation on the gamma portion of the gamma-poisson model:

I don’t believe McElreath plainly explained this in the text, but when we talk about the gamma-Poisson model as having two parameters, $\lambda$ and $\phi$, the $\lambda$ parameter is doing double duty. As with conventional Poisson models, $\lambda$ is the mean for the criterion. What might not be as clear is that $\lambda$ is also the mean of the gamma mixture distribution. So when you want to get a sense of the overall shape of the gamma-Poisson model-implied distribution of $\lambda$ parameters—one for each variable in the data set--, the distribution is gamma with a mean of $\lambda$ and shape of $\phi$. For a more detailed walk out of this, see Section 8.2 in Hilbe (2011).

It might be best to just add this as a footnote. Also, remember to add Hilbe (2011) to the reference list.

sample_n() has been superseded

From the help:

sample_n() and sample_frac() have been superseded in favour of slice_sample(). While they will not be deprecated in the near future, retirement means that we will only perform critical bug fixes, so we recommend moving to the newer alternative.

No harm (at least for now) in sticking with sample_n(), and the name is nicer, but Hadley's preference is that folks migrate to slice_sample().

social media

In Section 17.10, I list several twitter profiles.

Remove:

Add:

Bonus Waffle House example -- which models are "DAG endorsed"?

From the end of chapter 6...

The default output of adjustmentSets() produces only minimal sets of variables to condition on. But there may be other adjustments sets that are not minimal. Setting type = all will list them.

library(dagitty)
dag_6.2 <- 
  dagitty("dag {
  A -> D
  A -> M -> D
  A <- S -> M
  S -> W -> D
           }"
  )

paths(dag_6.2, from = "W", to = "D")
#> $paths
#> [1] "W -> D"                "W <- S -> A -> D"      "W <- S -> A -> M -> D"
#> [4] "W <- S -> M -> D"      "W <- S -> M <- A -> D"
#> 
#> $open
#> [1]  TRUE  TRUE  TRUE  TRUE FALSE
adjustmentSets(dag_6.2, exposure = "W", outcome = "D")
#> { A, M }
#> { S }
adjustmentSets(dag_6.2, exposure = "W", outcome = "D", type = "all")
#> { A, M }
#> { S }
#> { A, S }
#> { M, S }
#> { A, M, S }

^{Created on 2021-03-25 by the reprex package (v1.0.0)}

It might be good to mention this so users are aware that the minimal sets are not the only ones endorsed by the DAG. It might even be worth adding a model the is endorsed but not minimal.

askurz / statistical_rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed Goto Github PK

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed's Introduction

Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed's People

Contributors

Stargazers

Watchers

Forkers

statistical_rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed's Issues

Recommend Projects

Recommend Topics

Recommend Org