This is the issue for things that would be nice to get into the manual in the future.

Originally from Jeffrey Arnold here: <a class="issue-link js-issue-link" data-error-t

<input type="checkbox" id="" disabled=""

From Jeffrey Arnold: Regarding the Kalman filter examples, I have ex

<input type="checkbox" id="" disabled=""

manual longer term,about stan-dev/docs

Comments (47)

bob-carpenter commented on May 29, 2024 1

Add missing data imputation for multivariate normal. Use the bivariate example, modified slightly from the one @betanalpha discussed on thestan-users list. It marginalizes out the missing data. In this particular case, a total of N_miss instances of y have y[n,2] missing:

data {
  int<lower=0> N_nonmiss;
  vector[2] y[N_nonmiss];
  int<lower=0> N_miss;
  real y1_obs[N_miss];
}
parameters {
  vector[2] mu;
  cov_matrix[2] Sigma;
}
model {
  // Prior on mu
  // Prior on Sigma, see the manual for examples

  y_obs ~ multi_normal(mu, Sigma);
  y1_obs ~ normal(mu[1], sqrt(Sigma[1, 1]);
}

We can also sample y[n,2] using analytic conditional p(y[n,2] | y[n,1], mu, Sigma) given here:

https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions

And be sure to emphasize as Michael says that the missingness is a modeling problem, not just an afterthought. We can compare this all to the brute-force approach to missing data currently inthe manual and talk about how the analytic marginalizations speed things up.

I'll probably also include the brute-force approach and change the prior to a Cholesky factor.

from docs.

bob-carpenter commented on May 29, 2024

Originally from Jeffrey Arnold here: stan-dev/stan#1081 (comment)

Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.

from docs.

bob-carpenter commented on May 29, 2024

reparameterize Gaussian process example to use length scale with a proper prior (e.g., half-t(4)).

from docs.

bob-carpenter commented on May 29, 2024

add horseshoe prior example using Allen's implementation:

https://ariddell.org/horseshoe-prior-with-stan.html

explore what happens with L1 priors in Stan's MLE to see if we get truly 0 values or close; I'm not sure what L-BFGS will do

from docs.

bob-carpenter commented on May 29, 2024

add section or new chapter about arithmetic precision
include following example I sent to stan-users:

For example, this model is problematic arithmetically:

fit <- stan(model_code = "parameters { real y; } model { y ~ normal(1,1e-20); }")

and produces

          mean  se_mean       sd      2.5%       25% 50% 75% 97.5% n_eff         Rhat
y            1      0.0        0         1         1   1   1     1  4000          Inf
lp__ -15407440 422002.9 26689805 -61629758 -15407440   0   0     0  4000 1.500106e+14

There's really nothing we can do about such models numerically. What you have to do from a user perspective is use a parameterizations z = y - 1, which can be handled stably because you can represent numbers close to zero well.

parameters { real z; } model { z ~ normal(0,1e-20); }

The problem is that there's still no way to represent y = z + 1 as a real value using double-precision floating-point arithmetic, but you can get close to log(y) by using

  log_y <- log1p(z);

from docs.

bob-carpenter commented on May 29, 2024

Provide examples of log-mix for the manual mixture section.

from docs.

bob-carpenter commented on May 29, 2024

From Jeffrey Arnold:

from docs.

bob-carpenter commented on May 29, 2024

From Andrew on stan-dev, an example of a model for the problematic posteriors section:

Next I decided, just for fun, to fit a nonlinear model: y = a_exp(-b_x) + c + error, with N(0,sigma) errors. I simulated some data (x taking on the values between 0 and 23, setting a=.5, b=1/10, c=1, sigma=.2, and I fit it constraining all the parameters to be positive. (In retrospect I’m thinking I should’ve just constrained b and sigma to be positive; constraining a and c made sense in the context of my hypothetical example, but in general it’s poor practice to put in hard constraints unless absolutely necessary.)

I simulated data and fit the model and all was fine. But then I repeated and the fit blew up, I was getting some estimates of b that were either really close to 0 (1e-10, etc) or really high (1e10, etc), I can’t remember which. The point is that there can be instability at the extremes, and I realized the model need a bit of hand-holding. As a kluge I constrained b to be between 1/50 and 1 (which made sense in the context of the example). A hard constraint was ugly but was easy enough to understand. I didn’t want to confuse the class by introducing a normal prior here.

So this second example was a bit of a mess. Which suggests to me that in the manual/book we need to throw in an example to explicitly address instability and constraints. Since, ultimately, this is an inherent difficulty with modeling, it’s not anything to do with Stan in particular.

from docs.

bob-carpenter commented on May 29, 2024

add discussion of K vs. (K-1) coefficient vector parameterizations of multi-logit

Here's what the model should look like, but there's a bug in Stan 2.6 preventing the compilation:

data {
  int<lower=2> K;             // num categories
  int<lower=1> D;             // num predictors (without intercept)
  int<lower=0> N;             // num observations
  vector[D] x[N];              // predictors
  int<lower=1,upper=K> y[N];  // observations
}
transformed data {
  vector[1] zero;
  zero[1] <- 0;
}
parameters {
  vector[K-1] alpha;          // intercepts
  matrix[K-1,D] beta;         // slopes
}
model {
  // priors
  alpha ~ normal(0,20);
  to_vector(beta) ~ normal(0,10);
  // likelihood
  for (n in 1:N) {
    y[n] ~ categorical_logit(append_row(zero, alpha + beta * x[n]));
  }  
}

Here's a program that will work, but isn't optimally efficient:

data {
 int<lower=2> K;             // num categories
 int<lower=1> D;             // num predictors (without intercept)
 int<lower=0> N;             // num observations
 vector[D] x[N];              // predictors
 int<lower=1,upper=K> y[N];  // observations
}
transformed data {
 vector[1] zero;
 zero[1] <- 0;
}
parameters {
 vector[K-1] alpha;          // intercepts
 matrix[K-1,D] beta;         // slopes
}
model {
 // temps
 vector[K] lambda;
 lambda[1] <- 0;

 // priors
 alpha ~ normal(0,20);
 // likelihood
 for (n in 1:N) {
   for (k in 2:K)
     lambda[k] <- alpha[k-1] + beta[k-1] * x[n];
   y[n] ~ categorical_logit(lambda);
 }
}

from docs.

bob-carpenter commented on May 29, 2024

Add a chapter on using Stan for pure Monte Carlo integration

Computing pi

/**
 * Compute the mathematical constant pi by Monte Carlo integration.
 *
 * First, sample (x,y) uniformly from a square with vertices (-1,1),
 * (-1,1), (1,-1), (1,1) by sampling x and y independently from the
 * interval (-1,1).  
 * 
 * Second, determine if (x,y) lies within the circle circumscribed in
 * the square, by testing if (x^2 + y^2 < 1). Then multiply the
 * proportion of points inside the circle by the area of the square,
 * which is 4.
 *
 * The posterior mean of the generated quantity pi will compute pi
 * to within MC error.
 *
 * Run with fixed parameter algorithm and zero warmup iterations.
 */
model {
}
generated quantities {
  real<lower=-1,upper=1> x;
  real<lower=-1,upper=1> y;
  real<lower=0,upper=4> pi;
  x <- uniform_rng(-1,1);
  y <- uniform_rng(-1,1);
  pi <- 4 * (x^2 + y^2 < 1);
}

Generalization to volume of hypersphere

/**
 * Compute the volume of a hypersphere of a given radius in a
 * given number of dimensions by Monte Carlo integration.
 *
 * The volume will be the mean of the sapled sphere_volume values.
 */
data {
  int<lower=1> N;
  real<lower=0> r;
}
transformed data {
  real cube_volume;
  cube_volume <- (2 * r)^N;
}
model {
}
generated quantities {
  vector<lower=-1, upper=1>[N] x;
  real<lower=0,upper=cube_volume> sphere_volume;

  for (n in 1:N)
    x[n] <- uniform_rng(-1,1);

  sphere_volume <- cube_volume * (dot_self(x) < r);
}

from docs.

bob-carpenter commented on May 29, 2024

add example of WAIC calculation

from docs.

bob-carpenter commented on May 29, 2024

add example of posterior predictive checks

from docs.

bob-carpenter commented on May 29, 2024

add example of pure cross-validation
discuss leave-one-out cross-validation and relate to WAIC (see Aki and Andrew's paper)

from docs.

bob-carpenter commented on May 29, 2024

include an example of a spline model from BDA 3 (I still don't understand the description in the book, though, so hopefully someone else can code an example)
add Kalman filter examples (again, this is something I'm not that familiar with, but should be doable in Stan)

from docs.

bob-carpenter commented on May 29, 2024

review all the recommendations w.r.t. Cholesky factorization to bring them up to current best practice in terms of computational and statistical efficiency

from docs.

bob-carpenter commented on May 29, 2024

add a discussion of priors for degrees of freedom

From Aki on stan-users:

I recommend as an easy default option
nu ~ gamma(2,0.1);

This was proposed and anlysed by Juárez and Steel (2010) (Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.). Juárez and Steel compere this to Jeffreys prior and report that the difference is small. Simpson et al (2014) (arXiv:1403.4630) propose a theoretically well justified "penalised complexity (PC) prior", which they show to have a good behavior for the degrees of freedom, too. PC prior might be the best choice, but requires numerical computation of the prior (which could computed in a grid and interpolated etc.). It would be feasible to implement it in Stan, but it would require some work. Unfortunately no-one has compared PC prior and this gamma prior directly, but based on the discussion with Daniel Simpson, although PC prior would be better this gamma(2,0.1) prior is not a bad choice. Thus, I would use it until someone implements the PC prior for degrees of freedom of the Student's t in Stan.

from docs.

bob-carpenter commented on May 29, 2024

more general discussion of missing data, perhaps following Michael's comment included in the 2.4 manual issue

from docs.

bob-carpenter commented on May 29, 2024

emphasize the role of n_divergent in the manual (I'd really like to rename this, because n=1 or n=0, so the name seems misleading).

from docs.

bob-carpenter commented on May 29, 2024

add following advice from Andrew

I’d suggest an informative prior, for example if you think the elasticity should be between 0 and 1, you could use a normal prior with mean 0.5 and sd 0.5. The point is that you’d like to let the data speak but at the same time you want to control your estimate. If you do get a negative elasticity estimate, so be it (sometimes such things can happen due to unforeseen interactions in the data) but then you’ll probably also have a big standard error. And if the data really “want” to find a negative elasticity with a small s.e., you’d like to know this—as, at the very least, this will tell you that there’s something weird about the data. Better to let the data speak than to get an estimate right on the boundary which still doesn’t fit the data.

from docs.

bob-carpenter commented on May 29, 2024

Originally issue #1431 (from Titus van der Malsburg):

The term local variable is used a lot in the manual but it is difficult to extract a complete definition. For example, section 25.7 says:

At the beginning of each block, local variables may be declared that are scoped over the rest of the
statements in the block.

However, since all variables are defined at the beginning of a block, that would mean that all variables are local, which is clearly not the case. On the other hand, the section variable scope explains that variables declared in one program block are visible in subsequent blocks except for variables declared in the model block. This section, however, fails to mention that this is not the case for variables declared in nested blocks.

I think it would help if there was one place in the manual that gives the whole story. It may also make sense to use this opportunity to clarify why the term global variable isn't used (only variables in the data block are global).

from docs.

bob-carpenter commented on May 29, 2024

General discussion of efficiency of restart so users don't feel like they're losing so much time.

Everyone keeps asking about whether Stan can just "keep going" and do more sampling.

You absolutely cannot just keep going. Nor do you want to. The problem is that if you haven't reached convergence, then you want to do more warmup, and then more sampling. So you'd have to toss out the sampling part of the run anyway.

If you run 1000 warmup iterations and 1000 sampling iterations and haven't converged, then you want to go back and run 2000 warmup iterations and 2000 sampling iterations. If you were to restart, after the original warmup iterations, you'd save around 15% of your work. Here's an example with the default number of iterations (which is probably way too high):

Rerun from beginning, doubling sizes

   1000 + 1000
   2000 + 2000
   -----------
   3000 + 3000

Restart warmup (warmup hasn't converged):

   1000     + 1000
   0 + 1000 + 2000
   -----------
   2000 + 3000

The exception to this is when you have converged during warmup, but just haven't drawn enough iterations for the n_eff you need. In that case, you'd get the following.

Resample (sorry for header --- GitHub flavored markdown glitch in action)

   1000 + 1000
      0 + 1000
   -----------
   3000

Restart sampling (warmup converged)

   1000 + 1000
   1000 + 2000
   -----------
   5000

So you'd save about 40% of your total effort if you could just run more samples.

from docs.

bob-carpenter commented on May 29, 2024

Add least-absolute-value regression --- just needs DoubleExponential noise distribution

from docs.

bob-carpenter commented on May 29, 2024

Andre Pfeuffer asked about this on stan-users:

if (theta < -5 || theta > 5)
  increment_log_prob(-1.0e10);

I wrote back

Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.

from docs.

maverickg commented on May 29, 2024

I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.

Andre Pfeuffer asked about this on stan-users: ``` if (theta < -5 || theta > 5) increment_log_prob(-1.0e10); ```
I wrote back

Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.

from docs.

bob-carpenter commented on May 29, 2024

What I first said was wrong.

I was right in that it won't affect the gradient, but was
wrong in that it will affect the posterior by changing
the slice sampling in NUTS (or the Metropolis step in HMC).
It'll have a similarly indirect affect on optimization having
to do with termination criteria.

As to why it doesn't affect the gradients,
the C++ generated looks identical to the Stan code. So there's
nothing connecting the log probability accumulator back to theta.
It's used for the conditional, but not tied into the result.

So if you think of the expression tree, when I do this

increment_log_prob(-sigma);
increment_log_prob(-sqaure((y - mu) / sigma));

you get an expression that looks like:

lp == 0 - sigma - square((y - mu) / sigma);

So the derivative w.r.t. sigma and y and mu is defined.

But if you

if (theta < -5 || theta > 5)
increment_log_prob(-1e10);

then you have

lp == 0 - 1e10

lp == 0

and there's no connection back to the parameters. So
no gradient.

On Jul 6, 2015, at 2:12 PM, maverickg [email protected] wrote:

I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.

Andre Pfeuffer asked about this on stan-users: if (theta < -5 || theta > 5) increment_log_prob(-1.0e10); I wrote back
Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.

—
Reply to this email directly or view it on GitHub.

from docs.

bob-carpenter commented on May 29, 2024

Add the advice from Ben Goodrich and Krzysztof Sakrejda on stan-users in response to a query from Jonathan Gilligan:

Jonathan:

There have been some great discussions here about how important it is to use pairs() plots to check for sampling problems (bottlenecks, correlations, etc.), but I would love advice about how to work with big multilevel models where there are a small number of top-level hyperparameters characterizing hyperpriors, but at lower levels have one or more priors (each with one or more parameters) for each of a large number of groups, so pairs() plots of all combinations of parameters would require too many panes to be useful or practical.

Ben:

I would say to always include lp__, top-level hyperparameters, and any variance / standard deviation that goes into the likelihood. If that does not point to the source of your problem, then maybe do additional pairs plots with lp__ and a batch of lower-level parameters. High correlations are not a big deal, per se, but can be problematic when combined with changing variances.

Krzysztof

When choosing which lower-level parameters to plot it helps to divide them into groups based on the amount of data available.

from docs.

bob-carpenter commented on May 29, 2024

Following Ben's advice on stan-users:

For scale parameters, I do

Jeffreys if I know the units of measurement don't matter
rescaled Exponential if I know the mean
rescaled Gamma, Generalized Gamma, or Amoroso if I know more than the mean (which is rare)
add this advice for scale parameters (assuming Andrew doesn't object)
add chapter on reference priors (Jeffrey's priors; ditto)

from docs.

bob-carpenter commented on May 29, 2024

add chapter on arithmetic precision
add discussion of cdfs and ccdfs and rngs in general to the chapter that includes vectorization and rename it

from docs.

bob-carpenter commented on May 29, 2024

from #1564:

add arguments to sampling statements

from docs.

bob-carpenter commented on May 29, 2024

From Krzysztof Sakrejda on stan-users:

... slice sampling and MH don't make sense since we have HMC but ADVI and optimization have different enough properties that they're worth including. I just looked at the Stan manual and it doesn't really address the choice of algorithms in any clear section (maybe it's in the intro in story form) so maybe right before section 1.5 which launches into talking about sampling there could be a little section about choice of algorithms.

- [ ] add an overview of various algorithms

from docs.

bob-carpenter commented on May 29, 2024

Comment from @bgoodri on stan-users:

There are many ways of defining a partial correlation. Some include

The correlation between i and j given all other variables. This is given by inverting the covariance matrix, rescaling it to a correlation matrix, and negating the off-diagonal elements
The correlation between i and j given all variables (if any) before i where i < j. These are given by the transformation from a correlation matrix to unconstrained parameters if you do a Fisher transform of the unconstrained parameters. You can do this in R with the unconstrained_pars function.
The correlation between i and j given all variables (if any) between i and j where i < j. This is not available in Stan but there is code for it in the clusterGeneration R package.
The correlation between i and j given all variables (if any) after j where i < j. These partial correlations do not have a name as far as I know of and no code either.

from docs.

bob-carpenter commented on May 29, 2024

Add the log logistic as an example of a user-defined distribution:

real loglogistic_log(real y, real a, real b) {
  return log(b) - log(a) + (b - 1) * (log(y) - log(a))
         - 2 * log1p_exp(b * (log(y) - log(a));
}

from: https://en.wikipedia.org/wiki/Log-logistic_distribution

from docs.

bob-carpenter commented on May 29, 2024

Include a naive Bayes classifier example to show how to normalize. It can include discrete or continuous outputs. Add continuous HMM example (perhaps movement HMM, because that'll include circular stats, too).

from docs.

bob-carpenter commented on May 29, 2024

Have @bgoodri (or someone else who understands it) write about the QR decomposition used in RStanArm.

from docs.

bob-carpenter commented on May 29, 2024

Write Stan functions in the manual for the built-in transforms and Jacobians [or ideally, just wait until these are exposed in the language]

from docs.

bob-carpenter commented on May 29, 2024

From Camelia (last name?)

I think I have come across an untamed part of stan, and was wondering if you had any insights.

I'm trying to understand the idea of using 'scale-free' parameters. Gelman says "Aim to keep all parameters scale-free. [A way to do this is to] scale by some conventional value, for example if a parameter has a "typical" value of 4.5, you could work with log(theta/4.5) ". Full post here: http://stats.stackexchange.com/questions/189970/in-bayesian-linear-regression-why-do-we-assume-parameter-prior-has-zero-mean

My questions are

Does this apply to transformed parameters?
If a parameter has a typical value of 4.5, why can we not instead set a weak prior centered at 4.5 ? (assuming 4.5 makes sense)
In the context of the problem I'm working on, I need to infer the parameters of a Beta(a, b) distribution. Here is a snippet below. lambda[i] is typically around 10, which leads to a large value for b, which is what I want. According to Gelman, I should be "working with" lambda[i]/10 though. The exponent term makes me a bit nervous; the prior on this is, I think, overly sensitive, and I sometimes see bi-modalities in lambda_r.

transformed parameters {
   // signal distribution parameters    
    phi[i]    <- inv_logit(phi_r[r[i]] + phi_d[d[i]]);
    lambda[i] <- 1 + exp(lambda_r[r[i]] + lambda_d[d[i]]);

    // transformed signal distribution parameters
    a <- lambda[i] * phi[i];
    b <- lambda[i] * (1 - phi[i]);

    // implied search and hit rates
    search_rate[i] <- beta_ccdf(t_i[i], a, b);
    hit_rate[i]    <- beta_conditional_mean(t_i[i], a, b);
}

I haven't been able to find much in the way of explanation of "scale-free" parameters online. If you have any insights / any this makes sense, would be super helpful!

From Jim Savage:

I've always understood the quest for scale-free priors as some transformation we want to make to the data or parameters so that a) the geometry of the posterior (params, not transformed params) is easy for HMC to navigate b) the scale of your data don't upset the quality of the model fit. If we make transformations to the data, it's important not to implicitly be informing your prior with data (which is probably the case if you convert your data to z-scores--this is irrelevant to your particular problem, but I found it interesting when I learned it).

This might be helpful:
https://groups.google.com/forum/#!searchin/stan-users/scale-free|sort:relevance/stan-users/FjWL2290A5w/X0_tjLFbBgAJ

From Andrew Gelman:

Hi, scale-free is about computation but it's also about interpretability and it's also about being able to do a better job with priors, especially weakly informative priors. The idea is that scale free parameters should typically (tho not necessarily) in the (0, 1) range in absolute value, suggesting that things like N(0,1) could be weakly informative priors. So I'm not a fan of a prior centered at 4.5. The answer would have to depend on context, but I think maybe you'd \want some multiplicative factor here?

From Michael Betancourt:

It’s not really “scale free” but rather “scale unified”. As you noted a nice
way to think about this is in terms of units. Once you’ve identified the
appropriate units of your problem then you can redefine your system
of units so that all of the expected effects are all around 1 in those units.

This has nice computational benefits — if your effects vary on the order of
10^{6} in one direction and 10^{-6} in the other, then it will take 10^{12}
steps to explore everything, but if everything is order 1 then it will take
order 1 steps to jump to a new point.

But the real benefit is that it facilities building good priors. We like to
use “weakly” informative priors where we don’t know if an effect
exists but if it does we have some idea of it’s scale (a recent
example that has come up lately is in physiological models — however
complicated the body is, blood can’t floor faster than a certain amount,
so the effects of any drug can’t change blood flow by more than that
amount). Ultimately these manifest as priors of the form

x ~ normal(0, x_scale);

x ~ cauchy(0, x_scale);

Now if all of the scales are order 1, then all of your priors will look
the same and you have much less bookkeeping to do.

Adding a mean to the prior introduces more information, turning the
weakly informative prior into something much more informative. Now
if that is information that you actually believe then go for it, but always
verify that the fit was good and the prior mean wasn’t in tension with
the rest of the model.

from docs.

bob-carpenter commented on May 29, 2024

Stephen Martin on stan-users suggests adding mean and variance to the univariate distributions.

As an aside, I also really like the plots with various parameter values that you get in the Wikipedia.

from docs.

bob-carpenter commented on May 29, 2024

Moved from #757

It'd be nice to add some factor models with constraints as examples in the manual.

From Rick Farouni on stan-users:

For a factor model y=Lf +e, where f~ N(0,Sigma) , I think there are three important cases to consider:

The positive lower triangular (PLT) identification scheme mentioned here http://apps.olin.wustl.edu/faculty/zhou/GZ-RFS96.pdf , where Sigma=I (identity matrix) and the the diagonal entries of L are polarity constrained.
2.The modified PLT where the factors,f, are still orthogonal but the elements of diagonal of Sigma Aguilar, O. & West, M.(2000) Bayesian dynamic factor models and portfolio allocation.
3.. Sigma is a correlation matrix, but now instead of requiring D(D-1)/2 constraints on L, we need D(D-1) to ensure local uniqueness under rotation and the polarity constraint to ensure global rotational uniqueness.This paper lists the conditions http://link.springer.com/article/10.1007/s11336-012-9259-3

But as this paper shows https://www.stat.washington.edu/research/reports/2011/tr589.pdf , imposing constraints is problematic.

from docs.

bob-carpenter commented on May 29, 2024

add example of how to use Wisharts as conjugate priors directly; see
https://en.wikipedia.org/wiki/Inverse-Wishart_distribution#Conjugate_distribution

from docs.

bob-carpenter commented on May 29, 2024

John Hall suggests referring to section 21.6 of Gelman and Hill (first edition), formula 21.14 to get an estimate of how much pooling would be done by adding a hierarchical component to a model. Add this discussion in the reparameterization of hierarchical models section of the manual.

from docs.

bob-carpenter commented on May 29, 2024

From @jmh530 on the 2.16 manual issue: stan-dev/stan#2265 (comment)

In Section 8.2 on QR factorization, I feel like the paragraph that recommends you use it when you do not have an informative prior on the location of beta is mistaken (also mentioned in 8.3).

Consider a probability distribution Y ~ N(XB, sigma) with a prior B ~ N(mu_B, sd_B). Y is then re-written as Y ~ N(Q * theta, sigma). The equivalent distribution on the prior of theta would be theta ~ MVN(mu_B * R, t(R) * diag(sd_B ^ 2) * R), assuming mu_B is 1XN.

My point is that you can still include a prior on the mean of beta, you just have to do the appropriate transformation in order to express it in terms of the new beta. This raises some other questions about the reasoning in that section, but I'll leave off here.

@jgabry replied
@jmh530 yeah you're right that it's possible to use the QR approach even with a non-zero prior mean on the coefficients, so perhaps the recommendation in the manual is worded too strongly. @bgoodri should comment on this though because he wrote that section and I forget if he had a reason for discouraging this (he's definitely aware that's it's possible). One reason might just be that it is strange to have a prior belief that there is a dense covariance matrix for the MVN on theta since theta represents the coefficients on the columns of the Q matrix and those columns are orthogonal.

@jmh530 replied
True. When I was playing around with it, I had extracted the standard
deviations from the covariance matrix and just used the univariate normal
distribution.

from docs.

bob-carpenter commented on May 29, 2024

See stan-dev/stan#2273 for a discussion of numerical difficulties with simplexes used as regression coefficients in 5000 dimensions. I added a note to the simplex section of the manual on possible solutions, but going over the example itself would be a nice case study.

from docs.

bob-carpenter commented on May 29, 2024

I like the idea of defining the Bayesian posterior for R², defined by @bgoodri in a response on StackOverflow: https://stackoverflow.com/questions/44759319/overall-predictive-power-e-g-r2-for-bayesian-linear-mixed-models

from docs.

bob-carpenter commented on May 29, 2024

Add clutter example to mixture chapter as an example of "denoising" (it's an example in Bishop's book (section 10-7.1)

data {
  real<lower = 0, upper = 1> theta;  // clutter ratio
  int<lower = 0> N;
  vector[N] y;
}
parameters {
  real mu;
}
model {
  for (n in 1:N)
    target += log_mix(theta,
                      normal_lpdf(y[n] | mu, 1),
                      normal_lpdf(y[n] | 0, 10));
}

theta <- 0.5
N <- 200
mu <- 4.3

y <- rep(0, N);
for (n in 1:N) {
  if (rbinom(1, 1, 0.5)) {
    y[n] <- rnorm(1, mu, 1)
  } else {
    y[n] <- rnorm(1, 0, 10)
  }
}

library(rstan)
fit <- stan("clutter.stan", data = list(theta=theta, N=N, y=y))

from docs.

bob-carpenter commented on May 29, 2024

Include Ben's discussion of the "Lancaster" parameterization of multinomial in terms of Poissons:

http://discourse.mc-stan.org/t/large-poisson-model-with-individual-effects-is-too-slow/2112/2

I started this in the programming.tex and added the reference, but I can't quite follow what is being reparameterized to what.

from docs.

bob-carpenter commented on May 29, 2024

There used to be a section in the manual on error and warning messages. It was very incomplete and was removed for 2.18. Add one back in that properly catalogues all the

parser warnings
parser errors
run-time errors, organized by level
- reject()
- error()
- warning()

from docs.

bob-carpenter commented on May 29, 2024

explain @bgoodri's missing data trick

X == X_miss + X_obs

X * beta == X_miss * beta + X_obs * beta

and we can use csr_matrix_times_vector for the multiplies.

from docs.

manual longer term about docs HOT 47 OPEN

Comments (47)

Computing pi

Generalization to volume of hypersphere

Rerun from beginning, doubling sizes

Restart warmup (warmup hasn't converged):

Resample (sorry for header --- GitHub flavored markdown glitch in action)

Restart sampling (warmup converged)

From Camelia (last name?)

From Jim Savage:

From Andrew Gelman:

From Michael Betancourt:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent