Comments (47)
Add missing data imputation for multivariate normal. Use the bivariate example, modified slightly from the one @betanalpha discussed on thestan-users list. It marginalizes out the missing data. In this particular case, a total of N_miss
instances of y
have y[n,2]
missing:
data {
int<lower=0> N_nonmiss;
vector[2] y[N_nonmiss];
int<lower=0> N_miss;
real y1_obs[N_miss];
}
parameters {
vector[2] mu;
cov_matrix[2] Sigma;
}
model {
// Prior on mu
// Prior on Sigma, see the manual for examples
y_obs ~ multi_normal(mu, Sigma);
y1_obs ~ normal(mu[1], sqrt(Sigma[1, 1]);
}
We can also sample y[n,2]
using analytic conditional p(y[n,2] | y[n,1], mu, Sigma)
given here:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions
And be sure to emphasize as Michael says that the missingness is a modeling problem, not just an afterthought. We can compare this all to the brute-force approach to missing data currently inthe manual and talk about how the analytic marginalizations speed things up.
I'll probably also include the brute-force approach and change the prior to a Cholesky factor.
from docs.
Originally from Jeffrey Arnold here: stan-dev/stan#1081 (comment)
Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.
from docs.
- reparameterize Gaussian process example to use length scale with a proper prior (e.g.,
half-t(4)
).
from docs.
- add horseshoe prior example using Allen's implementation:
https://ariddell.org/horseshoe-prior-with-stan.html
- explore what happens with L1 priors in Stan's MLE to see if we get truly 0 values or close; I'm not sure what L-BFGS will do
from docs.
- add section or new chapter about arithmetic precision
- include following example I sent to stan-users:
For example, this model is problematic arithmetically:
fit <- stan(model_code = "parameters { real y; } model { y ~ normal(1,1e-20); }")
and produces
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
y 1 0.0 0 1 1 1 1 1 4000 Inf
lp__ -15407440 422002.9 26689805 -61629758 -15407440 0 0 0 4000 1.500106e+14
There's really nothing we can do about such models numerically. What you have to do from a user perspective is use a parameterizations z = y - 1
, which can be handled stably because you can represent numbers close to zero well.
parameters { real z; } model { z ~ normal(0,1e-20); }
The problem is that there's still no way to represent y = z + 1
as a real value using double-precision floating-point arithmetic, but you can get close to log(y)
by using
log_y <- log1p(z);
from docs.
Provide examples of log-mix for the manual mixture section.
from docs.
From Jeffrey Arnold:
Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.
from docs.
From Andrew on stan-dev, an example of a model for the problematic posteriors section:
Next I decided, just for fun, to fit a nonlinear model: y = a_exp(-b_x) + c + error, with N(0,sigma) errors. I simulated some data (x taking on the values between 0 and 23, setting a=.5, b=1/10, c=1, sigma=.2, and I fit it constraining all the parameters to be positive. (In retrospect I’m thinking I should’ve just constrained b and sigma to be positive; constraining a and c made sense in the context of my hypothetical example, but in general it’s poor practice to put in hard constraints unless absolutely necessary.)
I simulated data and fit the model and all was fine. But then I repeated and the fit blew up, I was getting some estimates of b that were either really close to 0 (1e-10, etc) or really high (1e10, etc), I can’t remember which. The point is that there can be instability at the extremes, and I realized the model need a bit of hand-holding. As a kluge I constrained b to be between 1/50 and 1 (which made sense in the context of the example). A hard constraint was ugly but was easy enough to understand. I didn’t want to confuse the class by introducing a normal prior here.
So this second example was a bit of a mess. Which suggests to me that in the manual/book we need to throw in an example to explicitly address instability and constraints. Since, ultimately, this is an inherent difficulty with modeling, it’s not anything to do with Stan in particular.
from docs.
- add discussion of K vs. (K-1) coefficient vector parameterizations of multi-logit
Here's what the model should look like, but there's a bug in Stan 2.6 preventing the compilation:
data {
int<lower=2> K; // num categories
int<lower=1> D; // num predictors (without intercept)
int<lower=0> N; // num observations
vector[D] x[N]; // predictors
int<lower=1,upper=K> y[N]; // observations
}
transformed data {
vector[1] zero;
zero[1] <- 0;
}
parameters {
vector[K-1] alpha; // intercepts
matrix[K-1,D] beta; // slopes
}
model {
// priors
alpha ~ normal(0,20);
to_vector(beta) ~ normal(0,10);
// likelihood
for (n in 1:N) {
y[n] ~ categorical_logit(append_row(zero, alpha + beta * x[n]));
}
}
Here's a program that will work, but isn't optimally efficient:
data {
int<lower=2> K; // num categories
int<lower=1> D; // num predictors (without intercept)
int<lower=0> N; // num observations
vector[D] x[N]; // predictors
int<lower=1,upper=K> y[N]; // observations
}
transformed data {
vector[1] zero;
zero[1] <- 0;
}
parameters {
vector[K-1] alpha; // intercepts
matrix[K-1,D] beta; // slopes
}
model {
// temps
vector[K] lambda;
lambda[1] <- 0;
// priors
alpha ~ normal(0,20);
// likelihood
for (n in 1:N) {
for (k in 2:K)
lambda[k] <- alpha[k-1] + beta[k-1] * x[n];
y[n] ~ categorical_logit(lambda);
}
}
from docs.
- Add a chapter on using Stan for pure Monte Carlo integration
Computing pi
/**
* Compute the mathematical constant pi by Monte Carlo integration.
*
* First, sample (x,y) uniformly from a square with vertices (-1,1),
* (-1,1), (1,-1), (1,1) by sampling x and y independently from the
* interval (-1,1).
*
* Second, determine if (x,y) lies within the circle circumscribed in
* the square, by testing if (x^2 + y^2 < 1). Then multiply the
* proportion of points inside the circle by the area of the square,
* which is 4.
*
* The posterior mean of the generated quantity pi will compute pi
* to within MC error.
*
* Run with fixed parameter algorithm and zero warmup iterations.
*/
model {
}
generated quantities {
real<lower=-1,upper=1> x;
real<lower=-1,upper=1> y;
real<lower=0,upper=4> pi;
x <- uniform_rng(-1,1);
y <- uniform_rng(-1,1);
pi <- 4 * (x^2 + y^2 < 1);
}
Generalization to volume of hypersphere
/**
* Compute the volume of a hypersphere of a given radius in a
* given number of dimensions by Monte Carlo integration.
*
* The volume will be the mean of the sapled sphere_volume values.
*/
data {
int<lower=1> N;
real<lower=0> r;
}
transformed data {
real cube_volume;
cube_volume <- (2 * r)^N;
}
model {
}
generated quantities {
vector<lower=-1, upper=1>[N] x;
real<lower=0,upper=cube_volume> sphere_volume;
for (n in 1:N)
x[n] <- uniform_rng(-1,1);
sphere_volume <- cube_volume * (dot_self(x) < r);
}
from docs.
- add example of WAIC calculation
from docs.
- add example of posterior predictive checks
from docs.
- add example of pure cross-validation
- discuss leave-one-out cross-validation and relate to WAIC (see Aki and Andrew's paper)
from docs.
- include an example of a spline model from BDA 3 (I still don't understand the description in the book, though, so hopefully someone else can code an example)
- add Kalman filter examples (again, this is something I'm not that familiar with, but should be doable in Stan)
from docs.
- review all the recommendations w.r.t. Cholesky factorization to bring them up to current best practice in terms of computational and statistical efficiency
from docs.
- add a discussion of priors for degrees of freedom
From Aki on stan-users:
I recommend as an easy default option
nu ~ gamma(2,0.1);
This was proposed and anlysed by Juárez and Steel (2010) (Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.). Juárez and Steel compere this to Jeffreys prior and report that the difference is small. Simpson et al (2014) (arXiv:1403.4630) propose a theoretically well justified "penalised complexity (PC) prior", which they show to have a good behavior for the degrees of freedom, too. PC prior might be the best choice, but requires numerical computation of the prior (which could computed in a grid and interpolated etc.). It would be feasible to implement it in Stan, but it would require some work. Unfortunately no-one has compared PC prior and this gamma prior directly, but based on the discussion with Daniel Simpson, although PC prior would be better this gamma(2,0.1) prior is not a bad choice. Thus, I would use it until someone implements the PC prior for degrees of freedom of the Student's t in Stan.
from docs.
- more general discussion of missing data, perhaps following Michael's comment included in the 2.4 manual issue
from docs.
- emphasize the role of
n_divergent
in the manual (I'd really like to rename this, because n=1 or n=0, so the name seems misleading).
from docs.
- add following advice from Andrew
I’d suggest an informative prior, for example if you think the elasticity should be between 0 and 1, you could use a normal prior with mean 0.5 and sd 0.5. The point is that you’d like to let the data speak but at the same time you want to control your estimate. If you do get a negative elasticity estimate, so be it (sometimes such things can happen due to unforeseen interactions in the data) but then you’ll probably also have a big standard error. And if the data really “want” to find a negative elasticity with a small s.e., you’d like to know this—as, at the very least, this will tell you that there’s something weird about the data. Better to let the data speak than to get an estimate right on the boundary which still doesn’t fit the data.
from docs.
Originally issue #1431 (from Titus van der Malsburg):
The term local variable is used a lot in the manual but it is difficult to extract a complete definition. For example, section 25.7 says:
At the beginning of each block, local variables may be declared that are scoped over the rest of the
statements in the block.
However, since all variables are defined at the beginning of a block, that would mean that all variables are local, which is clearly not the case. On the other hand, the section variable scope explains that variables declared in one program block are visible in subsequent blocks except for variables declared in the model block. This section, however, fails to mention that this is not the case for variables declared in nested blocks.
I think it would help if there was one place in the manual that gives the whole story. It may also make sense to use this opportunity to clarify why the term global variable isn't used (only variables in the data block are global).
from docs.
General discussion of efficiency of restart so users don't feel like they're losing so much time.
Everyone keeps asking about whether Stan can just "keep going" and do more sampling.
You absolutely cannot just keep going. Nor do you want to. The problem is that if you haven't reached convergence, then you want to do more warmup, and then more sampling. So you'd have to toss out the sampling part of the run anyway.
If you run 1000 warmup iterations and 1000 sampling iterations and haven't converged, then you want to go back and run 2000 warmup iterations and 2000 sampling iterations. If you were to restart, after the original warmup iterations, you'd save around 15% of your work. Here's an example with the default number of iterations (which is probably way too high):
Rerun from beginning, doubling sizes
1000 + 1000
2000 + 2000
-----------
3000 + 3000
Restart warmup (warmup hasn't converged):
1000 + 1000
0 + 1000 + 2000
-----------
2000 + 3000
The exception to this is when you have converged during warmup, but just haven't drawn enough iterations for the n_eff you need. In that case, you'd get the following.
Resample (sorry for header --- GitHub flavored markdown glitch in action)
1000 + 1000
0 + 1000
-----------
3000
Restart sampling (warmup converged)
1000 + 1000
1000 + 2000
-----------
5000
So you'd save about 40% of your total effort if you could just run more samples.
from docs.
- Add least-absolute-value regression --- just needs DoubleExponential noise distribution
from docs.
Andre Pfeuffer asked about this on stan-users:
if (theta < -5 || theta > 5)
increment_log_prob(-1.0e10);
I wrote back
Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.
from docs.
I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.
Andre Pfeuffer asked about this on stan-users: ``` if (theta < -5 || theta > 5) increment_log_prob(-1.0e10); ```I wrote back
Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.
from docs.
What I first said was wrong.
I was right in that it won't affect the gradient, but was
wrong in that it will affect the posterior by changing
the slice sampling in NUTS (or the Metropolis step in HMC).
It'll have a similarly indirect affect on optimization having
to do with termination criteria.
As to why it doesn't affect the gradients,
the C++ generated looks identical to the Stan code. So there's
nothing connecting the log probability accumulator back to theta.
It's used for the conditional, but not tied into the result.
So if you think of the expression tree, when I do this
increment_log_prob(-sigma);
increment_log_prob(-sqaure((y - mu) / sigma));
you get an expression that looks like:
lp == 0 - sigma - square((y - mu) / sigma);
So the derivative w.r.t. sigma and y and mu is defined.
But if you
if (theta < -5 || theta > 5)
increment_log_prob(-1e10);
then you have
lp == 0 - 1e10
or
lp == 0
and there's no connection back to the parameters. So
no gradient.
- Bob
On Jul 6, 2015, at 2:12 PM, maverickg [email protected] wrote:
I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.
Andre Pfeuffer asked about this on stan-users:
if (theta < -5 || theta > 5) increment_log_prob(-1.0e10);
I wrote back
Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.—
Reply to this email directly or view it on GitHub.
from docs.
Add the advice from Ben Goodrich and Krzysztof Sakrejda on stan-users in response to a query from Jonathan Gilligan:
Jonathan:
There have been some great discussions here about how important it is to use pairs() plots to check for sampling problems (bottlenecks, correlations, etc.), but I would love advice about how to work with big multilevel models where there are a small number of top-level hyperparameters characterizing hyperpriors, but at lower levels have one or more priors (each with one or more parameters) for each of a large number of groups, so pairs() plots of all combinations of parameters would require too many panes to be useful or practical.
Ben:
I would say to always include lp__, top-level hyperparameters, and any variance / standard deviation that goes into the likelihood. If that does not point to the source of your problem, then maybe do additional pairs plots with lp__ and a batch of lower-level parameters. High correlations are not a big deal, per se, but can be problematic when combined with changing variances.
Krzysztof
When choosing which lower-level parameters to plot it helps to divide them into groups based on the amount of data available.
from docs.
Following Ben's advice on stan-users:
For scale parameters, I do
- Jeffreys if I know the units of measurement don't matter
- rescaled Exponential if I know the mean
- rescaled Gamma, Generalized Gamma, or Amoroso if I know more than the mean (which is rare)
- add this advice for scale parameters (assuming Andrew doesn't object)
- add chapter on reference priors (Jeffrey's priors; ditto)
from docs.
- add chapter on arithmetic precision
- add discussion of cdfs and ccdfs and rngs in general to the chapter that includes vectorization and rename it
from docs.
from #1564:
- add arguments to sampling statements
from docs.
From Krzysztof Sakrejda on stan-users:
... slice sampling and MH don't make sense since we have HMC but ADVI and optimization have different enough properties that they're worth including. I just looked at the Stan manual and it doesn't really address the choice of algorithms in any clear section (maybe it's in the intro in story form) so maybe right before section 1.5 which launches into talking about sampling there could be a little section about choice of algorithms.- [ ] add an overview of various algorithms
from docs.
Comment from @bgoodri on stan-users:
There are many ways of defining a partial correlation. Some include
- The correlation between i and j given all other variables. This is given by inverting the covariance matrix, rescaling it to a correlation matrix, and negating the off-diagonal elements
- The correlation between i and j given all variables (if any) before i where i < j. These are given by the transformation from a correlation matrix to unconstrained parameters if you do a Fisher transform of the unconstrained parameters. You can do this in R with the unconstrained_pars function.
- The correlation between i and j given all variables (if any) between i and j where i < j. This is not available in Stan but there is code for it in the clusterGeneration R package.
- The correlation between i and j given all variables (if any) after j where i < j. These partial correlations do not have a name as far as I know of and no code either.
from docs.
Add the log logistic as an example of a user-defined distribution:
real loglogistic_log(real y, real a, real b) {
return log(b) - log(a) + (b - 1) * (log(y) - log(a))
- 2 * log1p_exp(b * (log(y) - log(a));
}
from: https://en.wikipedia.org/wiki/Log-logistic_distribution
from docs.
Include a naive Bayes classifier example to show how to normalize. It can include discrete or continuous outputs. Add continuous HMM example (perhaps movement HMM, because that'll include circular stats, too).
from docs.
Have @bgoodri (or someone else who understands it) write about the QR decomposition used in RStanArm.
from docs.
- Write Stan functions in the manual for the built-in transforms and Jacobians [or ideally, just wait until these are exposed in the language]
from docs.
From Camelia (last name?)
I think I have come across an untamed part of stan, and was wondering if you had any insights.
I'm trying to understand the idea of using 'scale-free' parameters. Gelman says "Aim to keep all parameters scale-free. [A way to do this is to] scale by some conventional value, for example if a parameter has a "typical" value of 4.5, you could work with log(theta/4.5) ". Full post here: http://stats.stackexchange.com/questions/189970/in-bayesian-linear-regression-why-do-we-assume-parameter-prior-has-zero-mean
My questions are
- Does this apply to transformed parameters?
- If a parameter has a typical value of 4.5, why can we not instead set a weak prior centered at 4.5 ? (assuming 4.5 makes sense)
- In the context of the problem I'm working on, I need to infer the parameters of a Beta(a, b) distribution. Here is a snippet below. lambda[i] is typically around 10, which leads to a large value for b, which is what I want. According to Gelman, I should be "working with" lambda[i]/10 though. The exponent term makes me a bit nervous; the prior on this is, I think, overly sensitive, and I sometimes see bi-modalities in lambda_r.
transformed parameters {
// signal distribution parameters
phi[i] <- inv_logit(phi_r[r[i]] + phi_d[d[i]]);
lambda[i] <- 1 + exp(lambda_r[r[i]] + lambda_d[d[i]]);
// transformed signal distribution parameters
a <- lambda[i] * phi[i];
b <- lambda[i] * (1 - phi[i]);
// implied search and hit rates
search_rate[i] <- beta_ccdf(t_i[i], a, b);
hit_rate[i] <- beta_conditional_mean(t_i[i], a, b);
}
I haven't been able to find much in the way of explanation of "scale-free" parameters online. If you have any insights / any this makes sense, would be super helpful!
From Jim Savage:
I've always understood the quest for scale-free priors as some transformation we want to make to the data or parameters so that a) the geometry of the posterior (params, not transformed params) is easy for HMC to navigate b) the scale of your data don't upset the quality of the model fit. If we make transformations to the data, it's important not to implicitly be informing your prior with data (which is probably the case if you convert your data to z-scores--this is irrelevant to your particular problem, but I found it interesting when I learned it).
This might be helpful:
https://groups.google.com/forum/#!searchin/stan-users/scale-free|sort:relevance/stan-users/FjWL2290A5w/X0_tjLFbBgAJ
From Andrew Gelman:
Hi, scale-free is about computation but it's also about interpretability and it's also about being able to do a better job with priors, especially weakly informative priors. The idea is that scale free parameters should typically (tho not necessarily) in the (0, 1) range in absolute value, suggesting that things like N(0,1) could be weakly informative priors. So I'm not a fan of a prior centered at 4.5. The answer would have to depend on context, but I think maybe you'd \want some multiplicative factor here?
From Michael Betancourt:
It’s not really “scale free” but rather “scale unified”. As you noted a nice
way to think about this is in terms of units. Once you’ve identified the
appropriate units of your problem then you can redefine your system
of units so that all of the expected effects are all around 1 in those units.
This has nice computational benefits — if your effects vary on the order of
10^{6} in one direction and 10^{-6} in the other, then it will take 10^{12}
steps to explore everything, but if everything is order 1 then it will take
order 1 steps to jump to a new point.
But the real benefit is that it facilities building good priors. We like to
use “weakly” informative priors where we don’t know if an effect
exists but if it does we have some idea of it’s scale (a recent
example that has come up lately is in physiological models — however
complicated the body is, blood can’t floor faster than a certain amount,
so the effects of any drug can’t change blood flow by more than that
amount). Ultimately these manifest as priors of the form
x ~ normal(0, x_scale);
or
x ~ cauchy(0, x_scale);
Now if all of the scales are order 1, then all of your priors will look
the same and you have much less bookkeeping to do.
Adding a mean to the prior introduces more information, turning the
weakly informative prior into something much more informative. Now
if that is information that you actually believe then go for it, but always
verify that the fit was good and the prior mean wasn’t in tension with
the rest of the model.
from docs.
Stephen Martin on stan-users suggests adding mean and variance to the univariate distributions.
As an aside, I also really like the plots with various parameter values that you get in the Wikipedia.
from docs.
Moved from #757
It'd be nice to add some factor models with constraints as examples in the manual.
From Rick Farouni on stan-users:
For a factor model y=Lf +e, where f~ N(0,Sigma) , I think there are three important cases to consider:
- The positive lower triangular (PLT) identification scheme mentioned here http://apps.olin.wustl.edu/faculty/zhou/GZ-RFS96.pdf , where Sigma=I (identity matrix) and the the diagonal entries of L are polarity constrained.
2.The modified PLT where the factors,f, are still orthogonal but the elements of diagonal of Sigma Aguilar, O. & West, M.(2000) Bayesian dynamic factor models and portfolio allocation.
3.. Sigma is a correlation matrix, but now instead of requiring D(D-1)/2 constraints on L, we need D(D-1) to ensure local uniqueness under rotation and the polarity constraint to ensure global rotational uniqueness.This paper lists the conditions http://link.springer.com/article/10.1007/s11336-012-9259-3
But as this paper shows https://www.stat.washington.edu/research/reports/2011/tr589.pdf , imposing constraints is problematic.
from docs.
- add example of how to use Wisharts as conjugate priors directly; see
https://en.wikipedia.org/wiki/Inverse-Wishart_distribution#Conjugate_distribution
from docs.
John Hall suggests referring to section 21.6 of Gelman and Hill (first edition), formula 21.14 to get an estimate of how much pooling would be done by adding a hierarchical component to a model. Add this discussion in the reparameterization of hierarchical models section of the manual.
from docs.
From @jmh530 on the 2.16 manual issue: stan-dev/stan#2265 (comment)
In Section 8.2 on QR factorization, I feel like the paragraph that recommends you use it when you do not have an informative prior on the location of beta is mistaken (also mentioned in 8.3).
Consider a probability distribution Y ~ N(XB, sigma) with a prior B ~ N(mu_B, sd_B). Y is then re-written as Y ~ N(Q * theta, sigma). The equivalent distribution on the prior of theta would be theta ~ MVN(mu_B * R, t(R) * diag(sd_B ^ 2) * R), assuming mu_B is 1XN.
My point is that you can still include a prior on the mean of beta, you just have to do the appropriate transformation in order to express it in terms of the new beta. This raises some other questions about the reasoning in that section, but I'll leave off here.
@jgabry replied
@jmh530 yeah you're right that it's possible to use the QR approach even with a non-zero prior mean on the coefficients, so perhaps the recommendation in the manual is worded too strongly. @bgoodri should comment on this though because he wrote that section and I forget if he had a reason for discouraging this (he's definitely aware that's it's possible). One reason might just be that it is strange to have a prior belief that there is a dense covariance matrix for the MVN on theta since theta represents the coefficients on the columns of the Q matrix and those columns are orthogonal.
@jmh530 replied
True. When I was playing around with it, I had extracted the standard
deviations from the covariance matrix and just used the univariate normal
distribution.
from docs.
See stan-dev/stan#2273 for a discussion of numerical difficulties with simplexes used as regression coefficients in 5000 dimensions. I added a note to the simplex section of the manual on possible solutions, but going over the example itself would be a nice case study.
from docs.
I like the idea of defining the Bayesian posterior for R2, defined by @bgoodri in a response on StackOverflow: https://stackoverflow.com/questions/44759319/overall-predictive-power-e-g-r2-for-bayesian-linear-mixed-models
from docs.
- Add clutter example to mixture chapter as an example of "denoising" (it's an example in Bishop's book (section 10-7.1)
data {
real<lower = 0, upper = 1> theta; // clutter ratio
int<lower = 0> N;
vector[N] y;
}
parameters {
real mu;
}
model {
for (n in 1:N)
target += log_mix(theta,
normal_lpdf(y[n] | mu, 1),
normal_lpdf(y[n] | 0, 10));
}
theta <- 0.5
N <- 200
mu <- 4.3
y <- rep(0, N);
for (n in 1:N) {
if (rbinom(1, 1, 0.5)) {
y[n] <- rnorm(1, mu, 1)
} else {
y[n] <- rnorm(1, 0, 10)
}
}
library(rstan)
fit <- stan("clutter.stan", data = list(theta=theta, N=N, y=y))
from docs.
- Include Ben's discussion of the "Lancaster" parameterization of multinomial in terms of Poissons:
http://discourse.mc-stan.org/t/large-poisson-model-with-individual-effects-is-too-slow/2112/2
I started this in the programming.tex and added the reference, but I can't quite follow what is being reparameterized to what.
from docs.
There used to be a section in the manual on error and warning messages. It was very incomplete and was removed for 2.18. Add one back in that properly catalogues all the
- parser warnings
- parser errors
- run-time errors, organized by level
- reject()
- error()
- warning()
from docs.
- explain @bgoodri's missing data trick
X == X_miss + X_obs
so
X * beta == X_miss * beta + X_obs * beta
and we can use csr_matrix_times_vector
for the multiplies.
from docs.
Related Issues (20)
- user's guide regression chapter - change section title "Hierarchical Logistic Regression" to "Hierarchical Regression" HOT 1
- update issue templates HOT 2
- consistent naming of covariance/correlation matrices
- add Survival models chapter to Stan User's Guide TOC
- Stan User's Guide Efficiency Tuning Section - beta_binomial? HOT 1
- add links to pdfs from docs homepage and overview pages
- add dark mode for website
- Problem searching the functions reference HOT 3
- Incorrect equation for wiener pdf
- [User Guide] Stan Translation efforts HOT 3
- Add more in-depth description of Pathfinder to the reference-manual
- Functions Refernce: return type of `bernoulli_rng` - `R` or `ints`? HOT 1
- Functions Reference - Array reductions contain vector-only signatures HOT 2
- needs updating: Efficiency-tuning/standardizing-predictors-and-outputs HOT 5
- Update documentation for "Missing Data: Sliced missing" data to new array syntax HOT 1
- Incorrect equation in the truncated data section HOT 1
- replace "sampling statement" with "target augmentation statement" HOT 2
- document normal sufficient stats in User's Guide efficiency chapter HOT 1
- Document that non-data variables are not allowed in top level size declarations - consequences for `generated quantities` block HOT 6
- Document new overloads of `wiener_lpdf`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docs.