storopoli / bayesian-julia Goto Github PK

Bayesian Statistics using Julia and Turing

Home Page: https://storopoli.github.io/Bayesian-Julia

License: Creative Commons Attribution Share Alike 4.0 International

Julia 100.00%

bayesian-statistics julia turing tutorials

bayesian-julia's Introduction

I am a software developer with a PhD and background in Computational Statistics. I am interested in performant and reliable software. In other words, I make computers go brrr faster.

In hope of a better future, I dedicate my free time to Bitcoin-related projects. I contribute to the Rust Bitcoin ecosystem, particularly Bitcoin Dev Kit (BDK) and rust-bitcoin. I also work full-time at Portal on Bitcoin-backed smart contracts.

I am a faculty member at the Department of Computer Science at Universidade Nove de Julho (UNINOVE). I am also an ardent Bayesian.

I like Rust, Julia, Nix, and Vim.

I practice Stoic Philosophy daily.

I hate bloatware and the soydev phenomenon.

Everything that I do is either open source or has a permissive Creative Commons license.

I don't have social media, since I think they are overrated and "they sell your data". If you want to contact me, please send an email.

If you want to find out what's recently on my mind, check out my blog.

bayesian-julia's People

Contributors

Stargazers

Watchers

bayesian-julia's Issues

Standardize Turing syntax in 4_Turing

Use a more standard code notation in 4_Turing.jl.

Instead of:

for i in eachindex(y)

Try to use what you already used in the later tutorials.

Add some extra explanation to generating data with predict function with missing input

For the PPC using the predict, the way that works is now shown in the tutorial:
https://storopoli.io/Bayesian-Julia/pages/4_Turing/#prior_and_posterior_predictive_checks

However, as became apparent on Slack, it might not be self-explanatory enough for people that encounter it a first time, so perhaps some extra explanation needs to be given at least for the fact why :y should be where it is in: DynamicPPL.Model{(:y,)}.

Because if one passes in multiple missing variables, e.g. s1 and s2, then this should become DynamicPPL.Model{(:s1, :s2,)} for example.

chapter 03: description of negative binomial distribution

I think the current description of the Negative Binomial is confusing and possibly wrong. In the parameterization used in Julia, the Negative Binomial with parameters r (Julia uses r, not k) and p is the distribution of the number of failures (not total trials) until r successes.

(By the way, the xlabel is, thus, wrong too: #71).

Gibbs sampler description

Hello and thank you very much for the excellent tutorial.

Currently you describe the iteration in the Gibbs sampler like the following:

I had some trouble understanding it and when on Wikipedia there was a different form, which to a Bayesian beginner like myself made more sense :
https://en.wikipedia.org/wiki/Gibbs_sampling

So I wonder if right-hand-side t-1 is actually meant to be t and the right-hand-side 0 is meant to be t ?
Again, I don't mean to undermine your excellent work. Just trying to clear out any gray areas.
Thanks again!

Formula definition of p-value

Given the definition of p-value and the fact that everywhere else in the book the letter D represents the observed data, I would propose - to avoid any confusion by the reader - to change the formula definition of p-value to $P(D* \mid H_0)$, where D* represents data at least as extreme as the observed data (D). The way it is written now $P(D \mid H_0)$ to me reads as if the p-value is the probability of the observed data given that the null hypothesis is true.

the difference between groups are zero

In "the difference between groups are zero" word "are' should be changed to "is". In p-values
And "there is no difference between groups" would be bit better "there are no differences among groups"?

Expect should be except?

# One-hot vector is a vector of integers in which all indices are zero (0) expect for one single index that is one (1).

Should expect be replaced here with except ?

Bayesian-Julia/_literate/1_why_Julia.jl

Line 546 in 2da20a4

 # One-hot vector is a vector of integers in which all indices are zero (0) expect for one single index that is one (1). 

Compat bounds in Project.toml

Hello,

I stumbled upon this repo after reading your issue with Pkg compats on Slack. I feel your pain, these clashes can be super annoying. One q: (1) did you try putting Franklin in your toml? and if that failed (2) did you try putting compat bounds explicitly in the Project.toml of your site folder instead of putting those in the deploy script?

The chain of dependencies that cause the problems described on Slack is

Franklin -> Literate ->  JSON -> Parsers

and then that clashes with CSV which requires Parsers 1, I think. But when you put everything in your Project.toml file (including Franklin), it should get the right versions for everything.

PS: pretty cool website, I wasn't aware of it :)

Chapters are down

Hi,

Sorry for bothering again but three chapters are down on the website:

2. What is Bayesian Statistics
5. Markov Chain Monte Carlo (MCMC)
9. Bayesian Regression with Count Data

Model Comparison with `ParetoSmooth.jl`

ParetoSmooth.jl has a loo and loo_compare function.

Move images from `Plots.jl` to `Makie.jl` + `AlgebraOfGraphics.jl`

Code to replace:

Densities in 2_bayes_stats.jl: straight up drop-in from Plots.jl to Makie.jl
PMFs/PDFs in 3_prob_dist.jl: straight up drop-in from Plots.jl to Makie.jl
Visualizations in 4_Turing.jl: use David's suggestions in TuringLang/MCMCChains.jl#306 (comment)
Animations in 5_MCMC.jl: don't know, first foray into Makie.jl animations.
Densities in 7_logistic_reg.jl, 8_count_reg.jl and 9_robust_reg.jl: straight up drop-in from Plots.jl to Makie.jl
Funnel in 11_Turing_tricks.jl: straight up drop-in from Plots.jl to Makie.jl
Time Series and ODE Solution in 12_epi_models.jl: straight up drop-in from Plots.jl to Makie.jl

UndefVarError on Two Chapters

Hello,

There are UndefVarError errors in two chapters (7 & 8) on logistic regression, (e.g., UndefVarError: chain not defined in here). Btw. the codes run without problem on my machine.

Thanks for this tutorial.

logistic regression error

Hey,

I tried to run your log reg example using Julia 1.8.5, Turing v0.24.0, LazyArrays v0.22.16 and get the following error:

ERROR: ArgumentError: Broadcasted{Unknown} wrappers do not have a style assigned

stemming from this line:

return y ~ arraydist(LazyArray( @~ BernoulliLogit.(α .+ X * β)))

Any thoughts? In the meanwhile I will just use the old syntax with a for loop but I imagine this implementation is faster?

Thanks!

Incorrect `sample` statement

All of the sample statements are incorrect for Turing Models:

This implies that Turing will sample 2_000 iterations after discarding the nadapts default in NUTS() (which is 1_000). So this gives you 3_000 iterations with 1_000 as warmup.

sample(model, NUTS(), MCMCThreads(), 2_000, 4)

The correct approach would be:

sample(model, NUTS(1_000, 0.8), MCMCThreads(), 1_000, 4)

Sparse Regression

New content for Sparse Regression:

What is Sparsity?
The frequentist case with Ridge, Lasso and Elastic Net (optimization with Lagrangian constraints)
Bayesian with:
- Original Horseshoe (Carvalho, Polson & Scott, 2009)
- Houseshoe+ (Bhadra et al., 2015)
- Finnish Horseshoe (Piironen & Vehtari, 2017)
- R2-D2 (Zhang et al., 2020)
Turing Code: Use the code from Robust Regression - Julia Discourse

References:

Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009). Handling sparsity via the horseshoe. In International Conference on Artificial Intelligence and Statistics (pp. 73-80).
Bhadra, A., Datta, J., Polson, N. G., & Willard, B. (2015). The Horseshoe+ Estimator of Ultra-Sparse Signals. ArXiv:1502.00560 [Math, Stat] . [1502.00560] The Horseshoe+ Estimator of Ultra-Sparse Signals
Piironen, J., & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. ArXiv:1707.01694 [Stat] . Sparsity information and regularization in the horseshoe and other shrinkage priors
Zhang, Y. D., Naughton, B. P., Bondell, H. D., & Reich, B. J. (2020). Bayesian regression using a prior on the model fit: The R2-D2 shrinkage prior. Journal of the American Statistical Association.

Chapter 4: possible error in description of Turing's warmup defaults

Hey,

Thanks for the great tutorial!
However I think I have spotted an error in chapter 4.
Here you say :

Please note that, as default, Turing samplers will discard the first half of iterations as warmup. 
So the sampler will output 1,000 samples (floor(2_000 / 2)):
chain = sample(model, NUTS(), 2_000);

https://github.com/storopoli/Bayesian-Julia/blob/master/_literate/4_Turing.jl#L162

However in the figure below that, plot(chain), you can see that the samples start from sample 1000 and go to sample 3000.
https://storopoli.io/Bayesian-Julia/assets/pages/4_Turing/code/output/chain.svg

So to me it seems that if you specify 2000 samples in the sample() function, Turing will add and extra 1000 warmup sample that get discarded afterwards.
You think this is correct?

I tried to look it up in the Turing documentation, but it not immediately clear to me where I can find this information in their documentation.
I am new to Julia, coming from R, and finding the right documentation in Julia land is still a challenge to me sometimes.

Best

Robust regression

In the summary table for the robust regression example there are clearly some issues with the sampler as evidenced by the summary of beta[1]:

 std           naive_se                mcse         ess         rhat   ess_per_sec
      Symbol              Float64              Float64            Float64             Float64     Float64      Float64       Float64

           α               5.9437             177.6342             1.9860              4.7019   1682.7144       1.0096      115.0889
        β[1]   2918557377982.3223   5055405652489.1826   56521153464.0125   568742023576.1283     16.0321   23930.2360        1.0965
        β[2]              -4.9522              89.6014             1.0018              4.3793    411.7090       1.0080       28.1587
           ν               1.5936               3.6077             0.0403              0.1740     89.2368       1.1117        6.1033
          νᵦ               2.8181               5.0152             0.0561              0.2888     61.1063       1.1777        4.1793
          νₐ               4.5323               4.8496             0.0542              0.0832   3484.2626       1.0039      238.3054
           σ               5.2063               2.5503             0.0285              0.2357     23.2047       1.8193        1.5871

Move as a new chapter in JuliaDataScience book

This should be a whole chapter.

chapter08 esoph example has wrong text at now

As you can see from the describe() output 58% of the respondents switched wells and 42% percent of respondents somehow are engaged in community organizations. The average years of education of the household's head is approximate 5 years and ranges from 0 (no education at all) to 17 years. The distance to safe drinking water is measured in meters and averages 48m ranging from less than 1m to 339m. Regarding arsenic levels I cannot comment because the only thing I know that it is toxic and you probably would never want to have your well contaminated with it. Here, we believe that all of those variables somehow influence the probability of a respondent switch to a safe well.

chapter 03: xlabel of many figures is wrong

For several of the distributions the xlabel is wrong, and should not be \theta:

Binomial: should be Number of events or `Number of successes'
Poisson: should be Number of events
NegativeBinomial: should be Number of failures
Normal: should be X (or however one calls the random variable)
etc

Ordinal Regression

Ordinal Regression, a.k.a Ordered Logit Regression

The second axiom of probability is false and the third one is fuzzy

The second axiom in your tutorial is stated as follows: Additivity: For two mutually exclusive (events) A and B (cannot occur at the same time[9]):
P(A) = 1 - P(B), and P(B) = 1 - P(A).

Counterexample:
Let us consider the experiment of rolling a fair dice, A={3, 6}, the event to get a face divisible with 3, and B={5}
the event to get the face 5. A and B are mutually exclusive.
P(A)=2/6=1/3, P(B)=1/6, but 1/3 is not equal to 1- 1/6.

Unfortunately you have taught this definition to your students, https://storopoli.github.io/Estatistica-Bayesiana/0-Estatistica-Bayesiana.html#defini%C3%A7%C3%A3o-matem%C3%A1tica, whereas there are hundreds of books of probability, where you could have found the right one :(
Also, I formulated the probability axioms when I opened the thread on Julia discourse, related to your chapter of Probability, you intended to introduce in Julia for Data Science, but it seems you did not read all comments in that thread :(

I post here the definition of probability from the book, M. Baron - Probability and Statistics for Computer Scientists, Second edition, CRC Press. and from S. M. Ross, Probability Models for Computer Science, Academic Press:

So far, I've only run a few Distributions and Turing examples, and can't yet express an opinion.

`Bijectors.jl` version `0.11.1` broke Ordinal Regression example

This was done in TuringLang/Bijectors.jl#241.

One of the issues in #83.

Confidence intervals

Hi,

Chapter 2 states:

This means that 95 studies out of 100, which would use the same sample size and target population, applying the same statistical test, will expect to find a result of mean differences between groups between 10.5 and 23.5.

This seems to correspond to misinterpretation 22 from “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”.

The correct interpretation is that approximately 95 studies out of 100 would compute a confidence interval that contains the true mean difference – but it says nothing about which ones those are (whereas the data might).

In other words, 95% is not the probability of obtaining data such that the estimate of the true parameter is contained in the interval that we obtained, it is the probability of obtaining data such that, if we compute another confidence interval in the same way, it contains the true parameter. The interval that we got in this particular instance is irrelevant and might as well be thrown away.

Here is some nice reading on the subject:

E. T. Jaynes. Confidence Intervals vs. Bayesian Intervals (1976). DOI: 10.1007/978-94-009-6581-2_9

Let us try to understand what is happening here. It is perfectly true that, if the distribution (15) is indeed identical with the limiting frequencies of various sample values, and if we could repeat all this an indefinitely large number of times, then use of the confidence interval (17) would lead us, in the long run, to a correct statement 90% of the time. But it would lead us to a wrong answer 100% of the time in the subclass of cases where $\theta^* > x_1 +0.85$; and we know from the sample whether we are in that subclass.
[…]
We suggest that the general situation, illustrated by the above example, is the following: whenever the confidence interval is not based on a sufficient statistic, it is possible to find a 'bad' subclass of samples, recognizable from the sample, in which use of the confidence interval would lead us to an incorrect statement more frequently than is indicated by the confidence level; and also a recognizable 'good' subclass in which the confidence interval is wider than it needs to be for the stated confidence level. The point is not that confidence intervals fail to do what is claimed for them; the point is that, if the confidence interval is not based on a sufficient statistic, it is possible to do better in the individual case by taking into account evidence from the sample that the confidence interval method throws away.

Morey, R.D., Hoekstra, R., Rouder, J.N. et al. The fallacy of placing confidence in confidence intervals. Psychon Bull Rev 23, 103–123 (2016). DOI: 10.3758/s13423-015-0947-8

For now, we give a simple example, which we call the “trivial interval.” Consider the problem of estimating the mean of a continuous population with two independent observations, $y_1$ and $y_2$. If $y_1 > y_2$, we construct an confidence interval that contains all real numbers $(-\infty, \infty)$; otherwise, we construct an empty confidence interval. The first interval is guaranteed to include the true value; the second is guaranteed not to. It is obvious that before observing the data, there is a 50 % probability that any sampled interval will contain the true mean. After observing the data, however, we know definitively whether the interval contains the true value.
[…]
Once one has collected data and computed a confidence interval, how does one then interpret the interval? The answer is quite straightforward: one does not – at least not within confidence interval theory.⁸ As Neyman and others pointed out repeatedly, and as we have shown, confidence limits cannot be interpreted as anything besides the result of a procedure that will contain the true value in a fixed proportion of samples. Unless an interpretation of the interval can be specifically justified by some other theory of inference, confidence intervals must remain uninterpreted, lest one make arbitrary inferences or inferences that are contradicted by the data. This applies even to “good” confidence intervals, as these are often built by inverting significance tests and may have strange properties (e.g., Steiger, 2004).

(As I. J. Good put it: “One of the intentions of using confidence intervals and regions is to protect the reputation of the statistician by being right in a certain proportion of cases in the long run. Unfortunately, it sometimes leads to such absurd statements, that if one of them were made there would not be a long run.” with Jaynes’ truncated exponential being a nice example.)

Fix all `Exponential()` priors

The rate is $\theta$ instead of $\frac{1}{\theta}$ in Julia parameterization.

`MethodError` in Ch 4.

Link for chapter 10 is broken

In the Read Me, the link for chapter 10 is broken.

chapter 10 duncans example has problem

example_-_duncans_prestige

summarystats(chain) seems use another datasets, cause rhat is incorrect

copy-paste result is correct

Latent Variables and Marginalization

New Content for LV and Marginalization:

What are LVs?
NUTS and Discrete Parameters
Marginalization and Latent Discrete Parameters: check Stan's Latent Discrete Parameters section
- Check the math in A step-by-step guide to marginalizing over discrete parameters for ecologists using Stan
Code
- McElreath's Algebra and the Missing Oxen

References?

errors in count regression page

There are a few undefvar errors in the poisson/negbin2 with turing page. On the actual website that is.

Are joint probability distributions not always symmetric?

Hi, thanks for making this book, it is really helpful! I have question about a statement you make in Chapter 2, equation 6. You say that in general P(A, K) != P(K, A). When is this not true in probability? Are there simple counter-examples?

storopoli / bayesian-julia Goto Github PK

bayesian-julia's Introduction

bayesian-julia's People

Contributors

Stargazers

Watchers

Forkers

bayesian-julia's Issues

example_-_duncans_prestige

Recommend Projects

Recommend Topics

Recommend Org