The prml_errata from yousuketakada

prml_errata's Issues

Errata in Figure 10.2

In the 10.1.2, with minimizing KL(q||p), we can found that the unique solution
q(z) = q_1^{}(z_1) * q_2^{}(z_2) = N(z_1| \mu_1, \Lambda_11^{-1}) * N(z_2| \mu_2, \Lambda_22^{-1})
as author showed in Exercise 10.2.
We also have same solution as above by minimizing KL(p||q) in (10.17) and (2.98).
So, the solution q(z) is ,in general, not the spherical Gaussian and can not be looked similar
to Figure 10.2.
The correct contour plot of the solution looks like in the following figure in (a) with purpled colored lines.
https://drive.google.com/file/d/1cSnMA-_hheAmnCBLKrp551BcszI15vpq/view?usp=sharing

I think that Figure 10.2 is originated from Mackay, 2003(https://www.inference.org.uk/itprnn/book.pdf)
at page 436 and the related problem is exercise 33.5 (p.434) and in this problem, the solution is
restricted to spherical Gaussian.

If my understanding is correct, Figure 10.2 is not relevant to the contexts in 10.1.2.

Please check my opinion and let me know your comments.

Thank you.

Errata fixed by Bishop is correct at page 101?

Hi Yousuketakada

According to the official errata fix in https://www.microsoft.com/en-us/research/wp-content/uploads/2016/05/prml-errata-3rd-20110921.pdf, at page 101 ,a = 1 + beta/2 should be read a = (1 + beta)/2.
However I think that this fix is wrong and the original correct., a = 1 + beta/2.
In (2.153), lambda^(beta/2) should be lambda^(a-1) by definition of gamma function.
So, a - 1 = beta/2 --> a = 1 + beta/2
Is my understanding is correct?

Errata at page 35 in More PRML Errata

Hi
I found simple typo in your PRML Errata.
In the equation number (148) at page 35, \bm{\phi}_N should be \bm{\phi}_M.

Please check and I am also waiting for my previous pull request.
Thank you.

Expand on robust regression

IRLS to solve Student's t regression
Relation to the EM
M-estimators and robust functions, e.g., Cauchy and pseudo-Huber

Missing x in equation 3.5

In 3.57 on p. 156, we have p(t|t,alpha,beta). I believe there should be an x in the conditioning too, as it is in 3.58.

Give some motivation for Fisher information matrix

KL and Fisher information

Error in eq. (3.99)

Considering eq. (3.95), eq.(3.99) should be corrected to:

$\beta = \frac{N - M}{2E_{D}(\mathbf{m}_N)}$

Success probability $\mu$ for Bernoulli should be restricted on $(0, 1)$

Although the Bernoulli distribution (2.2) is well-defined for $\mu \in [0, 1]$, it is convenient to restrict $\mu$ on the open set $(0, 1)$. With this restriction, we can avoid some edge cases (see below) while we can still consider the case where $\mu = 0$ or $\mu = 1$ as a limit $\mu \to 0$ or $\mu \to 1$, respectively.

If $\mu = 0$ or $\mu = 1$, the distribution becomes degenerate, i.e., the random variable is deterministic, so that we have some difficulties including:

The log probability (2.6) is not well-defined.
The entropy (B.5) diverges.
We cannot identify the Bernoulli as a member of the exponential family because the natural parameter $\eta = \ln { \mu / (1 - \mu) }$ diverges for $\mu = 0$ or $\mu = 1$.

In fact, some authors adopt the restriction $\mu \in (0, 1)$; see, e.g., A&S and http://mathworld.wolfram.com/BernoulliDistribution.html.

Also, in the context of Bayesian inference in which we regard $\mu$ as a random variable, this restriction will not be a problem after all because we have $0 < \mu < 1$ almost surely (unless the distribution is singular).

Similar discussions also apply to other discrete distributions, i.e., the Binomial, the multinoulli, and the multinomial.

For the prior distributions, i.e., the beta and the Dirichlet, the domain of $\mu$ becomes an open set, e.g., $(0, 1)$, but this does not affect the correct normalization. Also, this can avoid some potential difficulty that the density function diverges at $\mu = 0$ or $\mu = 1$ for some parameters.

Link to official support page broken

The link to the support page should be
https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/

Kronecker's delta used w/o definition

$\delta_{ij}$ should read $I_{ij}$ for consistency with other part of PRML where $I_{ij}$ is the $(i, j)$-th element of the identity matrix~$\mathbf{I}$ (see ``Mathematical Notation'' on Pages xi--xii.).

Possible errors are found at:

Page 174, Exercise 3.4, Line -4
Page 238, Equation (5.34)
Page 248, Equations (5.75) and (5.76)
Page 289, Equation (5.208)
Page 307, Equation (6.62)
Page 314, Equation (6.75)
Page 563, Equation (12.7)
We also need to add a mention to ``Mathematical Notation'' for PRML on Pages xi--xii for the first correction to these errors.

Introduce paragraphs and paragraph headers

Split lengthy paragraphs into shorter ones and introduce paragraph headers where appropriate to help the reader better understand the organization of the text.
Paragraph headers are also useful for introducing important concepts, e.g., big O notation, score function, etc.

Self-contained proof for Wishart distribution

Show the normalization (B.79) as well as the expectations (B.80) and (B.81) of the Wishart distribution (B.78) in a self-contained manner.
We have already pointed out that some appropriate citation, e.g., Anderson (2003), is needed for the Wishart distribution because it has been introduced without any proof. However, most multivariate statistics textbooks, including Anderson (2003), motivate the Wishart distribution differently from PRML; they typically introduce the Wishart distribution as the distribution over the scatter matrix.
The derivation of the Wishart distribution along this line is indirect for our purpose (we are mainly interested in its conjugacy). I would rather like to show the normalization (B.79) as well as the expectations (B.80) and (B.81) directly just as we have done for the gamma distribution (2.146).
We show the Wishart distribution based on the matrix factorization called the Cholesky decomposition as well as the associated Jacobian. We also introduce the multivariate gamma function, which simplifies the form of the normalization constant (B.79). The expectations (B.80) and (B.81) are shown by making use of the fact that the expectation of the score function vanishes.

Dismissal of "bayesian methods don't overfit" is too quick

You say that using a very broad prior distribution leads to insufficient regularization and thus overfitting. I'm guessing you have in mind using the MAP estimator. But this isn't a Bayesian thing to do. A Bayesian would produce a predictive distribution (section 3.3.2), or if required to produce a point prediction for every input, might produce something like the predictive mean or predictive median, depending on what the ultimate loss function is.

Errata for positive definite matrix in page 701

In page 701, second paragraph, "A matrix A is said to be positive definite, denoted by A > 0, if w^TAw > 0 for all non-zero values of the vector w. Equivalently, a positive definite matrix has \lambda_i > 0 for all of its eigenvalues ..." has some error as following.
The conditions in "Equivalently Equivalently, a positive definite matrix has \lambda_i > 0 for all of its eigenvalues ..." is true only when A is "symmetric" positive definite matrix. That is not all positive definite matrix is symmetric. For example,

\begin{pmatrix} 1 & 1 \-1 & 1\end{pmatrix} is not symmetric but positive definite and does not have positive eigenvalues.
(As David Mitra showed in https://math.stackexchange.com/questions/83134/does-non-symmetric-positive-definite-matrix-have-positive-eigenvalues)

So, my suggestion to correct is "Equivalently, a positive definite matrix" --> "If A is a symmetric, positive definite matrix".

Sylvster's determinant identity (C.14) is not a consequence of the push-through identity (C.6); use the Schur complement instead

Take the determinants of the two block diagonalizations we use to show the general push-through identity [a generalized version of (C.5)] and the Woodbury identity (C.7). The determinants are both equal to det(M) so that they are equated and setting A and D equal to identities (possibly of different dimensionalities) gives (C.14) with some reparameterization.
Note that we cannot take the determinants of both sides of (C.6) and then cancel the det(A) factor because A is not necessarily square nor nonsingular.

I believe M should be 4 in 5.5. Regularization in Neural Networks, p256

Figure 5.10 shows that at M=4, the sum of square errors is least, which accords with Figure 5.9. M=3 maybe better because there is no great improvement from 3 to 4.

Add errata for some hyphenation inconsistencies

"linear-Gaussian" vs. "linear Gaussian": Although a quick search on the web reveals that the term "linear Gaussian" (without hyphenation) may be more popular, it seems that the term "linear-Gaussian" is used consistently in PRML.
"expectation maximization" vs. "expectation-maximization": The term "expectation maximization" (without hyphenation) is better.
"positive definite" vs. "positive-definite": We say "A is positive definite" but "A is a positive-definite matrix" (we should use hyphenation only before a noun in this case).
"N-dimensional" vs. "N dimensional": We always use hyphenation if an adjective is formed with an ordinal.

Add some more properties of log minus digamma function

Positivity from the comparison between the log expectation and the logarithm of the mean of the gamma-distributed random variable, making use of Jensen's inequality.
Define complete monotonicity.
Asymptotic expansion of the log gamma function.
Complete monotonicity of the residual of the asymptotic expansion.
Complete monotonicity of the log minus digamma function.
Log minus digamma function is convex ($a_{\text{ML}}$ indeed maximizes the likelihood).
Log minus digamma function is bijective and thus invertible.
Numerical solution for the inverse of the log minus digamma function.
Add figure for log minus digamma function.

yousuketakada / prml_errata Goto Github PK

prml_errata's People

Contributors

Stargazers

Watchers

Forkers

prml_errata's Issues

Recommend Projects

Recommend Topics

Recommend Org