Giter Club home page Giter Club logo

prml_errata's People

Contributors

jcsahnwaldt avatar tommyod avatar yousuketakada avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prml_errata's Issues

Errata in Figure 10.2

In the 10.1.2, with minimizing KL(q||p), we can found that the unique solution
q(z) = q_1^{}(z_1) * q_2^{}(z_2) = N(z_1| \mu_1, \Lambda_11^{-1}) * N(z_2| \mu_2, \Lambda_22^{-1})
as author showed in Exercise 10.2.
We also have same solution as above by minimizing KL(p||q) in (10.17) and (2.98).
So, the solution q(z) is ,in general, not the spherical Gaussian and can not be looked similar
to Figure 10.2.
The correct contour plot of the solution looks like in the following figure in (a) with purpled colored lines.
https://drive.google.com/file/d/1cSnMA-_hheAmnCBLKrp551BcszI15vpq/view?usp=sharing

I think that Figure 10.2 is originated from Mackay, 2003(https://www.inference.org.uk/itprnn/book.pdf)
at page 436 and the related problem is exercise 33.5 (p.434) and in this problem, the solution is
restricted to spherical Gaussian.

If my understanding is correct, Figure 10.2 is not relevant to the contexts in 10.1.2.

Please check my opinion and let me know your comments.

Thank you.

Errata at page 35 in More PRML Errata

Hi
I found simple typo in your PRML Errata.
In the equation number (148) at page 35, \bm{\phi}_N should be \bm{\phi}_M.

Please check and I am also waiting for my previous pull request.
Thank you.

Expand on robust regression

  • IRLS to solve Student's t regression
  • Relation to the EM
  • M-estimators and robust functions, e.g., Cauchy and pseudo-Huber

Missing x in equation 3.5

In 3.57 on p. 156, we have p(t|t,alpha,beta). I believe there should be an x in the conditioning too, as it is in 3.58.

Success probability $\mu$ for Bernoulli should be restricted on $(0, 1)$

Although the Bernoulli distribution (2.2) is well-defined for $\mu \in [0, 1]$, it is convenient to restrict $\mu$ on the open set $(0, 1)$. With this restriction, we can avoid some edge cases (see below) while we can still consider the case where $\mu = 0$ or $\mu = 1$ as a limit $\mu \to 0$ or $\mu \to 1$, respectively.

If $\mu = 0$ or $\mu = 1$, the distribution becomes degenerate, i.e., the random variable is deterministic, so that we have some difficulties including:

  • The log probability (2.6) is not well-defined.
  • The entropy (B.5) diverges.
  • We cannot identify the Bernoulli as a member of the exponential family because the natural parameter $\eta = \ln { \mu / (1 - \mu) }$ diverges for $\mu = 0$ or $\mu = 1$.

In fact, some authors adopt the restriction $\mu \in (0, 1)$; see, e.g., A&S and http://mathworld.wolfram.com/BernoulliDistribution.html.

Also, in the context of Bayesian inference in which we regard $\mu$ as a random variable, this restriction will not be a problem after all because we have $0 < \mu < 1$ almost surely (unless the distribution is singular).

Similar discussions also apply to other discrete distributions, i.e., the Binomial, the multinoulli, and the multinomial.

For the prior distributions, i.e., the beta and the Dirichlet, the domain of $\mu$ becomes an open set, e.g., $(0, 1)$, but this does not affect the correct normalization. Also, this can avoid some potential difficulty that the density function diverges at $\mu = 0$ or $\mu = 1$ for some parameters.

Kronecker's delta used w/o definition

$\delta_{ij}$ should read $I_{ij}$ for consistency with other part of PRML where $I_{ij}$ is the $(i, j)$-th element of the identity matrix~$\mathbf{I}$ (see ``Mathematical Notation'' on Pages xi--xii.).

Possible errors are found at:

  • Page 174, Exercise 3.4, Line -4

  • Page 238, Equation (5.34)

  • Page 248, Equations (5.75) and (5.76)

  • Page 289, Equation (5.208)

  • Page 307, Equation (6.62)

  • Page 314, Equation (6.75)

  • Page 563, Equation (12.7)

  • We also need to add a mention to ``Mathematical Notation'' for PRML on Pages xi--xii for the first correction to these errors.

Introduce paragraphs and paragraph headers

Split lengthy paragraphs into shorter ones and introduce paragraph headers where appropriate to help the reader better understand the organization of the text.
Paragraph headers are also useful for introducing important concepts, e.g., big O notation, score function, etc.

Self-contained proof for Wishart distribution

Show the normalization (B.79) as well as the expectations (B.80) and (B.81) of the Wishart distribution (B.78) in a self-contained manner.
We have already pointed out that some appropriate citation, e.g., Anderson (2003), is needed for the Wishart distribution because it has been introduced without any proof. However, most multivariate statistics textbooks, including Anderson (2003), motivate the Wishart distribution differently from PRML; they typically introduce the Wishart distribution as the distribution over the scatter matrix.
The derivation of the Wishart distribution along this line is indirect for our purpose (we are mainly interested in its conjugacy). I would rather like to show the normalization (B.79) as well as the expectations (B.80) and (B.81) directly just as we have done for the gamma distribution (2.146).
We show the Wishart distribution based on the matrix factorization called the Cholesky decomposition as well as the associated Jacobian. We also introduce the multivariate gamma function, which simplifies the form of the normalization constant (B.79). The expectations (B.80) and (B.81) are shown by making use of the fact that the expectation of the score function vanishes.

Dismissal of "bayesian methods don't overfit" is too quick

You say that using a very broad prior distribution leads to insufficient regularization and thus overfitting. I'm guessing you have in mind using the MAP estimator. But this isn't a Bayesian thing to do. A Bayesian would produce a predictive distribution (section 3.3.2), or if required to produce a point prediction for every input, might produce something like the predictive mean or predictive median, depending on what the ultimate loss function is.

Errata for positive definite matrix in page 701

In page 701, second paragraph, "A matrix A is said to be positive definite, denoted by A > 0, if w^TAw > 0 for all non-zero values of the vector w. Equivalently, a positive definite matrix has \lambda_i > 0 for all of its eigenvalues ..." has some error as following.
The conditions in "Equivalently Equivalently, a positive definite matrix has \lambda_i > 0 for all of its eigenvalues ..." is true only when A is "symmetric" positive definite matrix. That is not all positive definite matrix is symmetric. For example,

\begin{pmatrix} 1 & 1 \-1 & 1\end{pmatrix} is not symmetric but positive definite and does not have positive eigenvalues.
(As David Mitra showed in https://math.stackexchange.com/questions/83134/does-non-symmetric-positive-definite-matrix-have-positive-eigenvalues)

So, my suggestion to correct is "Equivalently, a positive definite matrix" --> "If A is a symmetric, positive definite matrix".

Sylvster's determinant identity (C.14) is not a consequence of the push-through identity (C.6); use the Schur complement instead

Take the determinants of the two block diagonalizations we use to show the general push-through identity [a generalized version of (C.5)] and the Woodbury identity (C.7). The determinants are both equal to det(M) so that they are equated and setting A and D equal to identities (possibly of different dimensionalities) gives (C.14) with some reparameterization.
Note that we cannot take the determinants of both sides of (C.6) and then cancel the det(A) factor because A is not necessarily square nor nonsingular.

Add errata for some hyphenation inconsistencies

  • "linear-Gaussian" vs. "linear Gaussian": Although a quick search on the web reveals that the term "linear Gaussian" (without hyphenation) may be more popular, it seems that the term "linear-Gaussian" is used consistently in PRML.
  • "expectation maximization" vs. "expectation-maximization": The term "expectation maximization" (without hyphenation) is better.
  • "positive definite" vs. "positive-definite": We say "A is positive definite" but "A is a positive-definite matrix" (we should use hyphenation only before a noun in this case).
  • "N-dimensional" vs. "N dimensional": We always use hyphenation if an adjective is formed with an ordinal.

Add some more properties of log minus digamma function

  • Positivity from the comparison between the log expectation and the logarithm of the mean of the gamma-distributed random variable, making use of Jensen's inequality.
  • Define complete monotonicity.
  • Asymptotic expansion of the log gamma function.
  • Complete monotonicity of the residual of the asymptotic expansion.
  • Complete monotonicity of the log minus digamma function.
  • Log minus digamma function is convex ($a_{\text{ML}}$ indeed maximizes the likelihood).
  • Log minus digamma function is bijective and thus invertible.
  • Numerical solution for the inverse of the log minus digamma function.
  • Add figure for log minus digamma function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.