ermongroup / cs228-notes Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 465.0 38.31 MB

Course notes for CS228: Probabilistic Graphical Models.

License: MIT License

HTML 21.72% SCSS 78.28%

cs228-notes's People

Contributors

Stargazers

Watchers

Forkers

jygrinberg 20dbm marcuspanje chuchro3 avati kkiningh yk-tanigawa d1ngn1gefe1 chrisyeh96 connorbrinton nykh udincer steph-w fromgilbert gobid sher-ali phwang20 curiousx yushu-liu smartape sunjieee benjamesbabala alongwithyou cangyuwangwc kmario23 vikasmech yenchenlin pierrenowi yigal satpreetsingh shaoweipng pwichmann vibster codeaudit sid-kap mylearning2017 xmur skybirdhe raitobezarius learn-with-data paultopia tuzhucheng josephmisiti xc35 gojira swathimystery artwr nipunagarwala daerduocarey usmanakram232 vishallama chongtang manuelhaussmann nicolov ecebuzz elieshalhoub ali01 villanuevab michanaj garibarba se7oluti0n ac07 aalhour hades210 memoiryclear thegreedychoice dorringel xizero00 etonchow mohanarunachalam ibrahim5253 rsantana-isg gnvsoftware statml dkoutsou ml-ai-nlp-ir algento cleopatracleo soroushmehr duthedd soumenms2015 ersawant perrywh diazandr3s taitsmp ehfo0 rintukutum o7s8r6 chillee hal2001 kevinwkc peggyyuchunwang eddiepierce elaineyale egurapha amrege kandluis darthsuogles honghaoq marcwjj

cs228-notes's Issues

What's the best way to build ourselves.

I'd like to update the HTML after updating the markdown.

Coherence in JT Algorithm Lecture Notes

Hi! In the Junction Tree Algorithm Lecture Notes, I noted some inconsistencies w.r.t. factor graphs. In - that order, I quote:

First: "Sum-product message passing can also be applied to factor trees with a slight modification. Recall that a factor graph is a bipartite graph with edges going between variables and factors, with an edge signifying a factor depends on a variable." but factor trees are not mentioned before in the lecture notes.
Later, note 5. : "Arbitrary potentials can be handled using an algorithm called LBP on factor graphs. We will include this material at some point in the future."

Reading the notes and understanding what factor trees/graphs is doable, but the phrasing/order is confusing. I'll try submitting a pull request later in the quarter.

Thank you!

Incomprehensible sentence in the section "Junction tree algorithm"

In the section An illustrative example, you state

Crucially, we assume that the cliques c some path structure

Maybe you forgot "form"? So it should be

Crucially, we assume that the cliques c form some path structure

Possible confusing explanations in sections 2.5 and 2.6.

In section 2.5, you do not explain what I is supposed to be, but it might be necessary to explain it, as not everyone might be familiar with this indicator function.

In section 2.6, you describe the Bernoulli random variable as follows

one if a coin with heads probability p comes up heads, zero otherwise.

Specifically, the expression "heads probability" is a little confusing. You should have been more explicit "with probability p of getting heads".

It could be formulated as (or something like that)

this random variable takes the value p, if the input to the random variable is X=(x=1) (or x=heads), and it takes the value 1 - p, if X=(x=0) (or x=tails). In other words, the probability of taking the value "heads" (1) is p and the probability of taking the value "tails" (0) is 1 - p.

At least, you I would rephrase your explanations to make them less ambiguous and confusing.

I think that the explanations of the other random variables are also quite poor. For example, the geometric random variable is quite related to the binomial random variable, but you do not explain it. See: https://stats.stackexchange.com/q/263141/82135. In general, I think you should spend 2-3 sentences to motivate better the use of these particular random variables. This would make your notes even more useful!

Possible typo in Sampling Method notes

Should that read "if $p(e \mid z)$ is not too far from uniform..." instead of $p(z \mid e)$? If so I can open a PR I just wanted to verify first.

cs228-notes/inference/sampling/index.md

Line 110 in 5d0ddf7

 where $$w_e(z) = p(e, z)/q(z)$$. Unlike rejection sampling, this will use all the examples; if $$p(z \mid e)$$ is not too far from uniform, this will converge to the true probability after only a very small number of samples. 

Typo in language models application

cs228-notes/docs/preliminaries/applications/index.html

Line 167 in dd0ea64

 <p>Knowing the probability distribution can also help us model natural langauge utterances. In this case, we want to construct a probability distribution <script type="math/tex">p(x)</script> over sequences of words or characters <script type="math/tex">x</script> that assigns high probability to proper (English) sentences. This distribution can be learned from a variety of sources, such as Wikipedia articles.</p> 

It should be language utterances instead of langauge utterances.
The typo also exists in file preliminaries/applications/index.md.

Confusing notation of the factor-to-variable message in the belief propagation algorithm for trees

The following equation is quite confusing

Specifically, you use x_{\mathcal N(s)} as input to f_s, but the summation is over x_{\mathcal N(s)\setminus i}, whose meaning is not very clear. I think you should clarify this point a little more. Here, $x$ is a random variable, so what could x_{\mathcal N(s)} mean?

thank you!

Discrepancy in front-end and back-end text for VE

If you look at the front end of the VE chapter under "An illustrative example" (as far as I can tell), you see in the first line "...simplicity that we a \n given a chain..."; but the text in the source clearly has "....we are given a chain....".

Is it just me or is there something going on here?

Bayesian Learning MLE Confidence

Currently the notes on Bayesian Learning imply that the MLE estimate does not have improved confidence bounds with more data. However, I believe the sample MLE estimate errors from the true parameter achieves a normal distribution with covariance given by the Fischer information. The confidence bounds then improve with n^(1/2) as we get more data

Equivalence of the sum-product message passing term and the factor is not clear

In the section Sum-product message passing, you say

Again, observe that this message is precisely the factor...

I don't think this equivalence is clear at all. I think you should either provide the link to the explanation or explain it (again).

Make citable?

Thanks so much for posting these, they are awesome!

I would like to cite some parts of this in my thesis and was wondering if you might consider assigning them a DOI for this purpose. Its pretty easy to do with GitHub repos on Zenodo

change jekyll site destination to docs

you can just add this line to _config.yml

destination: docs

no need rm _site and rename docs to _site

Layout: Add previous and next links to each chapter

Right now, its hard to flip back and forth at the end of each chapter.

Sum is not an integral in the section Monte Carlo estimation

In the section Monte Carlo estimation, you say

... as well as computing integrals of the form

But then you show a sum not an integral. Maybe you meant "expectation"?

Inconsistency between definition of ELBO in sections "Auto-encoding variational Bayes" and "The variational lower bound"

In the section Auto-encoding variational Bayes, you state

Recall that in variational inference, we are interested in maximizing the ...

However, in the section The variational lower bound, you define the expectation in a different way. Specifically, \tilde{p} and q are functions of x (E[log(p(x)) - log(q(x))]), whereas in the section Auto-encoding variational Bayes, the distributions are a function of x and z (an unobserved variable) and, further, q is a conditional probability: E[log(p(x, z)) - log(q(z | x))]. So, I think you should explain these differences.

Typo: At the end of the variational auto-encoder section, the variable x is erroneously used instead of z

At the end of the Variational Auto-encoder section (https://ermongroup.github.io/cs228-notes/extras/vae/), the following is written:

A variational auto-encoder uses the AEVB algorithm to learn a specific model p using a particular encoder q. The model p is parametrized as
p(x∣z)=N(x;μ(z),diag(σ(x))^2)
p(z)=N(z;0,I),
where μ(z),σ(z) are parametrized by a neural network (typically, two dense hidden layers of 500 units each).

For p(x∣z), you erroneously wrote σ(x) instead of σ(z) for the covariance matrix, which is probably a typo.

Broken SampleApp demo

In the sampleApp demo
https://pgmlearning.herokuapp.com/samplingApp
mentioned in the sampling method chapter scripts are not loaded when visiting the demo through https due to mixing http and https connections.

In particular
http://d3js.org/d3.v4.min.js
and
http://dimplejs.org/dist/dimple.v2.3.0.min.js
which prevent the demo from computing samples

Just changing urls to the https versions should fix the issue

preliminaries/introduction "Thus, it will best" should be "Thus, it will be best"

A bird’s eye overview of the course

Our discussion of graphical models will be divided into three major parts: representation (how to specify a model), inference (how to ask the model questions), and learning (how to fit a model to real-world data). These three themes will also be closely linked: to derive efficient inference and learning algorithms, the model will need to be adequately represented; furthermore, learning models will require inference as a subroutine. Thus, it will best to always keep the three tasks in mind, rather than focusing on them in isolation . . .

Typo in Chapter 4 (Bayesian Learning) in Learning section

under conjugate prior section . This expression is mentioned :P(θ)∈φ⟹P(θ∣D)∈φ . Rather , it should actually be P(θ)∈φ⟹P(D∣θ)∈φ . Priori and Likelihood should belong to same family of distributions.

What is g?

See this line: https://github.com/ermongroup/cs228-notes/blame/master/inference/sampling/index.md#L35

I'm confused by what g is on this line. Is it supposed to be f? Or was this supposed to be something else entirely?

Same goes for another line: https://github.com/ermongroup/cs228-notes/blame/master/inference/sampling/index.md#L44

explain the chain rule?

In Bayesian Networks (the first meaty chapter),

Recall that by the chain rule, we can write any probability pp as...

Search Probability review for definition of the chain rule - nothing.
Search any earlier chapter for mention of the chain rule - nothing.

Clarification in Belief Propagation notes

In the section Max-product message passing,

The key observation is that the sum and max operators both distribute over products.

I think this holds true only because the factors are always positive and might not hold in general. And should be mentioned in the notes.

Sum-product message passing marginal query run time

The notes say this can be completed in constant time, but there are N(i) multiplications, which would actually make it linear with respect to the degree of the vertex.

Usage of the term "potentials" in discussion of Markov random fields

I have a comment regarding the following equation here
$$\tilde p(A,B,C,D) = \phi(A,B)\phi(B,C)\phi(C,D)\phi(D,A), $$
and the rest of the section where $\phi()$ are referred to as "potentials".

In the language of physics, the potential would refer to the the log likelihood, since potentials add when different probability factors multiply (just like the energies of different non-interacting subsystems add up).

It's possible that the jargon got scrambled when moving from one community to another. If the document's usage is consistent with conventions in the field, then it might be worth adding a clarifying footnote where $\phi$ is introduced.

Section "Introducing evidence" is unclear

The section Introducing evidence is quite unclear. For example, you say

P(X, Y, E) is a probability distribution

However, in the equation above that statement there is no term P(X, Y, E).

You say that sometimes we are interested in computing the posterior given the evidence. So, what does X have to do with this?

The following statement

We can compute this probability by performing variable elimination once on ...

is also not clear. Can you explain it further?

The last paragraph is also quite unclear. What do you mean by scope?

Some more practical examples

Is there a possibility to show some examples and code ? You may include some links. Someone like me or others can try to code too based on the examples. Is Reinforcement learning a good example ?

Generative vs discriminative models

I've been reading your notes on PGMs , and also some other resources, particularly [1]. In [1] , in the introduction, they mention the difference between generative and discriminative models, i.e. modelling joint distribution p(y, x) opposed to p(y|x). You mention the same difference in the The difficulties of probabilistic modeling.
The confusing part is in Real-World Applications/Probabilistic Models of Images where you write

one of the reasons why generative models are powerful lie in the fact that they have many fewer parameters than the amount of data that they are trained with,

which seems contradictory to the The difficulties of probabilistic modeling section. The confusing part is (in boldface) that the sentence above says that generative models have many fewer parameters than the amount of data that they are trained with, which is inconsistent with the argument made in The difficulties of probabilistic modeling.

Am I interpreting something wrong, or the terminology is ambiguous?

[1] An Introduction to Conditional Random Fields

Add a side note that links to a resource that proves that the probability distribution defined on a Bayesian network is valid

At the end of the section where you formally describe Bayesian networks, you state

It is not hard to see that a probability represented by a Bayesian network will be valid: clearly, it will be non-negative and one can show using an induction argument (and using the fact that the CPDs are valid probabilities) that the sum over all variable assignments will be one. Conversely, we can also show by counter-example that when G contains cycles, its associated probability may not sum to one.

I think you should have a side note that links to a paper or a resource that proves these statements.