ermongroup / cs228-notes Goto Github PK
View Code? Open in Web Editor NEWCourse notes for CS228: Probabilistic Graphical Models.
License: MIT License
Course notes for CS228: Probabilistic Graphical Models.
License: MIT License
I'd like to update the HTML after updating the markdown.
Hi! In the Junction Tree Algorithm Lecture Notes, I noted some inconsistencies w.r.t. factor graphs. In - that order, I quote:
Reading the notes and understanding what factor trees/graphs is doable, but the phrasing/order is confusing. I'll try submitting a pull request later in the quarter.
Thank you!
In the section An illustrative example, you state
Crucially, we assume that the cliques c some path structure
Maybe you forgot "form"? So it should be
Crucially, we assume that the cliques c form some path structure
In section 2.5, you do not explain what I is supposed to be, but it might be necessary to explain it, as not everyone might be familiar with this indicator function.
In section 2.6, you describe the Bernoulli random variable as follows
one if a coin with heads probability p comes up heads, zero otherwise.
Specifically, the expression "heads probability" is a little confusing. You should have been more explicit "with probability p of getting heads".
It could be formulated as (or something like that)
this random variable takes the value p, if the input to the random variable is X=(x=1) (or x=heads), and it takes the value 1 - p, if X=(x=0) (or x=tails). In other words, the probability of taking the value "heads" (1) is p and the probability of taking the value "tails" (0) is 1 - p.
At least, you I would rephrase your explanations to make them less ambiguous and confusing.
I think that the explanations of the other random variables are also quite poor. For example, the geometric random variable is quite related to the binomial random variable, but you do not explain it. See: https://stats.stackexchange.com/q/263141/82135. In general, I think you should spend 2-3 sentences to motivate better the use of these particular random variables. This would make your notes even more useful!
Should that read "if
cs228-notes/inference/sampling/index.md
Line 110 in 5d0ddf7
The following equation is quite confusing
Specifically, you use x_{\mathcal N(s)} as input to f_s, but the summation is over x_{\mathcal N(s)\setminus i}, whose meaning is not very clear. I think you should clarify this point a little more. Here,
If you look at the front end of the VE chapter under "An illustrative example" (as far as I can tell), you see in the first line "...simplicity that we a \n given a chain..."; but the text in the source clearly has "....we are given a chain....".
Is it just me or is there something going on here?
Currently the notes on Bayesian Learning imply that the MLE estimate does not have improved confidence bounds with more data. However, I believe the sample MLE estimate errors from the true parameter achieves a normal distribution with covariance given by the Fischer information. The confidence bounds then improve with n^(1/2) as we get more data
In the section Sum-product message passing, you say
Again, observe that this message is precisely the factor...
I don't think this equivalence is clear at all. I think you should either provide the link to the explanation or explain it (again).
Thanks so much for posting these, they are awesome!
I would like to cite some parts of this in my thesis and was wondering if you might consider assigning them a DOI for this purpose. Its pretty easy to do with GitHub repos on Zenodo
you can just add this line to _config.yml
destination: docs
no need rm _site and rename docs to _site
Right now, its hard to flip back and forth at the end of each chapter.
In the section Monte Carlo estimation, you say
... as well as computing integrals of the form
But then you show a sum not an integral. Maybe you meant "expectation"?
In the section Auto-encoding variational Bayes, you state
Recall that in variational inference, we are interested in maximizing the ...
However, in the section The variational lower bound, you define the expectation in a different way. Specifically, \tilde{p} and q are functions of x (E[log(p(x)) - log(q(x))]
), whereas in the section Auto-encoding variational Bayes, the distributions are a function of x and z (an unobserved variable) and, further, q is a conditional probability: E[log(p(x, z)) - log(q(z | x))]
. So, I think you should explain these differences.
At the end of the Variational Auto-encoder section (https://ermongroup.github.io/cs228-notes/extras/vae/), the following is written:
A variational auto-encoder uses the AEVB algorithm to learn a specific model p using a particular encoder q. The model p is parametrized as
p(x∣z)=N(x;μ(z),diag(σ(x))^2)
p(z)=N(z;0,I),
where μ(z),σ(z) are parametrized by a neural network (typically, two dense hidden layers of 500 units each).
For p(x∣z), you erroneously wrote σ(x) instead of σ(z) for the covariance matrix, which is probably a typo.
In the sampleApp demo
https://pgmlearning.herokuapp.com/samplingApp
mentioned in the sampling method chapter scripts are not loaded when visiting the demo through https due to mixing http and https connections.
In particular
http://d3js.org/d3.v4.min.js
and
http://dimplejs.org/dist/dimple.v2.3.0.min.js
which prevent the demo from computing samples
Just changing urls to the https versions should fix the issue
A bird’s eye overview of the course
Our discussion of graphical models will be divided into three major parts: representation (how to specify a model), inference (how to ask the model questions), and learning (how to fit a model to real-world data). These three themes will also be closely linked: to derive efficient inference and learning algorithms, the model will need to be adequately represented; furthermore, learning models will require inference as a subroutine. Thus, it will best to always keep the three tasks in mind, rather than focusing on them in isolation . . .
under conjugate prior section . This expression is mentioned :P(θ)∈φ⟹P(θ∣D)∈φ . Rather , it should actually be P(θ)∈φ⟹P(D∣θ)∈φ . Priori and Likelihood should belong to same family of distributions.
See this line: https://github.com/ermongroup/cs228-notes/blame/master/inference/sampling/index.md#L35
I'm confused by what g is on this line. Is it supposed to be f? Or was this supposed to be something else entirely?
Same goes for another line: https://github.com/ermongroup/cs228-notes/blame/master/inference/sampling/index.md#L44
In Bayesian Networks (the first meaty chapter),
Recall that by the chain rule, we can write any probability pp as...
Search Probability review for definition of the chain rule - nothing.
Search any earlier chapter for mention of the chain rule - nothing.
In the section Max-product message passing,
The key observation is that the sum and max operators both distribute over products.
I think this holds true only because the factors are always positive and might not hold in general. And should be mentioned in the notes.
The notes say this can be completed in constant time, but there are N(i) multiplications, which would actually make it linear with respect to the degree of the vertex.
I have a comment regarding the following equation here
and the rest of the section where
In the language of physics, the potential would refer to the the log likelihood, since potentials add when different probability factors multiply (just like the energies of different non-interacting subsystems add up).
It's possible that the jargon got scrambled when moving from one community to another. If the document's usage is consistent with conventions in the field, then it might be worth adding a clarifying footnote where
The section Introducing evidence is quite unclear. For example, you say
P(X, Y, E) is a probability distribution
However, in the equation above that statement there is no term P(X, Y, E).
You say that sometimes we are interested in computing the posterior given the evidence. So, what does X have to do with this?
The following statement
We can compute this probability by performing variable elimination once on ...
is also not clear. Can you explain it further?
The last paragraph is also quite unclear. What do you mean by scope?
Is there a possibility to show some examples and code ? You may include some links. Someone like me or others can try to code too based on the examples. Is Reinforcement learning a good example ?
I've been reading your notes on PGMs , and also some other resources, particularly [1]. In [1] , in the introduction, they mention the difference between generative and discriminative models, i.e. modelling joint distribution p(y, x) opposed to p(y|x). You mention the same difference in the The difficulties of probabilistic modeling.
The confusing part is in Real-World Applications/Probabilistic Models of Images where you write
one of the reasons why generative models are powerful lie in the fact that they have many fewer parameters than the amount of data that they are trained with,
which seems contradictory to the The difficulties of probabilistic modeling section. The confusing part is (in boldface) that the sentence above says that generative models have many fewer parameters than the amount of data that they are trained with, which is inconsistent with the argument made in The difficulties of probabilistic modeling.
Am I interpreting something wrong, or the terminology is ambiguous?
At the end of the section where you formally describe Bayesian networks, you state
It is not hard to see that a probability represented by a Bayesian network will be valid: clearly, it will be non-negative and one can show using an induction argument (and using the fact that the CPDs are valid probabilities) that the sum over all variable assignments will be one. Conversely, we can also show by counter-example that when G contains cycles, its associated probability may not sum to one.
I think you should have a side note that links to a paper or a resource that proves these statements.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.