smassung / text-data-book-comments Goto Github PK

Comments, errata, suggestions, and issues for the book "Text Data Analysis and Management: A Practical Introduction to Text Mining and Information Retrieval"

text-data-book-comments's Introduction

Text Data Analysis and Management: A Practical Introduction to Text Mining and Information Retrieval

ChengXiang Zhai and Sean Massung, UIUC Computer Science Department

About

We'd like to hear your comments, fix errors you spot, listen to your suggestions, and address issues you find with our book. We will keep track of your comments via GitHub issues on this page.

On the right side of this page, click the Issues link. It will take you to a screen where you can open a new issue or comment on existing issues. If creating a new issue, you can select one or more of the following labels for it:

Typo: a typing error in the book text such as "infromation retreival"
Error: a factual error such as "NDCG only operates on binary relevance values"
Comment: a general comment about part of the book, such as "I really like the chapter on X"
Suggestion: such as, "could you expand a little more on the Rocchio feedback method?"

We will add more issue labels if necessary. Thanks for your help in creating this book!

text-data-book-comments's People

Contributors

Stargazers

Watchers

Forkers

felipao28 datasciencemom leigaosearch carlosalexsander asifcsedu rhmiller47 mshayeb edie2014 yiwitzki rygbee pyup munduruca limingdeng shadowridgedev adscft243

text-data-book-comments's Issues

Typo in section 5.5

The classic probabilistic model has led to the BM25 retrieval function (which we discussed in the vector space model), because its form is actually similar to a vector space model.

could be:

The classic probabilistic model has led to the BM25 retrieval function (which we discussed in the vector space model, because its form is actually similar to a vector space model).

Joint probability of (circle, green)

In the probability and statistics chapter, I think you mean to say that the joint probability of (circle, green) is 1/6, not 1.

Generally speaking, I think using a non-uniform color distribution might make the example more illustrative (you can then have an example where it is 1/6, and an example where it isn't).

Use of exclamationmark with numbers.

Several places an exclamation mark is used next to numbers, to draw attention to the number. However this makes it in-distinguishable from a factorial notation, and might be confusing. Examples of this occours several places like page 98.

f(q, d_4) = 4!

Chapter 12 equation p90 and general remarks

In section 12.3.2, at page 90, third line of the equation:
= argmax(y, x1, ..., xn)
Shouldn't the first comma be replaced by a dot ? i.e. = argmax(y.x1, ..., xn).

I also have a suggestion regarding section 12.3.3 which introduces linear classifier.
I would recommend adding as a reference "An Introduction to Statistical Learning" which has a very nice explanation of support vector classifiers (see here page 344). It describes the maximum margin classifier you plot in figure 12.3 and then introduces soft margins and kernel based SVM.

From a general viewpoint I find the chapter nicely written and very clear. However I already knew most of the concept and I realize that for people not familiar with statistical learning methods, there are not many references for them to dig deeper. Some references to basic textbooks would help (for kNN, naive Bayes, cross-validation procedures and SVM). A last point is that the section about Evaluation of Text Categorization (12.4) is quite short, do you plan to introduce some elementary measures like false/true positive/negative, roc curve, etc. , ? (or maybe some reference describing in more detail measures that can be applied to contingency tables in the context of text retrieval or more globally)

Regarding book!

Hi, is there any free version of the book available on the Internet.

Thanks

Error: Wrong inequality symbol.

In Section 5.3 Document Selection vs Document Ranking on page 52 you wrote:
R'(q) = {d|f(q, d) ≤ θ}
but it should be:
R'(q) = {d|f(q, d) ≥ θ}
or:
R'(q) = {d|f(q, d) > θ}

Suggestion or Error: Figure 5.11 unexpected IDF values

Figure 5.11 shows unexpected IDF values, which don't correspond to the example discussed in section 5.4. Based on that example and assuming base 2 logarithms, the IDF values should be:
[log2(6/5), log2(6/2), log2(6/3), log2(6/4), log2(6/2)]
which is about:
[0.3, 1.6, 1, 0.6, 1.6]
These values change the ranking and remove the problem of d5 being a top rank.

There is a discussion about on Coursera forum under IDF weighting, but no answer from the staff. There are suggestions that IDF values were calculated based on larger set of documents or perhaps other words in {d1, ..., d5}, which are not shown on the figures for brevity. If that's the case, I think it should be stated clearly, otherwise most readers will assume C = {d1, ..., d5} and all of the words from the query present in documents d1, ..., d5 are shown on the figures.

First issue

This is the first issue. I've labeled it as a comment by selecting the "Labels" button on the right side of this text box.

Issues are cool because you can write in markdown.

Minor typo in Page 103

In ths summary paragraph of Section 6.2, Probabilistic Retrieval Models, there is a typo in the sentence "Otherwise, it would give zero probability for unseen words in the document, which not good for scoring a query
with an unseen word."

8.6.2 DBLRU cache

This has been removed in Lucene 5, and apparently was "dead code".

Also, if data is read from disk, it would be inserted in primary, to approximate LRU.

Minor typo in Page 103

In ths summary paragraph of Section 6.2, Probabilistic Retrieval Models, the sentence "Otherwise, it would give zero probability for unseen words in the document, which not good for scoring a query
with an unseen word." is missing an "is".

Indentation after equations

It looks like, in some paragraphs, you've got indentation happening in text that is interspersed with equations where you probably don't mean to. If you only have a single line break in your LaTeX source (instead of a double), then the text won't be indented like a new paragraph. Look at the Bayes' rule section of the probability chapter to see what I'm talking about.

Use of "Alg" instead of "Algorithm"

On page 117, the query given is "data mining algorithms", and the two examples are listed as: p("data mining alg" .. even though there is clearly enough space on the page to write out algorithms.

Minor typo on page 105

In the second to last paragraph on the page, "Just like Jelinek-Mercer smoothing, we’ll use the collection language model, but in this case we’re going to combine it with the MLE esimate in a somewhat different way" should be "estimate".