Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.
You can find the up-to-date collection of scribe notes here.
The first set of exercises is available here.
The readings listed below are not yet complete, but the topics list is accurate.
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.
Readings:
- a few introductory slides
- Jeff Leek's guide to sharing data
- Introduction to RMarkdown
- Introduction to GitHub
Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)
Scripts and data:
Readings:
- NIST Handbook, Chapter 1.
- R walkthroughs on basic EDA: contingency tables, histograms, and scatterplots/lattice plots.
- Bad graphics
- Good graphics: scan through some of the New York Times' best data visualizations
The bootstrap and the permutation test; joint distributions; basic moment identities for linear combinations; using the bootstrap to approximate value at risk (VaR).
Scripts:
Readings:
- ISL Section 5.2 for a basic overview.
- These notes, pages 99-111. This is an introduction to the bootstrap from the (by now familiar) perspective of linear regression modeling, but it conveys the essential idea.
- This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
- Another R walkthrough on the permutation test in a simple 2x2 table.
- Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.
Optionally, Shalizi (Chapter 6) has a much lengthier treatment of the bootstrap, should you wish to consult it.
Basics of clustering; K-means clustering; mixture models; hierarchical clustering.
Scripts and data:
Readings:
- ISL Section 10.1 and 10.3
- Elements Chapter 14.3 (more advanced)
- K means examples: a few stylized examples to build your intuition for how k-means behaves.
- Hierarchical clustering examples: ditto for hierarchical clustering.
Principal component analysis (PCA); factor analysis; canonical correlation analysis; multi-dimensional scaling.
Scripts and data:
- pca_2D.R
- pca_intro.R
- congress109.R, congress109.csv, and congress109members.csv
- gasoline.R and gasoline.csv
- FXmonthly.R, FXmonthly.csv, and currency_codes.txt
- cca_intro.R, mmreg.csv, and mouse_nutrition.csv
Readings:
- ISL Section 10.2 for the basics
- Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor analysis, beyond what we covered in class.
- Elements Chapter 14.5 (more advanced)
Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).
Scripts and data:
- textutils.R
- nyt_stories.R and selections from the New York Times.
- tm_examples.R and selections from the Reuters newswire.
- naive_bayes.R
- simple_mixture.R
- congress109_topics.R
Readings:
- Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
- (Using the tm package)[http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf] for text mining in R.
- Dave Blei's survey of topic models.
- A pretty long blog post on naive-Bayes classification.
Coverage of these topics will depend on the time available. Possibilities include: anomaly detection; label propagation; learning association rules; graph partitioning; partial least squares.
Scripts and data:
Readings: