Giter Club home page Giter Club logo

topicmodels_learning's Introduction

Topic Models Learning and R Resources Follow

This is a collection documenting the resources I find related to topic models with an R flavored focus. A topic model is a type of generative model used to "discover" latent topics that compose a corpus or collection of documents. Typically topic modeling is used on a collection of text documents but can be used for other modes including use as caption generation for images.

Table of Contents

Just the Essentials

This is my run down of the minimal readings, websites, videos, & scripts the reader needs to become familiar with topic modeling. The list is in an order I believe will be of greatest use and contains a nice mix of introduction, theory, application, and interpretation. As you want to learn more about topic modeling, the other sections will become more useful.

  1. Boyd-Graber, J. (2013). Computational Linguistics I: Topic Modeling
  2. Underwood, T. (2012). Topic Modeling Made Just Simple Enough
  3. Weingart, S. (2012). Topic Modeling for Humanists: A Guided Tour
  4. Blei, D. M. (2012). Probabilistic topic models. *Communications of the ACM, (55)*4, 77-84. doi:10.1145/2133806.2133826
  5. inkhorn82 (2014). A Delicious Analysis! (aka topic modelling using recipes) (CODE)
  6. Grüen, B. & Hornik, K. (2011). topicmodels: An R Package for Fitting Topic Models.. Journal of Statistical Software, 40(13), 1-30.
  7. Marwick, B. (2014a). The input parameters for using latent Dirichlet allocation
  8. Tang, J., Meng, Z., Nguyen, X. , Mei, Q. , & Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. In 31 st International Conference on Machine Learning, 190-198.
  9. Sievert, C. (2014). LDAvis: A method for visualizing and interpreting topic models
  10. Rhody, L. M. (2012). Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language
  11. Rinker, T.W. (2015). R Script: Example Topic Model Analysis

Key Players

Papadimitriou, Raghavan, Tamaki & Vempala, Santosh (1997) first introduced the notion of topic modeling in their "Latent Semantic Indexing: A probabilistic analysis". Thomas Hofmann (1999) developed "Probabilistic latent semantic indexing". Blei, Ng, & Jordan (2003) proposed latent Dirichlet allocation (LDA) as a means of modeling documents with multiple topics but assumes the topic are uncorrelated. Blei & Lafferty (2007) proposed correlated topics model (CTM), extending LDA to allow for correlations between topics. Roberts, Stewart, Tingley, & Airoldi (2013) propose a Structural Topic Model (STM), allowing the inclusion of meta-data in the modeling process.

Videos

Introductory

Theory

Visualization

Articles

Applied

Theoretical

Websites & Blogs

R Resources

Package Comparisons

Package Functionality Pluses Author R Language Interface
lda* Collapsed Gibbs for LDA Graphing utilities Chang R
topicmodels LDA and CTM Follows Blei's implementation; great vignette; takes C DTM
stm Model w/ meta-data Great documentation; nice visualization Roberts, Stewart, & Tingley C
LDAvis Interactive visualization Aids in model interpretation Sievert & Shirley R + Shiny
mallet** LDA MALLET is well known Mimno Java

*StackExchange discussion of lda vs. topicmodels
**Setting Up MALLET

R Specific References

Example Modeling

Topic Modeling R Demo

topicmodels Package

The .R script for this demonstration can be downloaded from scripts/Example_topic_model_analysis.R

Install/Load Tools & Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/gofastr")
pacman::p_load(tm, topicmodels, dplyr, tidyr, igraph, devtools, LDAvis, ggplot2)

## Source topicmodels2LDAvis & optimal_k functions
invisible(lapply(
    file.path(
        "https://raw.githubusercontent.com/trinker/topicmodels_learning/master/functions", 
        c("topicmodels2LDAvis.R", "optimal_k.R")
    ),
    devtools::source_url
))

## SHA-1 hash of file is 5ac52af21ce36dfe8f529b4fe77568ced9307cf0
## SHA-1 hash of file is 7f0ab64a94948c8b60ba29dddf799e3f6c423435

data(presidential_debates_2012)

Generate Stopwords

stops <- c(
        tm::stopwords("english"),
        tm::stopwords("SMART"),
        "governor", "president", "mister", "obama","romney"
    ) %>%
    gofastr::prep_stopwords() 

Create the DocumentTermMatrix

doc_term_mat <- presidential_debates_2012 %>%
    with(gofastr::q_dtm_stem(dialogue, paste(person, time, sep = "_"))) %>%           
    gofastr::remove_stopwords(stops, stem=TRUE) %>%                                                    
    gofastr::filter_tf_idf() %>%
    gofastr::filter_documents() 

Control List

control <- list(burnin = 500, iter = 1000, keep = 100, seed = 2500)

Determine Optimal Number of Topics

The plot below shows the harmonic mean of the log likelihoods against k (number of topics).

(k <- optimal_k(doc_term_mat, 40, control = control))

## 
## Grab a cup of coffee this could take a while...

## 10 of 40 iterations (Current: 08:54:32; Elapsed: .2 mins)
## 20 of 40 iterations (Current: 08:55:07; Elapsed: .8 mins; Remaining: ~2.3 mins)
## 30 of 40 iterations (Current: 08:56:03; Elapsed: 1.7 mins; Remaining: ~1.3 mins)
## 40 of 40 iterations (Current: 08:57:30; Elapsed: 3.2 mins; Remaining: ~0 mins)
## Optimal number of topics = 20

It appears the optimal number of topics is ~k = 20.

Run the Model

control[["seed"]] <- 100
lda_model <- topicmodels::LDA(doc_term_mat, k=as.numeric(k), method = "Gibbs", 
    control = control)

Plot the Topics Per Person & Time

topics <- topicmodels::posterior(lda_model, doc_term_mat)[["topics"]]
topic_dat <- dplyr::add_rownames(as.data.frame(topics), "Person_Time")
colnames(topic_dat)[-1] <- apply(terms(lda_model, 10), 2, paste, collapse = ", ")

tidyr::gather(topic_dat, Topic, Proportion, -c(Person_Time)) %>%
    tidyr::separate(Person_Time, c("Person", "Time"), sep = "_") %>%
    dplyr::mutate(Person = factor(Person, 
        levels = c("OBAMA", "ROMNEY", "LEHRER", "SCHIEFFER", "CROWLEY", "QUESTION" ))
    ) %>%
    ggplot2::ggplot(ggplot2::aes(weight=Proportion, x=Topic, fill=Topic)) +
        ggplot2::geom_bar() +
        ggplot2::coord_flip() +
        ggplot2::facet_grid(Person~Time) +
        ggplot2::guides(fill=FALSE) +
        ggplot2::xlab("Proportion")

Plot the Topics Matrix as a Heatmap

heatmap(topics, scale = "none")

Network of the Word Distributions Over Topics (Topic Relation)

post <- topicmodels::posterior(lda_model)

cor_mat <- cor(t(post[["terms"]]))
cor_mat[ cor_mat < .05 ] <- 0
diag(cor_mat) <- 0

graph <- graph.adjacency(cor_mat, weighted=TRUE, mode="lower")
graph <- delete.edges(graph, E(graph)[ weight < 0.05])

E(graph)$edge.width <- E(graph)$weight*20
V(graph)$label <- paste("Topic", V(graph))
V(graph)$size <- colSums(post[["topics"]]) * 15

par(mar=c(0, 0, 3, 0))
set.seed(110)
plot.igraph(graph, edge.width = E(graph)$edge.width, 
    edge.color = "orange", vertex.color = "orange", 
    vertex.frame.color = NA, vertex.label.color = "grey30")
title("Strength Between Topics Based On Word Probabilities", cex.main=.8)

Network of the Topics Over Dcouments (Topic Relation)

minval <- .1
topic_mat <- topicmodels::posterior(lda_model)[["topics"]]

graph <- graph_from_incidence_matrix(topic_mat, weighted=TRUE)
graph <- delete.edges(graph, E(graph)[ weight < minval])

E(graph)$edge.width <- E(graph)$weight*17
E(graph)$color <- "blue"
V(graph)$color <- ifelse(grepl("^\\d+$", V(graph)$name), "grey75", "orange")
V(graph)$frame.color <- NA
V(graph)$label <- ifelse(grepl("^\\d+$", V(graph)$name), paste("topic", V(graph)$name), gsub("_", "\n", V(graph)$name))
V(graph)$size <- c(rep(10, nrow(topic_mat)), colSums(topic_mat) * 20)
V(graph)$label.color <- ifelse(grepl("^\\d+$", V(graph)$name), "red", "grey30")

par(mar=c(0, 0, 3, 0))
set.seed(369)
plot.igraph(graph, edge.width = E(graph)$edge.width, 
    vertex.color = adjustcolor(V(graph)$color, alpha.f = .4))
title("Topic & Document Relationships", cex.main=.8)

LDAvis of Model

The output from LDAvis is not easily embedded within an R markdown document, however, the reader may see the results here.

lda_model %>%
    topicmodels2LDAvis() %>%
    LDAvis::serVis()

Apply Model to New Data

## Create the DocumentTermMatrix for New Data
doc_term_mat2 <- partial_republican_debates_2015 %>%
    with(gofastr::q_dtm_stem(dialogue, paste(person, location, sep = "_"))) %>%           
    gofastr::remove_stopwords(stops, stem=TRUE) %>%                                                    
    gofastr::filter_tf_idf() %>%
    gofastr::filter_documents() 


## Update Control List
control2 <- control
control2[["estimate.beta"]] <- FALSE


## Run the Model for New Data
lda_model2 <- topicmodels::LDA(doc_term_mat2, k = k, model = lda_model, 
    control = list(seed = 100, estimate.beta = FALSE))


## Plot the Topics Per Person & Location for New Data
topics2 <- topicmodels::posterior(lda_model2, doc_term_mat2)[["topics"]]
topic_dat2 <- dplyr::add_rownames(as.data.frame(topics2), "Person_Location")
colnames(topic_dat2)[-1] <- apply(terms(lda_model2, 10), 2, paste, collapse = ", ")

tidyr::gather(topic_dat2, Topic, Proportion, -c(Person_Location)) %>%
    tidyr::separate(Person_Location, c("Person", "Location"), sep = "_") %>%
    ggplot2::ggplot(ggplot2::aes(weight=Proportion, x=Topic, fill=Topic)) +
        ggplot2::geom_bar() +
        ggplot2::coord_flip() +
        ggplot2::facet_grid(Person~Location) +
        ggplot2::guides(fill=FALSE) +
        ggplot2::xlab("Proportion")


## LDAvis of Model for New Data
lda_model2 %>%
    topicmodels2LDAvis() %>%
    LDAvis::serVis()

Contributing

You are welcome to:

topicmodels_learning's People

Contributors

trinker avatar benmarwick avatar

Watchers

James Cloos avatar Don De Alban avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.