Giter Club home page Giter Club logo

seededlda's Introduction

Semisupervised LDA for theory-driven text analysis

NOTICE: This R package is renamed from quanteda.seededlda to seededlda for CRAN submission.

seededlda is an R package that implements the seeded-LDA for semisupervised topic modeling using quanteda. The seeded-LDA model was proposed by Lu et al. (2010). Until version 0.3, that packages has been a simple wrapper around the topicmodels package, but the LDA estimator is newly implemented in C++ using the GibbsLDA++ library to be submitted to CRAN in August 202. The author believes this package implements the seeded-LDA model more closely to the original proposal.

Please see Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches for the overview of semisupervised topic classification techniques and their advantages in social science research.

keyATM is the latest addition to the semisupervised topic models. The users of seeded-LDA are also encouraged to use that package.

Install

install.packages("devtools")
devtools::install_github("koheiw/seededlda") 

Example

The corpus and seed words in this example are from Conspiracist propaganda: How Russia promotes anti-establishment sentiment online?.

require(quanteda)
require(seededlda) # changed from quanteda.seededlda to seededlda

Users of seeded-LDA has to construct a small dictionary of keywords (seed words) to define the desired topics.

dict <- dictionary(file = "tests/data/topics.yml")
print(dict)
## Dictionary object with 5 key entries.
## - [economy]:
##   - market*, money, bank*, stock*, bond*, industry, company, shop*
## - [politics]:
##   - parliament*, congress*, white house, party leader*, party member*, voter*, lawmaker*, politician*
## - [society]:
##   - police, prison*, school*, hospital*
## - [diplomacy]:
##   - ambassador*, diplomat*, embassy, treaty
## - [military]:
##   - military, soldier*, terrorist*, air force, marine, navy, army
corp <- readRDS("tests/data/data_corpus_sputnik.RDS")
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) %>%
        tokens_select(min_nchar = 2) %>% 
        tokens_compound(dict) # for multi-word expressions
dfmt <- dfm(toks) %>% 
    dfm_remove(stopwords('en')) %>% 
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile", 
             max_docfreq = 0.2, docfreq_type = "prop")

Many of the top terms of the seeded-LDA are seed words but related topic words are also identified. The result includes “other” as a junk topic because residual = TRUE .

set.seed(1234)
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)
print(terms(slda, 20))
##       economy     politics        society           diplomacy      
##  [1,] "company"   "parliament"    "police"          "diplomatic"   
##  [2,] "money"     "congress"      "school"          "embassy"      
##  [3,] "market"    "white_house"   "hospital"        "ambassador"   
##  [4,] "bank"      "politicians"   "prison"          "treaty"       
##  [5,] "industry"  "parliamentary" "schools"         "diplomat"     
##  [6,] "banks"     "lawmakers"     "pic.twitter.com" "diplomats"    
##  [7,] "markets"   "voters"        "media"           "north"        
##  [8,] "banking"   "lawmaker"      "information"     "nuclear"      
##  [9,] "stock"     "politician"    "reported"        "defense"      
## [10,] "stockholm" "european"      "local"           "korea"        
## [11,] "china"     "minister"      "video"           "south"        
## [12,] "chinese"   "eu"            "women"           "trump"        
## [13,] "percent"   "party"         "department"      "korean"       
## [14,] "year"      "uk"            "found"           "missile"      
## [15,] "india"     "sanctions"     "investigation"   "moscow"       
## [16,] "oil"       "political"     "social"          "meeting"      
## [17,] "countries" "prime"         "public"          "security"     
## [18,] "economic"  "union"         "court"           "nato"         
## [19,] "billion"   "germany"       "several"         "foreign"      
## [20,] "trade"     "election"      "took"            "international"
##       military     other     
##  [1,] "army"       "trump"   
##  [2,] "terrorist"  "just"    
##  [3,] "navy"       "like"    
##  [4,] "terrorists" "world"   
##  [5,] "soldiers"   "think"   
##  [6,] "air_force"  "now"     
##  [7,] "marine"     "even"    
##  [8,] "soldier"    "going"   
##  [9,] "syria"      "get"     
## [10,] "syrian"     "american"
## [11,] "iran"       "made"    
## [12,] "forces"     "say"     
## [13,] "israel"     "way"     
## [14,] "group"      "want"    
## [15,] "daesh"      "really"  
## [16,] "turkish"    "show"    
## [17,] "turkey"     "come"    
## [18,] "region"     "make"    
## [19,] "security"   "know"    
## [20,] "war"        "back"
topic <- table(topics(slda))
print(topic)
## 
##   economy  politics   society diplomacy  military     other 
##       140       160       243       134       121       202

seededlda's People

Contributors

koheiw avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.