koheiw / seededlda Goto Github PK

LDA for semisupervised topic modeling

Home Page: https://koheiw.github.io/seededlda/

R 72.87% C++ 27.13%

text-classification semi-supervised-learning

seededlda's Introduction

seededlda: the package for semi-supervised topic modeling

seededlda is an R package that implements Seeded LDA (Latent Dirichlet Allocation) for semi-supervised topic modeling based on quanteda. Initially, the package was a simple wrapper around the topicmodels package, but it was fully rewritten in C++ using the GibbsLDA++ library and submitted to CRAN as version 0.5 in 2020. The package was further developed to add the sequential classification (Sequential LDA) and parallel computing (Distributed LDA) capabilities and released as version 1.0 in 2023.

keyATM is the latest addition to the semi-supervised topic models. The users of Seeded LDA are also encouraged to download that package.

Installation

From CRAN:

install.packages("seededlda")

From Github:

devtools::install_github("koheiw/seededlda")

Examples

Please visit the package website for examples:

Introduction: basic functions of the package
Distributed LDA: topic modeling with parallel computing
Seeded LDA: semi-supervised topic modeling
Sequential LDA: sentence-level topic modeling

Please read the following papers on the algorithms.

Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. Social Science Computer Review. https://doi.org/10.1177/08944393231178605
Watanabe, K. (2023). Speed Up Topic Modeling: Distributed Computing and Convergence Detection for LDA, working paper.

Other publications

Please read the following papers for how to apply seeded-LDA in social science research:

Curini, L., & Vignoli, V. (2021). Committed Moderates and Uncommitted Extremists: Ideological Leaning and Parties’ Narratives on Military Interventions in Italy. Foreign Policy Analysis, 17(3), 1–20. https://doi.org/10.1093/fpa/orab016

seededlda's People

Contributors

Stargazers

Watchers

Forkers

local-maxima carlosantagiustina jackylin2012 minghao2016 msaeltzer chainsawriot georgemarmelstein jvhuang1786 rnaimehaom humphy111 juyeon6273 b-zwarg kbenoit druedin moomoofarm1

seededlda's Issues

Allow weight to be a vector

It might be useful when some seedwords are much less frequent than others.

Add levels argument

Add levels argument to textmodels_seededlda(). quanteda::flatten_dictionary() can be used in tfm() for this purpose.

using predict()

I tried predict() and it seems to work, but then I can't seem to get the predictions to re-attach in the right order. I'm not super advanced and I'm probably missing something obvious.

new_corpus <- corpus(new_data, text_field = "text")

dfmt <- dfm(new_corpus, remove_number = TRUE) %>%
dfm_remove(stopwords('en'), min_nchar = 2) %>%
dfm_trim(min_termfreq = 0.70, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
dfmt <- dfm_subset(dfmt, ntoken(dfmt) > 1)

predictions <- predict(slda, newdata = dfmt, max_iter = 2000, verbose = quanteda_options("verbose"))

new_data_from_dfmt <- docvars(dfmt)

new_data_from_dfmt$likely_topic <- predictions

The predictions are pretty much random, so I think the order has been lost in the shuffle... I don't know how I'm supposed to do it properly.

(I've been using the model a lot and it's great, but I re-run it every time I want to apply it to new data.)

dictionary term in seededlda() needs correction

I think it should be

space = c("alien", "planet", "space")

(and gives better results)
😄

Save and return the instability statistic

It can be called sigma. Total sigma could be used for measuring the model fit.

seededlda/src/lda.h

Line 321 in 338509e

change += change_tp;

Make phi to be sum to 1.0 even with pseudo counts

Due to the pseudo counts, phi does not sum to 1.0, but it should.

Add group() to topics()

It should return a list() of topics.

store keywords in model object

Hey Kohei, Great package! I think it would make sense to automatically store seed terms in the model object for later replication and validation. I just built an oolong plugin for word intrusion tests. Right now users have to integrate their dictionaries themselves.
Best regards

tokens or tokens_ngrams for seededlda, and export to LDAvis

Hi @koheiw ,

I have two issues that I hope to hear from you some advice and suggestion.

I - tokens or tokens_ngrams

Just to know and do my case correctly. Please give me any advice or suggestion.

I have a list of guided keywords which are both uni and n-grams.
So, when tokenizing my corpus, I am not sure about using tokens or tokens_ngrams is the most suitable one.
Q1: What should I do if I expect

Input: both seeded uni and bi-grams. Output: newly explored keywords are both uni and bi-grams.
Input: both seeded uni and bi-grams. Output: newly explored keywords are only uni grams?

Q2: If I have n-grams, should I use tokens_compound?

Q3: do I need to remove stop words when using n-grams?

Please see my example code below.

# load the required packages
library(quanteda)
library(seededlda)

# Create a corpus from your text data
my_corpus <- corpus(c("The field of artificial intelligence is expanding with new breakthroughs.",
                      "Data analysis and machine learning are important in extracting insights.",
                      "Cybersecurity measures are crucial to protect sensitive information.",
                      "Health and fitness play a significant role in overall well-being.",
                      "The economy is affected by global market trends and trade policies.",
                      "Education is essential for personal growth and career opportunities.",
                      "Climate change and environmental sustainability are pressing issues.",
                      "Social media platforms are shaping communication and information sharing.",
                      "The entertainment industry is evolving with new digital platforms.",
                      "Urbanization and infrastructure development are transforming cities."))


# Define the seed words for each topic
seed_list <- list(topic1 = c("artificial intelligence", "machine learning", "data analysis"),
                  topic2 = c("cybersecurity", "privacy", "data protection"),
                  topic3 = c("climate change", "sustainability", "environmental impact")) 

dict <- quanteda::dictionary(seed_list)


# Create a document-feature matrix with unigram and bigram features
toks <- tokens(my_corpus, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE) |>  
    tokens_select(min_nchar = 2) |> 
    tokens() |> ## I'M NOT SURE HERE ###
    #tokens_ngrams(n = 1:2) |> ## I'M NOT SURE HERE ###
    tokens_compound(dict) # for multi-word expressions

dfmt <- dfm(toks) |>   
    dfm_remove(stopwords('en')) |>   
    dfm_trim(min_termfreq = 0.5, termfreq_type = "quantile", 
             max_docfreq = 0.5, docfreq_type = "prop")


# Run seeded LDA
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)


# extract topic classification for each document 
topics <- topics(slda)


# get most frequent keywords of each topic

keywords <- terms(slda, 20) |> as_tibble()

II - Use LDAvis to visualize and modify keywords, topics

As I need to do several rounds of seeded LDA to make the topics are separated clearly and combine topics if they are quite close to each other. I would like to use LDAvis to see more intuitively. I tried and got the error as below:

> # Create the interactive visualization using LDAvis
> lda_vis <- createJSON(phi = slda$phi, 
+                       theta = slda$theta,
+                       doc_lengths = slda$doc_lengths,
+                       vocab = slda$vocab,
+                       term_frequency = slda$term_frequency)
Error in createJSON(phi = slda$phi, theta = slda$theta, doc_lengths = slda$doc_lengths,  : 
  Length of doc.length not equal 
      to the number of rows in theta; both should be equal to the number of 
      documents in the data.

Q4: Is it because LDAvis does not work with seededlda output? If it works, what should I do? Is there any alternative visualization tools?

Thank you so much for your time and consideration.

Best,

HHN

FYI: A review of this package

There is a tidbit about this package in this now widely shared blog post.

As you can see, it is under the "implicit type conversions". But after "a reader of this blog" (you can guess who he is) pointed out the factual error in the accusation, it is not about implicit type conversions at all. Instead, it is about whether batch_size should be a proportion and according to the writer, 3 persons didn't understand what does batch_size mean.

I think the documentation has explained clearly what the batch_size does and in my opinion, it makes a lot of sense for it to be a proportion. But I am afraid some people might think that this parameter works similarly to gensim's chunksize.

Probabilities for topics

I am using seededLDA and I am trying to obtain probabilities like how we can get the gamma values from traditional LDA. Is there any way to do the same for seededLDA package or some workaround maybe? @koheiw
I am trying to use the topic model results and fit a logistic regression model.

Make the posterior() stats available

The intially wrapped package topicmodels offered the possibility of more refined exploration of topics in every document with topicmodels::posterior(my_lda)$topics. Could this be made available for a result of seededlda::textmodel_lda() ?

Given the probabilistic nature of topic-document associations, it would be nice to sensibilize students and the public to the fact that a given topic is only the most present one in a given text, not the only one.

Example:

lda_model2 <- topicmodels::LDA(convert(my_dfm, to = "topicmodels"), k = 6)
doc_topics <- topicmodels::posterior(lda_model2)$topics
df <- data.frame(doc_id = row.names(doc_topics) %>% str_replace(fixed(".txt"),""), doc_topics)
df_long <- tidyr::pivot_longer(df, cols = starts_with("X"), names_to = "topic", values_to = "importance")
ggplot(df_long, aes(x = importance, y = doc_id, fill = factor(topic))) +
	geom_bar(stat = "identity") +
	labs(x = "Topic Importance", y = "Document ID", fill = "Topic") +
	theme_minimal() +
	theme(axis.text.y = element_text(angle = 0, hjust = 1))

How `textmodel_lda` predicts topic for unseen documents?

Hi, thank you so much for developing this wonderful package. I have a conceptual question regarding how textmodel_lda predicts topic for unseen documents. I know this can be achieved by specifying the model argument in the function, but I wish to understand how it works in the background. Specifically,

I saw that the Gibbs sampling was running. Does it mean that the posterior word-topic distribution is getting updated?
How the model handle the unseen words?

Sorry if above questions are too basic and thank you for taking time reading this post.

Zhe

Topic-word probabilities not summing to one

Hello,

I observed a strange behavior when applying the seededLDA model: the topic-word distribution does not always sums to one.

I recently updated the package and I don't remember having this issue before (might be fairly old though).

require(seededlda)
#> Le chargement a nécessité le package : seededlda
#> Le chargement a nécessité le package : quanteda
#> Package version: 3.3.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Le chargement a nécessité le package : proxyC
#> 
#> Attachement du package : 'proxyC'
#> L'objet suivant est masqué depuis 'package:stats':
#> 
#>     dist
#> 
#> Attachement du package : 'seededlda'
#> L'objet suivant est masqué depuis 'package:stats':
#> 
#>     terms
require(quanteda)

corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
  dfm_remove(stopwords('en'), min_nchar = 2) %>%
  dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
           max_docfreq = 0.1, docfreq_type = "prop")

dict <- dictionary(list(people = c("family", "couple", "kids"),
                        space = c("alien", "planet", "space"),
                        moster = c("monster*", "ghost*", "zombie*"),
                        war = c("war", "soldier*", "tanks"),
                        crime = c("crime*", "murder", "killer")))
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE, min_termfreq = 10)

rowSums(slda$phi)
#>    people     space    moster       war     crime     other 
#> 1.0000000 1.0000000 0.9999004 1.0000000 1.0000000 1.0000000

^{Created on 2023-06-02 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.0 (2023-04-21 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  French_Belgium.utf8
#>  ctype    French_Belgium.utf8
#>  tz       Europe/Paris
#>  date     2023-06-02
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package      * version date (UTC) lib source
#>    cli            3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>    digest         0.6.31  2022-12-11 [1] CRAN (R 4.3.0)
#>    evaluate       0.21    2023-05-05 [1] CRAN (R 4.3.0)
#>    fastmap        1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>    fastmatch      1.1-3   2021-07-23 [1] CRAN (R 4.3.0)
#>    fs             1.6.2   2023-04-25 [1] CRAN (R 4.3.0)
#>    glue           1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>    htmltools      0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
#>    knitr          1.43    2023-05-25 [1] CRAN (R 4.3.0)
#>    lattice        0.21-8  2023-04-05 [2] CRAN (R 4.3.0)
#>    lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
#>    magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>    Matrix         1.5-4   2023-04-04 [2] CRAN (R 4.3.0)
#>    proxyC       * 0.3.3   2022-10-06 [1] CRAN (R 4.3.0)
#>    purrr          1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
#>    quanteda     * 3.3.1   2023-05-18 [1] CRAN (R 4.3.0)
#>    R.cache        0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>    R.methodsS3    1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>    R.oo           1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#>    R.utils        2.12.2  2022-11-11 [1] CRAN (R 4.3.0)
#>    Rcpp           1.0.10  2023-01-22 [1] CRAN (R 4.3.0)
#>  D RcppParallel   5.1.7   2023-02-27 [1] CRAN (R 4.3.0)
#>    reprex         2.0.2   2022-08-17 [1] CRAN (R 4.3.0)
#>    rlang          1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
#>    rmarkdown      2.22    2023-06-01 [1] CRAN (R 4.3.0)
#>    rstudioapi     0.14    2022-08-22 [1] CRAN (R 4.3.0)
#>    seededlda    * 1.0.0   2023-05-31 [1] CRAN (R 4.3.0)
#>    sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>    stopwords      2.3     2021-10-28 [1] CRAN (R 4.3.0)
#>    stringi        1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
#>    styler         1.10.0  2023-05-24 [1] CRAN (R 4.3.0)
#>    vctrs          0.6.2   2023-04-19 [1] CRAN (R 4.3.0)
#>    withr          2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
#>    xfun           0.39    2023-04-20 [1] CRAN (R 4.3.0)
#>    yaml           2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/Users/odlmarce/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.0/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Take into acount frequency of seed words in data in seeded LDA

A pattern "school*" not only match "school" and "schools" but also "schoolgirl" etc. that are infrequent, but we are giving the same psudo counts to all the matched words.

how to predict on new data?

hello:

love the package!!

i’m wondering how to apply the model to new data?

Argument uniform

Dear Mr. Watanabe,

Could you please explain to me what the meaning of the argument "uniform" is? The documentation says that it should be set to false to make the total amount of seed words in all topics the same. Does this mean that in this case the weight of the seed words is determined independently of their term frequency in the corpus (i.e. unlike the formula in your paper)? Or in this case, are the seed words in each seed topic given the same weight, i.e. the same pseudo-counts are added for each topic? But then, the seed words would no longer help to distinguish the different Seed Topics, right?

I would be very grateful for an answer! Many thanks in advance!

Error when dictionary contains empty element

Nice and straightforward implementation of a semi-supervised topic model. I accidentally discovered that if your dictionary contains an empty element (""), the function gives an error. Would be useful to have an appropriate error message signalling this; the function would first check the dictionary and give an error if there is something improper.

Compute likelihood of fitted models

Hi,
I want to confirm the difference between topicmodels and your seededlda.
To evaluate the LDA text-model quality, topicmodels stores loglikliehood and logLiks during the Gibbs sampling.
But I can't find these parameters in seededlda results.
Will you kindly suggest me to get these parameters?

strange error with lda & seededlda

the first time that I tried lda & seededlda on RStudio, they worked... however, I ran them again and suddenly they were no longer working generating the errors attached (concerning "terms" and "topics").

I tried with base R and got the same output: the first time it was working, since the second attempt they failed

instead, "predict" is always perfectly working.

I ran the examples on the vignette and all the required packages are installed and updated, I'm working on windows just in case.

any guess about the source of this problem and possible solutions? thanks!

Add goodness of fit metrics

I would like to know if there is any implementation of the standard goodness-of-fit metrics for your textmodel_lda class. For instance, here is a SO post which didn't get much attraction. I am wondering if in the case of seeded-LDA, standard metrics still apply.

Could you please give me some information about any upcoming implementation, if any? Or, could you please suggest a direct method that applies to your object class?
Thanks!

Compatibility with LDAvis

Hi @koheiw,

I was giving the package a go and really like how you implemented things so far.

I noticed a strange issue when trying to use LDAvis though:

library(quanteda)
library(seededlda)
library(LDAvis)


data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
dfmt <- dfm(corp, remove_number = TRUE) %>%
  dfm_remove(stopwords('en'), min_nchar = 2) %>%
  dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
           max_docfreq = 0.1, docfreq_type = "prop")

# unsupervised LDA
lda <- textmodel_lda(dfmt, 6)

# use LDAvis to explore topics
json <- createJSON(phi = lda$phi,
                   theta = lda$theta, 
                   doc.length = quanteda::ntoken(lda$x),
                   vocab = quanteda::featnames(lda$x), 
                   term.frequency = quanteda::featfreq(lda$x))
#> Error in createJSON(phi = lda$phi, theta = lda$theta, doc.length = quanteda::ntoken(lda$x), : Rows of phi don't all sum to 1.
serVis(json)

This error occurs every time it seems. And it is telling the truth:

rowSums(lda$phi)
#>   topic1   topic2   topic3   topic4   topic5   topic6 
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000136

I wondered if this is on purpose or a bug. My understanding is that it makes sense that phi rows add up to one. Yet the sum of the last topic's row is always a little over 1.

Issues about seededlda library installation

I am trying to install seededlda, but this error happens, and I do not know how to fix it. I am using R version 4.3.3 and my SO is Ubuntu 22.04. I look forward to hearing from you. Thanks.

installing source package ‘seededlda’ ...
** package ‘seededlda’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C++ compiler: ‘g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’
g++ -std=gnu++17 -I"/opt/R/4.3.3/lib/R/include" -DNDEBUG -DARMA_DONT_PRINT_OPENMP_WARNING -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/Rcpp/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/RcppParallel/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/RcppArmadillo/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/quanteda/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/testthat/include' -I/usr/local/include -DARMA_64BIT_WORD=1 -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
g++ -std=gnu++17 -I"/opt/R/4.3.3/lib/R/include" -DNDEBUG -DARMA_DONT_PRINT_OPENMP_WARNING -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/Rcpp/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/RcppParallel/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/RcppArmadillo/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/quanteda/include' -I'/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/testthat/include' -I/usr/local/include -DARMA_64BIT_WORD=1 -fpic -g -O2 -c lda.cpp -o lda.o
In file included from lda.cpp:4:
lda.h: In constructor ‘LDA::LDA(int, double, double, double, int, double, int, int, bool, int)’:
lda.h:128:33: error: ‘tbb’ has not been declared
128 | if (0 < thread && thread <= tbb::this_task_arena::max_concurrency())
| ^~~
lda.h: In member function ‘void LDA::set_default_values()’:
lda.h:152:14: error: ‘tbb’ has not been declared
152 | thread = tbb::this_task_arena::max_concurrency();
| ^~~
lda.h: In member function ‘void LDA::estimate()’:
lda.h:255:5: error: ‘tbb’ has not been declared
255 | tbb::mutex mutex_sync;
| ^~~
lda.h:263:9: error: ‘tbb’ has not been declared
263 | tbb::task_arena arena(thread);
| ^~~
lda.h:264:9: error: ‘arena’ was not declared in this scope
264 | arena.execute([&]{
| ^~~~~
lda.h: In lambda function:
lda.h:265:13: error: ‘tbb’ has not been declared
265 | tbb::parallel_for(tbb::blocked_range(0, M, batch), [&](tbb::blocked_range r) {
| ^~~
lda.h:265:31: error: ‘tbb’ has not been declared
265 | tbb::parallel_for(tbb::blocked_range(0, M, batch), [&](tbb::blocked_range r) {
| ^~~
lda.h:265:50: error: expected primary-expression before ‘int’
265 | tbb::parallel_for(tbb::blocked_range(0, M, batch), [&](tbb::blocked_range r) {
| ^~~
lda.h:265:73: error: ‘tbb’ has not been declared
265 | tbb::parallel_for(tbb::blocked_range(0, M, batch), [&](tbb::blocked_range r) {
| ^~~
lda.h:265:91: error: expected ‘,’ or ‘...’ before ‘<’ token
265 | tbb::parallel_for(tbb::blocked_range(0, M, batch), [&](tbb::blocked_range r) {
| ^
lda.h: In lambda function:
lda.h:267:29: error: ‘r’ was not declared in this scope
267 | int begin = r.begin();
| ^
lda.h:303:17: error: ‘mutex_sync’ was not declared in this scope
303 | mutex_sync.lock();
| ^~~~~~~~~~
lda.h: In lambda function:
lda.h:308:16: error: ‘tbb’ has not been declared
308 | }, tbb::static_partitioner());
| ^~~
make: *** [/opt/R/4.3.3/lib/R/etc/Makeconf:200: lda.o] Erro 1
ERROR: compilation failed for package ‘seededlda’

removing ‘/home/abdinardo/R/x86_64-pc-linux-gnu-library/4.3/seededlda’
Warning in install.packages :
installation of package ‘seededlda’ had non-zero exit status

The downloaded source packages are in
‘/tmp/RtmpN0g5Q2/downloaded_packages’

installation fails on Alpine Linux

Simply because the tbb namespace is undeclared (see the error log below).
This could be fixed by a bulk #include <tbb/tbb.h> of all of TBB in src/lda.h,
but I think it would be better to selectively include the required headers only.
Thanks!

Error Log (version 1.1.0)

In file included from lda.cpp:4:
lda.h: In constructor 'LDA::LDA(int, double, double, double, int, double, int, int, bool, int)':
lda.h:128:33: error: 'tbb' has not been declared
  128 |     if (0 < thread && thread <= tbb::this_task_arena::max_concurrency())
      |                                 ^~~
lda.h: In member function 'void LDA::set_default_values()':
lda.h:152:14: error: 'tbb' has not been declared
  152 |     thread = tbb::this_task_arena::max_concurrency();
      |              ^~~
lda.h: In member function 'void LDA::estimate()':
lda.h:255:5: error: 'tbb' has not been declared
  255 |     tbb::mutex mutex_sync;
      |     ^~~
lda.h:263:9: error: 'tbb' has not been declared
  263 |         tbb::task_arena arena(thread);
      |         ^~~
lda.h:264:9: error: 'arena' was not declared in this scope
  264 |         arena.execute([&]{
      |         ^~~~~
lda.h: In lambda function:
lda.h:265:13: error: 'tbb' has not been declared
  265 |             tbb::parallel_for(tbb::blocked_range<int>(0, M, batch), [&](tbb::blocked_range<int> r) {
      |             ^~~
lda.h:265:31: error: 'tbb' has not been declared
  265 |             tbb::parallel_for(tbb::blocked_range<int>(0, M, batch), [&](tbb::blocked_range<int> r) {
      |                               ^~~
lda.h:265:50: error: expected primary-expression before 'int'
  265 |             tbb::parallel_for(tbb::blocked_range<int>(0, M, batch), [&](tbb::blocked_range<int> r) {
      |                                                  ^~~
lda.h:265:73: error: 'tbb' has not been declared
  265 |             tbb::parallel_for(tbb::blocked_range<int>(0, M, batch), [&](tbb::blocked_range<int> r) {
      |                                                                         ^~~
lda.h:265:91: error: expected ',' or '...' before '<' token
  265 |             tbb::parallel_for(tbb::blocked_range<int>(0, M, batch), [&](tbb::blocked_range<int> r) {
      |                                                                                           ^
lda.h: In lambda function:
lda.h:267:29: error: 'r' was not declared in this scope
  267 |                 int begin = r.begin();
      |                             ^
lda.h:303:17: error: 'mutex_sync' was not declared in this scope
  303 |                 mutex_sync.lock();
      |                 ^~~~~~~~~~
lda.h: In lambda function:
lda.h:308:16: error: 'tbb' has not been declared
  308 |             }, tbb::static_partitioner());
      |                ^~~

Setting options(seededlda_threads = 1) gives reproducible results, otherwise not.

The documentation needs to clarify how to set a seed for each sub-processes when multi-threading.

Reference for this problem: https://stackoverflow.com/questions/78248050/set-seed-in-quantedas-lda-function

seededlda:::tfm errors if there is x in docvars

> require(quanteda)
> dict <- dictionary(list('A' = "a"))
> dat <- data.frame(text = c("a b c", "A B C"), x = c(1, 2))
> corp <- corpus(dat)
> toks <- tokens(corp)
> dfmt <- dfm(toks)
> seededlda:::tfm(dfmt, dict)
Error: ndoc() only works on corpus, dfm, spacyr_parsed, tokens, tokens_xptr objects.

Comparing with topicmodels::LDA

Dear Kohei,
I guess the theta matrix of your seededlda::textmodel_lda is same as the gamma matrix of textmodels:LDA.
I set the proc parameters such as Gibbs sampling, random seed are same for both.
I found significant difference between the per-document-per-topic-probability values.
I applied 110 sections of "Kokoro" written by Soseki Natsume as the test data.
kokoro_df2-m50-71z.csv

Please find a sample comparing result as an attached png.

Will you kindly show me the detail calculation of theta matrix?

Masataka