bnosac / etm Goto Github PK
View Code? Open in Web Editor NEWTopic Modelling in Semantic Embedding Spaces
License: Other
Topic Modelling in Semantic Embedding Spaces
License: Other
New submission
Conflicting package names (submitted: ETM, existing: etm [https://CRAN.R-project.org])
Conflicting package names (submitted: ETM, existing: etm [CRAN archive])
explain KL / NELBO / Loss in docs based on the paper
Remarks can on better structuring of license / copyright can be put in this thread. Issue created as remarked from Adji at adjidieng/ETM-R#3
Some notes on how copyright is referenced currently in the package
Description file indicates
The file LICENSE.note provides the license as indicated at https://github.com/adjidieng/ETM/blob/master/LICENSE according to the CRAN recommendation at https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Licensing and refers to the clone in the inst/orig/ETM folder of the source code at https://github.com/adjidieng/ETM. The R package does not use this code when executing the model, code is only there as a reference.
The file LICENSE uses the template is indicated by the documentation of the R-project https://www.r-project.org/Licenses/MIT. At that location we can add as well Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei @adjidieng agrees with this.
We could as well add the citation as https://github.com/adjidieng/ETM-R/blob/main/inst/CITATION
Hello! When I first tried your ETM package in R using the Belgian parliament data, it worked. However, when I was testing it on my (small) data, after running the ETM() function and optimizer, I encountered this error:
Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer
Here is my code:
library(topicmodels.etm)
library(doc2vec)
library(word2vec)
gcash_data <- read.csv("GCash_200_Reviews_PlayStore_RepeatScroll20_Wait5s_TimeOut60s_AJAx.csv")
names(gcash_data) <- c("UserName", "Date", "Likes", "Review", "Rating")
gcash_r5 <- filter(gcash_data, Rating == "5")
head(gcash_r5)
str(gcash_r5)
x <- data.frame(doc_id = gcash_r5$UserName,
text = gcash_r5$Review,
stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)
w2v <- word2vec(x = x$text, dim = 25, type = "skip-gram", iter = 10, min_count = 5, threads = 2)
embeddings <- as.matrix(w2v)
predict(w2v, newdata = c("app", "convenient"), type = "nearest", top_n = 4)
library(udpipe)
dtm <- strsplit.data.frame(x, group = "doc_id", term = "text", split = " ")
dtm <- document_term_frequencies(dtm)
dtm <- document_term_matrix(dtm)
dtm <- dtm_remove_tfidf(dtm, prob = 0.50)
vocab <- intersect(rownames(embeddings), colnames(dtm))
embeddings <- dtm_conform(embeddings, rows = vocab)
dtm <- dtm_conform(dtm, columns = vocab)
dim(dtm)
dim(embeddings)
set.seed(1234)
torch_manual_seed(4321)
model <- ETM(k = 5, dim = 100, embeddings = embeddings)
optimizer <- optim_adam(params = model$parameters, lr = 0.005, weight_decay = 0.0000012)
loss <- model$fit(data = dtm, optimizer = optimizer, epoch = 20, batch_size = 5)
As you may see in the code, for the ETM function, I changed args to k=5 topics. For model$fit, I changed args to batch_size =5.
After running the last line above with "model$fit", the following error occurs:
Error in seq_len(nrow(x))
: argument must be coercible to non-negative integer
Is this because I am trying to run a small dataset? How may I solve this?
Thank you in advance! :)
to_device
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.