bnosac / etm Goto Github PK

View Code? Open in Web Editor NEW

46.0 3.0 3.0 7.78 MB

Topic Modelling in Semantic Embedding Spaces

License: Other

R 51.73% Python 48.27%

lda word-embeddings word2vec topic-modeling embeddings

etm's People

Contributors

Stargazers

Watchers

Forkers

rolandmax relatience mstei4176

etm's Issues

replace gamma with beta to be consistent with the paper

add unit tests

on torch > 0.5 with fixed seeds
to proof that we saw already that the implementation gives the same results as https://github.com/bnosac-dev/ETM

change package name

using log directory 'd:/RCompile/CRANguest/R-devel/ETM.Rcheck'
using R Under development (unstable) (2021-08-13 r80752)
using platform: x86_64-w64-mingw32 (64-bit)
using session charset: ISO8859-1
checking for file 'ETM/DESCRIPTION' ... OK
checking extension type ... Package
this is package 'ETM' version '0.1.0'
package encoding: UTF-8
checking CRAN incoming feasibility ... ERROR

New submission

Conflicting package names (submitted: ETM, existing: etm [https://CRAN.R-project.org])

Conflicting package names (submitted: ETM, existing: etm [CRAN archive])

add simple way of projecting to 2D

as a generic
maybe include in R package word2vec
passing to uwot::umap / alongside uwot::umap_transform

docs

explain KL / NELBO / Loss in docs based on the paper

add plot function

- showing to show evolution of loss (plot(model, type = "loss", ...)
- showing words emitted by each topic (plot(model, type = "terminology", ...)
================> this will simplify workflow

license / copyright / citation

Remarks can on better structuring of license / copyright can be put in this thread. Issue created as remarked from Adji at adjidieng/ETM-R#3
Some notes on how copyright is referenced currently in the package

Description file indicates
- R part at the R folder is from Jan Wijffels and indicated at https://github.com/bnosac/ETM/blob/master/DESCRIPTION#L7
- The Python code at inst/orig is from Adji B. Dieng and colleagues https://github.com/bnosac/ETM/blob/master/DESCRIPTION#L9:L11
The file LICENSE.note provides the license as indicated at https://github.com/adjidieng/ETM/blob/master/LICENSE according to the CRAN recommendation at https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Licensing and refers to the clone in the inst/orig/ETM folder of the source code at https://github.com/adjidieng/ETM. The R package does not use this code when executing the model, code is only there as a reference.
The file LICENSE uses the template is indicated by the documentation of the R-project https://www.r-project.org/Licenses/MIT. At that location we can add as well Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei @adjidieng agrees with this.
We could as well add the citation as https://github.com/adjidieng/ETM-R/blob/main/inst/CITATION

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

Hello! When I first tried your ETM package in R using the Belgian parliament data, it worked. However, when I was testing it on my (small) data, after running the ETM() function and optimizer, I encountered this error:

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

Here is my code:

library(topicmodels.etm)
library(doc2vec)
library(word2vec)
gcash_data <- read.csv("GCash_200_Reviews_PlayStore_RepeatScroll20_Wait5s_TimeOut60s_AJAx.csv")
names(gcash_data) <- c("UserName", "Date", "Likes", "Review", "Rating")
gcash_r5 <- filter(gcash_data, Rating == "5")
head(gcash_r5)
str(gcash_r5)

x      <- data.frame(doc_id           = gcash_r5$UserName, 
                     text             = gcash_r5$Review, 
                     stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)

w2v        <- word2vec(x = x$text, dim = 25, type = "skip-gram", iter = 10, min_count = 5, threads = 2)
embeddings <- as.matrix(w2v)
predict(w2v, newdata = c("app", "convenient"), type = "nearest", top_n = 4)

library(udpipe)
dtm   <- strsplit.data.frame(x, group = "doc_id", term = "text", split = " ")
dtm   <- document_term_frequencies(dtm)
dtm   <- document_term_matrix(dtm)
dtm   <- dtm_remove_tfidf(dtm, prob = 0.50)

vocab        <- intersect(rownames(embeddings), colnames(dtm))
embeddings   <- dtm_conform(embeddings, rows = vocab)
dtm          <- dtm_conform(dtm,     columns = vocab)
dim(dtm)
dim(embeddings)

set.seed(1234)
torch_manual_seed(4321)
model     <- ETM(k = 5, dim = 100, embeddings = embeddings)
optimizer <- optim_adam(params = model$parameters, lr = 0.005, weight_decay = 0.0000012)
loss      <- model$fit(data = dtm, optimizer = optimizer, epoch = 20, batch_size = 5)

As you may see in the code, for the ETM function, I changed args to k=5 topics. For model$fit, I changed args to batch_size =5.

After running the last line above with "model$fit", the following error occurs:
Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

Is this because I am trying to run a small dataset? How may I solve this?

Thank you in advance! :)

also work on non-cpu computers

to_device

bnosac / etm Goto Github PK

etm's People

Contributors

Stargazers

Watchers

Forkers

etm's Issues

replace gamma with beta to be consistent with the paper

add unit tests

change package name

add simple way of projecting to 2D

docs

add plot function

implement measures

license / copyright / citation

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

also work on non-cpu computers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent