stephenslab / fasttopics Goto Github PK

Fast algorithms for fitting topic models and non-negative matrix factorizations to count data.

Home Page: https://stephenslab.github.io/fastTopics

License: Other

R 83.32% C++ 16.14% C 0.54%

fasttopics's Introduction

fastTopics

fastTopics is an R package implementing fast, scalable optimization algorithms for fitting topic models and non-negative matrix factorizations to count data. The methods exploit the close relationship between the topic model and Poisson non-negative matrix factorization. The package also provides tools to compare, annotate and visualize model fits, including functions to create "structure plots" and functions to identify distinctive features of topics. The fastTopics package is a successor to the CountClust package.

If you find a bug, or you have a question or feedback on this software, please post an issue.

Citing this work

If you find the fastTopics package or any of the source code in this repository useful for your work, please cite:

K. K. Dey, C. J. Hsiao and M. Stephens (2017). Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genetics 13, e1006599.

P. Carbonetto, A. Sarkar, Z. Wang and M. Stephens (2021). Non-negative matrix factorization algorithms greatly improve topic model fits. arXiv 2105.13440.

If you used the de_analysis function in fastTopics, please cite:

P. Carbonetto, K. Luo, A. Sarkar, A. Hung, K. Tayeb, S. Pott and M. Stephens (2023). GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. Genome Biology 24, 236.

License

All source code and software in this repository are made available under the terms of the MIT license.

Quick Start

Install and load the package from CRAN:

install.packages("fastTopics")
library(fastTopics)

Alternatively, install the latest version from GitHub:

remotes::install_github("stephenslab/fastTopics")
library(fastTopics)

Note that installing the package will require a C++ compiler setup that is appropriate for the version of R installed on your computer. For details, refer to the documentation on the CRAN website.

For guidance on using fastTopics to analyze gene expression data, see the single-cell RNA-seq vignette, part 1 and part 2.

Also, try running the small example that illustrates the fast model fitting algorithms:

example("fit_poisson_nmf")

See the package documentation for more information.

Developer notes

To prepare the package for CRAN, remove the "Remotes" field in the DESCRIPTION file, and set on_cran <- TRUE in helper_functions.R, then run R CMD build fastTopics to build the source package.

This is the command used to check the package before submitting to CRAN:

library(rhub)
check_for_cran(".",show_status = TRUE,
  env_vars = c(`_R_CHECK_FORCE_SUGGESTS_` = "false",
               `_R_CHECK_CRAN_INCOMING_USE_ASPELL_` = "true"))

Credits

The fastTopics R package was developed by Peter Carbonetto, Matthew Stephens and others.

fasttopics's People

Contributors

Stargazers

Watchers

Forkers

peikalunci aksarkar rnaimehaom skyler-ruiter seth-wolfgang zhangxiaodong8315

fasttopics's Issues

Naming in a structure_plot for small number of samples

I am trying to do create a fast_topics_model for a small number of samples (12) into four groups, but when I try to create a structure_plot I can't add names through the grouping option. I can get the structure plot to work when I don't try and name/group the samples, but it won't work if I try and add the grouping option. It is entirely possible that this simply isn't possible with a small number of samples. If it is not possible to plot this way, is there an alternative method where I could create a plot of the structure with names? Also if I have your attention, how do I extract the top genes for each topic?

plotting without grouping:

attempted plotting with grouping:

Error: Failed to install 'fastTopics' from GitHub: installation of package ‘RcppArmadillo’ had non-zero exit status

Hi @pcarbo,

I received an error while installing fastTopics. Please see the end of this message for the session information. Any suggestions?

Thanks,
Joyce

devtools::install_github("stephenslab/fastTopics")
library(fastTopics)

Downloading GitHub repo stephenslab/fastTopics@master

dplyr (0.8.4 -> 0.8.5 ) [CRAN]
RcppParallel (4.4.4 -> 5.0.0 ) [CRAN]
RcppArmad... (NA -> 0.9.870.2.0) [CRAN]
purrr (0.3.3 -> 0.3.4 ) [CRAN]

Installing 4 packages: dplyr, RcppParallel, RcppArmadillo, purrr

Error: Failed to install 'fastTopics' from GitHub:
(converted from warning) installation of package ‘RcppArmadillo’ had non-zero exit status
Traceback:

devtools::install_github("stephenslab/fastTopics")
pkgbuild::with_build_tools({
. ellipsis::check_dots_used(action = getOption("devtools.ellipsis_action",
. rlang::warn))
. {
. remotes <- lapply(repo, github_remote, ref = ref, subdir = subdir,
. auth_token = auth_token, host = host)
. install_remotes(remotes, auth_token = auth_token, host = host,
. dependencies = dependencies, upgrade = upgrade, force = force,
. quiet = quiet, build = build, build_opts = build_opts,
. build_manual = build_manual, build_vignettes = build_vignettes,
. repos = repos, type = type, ...)
. }
. }, required = FALSE)
install_remotes(remotes, auth_token = auth_token, host = host,
. dependencies = dependencies, upgrade = upgrade, force = force,
. quiet = quiet, build = build, build_opts = build_opts, build_manual = build_manual,
. build_vignettes = build_vignettes, repos = repos, type = type,
. ...)
tryCatch(res[[i]] <- install_remote(remotes[[i]], ...), error = function(e) {
. stop(remote_install_error(remotes[[i]], e))
. })
tryCatchList(expr, classes, parentenv, handlers)
tryCatchOne(expr, names, parentenv, handlers[[1L]])
value[3L]

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 pillar_1.4.3 compiler_3.6.1 prettyunits_1.1.1
[5] base64enc_0.1-3 remotes_2.1.1 tools_3.6.1 testthat_2.3.2
[9] digest_0.6.25 pkgbuild_1.0.7 uuid_0.1-2 pkgload_1.0.2
[13] jsonlite_1.6.1 evaluate_0.14 memoise_1.1.0 rlang_0.4.5
[17] IRdisplay_0.7.0 cli_2.0.2 curl_4.3 IRkernel_1.1
[21] repr_1.1.0 withr_2.2.0 desc_1.2.0 fs_1.3.1
[25] devtools_2.2.1 rprojroot_1.3-2 glue_1.4.0 R6_2.4.1
[29] processx_3.4.2 fansi_0.4.1 sessioninfo_1.1.1 pbdZMQ_0.3-3
[33] callr_3.4.3 magrittr_1.5 backports_1.1.6 ps_1.3.2
[37] ellipsis_0.3.0 htmltools_0.4.0 usethis_1.5.1 assertthat_0.2.1
[41] crayon_1.3.4

Convergence test and input count matrix

I really like the idea of using NMF as a way to tease out hidden patterns from bulk/sc gene expression data.

It is recommended in the vignette to use raw count matrix as an input to the fit_topic_model function. However, raw count data has been empirically known to have a negative binomial distribution and many different transformations have been suggested to take into account the overdispersion nature of the data. Is there any specific reason to use the raw count data vs using the variance-stabilized transformed data.
NMF optimization problems are ill-posed. The non-uniqueness of the solutions usually required users to run the optimization algorithms many times to get a more stable estimate. Is there any plan to implement a convergence test in fit_poisson_nmf function? It would be nice to have that feature in the function so users don't have to manually inspect the convergence curve when running multiple trials.

Segmentation fault in multithreading setup

The following example leads to a segfault

library(fastTopics)
data(pbmc_facs)
for (nc in c(1, 2)) {
  set.seed(0)
  temp <- fit_poisson_nmf(pbmc_facs$counts, k=6, init='random', numiter=1, method='em', verbose='detailed', control=list(nc=nc))
}

when run like this

$ OPENBLAS_NUM_THREADS=1 Rscript temp.R 
Fitting rank-6 Poisson NMF to 3774 x 16791 sparse matrix.
Running 1 EM updates, without extrapolation (fastTopics 0.6-117).
iter loglik(PoisNMF) loglik(multinom) res(KKT)  |F-F'|  |L-L'| nz(F) nz(L) beta
   1 -9.57230314e+06 -9.554666952e+06 1.68e+03 1.3e+02 9.5e-01 0.987 1.000 0.00
Fitting rank-6 Poisson NMF to 3774 x 16791 sparse matrix.
Running 1 EM updates, without extrapolation (fastTopics 0.6-117).
iter loglik(PoisNMF) loglik(multinom) res(KKT)  |F-F'|  |L-L'| nz(F) nz(L) beta

 *** caught segfault ***
address 0x600000005f7, cause 'memory not mapped'

Traceback:
 1: pnmfem_update_factors_sparse_parallel_rcpp(X, L, F, i - 1, numiter)
 2: pnmfem_update_loadings(X, F, L, i, numiter, nc)
 3: update_loadings_poisson_nmf(X, fit$F, fit$L, update.loadings,     method, control)
 4: update_poisson_nmf(X, fit, update.factors, update.loadings, method,     control)
 5: fit_poisson_nmf_main_loop(X, fit, numiter, update.factors, update.loadings,     method, control, verbose)
 6: fit_poisson_nmf(pbmc_facs$counts, k = 6, init = "random", numiter = 1,     method = "em", verbose = "detailed", control = list(nc = nc))
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault

The same occurs for any other choice of method. Interestingly, the following example does not:

library(fastTopics)
data(pbmc_facs)
for (method in c('em', 'scd', 'ccd')) {
  for (nc in c(2, 1)) {
    set.seed(0)
    temp <- fit_poisson_nmf(pbmc_facs$counts, k=6, init='random', numiter=1, method='em', verbose='detailed', control=list(nc=nc))
  }
}

R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Amazon Linux 2

Matrix products: default
BLAS/LAPACK: /home/ec2-user/mambaforge3/envs/fasttopics-dev/lib/libopenblasp-r0.3.26.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fastTopics_0.6-117

loaded via a namespace (and not attached):
 [1] progress_1.2.3     tidyselect_1.2.0   ashr_2.2-63        purrr_1.0.2       
 [5] pbapply_1.7-2      splines_4.2.3      lattice_0.22-6     colorspace_2.1-0  
 [9] vctrs_0.6.5        generics_0.1.3     viridisLite_0.4.2  htmltools_0.5.7   
[13] MCMCpack_1.7-0     utf8_1.2.4         survival_3.5-8     plotly_4.10.4     
[17] rlang_1.1.3        mixsqp_0.3-54      pillar_1.9.0       glue_1.7.0        
[21] uwot_0.1.16        lifecycle_1.0.4    MatrixModels_0.5-3 munsell_0.5.0     
[25] gtable_0.3.4       htmlwidgets_1.6.4  coda_0.19-4.1      fastmap_1.1.1     
[29] SparseM_1.81       invgamma_1.1       quantreg_5.97      irlba_2.3.5.1     
[33] parallel_4.2.3     fansi_1.0.6        Rcpp_1.0.12        scales_1.3.0      
[37] RcppParallel_5.1.6 jsonlite_1.8.8     truncnorm_1.0-9    mcmc_0.9-8        
[41] hms_1.1.3          ggplot2_3.5.0      digest_0.6.35      Rtsne_0.17        
[45] dplyr_1.1.4        ggrepel_0.9.5      cowplot_1.1.3      grid_4.2.3        
[49] quadprog_1.5-8     tools_4.2.3        cli_3.6.2          magrittr_2.0.3    
[53] lazyeval_0.2.2     tibble_3.2.1       crayon_1.5.2       tidyr_1.3.1       
[57] pkgconfig_2.0.3    MASS_7.3-60.0.1    Matrix_1.6-5       prettyunits_1.2.0 
[61] SQUAREM_2021.1     data.table_1.15.2  httr_1.4.7         R6_2.5.1          
[65] compiler_4.2.3

Improve documentation for fit_topic_model

There is quite a bit in the fit_topic_model documentation that is lacking.

diff_count_clusters not found

Hi!

R cannot find the diff_count_clusters functions. It gives me this error: 'could not find function "diff_count_clusters'.

Do you know why?

Thanks in advance,

Nicolo

Feature request: Batch correction variable?

Hello! Very happy with this software tool. I was wondering if it is possible to include a batch correction variable in the topic calculation in cases where a dataset consists of multiple batches and simple batch effects (of the sort that might be handled with an extra term in a linear model).

Cheers,

Curious about how fast is the CD algorithm compared to SQUAREM

Hi,

Thanks for your work. I am from another field. After reading your paper on the comparison of performance between "CD" and "EM" , I am curious about how fast is the CD algorithm compared to the accelerated EM, eg. SQUAEM package which was proposed by https://onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9469.2007.00585.x

Installation issues

I used the command remotes::install_github('stephenslab/seurat-wrappers') to install the wrappers, but the following error occurred:

Is there a reason for this? I tried both updating the packages recommended by the program as well as not updating them, but it seems to make no difference.
Thank you!

fastTopics on normalized data

Hi!

Thank you very much for developing such a great package!!

I'm planning to use fastTopics to extract common gene modules across different tumor samples. I'm working with Visium data in a very heterogeneous cancer. Since I have variability between and within samples regarding UMI counts and cell density I figured library-size normalization of the the data might help overcome this. However, in your paper and the vignettes you explicitly insist on the usage of the raw counts. What are the risks of using log normalized values?

Thanks, very cool tool ;)

A question on the paper on differentiation expression

Hi there, thanks for the very nice package and paper. I think its an interesting and appealing way to think of single cells than the usual clustering method, and am quite excited for this.

I do have a question on the de part. In the paper, there is this part on calculating pjk instead of using the fjk results by the topic models because the DE analysis is a gene-by-gene analysis, whereas the topic model considers all genes at once. Is there any disadvantages in using the fjk because I assume that the fjk will be more accurate than pjk since that they take account the uncertainty in the topic proportions as well. Thanks!

What is a sensible threshold to select the most informative features per topic?

Dear developers of fastTopics,

Thank you for developing such a wonderful package. I wanted to ask what is, in your opinion, the best way to prioritize important features for a given topic after runnin GoM differential expression analysis. The volcano plots look different across the different topics, for example:

What are the thresholds that you found to be the most meaningful? Should the thersholds be the same per topic, or should they be topic-specific?

On a related question: which rank best reflects the relative importance of each feature? Should I rank features just by postmean (LFC), or would you include the z?

Thanks a lot in advance!

Extracting all the genes for a given topic AND how to change volcano plot

firstly, You briefly mentioned this was possible in another thread, but I was hoping to get a little more clarification for how to extract the genes that are important for any single topic. I have a nice plot shown below, but I was wondering how to get a systematic list of the genes in the top right as you mention in your vignette. Secondly, can you point me in the direction of how to change dot size and/or font color etc. for the volcano plots. Thank you for your help in advance!

To make this plot I call

volcano_plot(dfa_out3,k=1,label_above_quantile = 0.995)

Type I error control for homogenous case

Hello Peter,

I am wondering if the Differential analysis has some kind of type I error control.

If I understand correctly, the model estimates a multinomial model and then tests which genes are differentially expressed.
I might a missed something, but do you need to have the number of components, right? Because when I test for DE on a dummy homogeneous model (see below) I find many "genes" that are significantly differentially expressed.

Could you let me know if I am doing something wrong?

`set.seed(1)
tt <- matrix (rpois(lambda=1 , n=1000*50), ncol = 50)
library(fastTopics)
res <- fit_topic_model(tt, k=2)
plot(res)
de <- de_analysis(res ,tt,pseudocount = 0.1 )
summary(de
)

library(cowplot)
plot_progress(res,x = "iter",add.point.every = 10,colors = "black") +
theme_cowplot(font_size = 10)
volcano_plot(de,k = 1)
`

function error-diff_count_clusters() & diff_count_analysis()

Hi,

Can the structure_plot function be displayed with a weight? Too many color blocks make me unable to distinguish their characteristics.

Second, I loaded FastTopics, but I couldn't use the function. I received the following error:

library(fastTopics)
 
dfc_out <- diff_count_clusters(samples$group,counts)
Error in diff_count_clusters(samples$group, counts) :
  没有"diff_count_clusters"这个函数
dfa_out <- diff_count_analysis(fit_k,counts)
Error in diff_count_analysis(fit_k, counts) :
  没有"diff_count_analysis"这个函数

Thanks!

Best,
Jamie

structure_plot not working on "fit" from "predict()" function

Hello,
I'm following the steps in the vignette to first predict topics in other groups of cells and then plot the structure plot, but I got this error in the latest version of fastTopics: "Error in value[3L] : Invalid selection of loadings".

This is because select_loadings tries to subset Ln in the fit_test from predict, which does not have the same row numbers as fit_test$L in this case.

Thanks so much!

RcppParallel library flags

It took me a while until I found out thatRcppParallel::RcppParallelLibs() generates nothing on my Mac OS laptop. This gave me errors in installation and loading.

After I set the following in .R/Makevar, I had no issues.

TBB = /usr/local/opt/tbb/
TBB_INC = /usr/local/opt/tbb/include
TBB_LIB = /usr/local/opt/tbb/lib

Convergence issues?

Hello,

I've been trying to fit the model on my dataset, but it seems 1000+ iterations I still can't get the likelihood to plateau? There are many plateaus in the fitting, is this normal?

Best,
Chang

handling large datasets

Thank you for this update to the clustcounts package. I am working with a relatively large dataset (~1.2 mill cells), I have subsetted into relevant 'buckets' to find relevant topics beyond traditional 'clustering' using topic modelling. This still yields datasets ranging in size from 30k-300k cells. As such, i had a few questions re: working with large datasets as below:

Do you have any suggestions on how to handle this computationally - particularly in terms of parallelisation to run fit_topic_model to improve time taken to run?
Furthermore, any suggestions on finding an optimum value for k besides trial and error?

Thank you again.

Kind Regards,
Tom

bug in compile_structure_plot_data when a group is empty

A group can be empty after downsampling. In this case n = 0, so the line dat <- data.frame(sample = rep(1:n, times = k), topic = rep(topics, each = n), prop = c(L[, topics])) produces an error (1:n is the vector c(1, 0), so sample has length 2k, but topic has length 0)

Improve documentation for parallel computing options

There are several types of parallel computing going on in the package:

multi-threading within BLAS. Based on some simple benchmarking (https://gist.github.com/aksarkar/664b8ce987f7b6be8934cb2a0216d909), this appears to require being disabled like
```
RhpcBLASctl::blas_set_num_threads(1)
```
multi-threading within RcppParallel, which is used for parallelFor in EM/SCD updates
multi-processing using mclapply, which is used in de_analysis.

Currently, control$nc is used to specify both (2) and (3), which is a bit confusing and should be clarified in the relevant parts of the documentation and verbose output. Compare https://www.rdocumentation.org/packages/parallel/versions/3.4.1/topics/mclapply to https://rcppcore.github.io/RcppParallel/#threads_used

(3) leads to high memory usage which appears to be essentially unavoidable, since every subprocess gets a copy of the entire R environment at the time mclapply is called. This should be mentioned in the docs, because it also leads to unexpected behavior (hanging).

And, (1) is not currently taken care of by the package, which leads to confusing behavior (#37). It might be better to take care of that in this package rather than documenting and expecting the user to take care of it.

Take care of warning, "Quick-TRANSfer stage steps exceeded maximum"

Changing the k-means algorithm used in topic_score, or increasing the number of restarts, may help solve the issue; see here.

Flip columns of X in poisson2binom (branch:poisson2binom)

From @stephens999: Presumably to make the Poisson approximation to binomial as good as possible, you want to flip each column of X so that 1 is the less frequent of the two outcomes before fitting the Poisson model?

feedback on structure_plot

since i tried this out today I thought I'd give some feedback:

i had some issues with the grouping I think because I used strings instead of a factor. Maybe check for this, or
use as.factor to convert strings to factors?
Also had some issues with misspecifying the topics (eg more than were in the fit) and the wrong number of colors,
and it errored out, but only after spending some time doing some calculations. Probably better to fail
on these things before doing any calculations.
It isn't clear to a novice user what all those calculations are being done. Presumably it is computing an ordering?
I wonder if there are faster ways for those who just want a quick plot?

Feature request: multi-threaded DE analysis

Testing for differential expression/accessibility is embarrassingly parallel, and could use all available cores by default.

Update Vignette

In the vignette, the function diff_count_clusters and diff_count_analysis are called to calculate differences in gene contribution to the topics. these don't exist in the package anymore.

After some confusion I deducted that de_analysis() is the correct function name now, might be good to update.

Best and thanks for the great package,
Niko

EDIT: there are more changes, I will update everything I find here :)

parameters for volcano_plot changed. label_above_quantile replaced by do.quantile

Computing topics over normalized data

In the "Analysis of single-cell RNA-seq data, Part 1" vignette you explain that the topic models should be executed over the counts data.

However, can it run over non discrete counts data? For example, when running on the integrated data of two datasets using the Seurat integration procedure?

Thanks,
Gilad Green

Properly handle count matrices with all-zero rows

All-zero rows are allowed for the Poisson NMF model, but don't make sense for the multinomial topic model. I need to add some extra checks to properly handle this special case.

Contract topics feature

Hi,

Thanks for developing CountClust into fastTopics.

Some papers displayed their results using CountClust package in the following way:
2019 Immunity 10.1016/j.immuni.2019.09.004

2021 Nature 10.1038/s41586-021-03188-w

I wonder how to make similar results using fastTopics, to show the top ranked features of topics and their embedding.

Could you give any directions about it in the developing vignettes?

Best,
Xiangyang.

Add more colors to structure_plot

Matthew suggested adding more colors to allow for comparing more topics.

More gracefully handle error in structure_plot when there are not enough colours

When the default setting of colors is used, and the number of topics is more than 9, this error will be generated:

There must be at least as many colours as topics

Error in verify.fit.and.count.matrix(X, fit) : Dimensions of input matrices "X" , "fit$F" and "fit$L" do not agree

When I run de_analysis(), there was an error:

> de <- de_analysis(fit,counts,pseudocount = 0.1, control = list(ns = 1e4,nc = 4))
Error in verify.fit.and.count.matrix(X, fit) : 
  Dimensions of input matrices "X" , "fit$F" and "fit$L" do not agree

The "counts" is a dgCMatrix

> str(counts)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:35272029] 11 19 32 69 111 118 120 158 164 165 ...
  ..@ p       : int [1:15727] 0 1580 1967 3050 4257 6138 7258 8988 10550 11551 ...
  ..@ Dim     : int [1:2] 27197 15726
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:27197] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
  .. ..$ : chr [1:15726] "CRLM_P2_Colon_P_AGAGCAGGTACAGTAA" "CRLM_P17_Colon_T_GAGTTACCAAGTTTGC-1" "ICC_I02T_CGAGTGCCAGAGGCTA" "ICC_I03P_CTCTGGTAGAGCATAT" ...
  ..@ x       : num [1:35272029] 1 1 1 1 4 1 1 1 1 1 ...
  ..@ factors : list()

And I found that the "fit" only has 24141 genes after run > fit <- fit_topic_model(t(counts),k = n.topics).
So how to solve this issue

Geting Help for FastTopics Error

Hi,

How to solve this error, and use FastTopics to process Seurat Object?

How to use FastTopics to process Seurat Object, demonstrating data is not a standard Seurat Object. Although I extracted some of the information in Seurat Object to build a similar list, I still failed to run, and I encountered many errors. E.g:

structure_plot(fit,colors = topic_colors,topics = 1:16,gap = 25,

```
           grouping = samples)
```

Error in structure_plot(fit, colors = topic_colors, topics = 1:16, gap = 25, :
Input argument "grouping" should be a factor with one entry for each row of fit$L
In addition: Warning message:
In xtfrm.data.frame(x) : cannot xtfrm data frames

Looking forward to your reply！

Best,
Jamie

de_analysis using LDA Topic Modeling

Hey @pcarbo!

I'm really interested in using the de_analysis to analyze our data. However, we've already processed the data and found topics using LDA outside of your package. I have a version of your n x K matrix fit$L . But I think I need more in order to start the differential expression. Do I need to first run init_poisson_nmf? If so, I think I'm confused on what the values of F should be.

Sorry if I missed this somewhere.

fit_poisson_nmf error with k=2

@pcarbo I encountered the following error while fitting fit_poisson_nmf with k=2 (using scd and extrapolate=TRUE).

Error in combn(1:k0, k): length(m) == 1L is not TRUE
Traceback:

1. fit_poisson_nmf(tmp, k = 2, init_method = "topicscore", numiter = 100, 
 .     method = "scd", verbose = FALSE, control = list(extrapolate = TRUE))
2. init_poisson_nmf(X, k = k, init_method = init_method, control = control, 
 .     verbose = verbose)
3. topic_score(X, k)
4. vertex_hunting(R, k0, m)
5. combn(1:k0, k)
6. stopifnot(length(m) == 1L, is.numeric(m))

Considerations for bulk RNAseq

Thanks for the great package and clear documentation. It's been a really insightful tool to use!

I would like to apply it to bulk RNAseq data. Are there any considerations regarding data processing or parameter values to set when running on bulk (as compared to single cell). Or maybe something like GeoMx spatial data that is very "bulk like" in terms of many cells per region of interest. Two things in particular:

Is the greater overdispersion of bulk RNAseq compared to scRNAseq an issue?
I wouldn't apply it to very small RNAseq experiments, but perhaps 50-200 samples and anywhere from 2000 to 15000 genes depending on how much filtering should be applied to help the processing time.

I appreciate any insight you may have.

Allow "fit" argument to structure_plot to also be an "L" matrix, similar to de_analysis.

This would allow more flexible use of the structure_plot function.

de_analysis save intermediate steps + return likelihood optimization requests

Hi @pcarbo

Thank you so much for developing such a great package with very cool functionalities!
I've been using fastTopics in some of my projects recently and have found out that the de_analysis step is very time consuming. I followed your advice from this issue and am currently running it on ~30k genes and 60k cells dataset like this:

de <- de_analysis(
    fit = fit,
    X =  t(mtx_c),
    s = rowSums(t(mtx_c)),
    pseudocount = 0.1,
    control = list(ns = 1e4, nc = 64, nsplit = 64*10),
    verbose = TRUE)

I am working on a slurm job manager system and initially sent it with a time limit for 48h. However, de_analysis had not finished and all progress was lost having to start over again. Would it be possible to add a functionality so that the optimal solution found every 100ns so that in case it fails it can picked up where it left off?

Moreover, would it be possible to return the loss of the MCMC to assess if it has converged?

Hope these suggestions makes sense,
Again, thank you so much for developing fastTopics!!

Feature request: early stopping

When fitting many models with a range of K, I find that the number of epochs needed to get close enough to a local optimum is not identical for different K, and tends to increase for large K.

It would be helpful to instead specify a maximum number of epochs, and an early stopping criterion such as KKT residual < threshold.

Extract top genes for each topic

Hi, I wonder is there any similar function in fastTopics to extract top genes for each topic from the result of fit_topic_model() as ExtractTopFeatures() in CountClust? I have tried to extract genes based on relative expression levels but I got many genes shared in different topics.

fit_poisson_nmf Error in solve.QP(M, d, A, b0)

@pcarbo

I encountered the following error message when fitting k >= 15. (tried 15, 16, 17)

The dimension of the data is 7,490 cell by 2,540 genes. 72% of the input count matrix has zero entries.

Thanks,
Joyce

Error in solve.QP(M, d, A, b0): matrix D in quadratic function is not positive definite!
Traceback:

1. fit_poisson_nmf(tmp, k = i, init_method = "topicscore", numiter = 100, 
 .     method = "scd", verbose = FALSE, control = list(extrapolate = TRUE))
2. init_poisson_nmf(X, k = k, init_method = init_method, control = control, 
 .     verbose = verbose)
3. topic_score(X, k)
4. vertex_hunting(R, k0, m)
5. simplex_dist(X[j, ], X[B[, i], , drop = FALSE])
6. solve.QP(M, d, A, b0)
7. stop("matrix D in quadratic function is not positive definite!")

Imported Count Matrix by Seurat, but Col Names are Cell Names

I imported raw count matrix from my seurat object, using function
counts <- GetAssayData(object,"counts")
And then found out my column names are cell names, unlike the sample dataset. How should I continue fastTopics from my Seurat object?

Implement function to estimate loadings in held-out samples

Fix F, estimate L for the held-out samples, then evaluate the log likelihood of the held-out samples.

In sklearn, the API is fit(X) to estimate L and F, and transform(X) to estimate L for some new X (fixing F). I'd probably not call it predict, since I would expect such a function to return the estimate of Lambda = LF'.

Create vignette to illustrate "projection" of samples onto existing F matrix

Procedure to "project" cells onto a previously estimated F matrix: (a) run init_poisson_nmf, in which F is provide as input: (b) run fit_poisson_nmf, setting update.factors = NULL, which will keep F the same up to a column-scaling; and (c) run poisson2multinom to get the topic model. Note that F can be from a multinomial topic model fit or a Poisson NMF fit. (It doesn't matter.)

Feature request: Gamma prior for NMF loadings

I'd like to request an option to incorporate a Gamma prior for the NMF loadings into the Poisson-NMF likelihood function.

I believe this would make the fastTopics model equivalent to the original Latent Dirichlet Allocation, due to the close relationship between Gamma and Dirichlet distributions (in particular, independent Gammas with shared scale parameter, become Dirichlet after dividing by their sum; e.g., https://en.wikipedia.org/wiki/Dirichlet_distribution#Gamma_distribution).

Practically, I think this would allow fastTopics to encourage sparsity, by setting the Gamma/Dirichlet shape parameter <1 -- this is my main motivation for wanting this feature.

I may be interested to help with a PR, but before I try this, would like to know your opinion on how difficult this would be to implement, and whether you already considered something like this.

Threading init message is printed even when verbose="none"

fastTopics/R/misc.R

Lines 132 to 141 in f40a266

 initialize.multithreading <- function (n) { 

 if (is.na(n)) { 

 setThreadOptions() 

 n <- defaultNumThreads() 

 } else 

 setThreadOptions(numThreads = n) 

 if (n > 1) 

 message(sprintf("Using %d RcppParallel threads.",n)) 

 return(n) 

 }

This produces a lot of noise when calling in a loop, especially through rpy2, e.g.

 26%|██▌       | 76/293 [03:24<09:45,  2.70s/it]R[write to console]: Using 16 RcppParallel threads.

 26%|██▋       | 77/293 [03:27<09:42,  2.70s/it]R[write to console]: Using 16 RcppParallel threads.

 27%|██▋       | 78/293 [03:30<09:38,  2.69s/it]R[write to console]: Using 16 RcppParallel threads.

Suggestions/questions about output

The suggestion is that the output be nicely tabbed so I can more easily tell which columns are which. This is what I'm seeing:

iter loglik(PoisNMF) loglik(multinom) res(KKT)  |F-F'|  |L-L'| nz(F) nz(L) beta
   1 -1.27133229e+07 +Inf 4.40e+03 5.0e+00 1.1e+01 0.443 0.552 0.00

Maybe this isn't an issue if it's intended to be output as a delimited file and read in for analysis. The question though is that in my two examples I've been using (untransformed scRNA-seq datasets) I get loglik(multinom) = +Inf for all iterations (100 EM and 100 SCD). Are there circumstances where that should be happening? The call I'm using is just

fit.fasttopics <- fit_topic_model(
      dat,
      k = K,
      verbose = "detailed"
    )

Finally the description of the output in the fit_poisson_NMF docs (version 0.5-52) does not exactly correspond to the output as it's rendered. I understand that this could be in flux and maybe not worth updating every time but eventually this could be addressed.

count not find function "diff_count_clusters"

Thanks a lot for developing this great package.

I installed the 'fastTopics' through remotes::install_github("stephenslab/fastTopics"),
I am following the tutorial, but at interpreting topics by analyzing gene expression differences error occurs saying cannot find the diff_count_cluster function.

Thanks a lot,
Guan

R4.0.2 and sessionInfo:
attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] Matrix_1.3-2 patchwork_1.1.0 cowplot_1.1.0 ggplot2_3.3.2 rstudioapi_0.12 fastTopics_0.6-98

loaded via a namespace (and not attached):
[1] ggrepel_0.9.1 Rcpp_1.0.5 invgamma_1.1 lattice_0.20-41 tidyr_1.1.2 prettyunits_1.1.1 digest_0.6.27
[8] truncnorm_1.0-8 R6_2.5.0 MatrixModels_0.4-1 coda_0.19-4 httr_1.4.2 pillar_1.4.6 rlang_0.4.11
[15] progress_1.2.2 lazyeval_0.2.2 data.table_1.14.2 irlba_2.3.3 SparseM_1.78 labeling_0.4.2 Rtsne_0.15
[22] htmlwidgets_1.5.2 munsell_0.5.0 uwot_0.1.8 mixsqp_0.3-43 tinytex_0.27 compiler_4.0.2 xfun_0.19
[29] pkgconfig_2.0.3 SQUAREM_2020.5 mcmc_0.9-7 htmltools_0.5.1.1 tidyselect_1.1.0 tibble_3.0.4 quadprog_1.5-8
[36] matrixStats_0.57.0 viridisLite_0.3.0 crayon_1.3.4 dplyr_1.0.2 conquer_1.0.2 withr_2.3.0 MASS_7.3-53
[43] grid_4.0.2 jsonlite_1.7.1 gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5 scales_1.1.1 RcppParallel_5.0.2
[50] pbapply_1.4-3 farver_2.0.3 ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.4 tools_4.0.2 glue_1.4.2
[57] purrr_0.3.4 hms_0.5.3 parallel_4.0.2 colorspace_1.4-1 ashr_2.2-54 plotly_4.9.2.1 quantreg_5.86
[64] MCMCpack_1.5-0

fastTopics extends

	initialize.multithreading <- function (n) {
	if (is.na(n)) {
	setThreadOptions()
	n <- defaultNumThreads()
	} else
	setThreadOptions(numThreads = n)
	if (n > 1)
	message(sprintf("Using %d RcppParallel threads.",n))
	return(n)
	}