klugerlab / alra Goto Github PK

View Code? Open in Web Editor NEW

69.0 7.0 19.0 7.91 MB

Imputation method for scRNA-seq based on low-rank approximation

License: MIT License

R 100.00%

scrna-seq svd imputation dropout matrix-completion

alra's People

Contributors

Stargazers

Watchers

Forkers

verma014 alabarga milescsmith sunlightwang arcolombo luwaert ayetown wes-lewis zhengweihuaynu inoue0426 rcannood shmohammadi86 pronoymondal mariusrklein gladelephant biomiha nine-sarayut hasihays

alra's Issues

Running ALRA with multiple samples

I am working on a project to compare multiple samples. Is it better to run ALRA on each sample individually or to merge them all together and then run ALRA on the whole combined dataset?

The full dataset contains samples from different chemistries and different biological conditions, and were collected by various users across years, so it is safe to assume that there are both technical batch effects and biological differences overlaid on one another.

Curious if you have any feedback or experience with this question.

Input data structure

For the input my singleCellExperiment the rows are genes and columns are cells. Is this the correct input structure? The readme suggests cells as rows.

GetAssayData(pbmc, slot = "counts") gives 0 x 0 matrix

Hi,

I have been doing impuation with ALRA for past few days and results seems really promising for my data s it keeps the biological variation intact. However, I need to supply this data into SCENIC to find the active Transcription factors. For this I need the exprMat of the data. I usually derive it using the following code;

exprMat<-GetAssayData(seurat, slot = "counts")
exprMat<-as.matrix(exprMat)

However, this is giving me 0 x0 matrix. How can I get the exprMat from alra data?
@JunZhao1990 @linqiaozhi

Thank you in advance

ALRAChooseKPlot throws error

Hello Developers and Maintainers!!
@mojaveazure @rcannood @JunZhao1990 @inoue0426 @linqiaozhi

I ran into an issue while trying to use ALRAChooseKPlot function as follows:

> Assays(t)
[1] "RNA"        "integrated"
> DefaultAssay(t)
[1] "RNA"
> imput <- SeuratWrappers::RunALRA(t,assay = "RNA",slot = "data",k.only = T)
Chose rank k = 40, WITHOUT performing ALRA
Warning message:
In asMethod(object) :
  sparse->dense coercion: allocating vector of size 1.9 GiB
> ggouts <- ALRAChooseKPlot(imput)
Error in data.frame(x = 2:length(x = d), y = pvals) : 
  arguments imply differing number of rows: 99, 0

Besides the k value differs from when I directly run ALRA using RunALRA where K value is chosen as 29

> imput <- SeuratWrappers::RunALRA(t,assay = "RNA",slot = "data")
Rank k = 29
Identifying non-zero values
Computing Randomized SVD
Find the 0.001000 quantile of each gene
Thresholding by the most negative value of each gene
Scaling all except for 1433 columns
0.00% of the values became negative in the scaling process and were set to zero
The matrix went from 0.50% nonzero to 17.98% nonzero
Setting default assay as alra
Warning messages:
1: In asMethod(object) :
  sparse->dense coercion: allocating vector of size 1.9 GiB
2: In asMethod(object) :
  sparse->dense coercion: allocating vector of size 1.9 GiB

When using functions from your package the k value suggested is again different.

A_norm <- normalize_data(t(as.matrix(GetAssayData(t,slot = 'count',assay = 'RNA'))))
k_choice <- choose_k(A_norm)

library(ggplot2)
library(gridExtra)
df <- data.frame(x=1:100,y=k_choice$d)
g1<-ggplot(df,aes(x=x,y=y),) + geom_point(size=1)  + geom_line(size=0.5)+ geom_vline(xintercept=k_choice$k)   + theme( axis.title.x=element_blank() ) + scale_x_continuous(breaks=seq(10,100,10)) + ylab('s_i') + ggtitle('Singular values')
df <- data.frame(x=2:100,y=diff(k_choice$d))[3:99,]
g2<-ggplot(df,aes(x=x,y=y),) + geom_point(size=1)  + geom_line(size=0.5)+ geom_vline(xintercept=k_choice$k+1)   + theme(axis.title.x=element_blank() ) + scale_x_continuous(breaks=seq(10,100,10)) + ylab('s_{i} - s_{i-1}') + ggtitle('Singular value spacings')
grid.arrange(g1,g2,nrow=1)

Checked with Seurat Normalized data with your package function and again the results vary. The k this time however is similar to the one I get when using SeuratWrappers::RunALRA(t,assay = "RNA",slot = "data",k.only = T)

A_norm <- t(as.matrix(GetAssayData(t,slot = 'data',assay = 'RNA')))
Warning message:
In asMethod(object) :
  sparse->dense coercion: allocating vector of size 1.9 GiB
> k_choice <- choose_k(A_norm)
> k_choice$k
[1] 40

ALRA and Seurat

I see in the README that ALRA was previously integrated into Seurat with the RunALRA function. Why is this no longer the case? I considered reopening issue 15 but chose not to because my question isn't how to use ALRA with seurat, that I can do.

Run ALRA in Seurat v4.1.0

Hi,

I notice there is a function RunALRA() in Seurat v3.1.4. But I am using Seurat v4.1.0 and there is no such function. Could you suggest how to run ALRA in Seurat v4.1.0?

Many thanks!

Input and Output

Dear author,

Thanks for developing the software! I have a few questions but did not find clear answer from your paper:

is the input a count matrix?
is the output a count matrix? Has it been normalized by library size, or log2-transformed?

Thanks!

Which `selection.method` to use for FindingVariableFeatures on ALRA imputed data

Hi @linqiaozhi @JunZhao1990 @rcannood @inoue0426

I was following this issue where @ChristophH mentions that

Results should not be very different from using the original "count" data. Generally, using "data" slot
should work with "vst" method as long as the loess fit can capture the mean

variance relationship.

Also, @linqiaozhi suggests

For example, "The VST selection method uses count data and does not use the ALRA imputed data; please use mean.var.plot instead, if you would like to find the variable genes based on the imputed data."

So I decided to see if this relationship of mean-variance could be captured better by vst or mean.var.plot method of Seurat. Unlike mca (Malaria Cell Atlas) that I wish to use as reference and didn't perform imputation on, some cells in my samples (t1,n1) shows some deviation from the linear relationship. Is this slight deviation anticipated ?

I also observe that the standardized variance for imputed data is based at 1 unlike MCA which is based at zero. So will this be a problem when I perform integration with MCA of these samples? I am trying to resolve the problem of Jackstraw plot having all PCs as significant that I discuss in another issue here
and I thought maybe the nature of imputed data or the method used for feature selection might be influencing this.

Tips for larger matrices?

I am working with a matrix that has 53201 cells and 20245 genes.

Its size in memory is only 482 MB as a dgCMatrix but 8.62 GB as.matrix().

When I try RunALRA from Seurat, I get:

Error: vector memory exhausted (limit reached?)

Same if I try to run with alra(A_norm = as.matrix(normRNA), use.mkl = FALSE) and use.mkl = TRUE (only that if TRUE it takes a lot longer to show the error).

Do you have any suggestions for how to run on large matrices on a laptop?

marker genes not specific for cluster after imputation

hi, I imputed my dataset with alra in seurat. after the imputation I integrated the alra assay and calculated pca, umap and clusters. the problem is that when I look for the top markers per cluster, using alra as an assay, I don't find anything specific for that cluster. should I use the rna assay? or what is the problem?

Final proportion of non-zero values is large.

I start with a matrix which has 10.81% nonzero values. After using this imputation method this proportion changes to 55.95%. Is this reasonable compared to your experiments?

alraSeurat2 FUNCTION NOT FOUND!!!!

Hello Everyone,
I have installed the rsvd package, when I try to run the alraSeurat2 function, it return an error, saying that the function is not found.
Any idea what is going wrong?
Thank you.

In nrow(A_norm) * ncol(A_norm) : NAs produced by integer overflow

I am working with a particularly large dataset consisting of 2,423,133 cells and 1,091 genes. Using the ALRA, I ran into repeated warnings about integer overflow. Here’s the warning:

r$> A_norm_completed <- alra(A_norm,k=k_choice$k)[[3]]
Read matrix with 2423133 cells and 1091 genes
Getting nonzeros
Randomized SVD
Find the 0.001000 quantile of each gene
Sweep
Scaling all except for 0 columns
NA% of the values became negative in the scaling process and were set to zero
The matrix went from NA% nonzero to NA% nonzero
Warning messages:
1: In nrow(A_norm) * ncol(A_norm) : NAs produced by integer overflow
2: In nrow(A_norm) * ncol(A_norm) : NAs produced by integer overflow
3: In nrow(A_norm) * ncol(A_norm) : NAs produced by integer overflow

There doesn't seem to be a good complement of zeros.

Could you please provide any suggestions on how to mitigate this issue? Is there a recommended approach to handling large datasets with your software, or perhaps a parameter adjustment that I might not be aware of?

Large matrix errors (more than 2^31-1 non-zero entries) [on large datasets]

I'm working with a matrix of over 200,000 cells and 36,000 genes.

"I first tried the 'RunALRA' function in Seurat. Then, I extracted the expression table, converted it into a matrix, and attempted to use ALRA (including alra.low.memory), but encountered the following error.

"Attempting to construct a sparseMatrix with at least 2^31-1 non-zero entries."

It appears that the dgCMatrix conversion process fails because a large matrix exceeds the limit. Modifying the ALRA function code to use a general matrix format instead of dgCMatrix is possible, but operating it realistically is challenging due to the near 100Gb size

If there is a function or method to address this issue with large datasets like mine, I would appreciate any suggestions. Below are the alternatives I am currently considering. I would be grateful if you could share your opinions on them as well.

Currently, I am considering the following three alternatives :

For Alternative A, Imputation is performed for each sample and integrated into one. However, based on the experiences of other users registered in this issue, it seems that normalizing and imputing the integrated data yields more accurate results.

For Alternative B, After normalization is performed on the integrated data, imputation is performed by reducing the number of genes. However, there may be different trends compared to when imputation is performed with the entire gene.

For Alternative C, (If celltype information is known) Immediately perform normalization on the integrated data and then perform subsetting for each celltype to separate them. Imputation is then performed for each cell type and then integrated again. I think this alternative has the advantage of allowing the use of any gene. Additionally, certain genes may not be expressed at all or may be expressed only in certain cell types. I hope that the biological perspective that it can be expressed differently only in certain cells can be applied. Additionally, since I have performed normalization for the entire cell population, so I believe it will be possible to compare the expression levels between cell types in the integrated data after conducting ALRA Imputation for each cell type. If there are any suggestions for revising my thoughts, I would appreciate hearing them.

Can I use ALRA imputed matrix to perform the `Find Marker Gene` step ?

Hello, as the title say, can I use the imputed matrix to find marker genes and differentially expressed genes for downstream analysis ?

big data set and K

Hey ALRA team,
I would like to ask alra's performance on very large data sets（～600k cell)
I am using scapy pipeline and I have 2 quertions:

I noticed that the excessively large value of k in your article seems to have little effect on the results. Is it appropriate to use the default parameter of k = 50？
2.I found that after subsetting the data, I found that it seemed to perform better.Is this related to the k value?
By the way , alra provides the best experience in certain aspects !^_^
Looking forward to your reply

help with alra

Dear all,

I am kind of lost how to deal with the results of ALRA.
When I process a Seurat object with ALRA I get a new assay. How do I then use the ALRA computed data to do a UMAP or TSNE reduction based on these data? Seems that whatever I do the result of both looks the same as the one produced by the standard pipeline after SCTransform.

I would greatly appreciate a jump start in this matter.

Thanks and best,
Matthias

Provided normalized function

According to the code below, normalized function seems to remove row that has total count equal to zero.

If I want to merge computation result back to original data, I have to follow these steps

looping each row of original data
check whether current row has total count as 0 or greater
if total_count > 0 then replace(original_data_row[o_index], imputed_data_row[i_index] )

Although this approach may yield satisfied result but the performance will greatly affect which really matter for big dataset.

Do you have any suggestions for improve the performance in this case?

`
normalize_data <- function (A) {
# Simple convenience function to library and log normalize a matrix

totalUMIPerCell <- rowSums(A);
if (any(totalUMIPerCell == 0)) {
    toRemove <- which(totalUMIPerCell == 0)
    A <- A[-toRemove,]
    totalUMIPerCell <- totalUMIPerCell[-toRemove]
    cat(sprintf("Removed %d cells which did not express any genes\n", length(toRemove)))
}

A_norm <- sweep(A, 1, totalUMIPerCell, '/');
A_norm <- A_norm * 10E3
A_norm <- log(A_norm +1);

}

Input format (genes as rows?)

Thank you for the interesting approach to dropout imputation! I know it's documented, but different parts of the documentation say different things: should the input matrix have genes as rows and cells as columns or the other way around? Thanks!

Best Practice for imputing big data-sets

Hey ALRA team,
I would like to ask about your recommendations regarding imputing a big data-set (~ 250k cell - 10x RNA-seq data). I am using Seurat v3 and I would like to know at which step should I start the imputation in the standard workflow and the new Seurat v3 integration workflow.
Also, do recommend certain parameters for such big data-set? Finally, do you have an estimate for the computation time and requirement for the imputation step?

Thanks in advance for your time.

Best,
Abdelrahman

During FindVariableFeatures after ALRA output, count slot is empty?

I was trying to use ALRA through Seurat 4.0, and after I normalize the data, and then I run ALRA. I noticed that ALRA will create an new assay, called "alra". I assume under the "alra" assay, the data slot is our "updated" lognormalized data, right? however, the counts slot is empty, which leads to the following warning when I use FindVariableFeatures function:

In FindVariableFeatures.Assay(object = assay.data, selection.method = selection.method, :
selection.method set to 'vst' but count slot is empty; will use data slot instead

Will this cause any trouble in the following analysis?

Imputation before other QC steps?

Hi there,
I'm assuming this should be run before the cell/gene filtering steps for multiplets and empty droplets? Just want to double check beforehand.
Thanks,
AT

Differential expression

Hello,

First of all, thank you for this tool. ALRA seems really convincing and is performing really fast. I wanted to ask few questions regarding the use of ALRA. Data I am using contains 6 samples, from 6 different mice, 3 of wich are KO for certain gene. I want to compare 3 KO to 3 WT.

Should ALRA be run on data that was normalized using Seurat's SCTransform() function?
If I am using ALRA on few samples that I want to integrate using Seurat, is it ok to run ALRA on each sample individualy, before integration? I am using Seurat's SCTransform integration pipeline.
After integration, can I use data imputed by ALRA to perform differential expression analysis?
After integration, should I use some tool to remove batch effects between samples (3KO and 3WT individualy)?

Need to deal with R 4.2.0

Hi,

I found this library doesn't work because of the change by R 4.2.0.

if (class(A_norm) != 'matrix') { doesn't work and it should use if (!(any(grepl("matrix", class(A_norm), ignore.case = TRUE)))) {

Can I make a Pull Request for that?

Error in asMethod(object)

I am trying to RunALRA with a Seurat Object containing ~ 120,000 cells but I receive this error:

Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
Calls: RunALRA ... t -> as.matrix -> as.matrix.Matrix -> as -> asMethod
Execution halted

Is there any way around this please?

Imputing integrated data

Hi,
Thank you for the nice tool. Is there any way to use ALRA in integrated data?. Currently if we use ALRA, it seems to be masking the difference in expression between treatment and control. One way would be the splitting the object and do ALRA separately, but while combining back the object, the alra assay is not preserved.

Thanks