niaid / dsb Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 13.0 10.6 MB

Normalize CITEseq Data

License: Other

R 100.00%

cite-seq niaid-tsang-lab

dsb's People

Contributors

Stargazers

Watchers

Forkers

igordot danjong99 sandy-xu biswabio pedromilanezalmeida diegoalexespi rnaimehaom jing-xinxing jang1563 gasmib zmokhtari tsanglabcsei

dsb's Issues

Log transform DSB-normalised corrected ADT counts

Hi @MattPM,

we used DSB to normalize ADT counts of a CITEseq dataset (multiple timepoints per patient; DSB normalization was done for each point of time separatly) and it worked pretty well! :) .To integrate the different timepoints based on both modalities (ADT+RNA) by WNN we first need to integrate the DSB-normalised ADT counts and RNA counts, respectively by Scanorama. Scanorama requires normalized + log(2) transformed data. Hence, we would like to log2(x+1) transform the DSB normalised counts.

For some antibodies we see negative values. Since log2 transformation of zero/negative values is not possible we shifted the whole dataset using log2(x+(min(x)*-1+1)). Most of the negative values in all datasets to be integrated are between 0 and -5. For some datasets, however, we see a handful of antibodies containing outliers with very high negative values (<-50). Since the min of each dataset is very different the log2 transformed DSB corrected datasets are not comparable any longer.

So how do you handle log(2) transformation of DSB corrected datasets to ensure comparability of ADT counts between datasets (different timepoints per patient)?

Thanks for your suggestions! :)

as.matrix conversion: memory issues

Hi,
I am encountering memory problems when I try to convert my Negative ADT from sparse matrix to matrix:

neg_adt_matrix = GetAssayData(NegativeObject, assay = "ADT", slot = 'counts') %>% as.matrix()
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Here, because no hashing was performed, I used as negative all droplets with total UMI counts between 0 and 40 and the resulting matrix is 294 x 20189901.

Could you suggest some workaround? I could probably select a smaller number of empty droplets, but I am not sure if it will be representative sample in this case?

Sessioninfo:
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /home/daliya/anaconda3/lib/libmkl_rt.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_BE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_BE.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_BE.UTF-8 LC_NAME=de_BE.UTF-8 LC_ADDRESS=de_BE.UTF-8 LC_TELEPHONE=de_BE.UTF-8
[11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=de_BE.UTF-8

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] VennDiagram_1.6.20 futile.logger_1.4.3 dsb_0.1.0 Matrix_1.2-18 plyr_1.8.6 ggrepel_0.8.2 xlsx_0.6.3
[8] cowplot_1.0.0 SoupX_1.4.5 dplyr_1.0.0 Seurat_3.2.0 ggplot2_3.3.2

loaded via a namespace (and not attached):
[1] backports_1.1.8 igraph_1.2.5 lazyeval_0.2.2 splines_4.0.2 listenv_0.8.0 usethis_1.6.1 digest_0.6.25
[8] htmltools_0.5.0 fansi_0.4.1 magrittr_1.5 memoise_1.1.0 tensor_1.5 cluster_2.1.0 ROCR_1.0-11
[15] limma_3.44.3 remotes_2.1.1 globals_0.12.5 prettyunits_1.1.1 colorspace_1.4-1 rappdirs_0.3.1 xfun_0.15
[22] callr_3.4.3 crayon_1.3.4 jsonlite_1.7.0 textTinyR_1.1.3 spatstat_1.64-1 spatstat.data_1.4-3 survival_3.1-12
[29] zoo_1.8-8 ape_5.4 glue_1.4.1 polyclip_1.10-0 gtable_0.3.0 leiden_0.3.3 pkgbuild_1.1.0
[36] future.apply_1.6.0 abind_1.4-5 scales_1.1.1 futile.options_1.0.1 miniUI_0.1.1.1 Rcpp_1.0.5 viridisLite_0.3.0
[43] xtable_1.8-4 reticulate_1.16 rsvd_1.0.3 mclust_5.4.6 htmlwidgets_1.5.1 httr_1.4.1 RColorBrewer_1.1-2
[50] ellipsis_0.3.1 ica_1.0-2 pkgconfig_2.0.3 rJava_0.9-13 farver_2.0.3 uwot_0.1.8 deldir_0.1-28
[57] utf8_1.1.4 tidyselect_1.1.0 labeling_0.3 rlang_0.4.7 reshape2_1.4.4 later_1.1.0.1 munsell_0.5.0
[64] tools_4.0.2 cli_2.0.2 generics_0.0.2 devtools_2.3.0 ggridges_0.5.2 evaluate_0.14 stringr_1.4.0
[71] fastmap_1.0.1 yaml_2.2.1 goftest_1.2-2 processx_3.4.3 knitr_1.29 fs_1.4.2 fitdistrplus_1.1-1
[78] purrr_0.3.4 RANN_2.6.1 pbapply_1.4-2 future_1.18.0 nlme_3.1-147 mime_0.9 formatR_1.7
[85] compiler_4.0.2 rstudioapi_0.11 plotly_4.9.2.1 curl_4.3 png_0.1-7 testthat_2.3.2 spatstat.utils_1.17-0
[92] tibble_3.0.3 stringi_1.4.6 ps_1.3.3 desc_1.2.0 lattice_0.20-41 vctrs_0.3.2 pillar_1.4.6
[99] lifecycle_0.2.0 lmtest_0.9-37 RcppAnnoy_0.0.16 data.table_1.12.8 irlba_2.3.3 httpuv_1.5.4 patchwork_1.0.1.9000
[106] R6_2.4.1 promises_1.1.1 KernSmooth_2.23-17 gridExtra_2.3 sessioninfo_1.1.1 codetools_0.2-16 lambda.r_1.2.4
[113] MASS_7.3-51.6 assertthat_0.2.1 pkgload_1.1.0 xlsxjars_0.6.1 rprojroot_1.3-2 withr_2.2.0 sctransform_0.2.1
[120] mgcv_1.8-31 parallel_4.0.2 rpart_4.1-15 tidyr_1.1.0 rmarkdown_2.3 Rtsne_0.15 shiny_1.5.0
[127] base64enc_0.1-3

Thanks!
Daliya

Question: Workflow using HTOdemux

Hello,

First of all, thanks a lot for this package and all the documentation that you have attached with it, it have helped me a lot.

However, I still have few questions and I would like to know your thoughts.

My data includes ADTs (24 + 1 isotype ctrl) and Hashtags (6). We decided to use only one isotype control because is the most abundant in our panel (and to make it more cost/effective(?)).
So here, since I don't have a control for all the abs do you still advise to use the data from that one isotype?

Since I have the data from the Hashtags, I've been trying to run HTOdemux in the raw data but it get stuck at the end (when all Cutoff have been set) so I have to kill it (I also tried running it overnight, but it doesn't finish).

Therefore, I started using HTOdemux in the "filtered" data and runs smoothly. But I realized that when I use HTOdemux after "NormalizeData" I get more "Singlets" (11274 out of 13190 total cells vs 9365 out of 13190 total cells).
I think that using the "Negative" cells from Normalized HTO works better, but I wanted to know why you wouldn't normalize prior HTOdemux?
Then, I add the cells detected as Negative from HTOdemux to the "background_drops" that I selected using your pipeline (background_drops = md[md$prot_size > 1.5 & md$prot_size < 2.5 & md$ngene < 100, ]$bc) and I use this matrix of Negative + Background to run with "DSBNormalizeProtein". Then, the normalized proteins data are added to my Seurat object that was demultiplexed. Does it makes sense this approach?

Thanks a lot

Stephania

Integrating RNA+DSB using Seurats v4 WNN

Hi @MattPM ,

Great package, its working quiet well!

I was wondering on what you believe is the best way to integrate RNA with DSB normalised CITE values to use Seurats v4 latest WNN method.

Thanks!

choice of graph to use for clustering unclear?

Hi,
I noticed that you made interesting different decisions regarding which graph to use for FindClusters function. In your tutorial, you use the knn graph when using direct dsb values, i know that by defualt snn graph goes into the clustering function, is there a reason why?

why not "dsb_wnn" ??

However, here you use the default snn as expected

cellranger Vs CITE-seq-count

Hi @MattPM ,

I'm just coming back to your package, and I noticed in your vignette you have:

# read in HTO data (output from citeseq count)
hto_data = Read10X('citeseq_count/HTO_umi_40k/', gene.column=1)

# read in ADT data (output from citeseq count)
adt_data = Read10X('citeseq_count/ADT_umi_40k/', gene.column=1)

# read in mRNA data (output from 10x cellranger)
rna_data = Read10X(data.dir = "cellranger_count/raw_feature_bc_matrix/")

Can't I just use all the ADT+HTO+RNA outputs from cellranger? Or is there a reason why I shouldn't align ADT with cellranger?

Error in quantile.default(x, seq(from = 0, to = 1, length = n)) :

The following error was encountered while running DSBNormalizeProtein:

Error in quantile.default(x, seq(from = 0, to = 1, length = n)) :
missing values and NaN's not allowed if 'na.rm' is FALSE

Thank you!

DSB with high throughput method

Hello! I recently completed a large CITE-seq experiment where I pulled 8 samples together for high throughput sequencing, so I have a hashtag antibody that I stained each cell with. Does the DSB workflow still work with this particular dataset or will the strong expression of the hashtags skew the protein expression?

Thank you for your time.

Do Isotype Control Values have to be Subtracted?

Since we have not specified which isotype control corresponds to which antibody, rather simply which of our antibodies are isotype controls.

Then, after normalization, do we need to subtract the isotype control values from their respective antibodies?

This is prior to interpreting AverageExpression() and FindMarkers() using Seurat.

Variation due to library size remains in normalized counts.

Good evening,

I have a 10x Genomics CITE-Seq data set, stained with a cocktail of ≈ 130 antibodies. I tried applying the tool, since I like the rational behind it. There are several isotype controls in the cocktail and I ran it in the following configuration:

 DSBNormalizeProtein(cell_protein_matrix = counts(ab.sce.x), empty_drop_matrix = counts(sce.empty.x), 
                                    isotype.control.name.vec = isotype.ctrl, 
                                    denoise.counts = TRUE, use.isotype.control = TRUE,  define.pseudocount = FALSE, 
                                    quantile.clipping = FALSE, scale.factor = "standardize", return.stats = TRUE)

Unfortunately, I find variation due to library size remains in the final output. Exemplified by this UMAP representation of one sample (colour indicates ADT library size).

Many thanks,
M

AB isotypes needed?

Hi, cool method!
Can you clarify what you use as criteria to select AB-isotypes?
In your paper these are the mouse AB within the human cells experiment, so any unrelated species AB, which is not expected to bind to human cells, would do?
I'm planning a citeseq experiment and i want to use your normalization strategy, how is it effective without isotypes to do batch correction? ( i know it's 2 separate problems, I'm just jumping to the conclusion here!)
thanks!

Additional normalization?

Hi, thank you for providing a great package!!
I am using dsb and Seurat v3 package.

Seurat recommends centered log-ratio (CLR) normalization for CITE-seq data.
(https://satijalab.org/seurat/v3.1/multimodal_vignette.html)
cbmc <- NormalizeData(cbmc, assay = "ADT", normalization.method = "CLR")
cbmc <- ScaleData(cbmc, assay = "ADT")

For dsb-normalized CITE-seq data, do we need to perform further normalization using Seurat? Or is it not recommended?

Integrated multiple samples

Hi,
Thanks for the excellent method for CITE-seq!
I find the vignettes from "https://cran.r-project.org/web/packages/dsb/vignettes/end_to_end_workflow.html" only deal with one sample.
If I have multiple samples (more than 3) CITE-seq data and hope to integrate together by anchors to correct the batch, how to deal with this circumstance, may I use the dsb normalized directly?

immune.anchors_adt <- FindIntegrationAnchors(object.list = immune.combined_adt.list, anchor.features = features)
immune.combined_adt <- IntegrateData(anchorset = immune.anchors_adt, new.assay.name = "integrated.adt")

Thanks.

Adding a tag

Will you please add a tag/version so that conda users can build a skeleton?

The idea is explained here: conda/conda#6674

Thanks.

multiplexing experiments section?

Hello - we're very interested in trying dsb to normalize multi-lane 10x ADT data. in your README you refer to a "multiplexing experiments section". Sorry if I missed this, but does that section exist somewhere? Thanks.

DropletUtils::EmptyDrops?

hello -
i notice in your example you are setting somewhat manual filter thresholds on the ADT and GEX droplets. have you tried using other algorithms, like DropUtils::emptyDrops (https://rdrr.io/github/MarioniLab/DropletUtils/man/emptyDrops.html) to differentiate true cells from background?

batch-effect correction in DSB-normalized data?

Dear DSB team,

Thanks so much for developing this amazing resource.

I have a question regarding batch effect correction. As specified in the preprint: "we implemented this transformation on each staining batch separately to accommodate potential batch specific ambient noise– this helped mitigate batch-to-batch variation". Thus, would you recommend me apply DSB on each batch separately? In your supplementary figure 1A there is still some batch effect. Would you run any single-cell integration tool?

Thanks in advance for your help!

CiteSeqCount on empty droplet whitelist

Hi,

I am trying perform dsb normalisation on my ADTs. However, I am encountering an issue when it comes to perform CiteSeqCount on the whitelists of the empty droplets (excluded filtered barcodes from CellRanger). Basically, the analysis is stuck at testing cell barcode collapsing thresholds.
This happens on samples that have 2.5 to 3 million barcodes in their empty droplets.

Any suggestion? I may be doing something wrong.

Thanks

isotype controls

Currently, it looks like you are combining all the isotype controls as background. Have you considered taking into account which specific antibody each one is matched with? Obviously, that would complicate the process, but is that something that is perhaps on the roadmap?

differential expression analysis: log2FC on negative DSB-normalized values

Hi @MattPM,

Thank you for your work on this package - it has been a large improvement over other normalization methods. I am interested in integrating the DSB-normalized values with Seurat V4 for downstream, differential expression analysis. The differential expression is done using the FindMarkers( ) function (https://satijalab.org/seurat/reference/findmarkers
), which calculates the avg_log2FC between our two conditions. The problem is that the equation for avg_log2FC cannot handle negative values, and the DSB-normalized values we got out of the pipeline include negative values.

Do you have any suggestions for feeding the DSB-normalized values through Seurat's FindMarkers ( ) workflow?

Thank you!

Finding the right cutoffs manually

Hello,

Thanks for your work on this package, I think that this is a serious contribution to helping resolve a lot of issues that people are having with CITE-Seq and related single-cell techniques these days. I thought that other people might have had a similar issue as below, or that others will encounter this in the future.

I am trying to analyze a CITE-Seq dataset from mouse lung tissue that we generated with 10X. However, I cannot follow the procedure listed in your vignette because I do not see a third population of cells to the right-side of my histogram for the ADT data (shown below). Therefore, I am unsure how to identify a proper background for the ADT raw data. Could anyone provide guidance on how to define a background for proper analysis? Or have I performed improper sequencing?

This is the same image, with a cutoff on the a-axis of 2.5:

I made these plots with the following code, adapted from the vignette:

prot <- readRDS("prot_merged.rds") #raw merged count matrices
rna <- readRDS("rna_merged.rds") #raw merged count matrices

rna_size = log10(Matrix::colSums(rna))
prot_size = log10(Matrix::colSums(prot))
ngene = Matrix::colSums(rna > 0)
mtgene = grep(pattern = "^mt-", rownames(rna), value = TRUE)
propmt = Matrix::colSums(rna[mtgene, ]) / Matrix::colSums(rna)
md = as.data.frame(cbind(propmt, rna_size, ngene, prot_size))
md$bc = rownames(md)

p1 = ggplot(md[md$rna_size > 0, ], aes(x = rna_size)) + geom_histogram(fill = "dodgerblue") + ggtitle("RNA library size \n distribution")
p2 = ggplot(md[md$prot_size> 0, ], aes(x = prot_size)) + geom_histogram(fill = "firebrick2") + ggtitle("Protein library size \n distribution")
cowplot::plot_grid(p1, p2, nrow = 1)
ggsave("raw_histogram.png", width = 10, height = 5)

p1 = ggplot(md[md$rna_size > 2.5, ], aes(x = rna_size)) + geom_histogram(fill = "dodgerblue") + ggtitle("RNA library size \n distribution")
p2 = ggplot(md[md$prot_size> 2.5, ], aes(x = prot_size)) + geom_histogram(fill = "firebrick2") + ggtitle("Protein library size \n distribution")
cowplot::plot_grid(p1, p2, nrow = 1)
ggsave("zoomed_raw_histogram.png", width = 10, height = 5)

Using emptyDrops on RNA to define my negative matrix

Hi @MattPM, it's me again.

So I was using emptyDrops on the RNA to define my negative matrix, under the assumption that if a droplet is considered empty in RNA, the level of ADT in this droplet would be noise, right?

I wanted to show you how this looks like for me. In the plot we see total counts of ADT on x, and total counts of RNA on y. Before plotting I remove total counts of ADT < 10. Then I run emptyDrops on the RNA and I get 3 categories: in green (NA FDR) are all barcodes <100 total counts, in red barcodes that are considered empty, in blue barcodes with a cell in them. My problem with this is that the filtering is fine for RNA (green and red points would be removed), but if I remove those same cells in ADT I'm losing cells with a good total count (I think). I was wondering if this is normal in the data you have seen, and if in fact this peak of cells with low RNA and high-ish ADT should in fact be considered background.

Thanks for your help,
Stephany

Truncating negative values

Hi,

Really excited about this package and have been using it regularly.
One note is that some of the DSB values wind up low negative. In an old tutorial, you suggested:
Very negative values correspond to very low expression and it can be helpful to convert all dsb normalized values < -10 to 0 for interpretation / visualization purposes.

Is there a reason not to set all values below zero to zero?
Thanks!

return.stats use-case error

Hi dsb team!

Thank you for the excellent tool and vignettes. I ran into an error when setting the following parameters in DSBNormalizeProtein:

DSBNormalizeProtein(
        cell_protein_matrix = raw_matrix, 
        empty_drop_matrix = background_matrix,
        denoise.counts = TRUE, 
        use.isotype.control = FALSE,
        return.stats = TRUE)

When return.stats = TRUE and use.isotype.control = FALSE, the object noise_matrix will not be instantiated, but the function will attempt to access it, leading to the error: Error in t(noise_matrix) : object 'noise_matrix' not found because of the following lines of code:

dsb/R/dsb.r

Lines 254 to 257 in 3531cdc

 if(isTRUE(denoise.counts)) { 

 technical_stats = cbind(t(noise_matrix), dsb_technical_component = noise_vector) 

 } else{ 

 technical_stats = NULL

I quickly fixed this by adding an if-else statement in a fork, and I can open a pull request to merge if you agree with the fix!

Quantile trimming after dsb normalisation

After looking at my normalised data in more detail, I found many markers that have really extreme range of values. Usually, it was caused by only a few data points. I tried tinkering with the thresholds to determine the ambient matrix, but I still see this problem. I thought of scaling, but then it will remove the info on the actual range of expression. In the end, I trimmed the .01 and 99.9 percentile and it looked much better!

Do you have any thoughts on this?

Advise without HTODemux

Hi @MattPM, this looks like a nice package ;)

I was wondering if you could give me some advice to use your method.

To get singlets and negatives I have to run HTODemux right? my problem is that the assumption in HTODemux is that each sample is "single positive" for an HTO, so it's quite straightforward to assign a sample to one HTO. In my case, I have a multi-positive situation (e.g., 6 total HTOs and each sample is double-positive or triple-positive, see Figure 4 in https://www.nature.com/articles/nprot.2015.020 for an example in cytof).

What I actually want to do is normalize HTO and Ab counts, because I think both suffer from the same background noise. I saw in your vignette there was another way of doing this using low expression cells. Do you think this would make sense to normalize the HTOs, so I can demultiplex (on my own) with a multi-positive setup?

Not issue quickstart example plots

{this is not an issue it was used previously for vignette file references that did not need to be in package}

Filtering cells of multiplexed experiment stained with TotalSeq-A universal panel

Dear author,

Thank you for writing the package.

I apologise in advance if my question is naive or has been asked by someone else. I am new to the field and I am paralysed by the many forking paths of analysis that we can do.

I have a multiplexed experiment stained with TotalSeq-A universal panel. Given this design, if I understand correctly, I should perform a very minimal filtering of the unfiltered cellranger matrix and then directly demultiplex instead of performing droplet filtering (eg. EmptyDrops). After selecting singlets and negative droplets, I can normalise my CITE-Seq data using dsb and then perform gene-based QC (% mito, etc.).

Is this correct?

Many thanks in advance.

(denoise.counts=TRUE, use.isotype.control=TRUE) very similar (denoise.counts=FALSE, use.isotype.control=FALSE)

Hello,
thanks for this interesting package,
I am just using the code https://cran.r-project.org/web/packages/dsb/vignettes/end_to_end_workflow.html on a dataset I have to see what difference it can make. But I realized that having the option (denoise.counts = TRUE, use.isotype.control = TRUE) or (denoise.counts = FALSE, use.isotype.control = FALSE) return a very similar result...which surprise me .

This

cells.dsb.norm1 = DSBNormalizeProtein(
  cell_protein_matrix = cell.adt.raw,
  empty_drop_matrix = background.adt.mtx,
  denoise.counts = TRUE,
  use.isotype.control = TRUE,
  isotype.control.name.vec = isotype.controls
  )

takes longer time to run, and return:
[1] "correcting ambient protein background noise"
[1] "some proteins with low background variance detected check raw and normalized distributions. protein stats can be returned with return.stats = TRUE"
[1] "Hu.CD48" "Hu.CD11c" "Hu.CD31" "Hu.CD62L" "Hu.CD36"
[1] "fitting models to each cell for dsb technical component and removing cell to cell technical noise"

	Hu.CD86	Hu.CD274	Hu.CD270	Hu.CD155	Hu.CD112	Hu.CD47	Hu.CD48
0	1.944590	1.092919	1.101433	0.8200972	0.9423016	7.043330	1.31645800
1	4.206515	1.143181	1.104835	0.9502019	0.9075726	10.947148	-0.32764956
2	2.777990	1.221543	1.247028	0.8633896	1.0350072	7.846072	-0.06536306

while this:

cells.dsb.norm2 = DSBNormalizeProtein(
  cell_protein_matrix = cell.adt.raw,
  empty_drop_matrix = background.adt.mtx,
  denoise.counts = FALSE,
  use.isotype.control = FALSE
  )

is very quick, and return:
[1] "Not running dsb step II (removal of cell to cell technical noise) Setting use.isotype.control and isotype.control.name.vec to FALSE and NULL"
[1] "potential isotype controls detected: "
[1] "Isotype_MOPC.21" "Isotype_MOPC.173" "Isotype_MPC.11" "Isotype_RTK4530"
[5] "Isotype_RTK2071" "Isotype_G0114F7" "Isotype_RTK2758" "Isotype_RTK4174"
[9] "Isotype_HTK888"
[1] "correcting ambient protein background noise"
[1] "some proteins with low background variance detected check raw and normalized distributions. protein stats can be returned with return.stats = TRUE"
[1] "Hu.CD48" "Hu.CD11c" "Hu.CD31" "Hu.CD62L" "Hu.CD36"

	Hu.CD86	Hu.CD274	Hu.CD270	Hu.CD155	Hu.CD112	Hu.CD47	Hu.CD48
0	1.582482	1.052647	1.120445	0.9469014	0.9177784	7.094805	1.4910953
1	4.205012	1.052647	1.120445	0.9469014	0.9177784	10.995484	-0.4915249
2	2.981426	1.306219	1.324305	0.9469014	0.9177784	8.029984	-0.4915249

which is different of course, but surprising similar, even the heatmaps looks identical

Also FeaturePlot/VlnPlot looks very similar

Am I doing anything wrong?
Thanks

DSB normalising with artificial (zero) background

Hi @MattPM ,

We have 5 CITEseq datasets and for these we have normalized the ADT counts with DSB. In a second step we integrate all data sets into each other. Now we would like to insert an ABseq dataset. Unfortunately there is no information about the empty droplets. Due to the integration method we would like to use a uniform preprocessing as far as possible.
We wonder if there is a way to normalize the ABseq data with DSB? (For example: Could we create an artificial empty droplet matrix and just fill it with 0 values. I.e. we want to simulate that there is no background).

Thank you in advance! :)

Error in cells$`Gene Expression`

Hi,
Thank you for this package! I am excited to run it. However, I do get an error. Perhaps a bit naive error:
So I got the step fine:

read raw data using the Seurat function "Read10X"

raw = Seurat::Read10X("data/raw_feature_bc_matrix/")
cells = Seurat::Read10X("data/filtered_feature_bc_matrix/")

When I get to this step, it says:

define a vector of cell-containing barcodes and remove them from unfiltered data

stained_cells = colnames(cells$Gene Expression)

Error in cells$Gene Expression :
$ operator not defined for this S4 class

Am I missing a step?
Thanks.
Rini

Error in quantile.default

Hi:

Thanks for developing this nice package.

I tried this command but got the

dsb_norm_prot = DSBNormalizeProtein(
cell_protein_matrix = positive_adt_matrix,
empty_drop_matrix = neg_adt_matrix,
denoise.counts = TRUE,
use.isotype.control = TRUE,
isotype.control.name.vec = rownames(citeseq_counts)[162:165])

Error in quantile.default(x, seq(from = 0, to = 1, length = n)): missing values and NaN's not allowed if 'na.rm' is FALSE
Traceback:

DSBNormalizeProtein(cell_protein_matrix = positive_adt_matrix,
. empty_drop_matrix = neg_adt_matrix, denoise.counts = TRUE,
. use.isotype.control = TRUE, isotype.control.name.vec = rownames(citeseq_counts)[162:165])
apply(norm_adt, 2, function(x) {
. g = mclust::Mclust(x, G = 2, warn = F, verbose = F)
. return(g$parameters$mean[1])
. })
FUN(newX[, i], ...)
mclust::Mclust(x, G = 2, warn = F, verbose = F)
eval(mc, parent.frame())
eval(mc, parent.frame())
mclustBIC(data = structure(c(-0.756590760281651, 0.706345357143731,
. -0.571362010259017, 0.455675113554685, 0.178004789840622, 0.11274129347221,
...
. NaN, NaN, -0.20047762928513, NaN, -0.0629144943331455, NaN, -0.0719656170716554,
. -0.104914014711862, NaN), .Dim = c(217L, 1L), .Dimnames = list(
. c("anti-human-CD140a-PDGFRalpha-TotalSeqC", "anti-human-CD140b-PDGFRbeta-TotalSeqC",
. "anti-human-CD119-IFN-gamma-R-alpha-chain-TotalSeqC", "Rat-IgG1-kapa-isotype-Ctrl-TotalSeqC",

What do you think might be the reason?

before this command, I ran
Idents(seurat_obj) = "HTO_classification.global"

subset empty drop/background and cells

neg_object = subset(seurat_obj, idents = "Negative")
singlet_object = subset(seurat_obj, idents = "Singlet")
neg_adt_matrix = GetAssayData(neg_object, assay = "ADT", slot = 'counts') %>% as.matrix()
positive_adt_matrix = GetAssayData(singlet_object, assay = "ADT", slot = 'counts') %>% as.matrix()

the out put of rownames(citeseq_counts)[162:165] is
'Rat-IgG2b-kapa-Isotype-Ctrl-TotalSeqC''Mouse-IgG2b-kapa-isotype-Ctrl-TotalSeqC''Mouse-IgG2a-kapa-isotype-Ctrl-TotalSeqC''Mouse-IgG1-kapa-isotype-Ctrl-TotalSeqC'

thanks for your time

Decision tree for deciding whether to set denoise.counts = TRUE

Hi, thanks for providing dsb!

I am analyzing an experiment which did not use isotype controls. I went through your vignette in order to figure out whether I should be setting denoise.counts = TRUE. I created the decision tree below to help me (and potentially others) decide, and I would like feedback on whether it is accurate. Thank you. (I can also share/edit the lucidchart if you like.)

To follow-up for my specific experiment:

I am at outcome (B) because µ1 and µ2 are significantly correlated for my experiment and we did not include isotype controls.
This means that I was able to use dsb to remove protein-specific background noise by using the empty droplets, but the technical cell-to-cell variation is still present in my data.

In your experience, how appropriate is this partially-denoised data for downstream applications such as 1) gating (as in the flow cytometry context) or 2) clustering or 3) differential expression analysis (between clusters or between sets of patients)? Or is that something that will vary from experiment-to-experiment and depends on the ratio of biological signal:technical noise? Basically, I am wondering how usable our data are.

Your answer will also help inform whether we include isotype controls in future experiments.
(Edit: Your answer here addresses this question actually)

As a sort of experiment, I've plotted CD4 vs CD8 plots below using ADT data collected from a subset of T cells. Note the subset of T cells in our experiment are primarily CD8+. I've plotted the raw and dsb-transformed data side-by-side.

Interestingly, the third panel (where I used denoise.counts = TRUE despite µ1 and µ2 being significantly correlated) is most similar to what we see using flow cytometry^*. The third panel also doesn't have those artificial vertical lines, which seems like a plus.
However, according to the dsb recommendations, it seems as though I should probably proceed with the data from panel 2 (denoise.counts = FALSE) so I don't lose biological signal. It's just a shame the data can't be cleaned up any further. If you have further thoughts or advice on this, I am interested. :)

^* though I understand that flow data may not necessarily be a “gold standard” here because it has its own set of problems like fluorescent spillover

Thank you so much for your feedback and advice.

Selecting Cut-Off Value For ADT Expression

In the vignette and publication, you recommend to set a threshold across all proteins for positivity. In your publication, you have selected 3.5.

I'm thinking you take your isotype controls and put positivity above that?

To justify this how would I generate something like this where I can show my threshold and expressed/unexpressed?

How would this threshold be applied to the differential expression? Do I need to subtract some value during differential expression or would we have to set the values of all proteins deemed unexpressed (under the threshold) to 0 and then perform differential expression? If so, how would we do that?

Thanks so much for your help!

range of normalized values

Hi,
i wonder what a range of values would be considered unusual for the DSB normalized protein signal.
The plots in your bioarxiv paper go from -2 to +30 for most markers, and you seem to suggest there's no transformation applied on the values, but after normalizing my data I have values ranging between -200 and +200. if i compare the DSB distribution to the CLR it looks fine, and the markers are expressed where i would expect them, but the visualization and interpretation of such a big range is difficult, and also, there's a lot of false positives.
i am normalizing using control isotypes. any suggestions?
thanks

Empty droplet definition with Mission Bio data

Hi
I am trying to use dsb to normalize my ADT data from MissionBio, however is not very clear for me how to define my empty droplet background set. My sample distributions look rather normally distributed (see below, right side), unlike in the example from the paper where it was clear that a big proportion of cells did not have any ADTs (left part of the image). Because of the type of data I do not have RNA to use mtRNA content as control. Do you have any advice, perhaps using the DNA variant data to get the background set?

I was also thinking about trying the normalization suggested here, but the "it has not been widely tested" stopped me.

Thanks,
Mariela

DSB validity for limited ADT features

i have read your paper and have seen your vignettes. I understood that this was created for 14 ADT protein levels. The data that I'm currently working on, as only 4 ADT feature data. How reliable is the DSB method for normalizing this. If no what could be a better approach?

object 'count_prot' not found

Hi! I was trying to run your vignette and the code fails at Step III (option b) because "count_prot" hasn't been defined anywhere. Could you please indicate what this variable corresponds to exactly and maybe update your vignette?

## Advance use error

Hi,
I am using the latest version 1.0.3
When using argument scsle.factor = "mean_subtract I get the following error:
[1] "correcting ambient protein background noise"
[1] "fitting models to each cell for dsb technical component and removing cell to cell technical noise"
Error in apply(norm_adt, 2, function(x) { : object 'norm_adt' not found

This error seems to suggest that there's a reference to a local variable norm_adt within the function that is not found. The error occurs only when I set the scale.factor argument to "mean_subtract". If I omit this argument, the function proceeds without any issues.

Could you please help me understand what might be causing this error and how to resolve it?

Thank you for your assistance and for developing such a useful tool.

longer object length is not a multiple of shorter object length

I'm trying to run DSBNormalizeProtein on a REAP-seq data set. I read in the RNA and protein expression matrices with the Read10X function from Seurat, then proceeded to make a vector of cells and empty droplets. cells and empty respectively.

I split out the empty droplet and cell-containing matrices below:

raw_prot_mat <- object_raw$`Antibody Capture`  %>% as.matrix()
empty_prot_mat <- raw_prot_mat[,empty] %>% as.matrix()
cell_prot_mat <- raw_prot_mat[,cells] %>% as.matrix()

I then fed these into the DSBNormalizeProtein function.

test <-DSBNormalizeProtein(cell_protein_matrix = cell_prot_mat, 
                    empty_drop_matrix = empty_prot_mat,
                    denoise.counts = F,
                    use.isotype.control = FALSE)

It outputs the following error message: "longer object length is not a multiple of shorter object length".

I haven't seen this particular error reported by anyone else, yet so I thought I would ask: any idea what is causing this error? Very possible I'm missing something simple/fundamental. Thanks!

Should we scale the data?

Hello, thank you for designing the great package.
I noticed that in the end-to-end vignettes, you also included the ScaleData in the Seurat WNN workflow using dsb normalized data. However, in this issue https://github.com/niaid/dsb/issues/4#issuecomment-618435439 you mention not to perform the ScaleData. Is that mean now you still recommend scaling the data before downstream analysis?
There is another question. If I perform subset analysis for the dataset (for example, exploring the T cell subset), I don't need to normalise the Cite-seq data again, is that correct? Only re-scaled the data is needed for the subset analysis?

Thank you very much.

ValueError (ill-defined empirical covariance)

Thank you for the wonderful package to analyse CITE-seq. I'm having an error while running dsb and below is the code and error message:

pt.pp.dsb(filtered, raw, empty_counts_range=(2, 3), random_state=1)
Having below error message:
ValueError: Fitting the mixture model failed because some components have ill-defined empirical covariance (for instance caused by singleton or collapsed samples). Try to decrease the number of components, or increase reg_covar.

Could help me to figure out what causing the error?

dsb package environment

Hi,

I would like to add this package to an environment. I don't think it is available to bioconda or conda-forge, so I wanted to build a skeleton recipe in order to add this to my environment, however, when using conda-build r-dsb, I get the error "conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {'r-limma'}". I have bioconductor-limma in the environment already; I believe r-limma is the same package, and while I can get bioconductor-limma, I cannot get r-limma from anaconda (it is stuck on solving environment).

Has this package already been made available on another channel? If not, is there anyway to install without r-limma? Is it the same as the bioconductor-limma package? I am new to using conda, so I apologize if what I describe isn't clear!

	if(isTRUE(denoise.counts)) {
	technical_stats = cbind(t(noise_matrix), dsb_technical_component = noise_vector)
	} else{
	technical_stats = NULL

niaid / dsb Goto Github PK

dsb's People

Contributors

Stargazers

Watchers

Forkers

dsb's Issues

read raw data using the Seurat function "Read10X"

define a vector of cell-containing barcodes and remove them from unfiltered data

subset empty drop/background and cells

Recommend Projects

Recommend Topics

Recommend Org