It is live! 🚀 check it out here!
Made with the help of this template.
A count based method for detecting doublets from single nucleus ATAC-seq (snATAC-seq) data.
Home Page: https://ucarlab.github.io/AMULET/
License: GNU General Public License v3.0
It is live! 🚀 check it out here!
Made with the help of this template.
Hi,
Based on the README and on the thesis abstract I see here, this looks like it could be a very useful tool, and of course the idea behind it is quite intuitive. If you don't mind me asking, will a pre-print or manuscript describing the methods and demonstrating accuracy be available any time soon? I'd love to give this a shot on my own data, as long as there's good evidence it works well (I can validate this on some of my data using orthogonal doublet detection methods) and it's cite-able.
Thanks!
Peter
Hi there. I recently got this error when running the bash script . If I could get some help understanding the issue that would be great. Here is the traceback.
Traceback (most recent call last):
File "/AMULET/AMULET.py", line 201, in
lengths = filtereddata[:,2]-filtereddata[:,1]+1
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
Thanks
Hi AMULET team,
I have a 10X multiome kit dataset and trying to run AMULET on the ATAC data.
The fragments input file is the same, but the singlecell.csv is no longer available as CellRanger output. Instead, the following per_barcode_metrics.csv
is produced.
barcode,gex_barcode,atac_barcode,is_cell,excluded_reason,gex_raw_reads,gex_mapped_reads,gex_conf_intergenic_reads,gex_conf_exonic_reads,gex_conf_intronic_reads,gex_conf_exonic_unique_reads,gex_conf_exonic_antisense_reads,gex_conf_exonic_dup_reads,gex_exonic_umis,gex_conf_intronic_unique_reads,gex_conf_intronic_antisense_reads,gex_conf_intronic_dup_reads,gex_intronic_umis,gex_conf_txomic_unique_reads,gex_umis_count,gex_genes_count,atac_raw_reads,atac_unmapped_reads,atac_lowmapq,atac_dup_reads,atac_chimeric_reads,atac_mitochondrial_reads,atac_fragments,atac_TSS_fragments,atac_peak_region_fragments,atac_peak_region_cutsites
AAACAGCCAAACAACA-1,AAACAGCCAAACAACA-1,ACAGCGGGTGTGTTAC-1,0,0,12,12,0,4,7,4,0,2,2,0,7,0,0,4,0,0,2,0,0,0,0,0,2,1,1,2
The is_cell column corresponds to the is__cell_barcode from singlecell.csv I believe. I've tried a quick and dirty solution of renaming the column, which failed. The atac_barcode column corresponds to the barcode column from singlecell.csv .
Is there a good way to modify the source code accordingly to take care of this?
Thanks so much,
Tina
Hello,
I am wondering whether there is a straightforward way to use AMULET with the output of scATAC-pro. The vignettes seem to require the outputs from cellranger which conveniently produce a csv file. scATAC-produces cell_barcodes.bam file, is there a way to convert this to csv file to be in a form to work with AMULET.
Thanks,
Jonathan
Hello,
Thanks for this really nice tool!! I have two questions:
I am facing some issues in the implementation: As a first pass, I am trying to compare the doublet rate identified in mouse+human mixed species barnyard experiment. The data is generated with default 10x pipeline. For mixed species, it generates the following first few columns in singlecell.csv. The default is_cell_barcode is 9, here that corresponds to is_GRCh38_cell_barcode.
barcode,total,duplicate,chimeric,unmapped,lowmapq,mitochondrial,nonprimary,passed_filters,is_GRCh38_cell_barcode,is_mm10_cell_barcode
The pipeline returns empty overlap.txt. Any clues why?
Also, if I were to use some other pipeline than 10x that does not produce singlecell.csv. I essentially need a "comma" separated csv file that has barcode info (--bcidx) and one column indicating whether it is a cell barcode or not (specified in --iscellidx) ? How do I specify --cellidx (not sure what that corresponds to)
sorry but I am a little confused that there is no such a file snATACOverlapCounter.jar
under the GitHub page? how can I generate it? thank you!!
Hi! I am applying AMULET on snATAC-seq data generated from my lab. For a dataset like this, after filtering with TSSE >= 10 and log10(UQ + 1) >= 3 (UQ means the number of uniquely mapped fragments), the cells still have a wide range of UQ.
Since the number of overlaps (genomic regions with >2 overlapping reads) for a cell positively correlates with the UQ of the cell, cells have high UQ tend to have a larger number of overlaps. Because the majority of cells fall in the range of 3 < log10(UQ + 1) < 4, I believe the majority of doublets are also there (at least with the same order of magnitude). However, with a doublet formed by two singlets each with UQ = 3000 and a singlet with UQ = 30000, AMULET would very probably locate more overlaps in the singlet than in the doublet. The cells whose UQ > 10000 tend to have more overlaps and are thus more likely to be identified as doublets.
The AMULET paper indicates that 25K median valid reads per nucleus is optimal, but is there a way to maximize the probability of finding doublets in the range of 3 < log10(UQ + 1) < 4 in this dataset given that there are cells in range 4 < log10(UQ + 1) < 5?
Hi,
I am trying to run the snATACOverlapCounter.jar through the AMULET.sh, but it fails, and the error message is truncated, so not helpful at all. Most probably, my input files are not shaped properly. Can you provide the test files in the examples and/or a better representation of the errors?
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
at org.jax.snatacoverlapcounter.util.Util.readCellBarcodes(Util.java:56)
at org.jax.snatacoverlapcounter.OverlapCounter.findOverlaps(OverlapCounter.java:203)
at org.jax.snatacoverlapcounter.OverlapCounter.main(OverlapCounter.java:134)
... 5 more
Hello,
I downloaded directly from the binary releases and it seems throw this error.
Reading BAM file. [434/804]java.io.FileNotFoundException: ../ATACDoublet_output/Overlaps.txt (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:101)
at java.io.FileWriter.<init>(FileWriter.java:63)
at org.jax.snatacoverlapcounter.OverlapCounter.findOverlaps(OverlapCounter.java:179) at org.jax.snatacoverlapcounter.OverlapCounter.main(OverlapCounter.java:99)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
java.io.FileNotFoundException: ../ATACDoublet_output/Error_2021-02-12T203531.txt (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:101)
at java.io.FileWriter.<init>(FileWriter.java:63)
at org.jax.snatacoverlapcounter.util.Util.writeCompletionFile(Util.java:96)
at org.jax.snatacoverlapcounter.util.Util.writeErrorFile(Util.java:107)
at org.jax.snatacoverlapcounter.OverlapCounter.main(OverlapCounter.java:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
java.io.FileNotFoundException: ../ATACDoublet_output/Overlaps.txt (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:101) at java.io.FileWriter.<init>(FileWriter.java:63)
at org.jax.snatacoverlapcounter.OverlapCounter.findOverlaps(OverlapCounter.java:179)
at org.jax.snatacoverlapcounter.OverlapCounter.main(OverlapCounter.java:99)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
java.io.FileNotFoundException: ../ATACDoublet_output/RunTime.txt (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:101)
at java.io.FileWriter.<init>(FileWriter.java:63)
at org.jax.snatacoverlapcounter.OverlapCounter.writeRunTime(OverlapCounter.java:131)
at org.jax.snatacoverlapcounter.OverlapCounter.main(OverlapCounter.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
I have Java 8, and all my inputs are directly from CellRanger outputs.
Any ideas?
how can I change this parameter MAPQTHRESHOLD=30 to 10?
Thank you!
I tried running the AMULET.sh script but it only produces the Overlaps.txt output, the OverlapSummary.txt file is not written.
Then, the python script throws an error.
`Traceback (most recent call last):
summarydata = pd.read_csv(args.overlapsummary, sep="\t").values
pandas.errors.EmptyDataError: No columns to parse from file`
Hi, the fragment files from cellranger-atac 2.0.0 contains some comment lines starting with "#". I think currently this is not handled in AMULET, which throws an error. I added two lines of code myself that skips the comment lines and everything runs fine.
Thanks so much!
Hi, I tried to run your doublet annotation functions but I encountered several errors.
Firstly, I tried to run your getMarkerPeaks function:
marker_peaks <- getMarkerPeaks(data, doublets = unlist(multiplets),
n_peaks = 100,
min_cells = 200)
Error in order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj, decreasing = T): argument 1 is not a vector
Traceback:
1. getMarkerPeaks(data, doublets = unlist(multiplets), n_peaks = 100,
. min_cells = 200)
2. meta_peaks[order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj,
. decreasing = T), ]
3. `[.data.frame`(meta_peaks, order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj,
. decreasing = T), )
4. order(meta_peaks$avg_logFC, -meta_peaks$p_val_adj, decreasing = T)
After that I generated marker peaks by myself and used them in your annotateDoublets function:
multiplet_annotations <- annotateDoublets(obj = data, marker_peaks = df, doublets = multiplets)
Error in rep(t_read_counts$ids[j], t_read_counts[j, row.names(probs)[i]]): invalid 'times' argument
Traceback:
1. annotateDoublets(obj = data, marker_peaks = df, doublets = multiplets)
2. getCellValues(obj, cells = Cells(obj), marker_peaks_set = marker_peaks,
. doublets = doublets, k = k)
3. lapply(cells, function(cell) {
. neighbors <- names(obj@graphs$ATAC_nn[cell, obj@graphs$ATAC_nn[cell,
. ] > 0])
. reads <- Matrix::as.matrix(subset(obj, cells = neighbors,
. features = marker_peaks_set$peaks)@assays[["ATAC"]]@counts)
. no_clusters <- length(unique(marker_peaks_set$cluster))
. results = data.frame(matrix(nrow = 0, ncol = no_clusters +
. 1)) %>% `colnames<-`(value = c("cell_id", as.character(unique(marker_peaks_set$cluster))))
. results[cell, "cell_id"] = cell
. reads <- data.frame(apply(reads, 1, mean)) %>% `colnames<-`(value = cell)
. if (colSums(reads) == 0) {
. results[, -1] <- 0
. results[cell, "a.heterotypic"] <- NA
. results[cell, "a.homotypic"] <- NA
. return(results)
. }
. doublet_probs <- reads %>% getReadCountDistributions(marker_peaks_set,
. .) %>% data.frame()
. results[cell, colnames(doublet_probs)] <- doublet_probs
. results[cell, "a.heterotypic"] <- paste(names(doublet_probs)[order(doublet_probs,
. decreasing = T)[1:2]], collapse = ".")
. results[cell, "a.homotypic"] <- names(which.max(doublet_probs))
. return(results)
. }) %>% do.call(rbind, .)
4. do.call(rbind, .)
5. lapply(cells, function(cell) {
. neighbors <- names(obj@graphs$ATAC_nn[cell, obj@graphs$ATAC_nn[cell,
. ] > 0])
. reads <- Matrix::as.matrix(subset(obj, cells = neighbors,
. features = marker_peaks_set$peaks)@assays[["ATAC"]]@counts)
. no_clusters <- length(unique(marker_peaks_set$cluster))
. results = data.frame(matrix(nrow = 0, ncol = no_clusters +
. 1)) %>% `colnames<-`(value = c("cell_id", as.character(unique(marker_peaks_set$cluster))))
. results[cell, "cell_id"] = cell
. reads <- data.frame(apply(reads, 1, mean)) %>% `colnames<-`(value = cell)
. if (colSums(reads) == 0) {
. results[, -1] <- 0
. results[cell, "a.heterotypic"] <- NA
. results[cell, "a.homotypic"] <- NA
. return(results)
. }
. doublet_probs <- reads %>% getReadCountDistributions(marker_peaks_set,
. .) %>% data.frame()
. results[cell, colnames(doublet_probs)] <- doublet_probs
. results[cell, "a.heterotypic"] <- paste(names(doublet_probs)[order(doublet_probs,
. decreasing = T)[1:2]], collapse = ".")
. results[cell, "a.homotypic"] <- names(which.max(doublet_probs))
. return(results)
. })
6. FUN(X[[i]], ...)
7. reads %>% getReadCountDistributions(marker_peaks_set, .) %>%
. data.frame()
8. data.frame(.)
9. getReadCountDistributions(marker_peaks_set, .)
I used ATAC peaks from 10X genomics and successfully generated file with possible multiplet barcodes, but on annotation step code stops working. Have you encountered this behavior? Thanks in advance!
Hi AMULET team,
Thank you so much for developing this great method first! So sorry that I am quite new to this area and have some basic questions.
I noticed that in the multiplet detection step, the matrix we generated is already binarized. My first question is can we have the count matrix before binarization by any chance? I tried to repeat your method using the seurat count matrix, and I would like to use the original count matrix generated by AMULET to check the concordance。
The second question I have is that I am still confused about how can I generate artificial doublets to assess the detection accuracy. I have seen your answer in https://github.com/UcarLab/AMULET/issues/16, but still didn't quite get it. It would be great if you could share a sample script to do that or give a specific example of the process.
Many thanks,
Yuntian.
Hi,
Thank you for developing this great tool!
I tried AMULET on two of 10x Multiome datasets and found two issues:
Hi! Thank you for developing this great tool!
I encountered this issue when trying to run the multiplet detection vignette:
Filtering regions.
Traceback (most recent call last):
File "$PWD/AMULET-v1.1/AMULET.py", line 201, in
lengths = filtereddata[:,2]-filtereddata[:,1]+1
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
The number of valid read pairs seems good, but the number of overlaps are all zeros, and that's why I think it gives the above error.
head -n 6 OverlapSummary.txt
Cell Id Number of Valid Read Pairs Number of Overlaps Barcode Total Number of Reads
10 9 0 TCACAGAGTACAACGG-1 278
100 95 0 GAACGTTAGAGTCCGA-1 620
101 195 0 TGTACGATCCTATCAT-1 1100
104 102 0 TCAGTCCAGGCTGGAT-1 496
105 208 0 TTGAGTGAGAAGCCTG-1 722
cat Overlaps.txt
chr start end cell id Min Overlap Count Max Overlap Count Mean Mapping Quality Min Mapping Quality Max Mapping Quality Starts Ends
cat StatSummary.txt
Total Reads: 713014
Valid Read Pairs: 18305
Positive: 464525
Negative: 248489
Not in chromosome list: 15155
Insert size (<=0 or >900): 8
Low Mapping Quality Reads (<=30): 332222
Mean Read Length: 49.20841300191205
Mean Insert Size: 135.4882272603114
SAMTools Flag Checked: 188719
SAMTools Flag Passed: 187284
SAMTools Flag Duplicate: 1435
SAMTools Flag Unpaired: 0
SAMTools Flag Unmapped: 0
SAMTools Flag Mate Unmapped: 0
SAMTools Flag Secondary: 0
SAMTools Flag Reference Mismatch: 0
I followed the instructions on 10X genomics step by step and I don't know where I messed up. I checked my fastq files after running cellranger-atac mkfastq:
(base) -bash-4.2$ ls -l HJN3KBCX2/outs/fastq_path/ 总用量 424780 drwxr-xr-x 3 yuw044 ren-group 3 1月 30 12:28 p1 drwxr-xr-x 3 yuw044 ren-group 3 1月 30 12:28 Reports drwxr-xr-x 2 yuw044 ren-group 8 1月 30 12:28 Stats -rw-r--r-- 1 yuw044 ren-group 41185965 1月 30 12:28 Undetermined_S0_L001_I1_001.fastq.gz -rw-r--r-- 1 yuw044 ren-group 156691442 1月 30 12:28 Undetermined_S0_L001_R1_001.fastq.gz -rw-r--r-- 1 yuw044 ren-group 78766690 1月 30 12:28 Undetermined_S0_L001_R2_001.fastq.gz -rw-r--r-- 1 yuw044 ren-group 157781371 1月 30 12:28 Undetermined_S0_L001_R3_001.fastq.gz
(base) -bash-4.2$ tree HJN3KBCX2/outs/fastq_path/p1/s1/ HJN3KBCX2/outs/fastq_path/p1/s1/ test_sample_S1_L001_I1_001.fastq.gz test_sample_S1_L001_R1_001.fastq.gz test_sample_S1_L001_R2_001.fastq.gz test_sample_S1_L001_R3_001.fastq.gz 0 directories, 4 files
I used the following options when running cellranger-atac count:
--reference $pwd/opt/refdata-cellranger-arc-GRCh38-2020-A-2.0.0
where I downloaded this reference file with this:
wget https://cf.10xgenomics.com/supp/cell-atac/refdata-cellranger-arc-GRCh38-2020-A-2.0.0.tar.gz
--fastqs
to the directory containing these files:
(base) -bash-4.2$ tree HJN3KBCX2/outs/fastq_path/p1/s1/ HJN3KBCX2/outs/fastq_path/p1/s1/ test_sample_S1_L001_I1_001.fastq.gz test_sample_S1_L001_R1_001.fastq.gz test_sample_S1_L001_R2_001.fastq.gz test_sample_S1_L001_R3_001.fastq.gz 0 directories, 4 files
Can you help me with this problem?
Hi,
I have noticed that no. of multiplets detected by AMULET are proportional to number of cells in singlecell.csv. For clarification, does amulet considers only the cells in singlecell.csv for testing multiplets or does it consider all the barcodes in the singlecell.csv?
Since I am not using 10x cell calling method, I would like for AMULET to detect multiplets from all the barcodes in the singlecell.csv. Is that feasible?
Thanks
I have a very basic question about the outputs generated. Are both homotypic and heterotypic multiplets barcode ids in the MultipletCellIds_xx.txt ? If so, how can I find out from the output files of AMULET (without using the R vignette that you provided) which barcode ids are heterotypic vs homotypic?
Thanks again for this great tool for scATAC users! :)
Anjali
I read the original thesis PDF of AMULET. When you simulate artificial doublets to measure recall for detecting multiplets, I wonder how "each one of them is separately added to the existing set of single cells (them = artificial doublets)" and how you "tested the performance of DoubletDetector by evaluating each artificial cell individually". Did you generate a bam file for each simulated artificial doublet, run overlap counter (the first step of AMULET) on the bam file to obtain the overlaps files, and add the overlaps to the overlaps files of the bam file of the set of real cells?
Hi Ucar Lab,
Thanks for the great tool--- have you all tried the statistical approach utilizing the fragments.tsv.gz file as input instead of the .bam file? Do you have any sense off-hand on how it would perform? It'd greatly simplify my workflow as I have several fragments files but not .bam files saved from preprocessing runs.
how can i find known repetitive elements like restrictionlist_repeats_segdups_rmsk_hg38.bed ? Can you provide the file or download link ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.