Giter Club home page Giter Club logo

snatac's Introduction

NOTE: This project is no longer under maintenance, please find our latest pipeline SnapATAC here

Introduction

snATAC is a Ren-lab in-house bioinformatics pipeline for single-nucleus ATAC-seq (snATAC-seq). Download the our full data set from here.

Dependency

Quick Install

> git clone --depth=1 https://github.com/r3fang/snATAC.git
> cd snATAC
> chmod u+x bin/*
> gcc -g -O2 src/calculate_jacard_index_matrix.c -o bin/snATAC_jacard
> export PATH=$PATH:./bin/
> ./bin/snATAC

Program:  snATAC (snATAC-seq analysis pipeline)
Version:  12.24.2017
Contact:  Rongxin Fang <[email protected]>
          Sebastian Preissl <[email protected]>
          Bing Ren <[email protected]>

Usage:    snATAC <command> [options]

Command:  pre           preprocessing
          filter        filter reads with unselected barcodes
          bmat          binary accessible matrix
          jacard        jaccard index matrix

Optional (under development):
          decomplex     decomplex the fastq file
          bstat         simple statistics for barcode

Note: To use snATAC pipeline, you need to first decomplex barcode
combination and integrate barcodes to the beginning of the
read name in both R1 and R2 fastq files.

Requirement

Before alignment, you need to first decomplex the FASTQ file by adding the barcode to the beginning of each read. Below is one example of decomplexed R1 and R2 for snATAC-seq. Each read name follows this format: "@" + "barcode" + ":" + "original read_name". Unfortunately, because barcode design may be different between each experiment, we decided not to include this part in the pipeline.

> zcat p56.R1.fastq.gz | head
@TCCGGAGATAAGGCGAAAGGAGTAATAGAGGC:SN1113732HY737BCXX2110113481882 1N00
GTGTTGTTCTAGCTGGACAGGACAACTTCCTATCCTCCCCTTTAGCCCTA
+
DDDDDIIIIIIIIIIIIIIIHIIIIIIIHHIHIIIIIIIIHIIIHIIIHI

> zcat p56.R2.fastq.gz | head
@TCCGGAGATAAGGCGAAAGGAGTAATAGAGGC:SN1113732HY737BCXX2110113481882 2N00
CGGGCTCCTCGGCCGATATGTATGAGTAGGAAGGTGTCCTGTCCAGCTAG
+
<0000/11110000/<///<<11111<111<110<DCH1G<<<11111<1

Pipeline

snATAC has following steps:

  1. Alignment by bwa or bowtie2 (output.bam);
  2. Pre-processing (output.bed/output.qc);
    • Alignment filteration
    • Speration of single cell
    • PCR duplicates removal
    • Mitochondrial reads removal
    • Adjusting position of Tn5 insertion
    • Create a master .bed file
    • Create a .qc file
  3. Identify feature candidates (output.txt);
    • Call peaks using MACS2 from ensemble data
  4. Calculate barcode statistics;
    • Reads per barcode
    • Consecutive promoter coverage
    • Reads in peak ratio
  5. Barcode selection (output.xgi);
    • Reads per barcode >= 1000
    • Consecutive promoter coverage > 5%
    • Reads in peak ratio >= 20%
    • NOTE: above cutoff can vary singificant between different samples
    • Filter potential doublets
  6. Feature selection (output.ygi);
    • Filter top 5% peaks
    • Filter promoters
  7. Binary accesibility matrix (output.mat);
  8. Jaccard index matrix (output.jacard);
  9. Dimentionality reduction (output.tsne);
  10. Density-based clustering (output.cluster);

A Complete Example

Step 0. Download sample data.

> wget http://renlab.sdsc.edu/r3fang/projects/Preissl_Nat_Neuro/data_raw/p56.rep1.R1.decomplex.fastq.gz
> wget http://renlab.sdsc.edu/r3fang/projects/Preissl_Nat_Neuro/data_raw/p56.rep1.R2.decomplex.fastq.gz
> wget http://renlab.sdsc.edu/r3fang/snATAC/mm10.tar.gz
> tar -xvzf mm10.tar.gz

Step 1. Alignment using bwa (or bowtie2) in pair-end.

> bwa mem -t 1 mm10.fa \
          p56.rep1.R1.decomplex.fastq.gz \
          p56.rep1.R2.decomplex.fastq.gz \
          | samtools view -bS - > p56.rep1.bam
> samtools sort -t 5 -n p56.rep1.bam -o p56.rep1.nsrt.bam

Step 2. Pre-processing.

> ./bin/snATAC pre -t 5 -m 30 -f 2000 -e 75 \
                   -i p56.rep1.nsrt.bam \
                   -o p56.rep1.bed.gz 2> p56.rep1.pre.log

./bin/snATAC pre will output two files p56.pre.bed.gz and p56.rep1.pre.log. p56.pre.bed.gz contains read and corresponding barcode, p56.rep1.pre.log contains basic quality control. Below is one example of p56.rep1.pre.log.

p56.pre.log
number of totol reads 217064143
number of uniquely mapped reads 191813608
number of properly paired reads 189561202
number of chrM reads 5334929
number of usable reads 184310563
number of distinct reads 52036692
estimated duplicate rate 0.71

Step 3. Identify feature candidates (output.txt)

> macs2 callpeak -t p56.rep1.bed.gz \
                 -f BED -n p56.rep1 \
                 -g mm -p 0.05 \
                 --nomodel --shift 150 \
                 --keep-dup all

Step 4. Calculate barcode statistics

##################################################################
# calculate 3 major barcode stats
# 1) number of reads
# 2) consecutive promoter coverage 
# 3) reads in peak ratio
##################################################################
# count number of reads per barcode
> zcat p56.rep1.bed.gz | awk '{print $4}' \
               | sort \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 > p56.rep1.reads_per_cell
# consecutive promoter coverage 
> intersectBed -wa -wb \
               -a p56.rep1.bed.gz \
               -b mm10_consecutive_promoters.bed \
               | awk '{print $4, $8}' \
               | sort \
               | uniq \ 
               | awk '{print $1}' \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 > p56.rep1.promoter_cov 
# reads in peak ratio
> intersectBed -a p56.rep1.bed.gz -b p56.rep1_peaks.narrowPeak -u \
               | awk '{print $4}' \
               | sort \
               | uniq -c \
               | awk '{print $2, $1}' \
               | sort -k1,1 - > p56.rep1.reads_in_peak

Step 5. Cell selection (output.xgi)

> R
##################################################################
# 1) reads per barcode >= 1000
# 2) consecutive promoter coverage > 3%
# 3) reads in peak ratio >= 20%
# NOTE: The cutoff can vary singificantly between different samples
# 4) filter potential doublets (OPTIONAL NOT SUGGUESTED, UNSTABLE)
##################################################################

consecutive_promoters <- read.table("mm10/mm10_consecutive_promoters.bed")
num_of_reads = read.table("p56.rep1.reads_per_cell")
promoter_cov = read.table("p56.rep1.promoter_cov")
read_in_peak = read.table("p56.rep1.reads_in_peak")
qc = num_of_reads; 
colnames(qc) <- c("barcode", "num_of_reads")
qc$promoter_cov = 0; 
qc$read_in_peak = 0;
qc$promoter_cov[match(promoter_cov$V1, qc$barcode)] = promoter_cov$V2/nrow(consecutive_promoters)
qc$read_in_peak[match(read_in_peak$V1, qc$barcode)] = read_in_peak$V2
qc$ratio = qc$read_in_peak/qc$num_of_reads
idx <- which(qc$promoter_cov > 0.09 & qc$ratio > 0.2 & qc$num_of_reads > 1000)
qc_sel <- qc[idx,]

# among these cells, further filter PUTATIVE DOUBLETS ((OPTIONAL NOT SUGGUESTED, UNSTABLE))
#pvalues <- sapply(qc_sel$num_of_reads, function(x) 
#           poisson.test(x, mean(qc_sel$num_of_reads), 
#           alternative="greater")$p.value)
#fdrs <- p.adjust(pvalues, "BH")
#idx <- which(fdrs < 1e-2)

write.table(qc_sel[,1], file = "p56.rep1.xgi", append = FALSE, 
			  quote = FALSE, sep = "\t", eol = "\n", 
			  na = "NA", dec = ".", row.names = FALSE,
			  col.names = FALSE, qmethod = c("escape", "double"),
			  fileEncoding = "")

Step 5. Feature selection (output.xgi);

> R
##################################################################
# 1) Filter top 5% peaks
# 2) Filter promoters
# 3) extend and merge
##################################################################
library(GenomicRanges)
peaks.df <- read.table("p56.rep1_peaks.narrowPeak")
# remove top 5% peaks
cutoff <- quantile((peaks.df$V5), probs = 0.95)
peaks.df <- peaks.df[which(peaks.df$V5 < cutoff),]
proms.df <- read.table("mm10/mm10.refSeq_promoter.bed")
peaks.gr <- GRanges(peaks.df[,1], IRanges(peaks.df[,2], peaks.df[,3]))
proms.gr <- GRanges(proms.df[,1], IRanges(proms.df[,2], proms.df[,3]))

peaks.sel.gr <- peaks.gr[-queryHits(findOverlaps(peaks.gr, proms.gr))]
peaks.sel.ex.gr <- resize(reduce(resize(peaks.sel.gr, 1000, 
                          fix="center")), 1000, fix="center")

peaks.sel.ex.df <- as.data.frame(peaks.sel.ex.gr)[,1:3]
write.table(peaks.sel.ex.df, file = "p56.rep1.ygi", 
			  append = FALSE, quote = FALSE, sep = "\t", 
			  eol = "\n", na = "NA", dec = ".", 
			  row.names = FALSE, col.names = FALSE, 
			  qmethod = c("escape", "double"),
			  fileEncoding = "")

Step 6. Generate binary accessibility matrix (NOTE: this may require large RAM)

> ./bin/snATAC bmat -i p56.rep1.bed.gz \
                    -x p56.rep1.xgi \
                    -y p56.rep1.ygi \
                    -o p56.rep1.mat

Step 7. Calculate jaccard index (NOTE: snATAC jacard may require large memory when you have more cells, "Segmentation fault" means you need larger RAM)

# mannually count number of rows (1,464)
> wc -l p56.rep1.xgi
# mannually count number of columns (184,519)
> wc -l p56.rep1.ygi
# calculate jaccard index matrix
> ./bin/snATAC jacard -i p56.rep1.mat -x 1464 -y 184519 -o p56.rep1.jacard

Step 8. Dimentionality reduction (R) (NOTE: tSNE is a heristic method, it is expected that different run can have slightly different result)

> R
##################################################################
# Dimentionality reduction using t-SNE
# Please refer to https://lvdmaaten.github.io/tsne/ for how to
# tune parameters such as perplexity. We find most of time, it 
# takes less than 500 iteration to converge.
##################################################################

library(tsne)

data <- as.matrix(read.table("p56.rep1.jacard"))
diag(data) <- 0
b = tsne(data/sum(data), initial_config = NULL, 
		  k = 2, perplexity = 30, max_iter = 500, 
		  min_cost = 0, epoch_callback = NULL, 
		  whiten = TRUE, epoch=100)

plot(b, cex=0.7, xlab="tsne1", ylab="tsne2")
write.table(b, file = "p56.rep1.tsne", append = FALSE, 
            quote = FALSE, sep = "\t", eol = "\n", 
            na = "NA", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")

plot

Step 9. Density-based clustering

> R
##################################################################
# NOTE: Choosing cluster number is absolutely an art! In our paper, 
# we adopted a method originally published from Habib et al. that
# determines the best cluster number by optimizing Dunn Index.
# However, such approach is extremely slow with time complexity
# O(n^3), without optimization and better implementation, it is
# almost impossible to run on a sample of thousands of cells.
# Therefore, we decided to switch to a heuristic approach as shown
# below. Though, it does not guarantee the best result (in fact, 
# none of the methods do), from our experience, it gives satisfactory
# result. At the end of the day, choosing the optimal cluster number
# is very tricky task and it is hard to be fully automatic. We still
# post our code for Dunn Index approach as described in our paper. 
##################################################################

library(densityClust)
MaxStep <- function(D){
	D_hat <- D
	n = nrow(D)
	for(k in 1:n){
		for(i in 1:(n-1)){
			for(j in (i+1):n){
				D_hat[i,j] <- min(D_hat[i,j], 
				max(D_hat[k,j], D_hat[i,k]))
			}
		}
	}
	return(D_hat)
}
find_center <- function(p, clust){
	centers <- data.frame()
	cols <- c()
	for(i in as.numeric(names(table(clust)))){
		ind <- which(clust== i)
		if(length(ind) > 10){
			centers <- rbind(centers, colMeans(p[ind,1:3]))
			cols <- c(cols, i)
		}
	}
	res <- cbind(centers, cols)
	colnames(res) <- c("x", "y", "z", "col")
	return(res)
}
DB_index <- function(D, CL){
	cl_uniq <- unique(CL)
	n <- length(cl_uniq)
	d_k <- rep(0, n)
	d_ij <- matrix(0, n, n)
	for(i in 1:n){
		ii <- which(CL == i);
		d_k[i] <- max(MaxStep(D[ii,ii]))
	}
	for(i in 1:(n-1)){
		for(j in (i+1):n){
			ii <- which(CL == i | CL == j)
			d_ij <- max(MaxStep(D[ii,ii]))
		}
	}
	return(min(d_ij)/max(d_k))
}

# perform cluster
points <- read.table("p56.rep1.tsne")

set.seed(10)
dis <- dist(points)
irisClust <- densityClust(dis, gaussian=TRUE)
# plot decision graph
plot(irisClust)

rho_cutoff <- irisClust$dc
delta_cutoff <- 20
irisClust <- findClusters(irisClust, 
                          rho= rho_cutoff, 
                          delta= delta_cutoff)
clusters <- irisClust$cluster

plot(points, col=clusters, cex=0.5, pch=19)
dev.off()

write.table(data.frame(cluster), file = "p56.rep1.cluster",
			append = FALSE, quote = FALSE, sep = "\t",
            eol = "\n", na = "NA", dec = ".", row.names = FALSE,
            col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")

cluster

Cite us

Preissl S.*, Fang R.*, Zhao Y., Raviram R., Zhang Y., Brandon C.S., Huang H., Gorkin D.U., Afzal V., Dickel D.E., Kuan S., Visel A., Pennacchio L.A., Zhang K., Ren B. Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat Neurosci. 2018 Feb 12. doi: 10.1038/s41593-018-0079-3.(* contributed equally)

File Organization

.
# raw data
|____p56.rep1.R1.decomplex.fastq.gz
|____p56.rep1.R2.decomplex.fastq.gz
# pre-defined features for mm10
|____mm10
| |____mm10.genome.size
| |____mm10_consecutive_promoters.bed
| |____mm10.refSeq_promoter.bed
# output from alignment
|____p56.rep1.bam
# output from pre-proessing
|____p56.rep1.log
|____p56.rep1.bed.gz
# output from MACS2
|____p56.rep1_peaks.xls
|____p56.rep1_peaks.narrowPeak
|____p56.rep1_summits.bed
# output from barcode stats
|____p56.rep1.reads_in_peak
|____p56.rep1.promoter_cov
|____p56.rep1.reads_per_cell
# output from barcode selection
|____p56.rep1.xgi
# output from feature selection
|____p56.rep1.ygi
# output from binary accessibility matrix
|____p56.rep1.mat
# output from jaccard matrix
|____p56.rep1.jacard
# output from dimentionality reduction
|____p56.rep1.tsne
# output from clustering
|____p56.rep1.cluster

snatac's People

Contributors

r3fang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

snatac's Issues

memoryview error with snATAC pre

Hello,
I am trying to process the data. I am getting this error when using snATAC pre:

File "/home//anaconda3.4.3/lib/python3.6/gzip.py", line 260, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

Permission Denied to Download Files from http://renlab.sdsc.edu/r3fang/snATAC/

Hi, First - I wanted to say thanks and well done on establishing this protocol. This is very impressive and interesting! We are very interested in scATAC-seq of tissue samples and would like to look at your dataset. However, it seems that not all the files in the http://renlab.sdsc.edu/r3fang/snATAC/ folder are open for public download. For about half of them (e11.5_R2, e12.5_R2, e14.5_R2, e15.5_R2, e16.5_R2, p0_R2, p56_R1, p56_R2), the following error arises when trying to Download

-sh-4.1$ wget http://renlab.sdsc.edu/r3fang/snATAC/p56_R1.decomplex.fastq.gz
--2018-03-07 11:23:55--  http://renlab.sdsc.edu/r3fang/snATAC/p56_R1.decomplex.fastq.gz
Resolving renlab.sdsc.edu... 198.202.90.216
Connecting to renlab.sdsc.edu|198.202.90.216|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2018-03-07 11:23:55 ERROR 403: Forbidden.

We wanted to look at the full dataset, would you be able to provide download access for all the decomplexed fastq files?

Many thanks, Julia

mm10_consecutive_promoters.bed file

Hi there,

I'm now working on the pipeline. I'm curious how do you generate the file "mm10_consecutive_promoters.bed"? If also need the file of hg19, where would be the best place to download them? Thanks!

Sincerely,
Simon

How snATAC jacard works?

I couldn't see the source code of snATAC jacard, therefore I tried snATAC jacard with my test data.
However, I couldn't make out the results.

Input test data is as below (not resized peaks)

test.bed
chr1 1 1000 test_a
chr1 2001 3000 test_a
chr1 10001 20000 test_a
chr1 30001 40000 test_a
chr1 1 1000 test_b
chr1 10001 20000 test_b
chr1 10001 20000 test_c
chr1 30001 40000 test_c
test.xgi
test_a
test_b
test_c
test.ygi
chr1 1 1000
chr1 1001 2000
chr1 2001 3000
chr1 10001 20000
chr1 20001 30000
chr1 30001 40000

test.mat (this is the output of read.table(as.matrix("test.mat")) in R)
V1 V2 V3 V4 V5 V6
1 1 0 1 1 0 1
2 1 0 0 1 0 0
3 0 0 0 1 0 1

And the output of "snATAC jacard -i test.mat -x 3 -y 6 -o test.jaccard" was below
0.00 600.00 600.00
600.00 0.00 500.00
600.00 500.00 0.00

This result seems that the function "snATAC jaccard" calculate the Jaccard distance of the vector assigned to each cell (e.g. (1,0,1,1,0,1) for test_a cell) and we have to resize the width of each peak in .ygi.
However, if so, the Jaccard distance between test_a and test_b would be 0.5(2/4) and the value of 1000x jaccard would be 500. Why is the result is slightly different from the expected result?

(I think "1" is added both denominator and numerator. (2/4→3/5→0.6, 1/3→2/4→0.5) )

mm10.fa file

Hi there,

Really appreciate for your scATAC-seq pipeline! It really help my research a lot!
I have one question in your tutorial: I can't find mm10.fa file in your sample dataset after I unzip the mm10.tar.gz file. Could you please let me know where could I have this file? Thanks!

Simon

Segmentation fault

Hi there,

I got the number 4598 from the command wc -l p56.rep1.xgi and
184519 from wc -l p56.rep1.ygi.
After I use the command ./bin/snATAC_jacard -i p56.rep1.mat -x 4598 -y 184519 -o p56.rep1.jacard
it shows: Segmentation fault

Do you have any suggestion?

Simon

p56.rep1.txt file

Hi there,

I don't have the p56.rep1.txt file to proceed the intersectBed step after followed each of your previous codes. Do you have any suggestion? Thanks!

Sincerely,
Simon

BWA mapping

Hi there,

Really appreciate for your pipeline of snATAC-seq.
I follow your tutorial and execute the commands; however, some of the execution are quite slow (takes more than one day).
Could you please provide the output file in the step of

          bwa mem -t 1 mm10.fa \
          p56.rep1.R1.decomplex.fastq.gz \
          p56.rep1.R2.decomplex.fastq.gz \
          | samtools view -bS - > p56.rep1.bam

Also, is it possible to run this command in multiple cores?

Sincerely,
Simon

About barcode sequence

Hello !
The software ran successfully in the example dataset. but I noticed some of index sequence is different from the Supplementary Table 5 in paper 'Single_nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell_type_specific transcriptional regulation'.
And with barcode sequence provided here, we can not completely correct barcode of dataset SRR6768121.
How could I generate index sequence according to the Supplementary Table 5 mentioned above ?
Do I need to reverse pcr index sequence or sth. else ?

Thank you and good luck !
Kylin

Unable to reproduce...

Hi, I am interested in the snATAC method, and I was trying to first reproduce your pipeline to make sure that I can do the same with my dataset.

I copied the tutorial exactly (except for downloading the mm10.fa, since I already have local BWA Index for mm10), and made sure that I am doing the same thing. However, when I filter out the barcodes with your pipeline (>1000 reads per cell, 3% consecutive promoter coverage, 20% reads in peak), I'm not getting 1464 barcodes--instead, 4600 of them--, and I am unable to produce any good clustering on the tSNE as it shows in the pipeline. I read in the paper that you got better clustering using the distal promoter regions, so I also tried creating a bed file containing regions +2kb upstream of the refSeq promoter.bed file as well, yet I still don't see much of a difference.

Would you be able to help me understand how you filtered out the barcode...? (+ tips on making good tSNE maps?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.