yipeng-gao / scdapars Goto Github PK

View Code? Open in Web Editor NEW

10.0 1.0 1.0 9.27 MB

Dynamic Analysis of Alternative Polyadenylation from single-cell RNA-seq (scDaPars)

R 100.00%

single-cell-rna-seq alternative-polyadenylation

scdapars's Introduction

scDaPars: Dynamic Analysis of Alternative Polyadenylation from Single-Cell RNA-Seq

Yipeng Gao, Wei Li 2020-08-22

Latest News

2020-08-22:

Version 0.0.1 is released!

Introduction

scDaPars is a bioinformatics algorithm to accurately quantify Alternative Polyadenylation (APA) events at both single-cell and single-gene resolution using standard scRNA-seq data.

Step.1 scDaPars first takes scRNA-seq genome coverage data (bedgraph format) as input and forms a linear regression model to jointly infer the exact location of proximal poly(A) sites (Current Version of scDaPars do not support this function, raw PDUI values are calulated seperately using DaPars2). DaPars2 can be downloaded and the instructions to run DaPars2 can be found in DaPars2 Instruction.

Step.2 scDaPars constructs a nearest neighbor graph based on the sparse APA matrix generated in step.1 to identify a pool of candidate neighboring cells that have similar APA profiles.

Step.3 scDaPars uses a non-negative least square (NNLS) regression model to refine neighboring cells and impute PDUIs of dropout genes in each cell.

We welcome any suggestions on the package. For technical problems, please report to Issues. For suggestions and comments, please contact Yipeng ([email protected]) or Dr. Wei Li ([email protected]).

scDaPars Pipeline

Installation

The package is not on CRAN yet. For installation please use the following codes in R

install.packages("devtools")
library(devtools)

install_github("YiPeng-Gao/scDaPars")

Quick Start

The imputation steps of scDaPars takes the APA matrix from step.1 as input and in the simplest case, the imputation can be done with one single function scDaPars:

scDaPars(raw_PDUI_file,         # full path of the raw PDUI matrix generated by step1 of scDaPars
         out_dir,               # full path of the output directory
         filter_gene_thre,      # the percent of cells a gene's APA must be detected in step.1
         filter_cell_thre)      # the percent of gene APA a cell must be detected in step.1

Example

The dataset used in this example is a time-course scRNA-seq dataset containing 758 cells sequenced at 0, 12, 24, 36, 72 and 96 h of differentiation during human definitive endoderm (DE) emergence from Chu et al. under GEO accession code GSE75748. After quality control, there is 739 cells remained for analysis.

Genearting raw PDUI matrix

For generating raw PDUI matrix in step 1, we assume that the scRNA-seq data has been preprocessed, so that we have one wiggle file per cell. The raw PDUI files are then generated by DaPars2. The raw PDUI matrix for this example "Dapars_hESC_combined_all_chromosome.txt" is included in the example folder.

install and load scDaPars R package

if(!require(devtools)) install.packages("devtools")
library(devtools)

## Loading required package: devtools

## Loading required package: usethis

devtools::install_github("YiPeng-Gao/scDaPars")
library(scDaPars)


## Loading required package: survival

## Welcome to penalized. For extended examples, see vignette("penalized").

## Loading required package: foreach

## Loading required package: RANN

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

Run scDaPars

scDaPars.res = scDaPars(raw_PDUI_file = "Dapars_hESC_combined_all_chromosome.txt", 
                        out_dir = "./scDaPars_result", 
                        filter_gene_thre = 0.2, 
                        filter_cell_thre = 0.1)

## [1] "Reading in raw PDUI matrix ..."
## [1] "number of genes in raw count matrix 13029"
## [1] "number of cells in raw count matrix 743"
## [1] "Reading finished!"
## [1] "Start Processing raw PDUI matrix"
## [1] "Pre-Processing finished"
## [1] "Start Imputation Steps ..."
## [1] "Find potential Neighboring Cells ..."
## [1] "Number of neighbors(clusters) is 4"
## [1] "Outliers is/are "
## [1] "Imputation with neighboring cells ..."
## [1] 100
## [1] 100
## [1] 100
## [1] 200
## [1] 300
## [1] "Imputation Steps Finished!"
## [1] "Writing imputed PDUI matrix ..."

head(scDaPars.res)[,1:6]

##                      SRR2978558  SRR2978559  SRR2978560  SRR2978561  SRR2978562
## ENST00000000412.3    1.00000000 0.804251535 0.857067085 1.000000000 0.706309880
## ENST00000002165.10_1 0.01711468 0.002728381 0.001751477 0.001465125 0.001576918
## ENST00000003100.13_3 0.05000000 0.607145404 0.496367748 0.733390448 0.611959391
## ENST00000005082.13_1 0.75459201 1.000000000 0.788302984 0.110000000 0.672660621
## ENST00000005257.7_2  0.15491832 0.000000000 0.289365021 0.285070704 0.224915218
## ENST00000005259.8_2  0.05000000 0.046576290 0.036624434 0.050765637 0.070246239
##                       SRR2978564
## ENST00000000412.3    0.710688649
## ENST00000002165.10_1 0.002705341
## ENST00000003100.13_3 0.523058864
## ENST00000005082.13_1 0.655225597
## ENST00000005257.7_2  1.000000000
## ENST00000005259.8_2  0.071559391

Visualize scDaPars' results

#Load cell type information for the example data

hESC_SRA = read.table("SraRunTable.txt", header = T, sep = ",", stringsAsFactors = F)
cell_type = hESC_SRA[which(hESC_SRA$Run %in% colnames(scDaPars.res)),]
cell_type = cell_type[match(colnames(scDaPars.res), cell_type$Run),]
head(cell_type)

##            Run Assay.Type AvgSpotLen  BioProject    BioSample Center.Name
## 933 SRR2978558    RNA-Seq         51 PRJNA305280 SAMN04322527         GEO
## 934 SRR2978559    RNA-Seq         51 PRJNA305280 SAMN04322528         GEO
## 935 SRR2978560    RNA-Seq         51 PRJNA305280 SAMN04322529         GEO
## 936 SRR2978561    RNA-Seq         51 PRJNA305280 SAMN04322530         GEO
## 937 SRR2978562    RNA-Seq         51 PRJNA305280 SAMN04322531         GEO
## 939 SRR2978564    RNA-Seq         51 PRJNA305280 SAMN04322533         GEO
##     Consent DATASTORE.filetype DATASTORE.provider
## 933  public                sra         gs,ncbi,s3
## 934  public                sra         gs,ncbi,s3
## 935  public                sra         gs,ncbi,s3
## 936  public                sra         gs,ncbi,s3
## 937  public                sra         gs,ncbi,s3
## 939  public                sra         gs,ncbi,s3
##                   DATASTORE.region Experiment facs_sorting GEO_Accession
## 933 gs.US,ncbi.public,s3.us-east-1 SRX1468145   not sorted    GSM1965870
## 934 gs.US,ncbi.public,s3.us-east-1 SRX1468146   not sorted    GSM1965871
## 935 gs.US,ncbi.public,s3.us-east-1 SRX1468147   not sorted    GSM1965872
## 936 gs.US,ncbi.public,s3.us-east-1 SRX1468148   not sorted    GSM1965873
## 937 gs.US,ncbi.public,s3.us-east-1 SRX1468149   not sorted    GSM1965874
## 939 gs.US,ncbi.public,s3.us-east-1 SRX1468151   not sorted    GSM1965876
##              Instrument LibraryLayout LibrarySelection  LibrarySource MBases
## 933 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    105
## 934 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    116
## 935 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    141
## 936 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC     99
## 937 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC     90
## 939 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    103
##     MBytes     Organism       passage Platform          ReleaseDate sample_acc
## 933     65 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194214
## 934     72 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194213
## 935     85 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194212
## 936     61 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194211
## 937     54 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194210
## 939     62 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194208
##     Sample.Name                          source_name SRA.Study
## 933  GSM1965870 H9 cells differentiated for 12 hours SRP067036
## 934  GSM1965871 H9 cells differentiated for 12 hours SRP067036
## 935  GSM1965872 H9 cells differentiated for 12 hours SRP067036
## 936  GSM1965873 H9 cells differentiated for 12 hours SRP067036
## 937  GSM1965874 H9 cells differentiated for 12 hours SRP067036
## 939  GSM1965876 H9 cells differentiated for 12 hours SRP067036

Perform UMAP analysis

scDaPars.res.umap = umap(t(scDaPars.res))
scDaPars.res.umap.data = data.frame(scDaPars.res.umap$layout)
colnames(scDaPars.res.umap.data) = c("Dim1", "Dim2")
scDaPars.res.umap.data$cellType = cell_type$source_name

Generate Scatter plot

ggplot(scDaPars.res.umap.data, aes(x=Dim1, y=Dim2, color = cellType)) +
  scale_color_manual(values = c("#7DD2D9", "#FFA500", "#e55b54", "#ad6c58", "#989797", "#166FD5")) +
  geom_point(size = 0.5) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  theme(legend.position = "right", legend.title = element_blank())

scdapars's People

Contributors

Stargazers

Watchers

Forkers

bit-vs-it

scdapars's Issues

Different sample analysis

Hello,

Is it possible to do a significant analysis of differences between samples.

confounding factor

Hello,
For example, I have 2 scRNAseq datasets (AD and control), each consisting of 10 samples. But each sample has some different features, such as age and sex. When we analyze differential expression for such bulk RNAseq or scRNAseq, we usually regress out these possible confounding factors(age and sex). my question is that DaPars or scDaPars regresses out those confounding factors?

Single cell sequencing data type selection

Hello!I just learned about this software.The original article described the use of scRNA-Seqdata[10x Chromium or full-length (Smart-seq2)]. I would like to ask if other forms of single cell data are available, such as Singleron's single cell sequencing data.

dapars2 result from 10x data

A chromosome has only a few lines. Is your result the same?

code

Does the code need further optimization？

var_thre

Hi,
What do you base your var_thre value on?

raw_PDUI = read.table("Dapars_hESC_combined_all_chromosome.txt", header = TRUE, stringsAsFactors = F)
raw_PDUI_matrix_sc = raw_PDUI[,5:ncol(raw_PDUI)]
genenames = sapply(strsplit(raw_PDUI$Gene, "[|]"), `[`, 1)
row.names(raw_PDUI_matrix_sc) = genenames
raw_PDUI_matrix_sc = raw_PDUI_matrix_sc[order(row.names(raw_PDUI_matrix_sc)),]
raw_PDUI_matrix_sc = raw_PDUI_matrix_sc[which(rowSums(!is.na(raw_PDUI_matrix_sc)) >= ncol(raw_PDUI_matrix_sc) * 0.2),]
pass_cells_Dapars= names(which(colSums(!is.na(raw_PDUI_matrix_sc)) >= nrow(raw_PDUI_matrix_sc) * 0.1))
raw_PDUI_matrix_sc = subset(raw_PDUI_matrix_sc, select = which(colnames(raw_PDUI_matrix_sc) %in% pass_cells_Dapars))
raw_PDUI_matrix_sNA = raw_PDUI_matrix_sc
raw_PDUI_matrix_sNA[is.na(raw_PDUI_matrix_sNA)] = 0
var_thre = 0.4     # var_thre 
pca = prcomp(t(raw_PDUI_matrix_sNA))
eigs = (pca$sdev)^2
var_cum = cumsum(eigs)/sum(eigs)

What do you base your var_thre value on?

If I modify var_thre = 0.5, my data can be analyzed using scDaPars
if i use var_thre = 0.4, run scDaPars,show error :

The screenshot of my data is as follows, pc1 = 0.4511342, so I want to consult you, can the value of var_thre be adjusted? What is the range of var_thre values?

Thank you very much.

cell type information

Hello,
when I try to run scDaPars on one public scRNA-seq, I reaslized that I need to provide the cell type information.
scDaPars.res.umap = umap(t(scDaPars.res))
scDaPars.res.umap.data = data.frame(scDaPars.res.umap$layout)
colnames(scDaPars.res.umap.data) = c("Dim1", "Dim2")
scDaPars.res.umap.data$cellType = cell_type$source_name

My question is that scDaPars can identify cell clusters, not depends known cell typ information?

differential 3'UTR

Hello,
I just ran scDaPars and got the results, it seems that scDaPars does not perform differential 3'UTR between differential cells or clusters.

ask for the minimum bam file numbers

Hello,

if I only have 4 bam files , can I continue to use scDapars for analysis?
when I run scDapars with 4 bam files ,the program reported the following info :

[1] "Reading in raw PDUI matrix ..."
[1] "number of genes in raw count matrix 2353"
[1] "number of cells in raw count matrix 8"
[1] "Reading finished!"
[1] "Start Processing raw PDUI matrix"
[1] "Pre-Processing finished"
[1] "Start Imputation Steps ..."
[1] "Find potential Neighboring Cells ..."
Error in RANN::nn2(t(mat_pcs), k = 30) :
Cannot find more nearest neighbours than there are points

So,what changes should I make to let this program work?
Looking forward to your reply. Thank you.

10x data

Hello!
For 10x example data,I want to know which GSMs in GEO132044 you used in scDapars, because there are too many samples in GEO132044.

the number of bam for single cell

Hi
I have a bam about 40G.I creat the bam for single cell about 60000.The number is so large.could you tell me your number of bam for single cell.If possiable ,would you mind providing your code for me .Thank you very much
Yang Fei
2021.9.24

How to obtain the site of distal PAS?

Thank you very much for launching scDaPars, I have a question.

In the process of running scDaPars, we first used DaPars2 to generate a table, as shown below, containing the predicted proximal PAS position, 3’UTR information, and PDUI values of different transcripts in each cell, but the table did not contain the distal PAS position information. I would like to ask how to obtain the location of the distal PAS and whether such information is generated during the process of using scDaPars?

scDaPars

scDaPars.res = scDaPars(raw_PDUI_file = "/DaPars2_Test_data_result_temp.chr3.txt",out_dir = "/scDaPars_result",filter_gene_thre = 0.2,filter_cell_thre = 0.1)
show error :

Error in RANN::nn2(t(mat_pcs), k = 30) :
Cannot find more nearest neighbours than there are points

generated one BAM file for each single cell

Hello, can you provide an operation for this step?
Write a script to operate this step. The 1G file is split into files for each cell. This process takes 10h.
Each bam file will be split into thousands of files

problem

The bam file is separated into the bam of different cells. and there are about 700,000 files.
The barcode sequence recorded by barcodes.tsv.gz is 9403.
Need to refer to barcodes.tsv.gz information?
DaPars produces the result file of each chromosome, and finally I use cat to merge the result files of all chromosomes to form a merged file, right?
What are the thresholds of scDaPars?

PDUI of only less than 1000 genes were successfully estimated

Hi,
First thanks for developing such an awesome tool. I applied it on my 10X data but found only ~1000 genes were included in the final output file. I think it's strange because there are much more genes. in the figures from your Genome Research paper,

The quality of my data is quite good (median: ~ 3000 UMIs and > 1500 genes per cell). All my cells are generated from 10X 3UTR library. I use cellranger to map reads to GRCh38, and then split the bam by sinto (python package). Then I used umi_tools for deduplication. The file size of bam for each cell ranges from 5MB to 20 MB. Then I used bedtools genomecov to get the bedgraph file and fed them to dapars2. When I run dapars2, I set the Coverage_threshold to zero in include as much as genes.

I wonder that, from your experience, is my result strange? If so, which step is possibly wrong? Could you please give me some kind advice?

Best
Yang

the number of genes

Use 50G bam file to run DaPars software, the result file contains more than 10,000 genes.
Split the 50G bam file into single cell bam files according to the barcode, and run DaPars software. The DaPars result file contains hundreds of genes.
Excuse me, after the bam file is split into a single cell bam file according to the barcode, do I need to modify some parameters? Can you give advice on reducing the number of genes so much?

how to generate raw PDUI

Hello,
I want to run scDaPars to analyze scRNA-seq. scDaPars takes as input the raw PDUI generated by DaPars2. Would you like to tell me how to run DaPars2? I have checked the DaPars2 (https://github.com/3UTR/DaPars2), it does not show how to run it.
Best,

Which test method was used to do differential PDUI analysis

Hi，
I've got the PDUI matrix of each cell and each gene in one sample by scDaPars. And I want to konw which test method was used to do differential PDUI analysis between two cell type?
Looking forward to your reply.
Thanks.

Error in qr.coef(qr(a, LAPACK = TRUE), b) : error code 95 from Lapack routine 'dtrtrs'

Hi, I came across this problem when I was running the following command lines:
scDaPars.res = scDaPars(raw_PDUI_file = paste0(path,"Dapars2_hspc_combined_all_chr.txt"),
out_dir = "./scDaPars_result",
filter_gene_thre = 0.2,
filter_cell_thre = 0.1)

The output messages are as follows:
[1] "Reading in raw PDUI matrix ..."
[1] "number of genes in raw count matrix 6276"
[1] "number of cells in raw count matrix 1250"
[1] "Reading finished!"
[1] "Start Processing raw PDUI matrix"
[1] "Pre-Processing finished"
[1] "Start Imputation Steps ..."
[1] "Find potential Neighboring Cells ..."
[1] "Number of neighbors(clusters) is 5"
[1] "Outliers is/are "
[1] "Imputation with neighboring cells ..."
[1] 100
[1] 200
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 84 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 63 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 347 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 49 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 81 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 74 from Lapack routine 'dtrtrs'
.......

Which part of the code in scDaPars calls the qr.coef() function？I can't figure out why this error occurs. Any ideas?

Dapars_hESC_combined_all_chromosome.txt

Run DaPars2, Output format, /scDaPars/Example/Dapars_hESC_combined_all_chromosome.txt comes out with a column of numbers. What should I do with the data after Dapars2 runs? Dapars_hESC_combined_all_chromosome.txt What information is the number in the first column?

What does NA represent in the original PDUI matrix？

Hi,
I noticed that there are a lot of 'NA' values in raw PDUI matrix generated from DaPars2.
Does these ’NA‘ values mean that it is uncertain whether to use a distal polyA site for a certain gene?
Could you explain the specific meaning about 'NA' values?
Note: My data is scRNAseq data generated with SMART-seq2 protocol.
Thanks
Hanwen Yu
2021/07/18

Creating wiggle files for every cell

I am unsure of how I can make one Wiggle file for every cell. Should I split the bam file by barcode and then convert them to wig? Could you please help with this?

pipeline

Is the flow chart missing 3 steps and not displayed?

selected reads with correct unique molecular identifier (UMI) using Drop-seq tools FilterBAM (Macosko et al. 2015)
remove reads with duplicated UMIs using UMI-tools dedup (Smith et al. 2017).
merged reads originated from same cells together and generated one BAM file for each single cell