Giter Club home page Giter Club logo

scdapars's Introduction

scDaPars: Dynamic Analysis of Alternative Polyadenylation from Single-Cell RNA-Seq

Yipeng Gao, Wei Li 2020-08-22

Latest News

2020-08-22:

  • Version 0.0.1 is released!

Introduction

scDaPars is a bioinformatics algorithm to accurately quantify Alternative Polyadenylation (APA) events at both single-cell and single-gene resolution using standard scRNA-seq data.

Step.1 scDaPars first takes scRNA-seq genome coverage data (bedgraph format) as input and forms a linear regression model to jointly infer the exact location of proximal poly(A) sites (Current Version of scDaPars do not support this function, raw PDUI values are calulated seperately using DaPars2). DaPars2 can be downloaded and the instructions to run DaPars2 can be found in DaPars2 Instruction.

Step.2 scDaPars constructs a nearest neighbor graph based on the sparse APA matrix generated in step.1 to identify a pool of candidate neighboring cells that have similar APA profiles.

Step.3 scDaPars uses a non-negative least square (NNLS) regression model to refine neighboring cells and impute PDUIs of dropout genes in each cell.

We welcome any suggestions on the package. For technical problems, please report to Issues. For suggestions and comments, please contact Yipeng ([email protected]) or Dr. Wei Li ([email protected]).

scDaPars Pipeline

Installation

The package is not on CRAN yet. For installation please use the following codes in R

install.packages("devtools")
library(devtools)

install_github("YiPeng-Gao/scDaPars")

Quick Start

The imputation steps of scDaPars takes the APA matrix from step.1 as input and in the simplest case, the imputation can be done with one single function scDaPars:

scDaPars(raw_PDUI_file,         # full path of the raw PDUI matrix generated by step1 of scDaPars
         out_dir,               # full path of the output directory
         filter_gene_thre,      # the percent of cells a gene's APA must be detected in step.1
         filter_cell_thre)      # the percent of gene APA a cell must be detected in step.1

Example

The dataset used in this example is a time-course scRNA-seq dataset containing 758 cells sequenced at 0, 12, 24, 36, 72 and 96 h of differentiation during human definitive endoderm (DE) emergence from Chu et al. under GEO accession code GSE75748. After quality control, there is 739 cells remained for analysis.

  1. Genearting raw PDUI matrix

For generating raw PDUI matrix in step 1, we assume that the scRNA-seq data has been preprocessed, so that we have one wiggle file per cell. The raw PDUI files are then generated by DaPars2. The raw PDUI matrix for this example "Dapars_hESC_combined_all_chromosome.txt" is included in the example folder.

  1. install and load scDaPars R package
if(!require(devtools)) install.packages("devtools")
library(devtools)
## Loading required package: devtools

## Loading required package: usethis
devtools::install_github("YiPeng-Gao/scDaPars")
library(scDaPars)

## Loading required package: survival

## Welcome to penalized. For extended examples, see vignette("penalized").

## Loading required package: foreach

## Loading required package: RANN

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union
  1. Run scDaPars
scDaPars.res = scDaPars(raw_PDUI_file = "Dapars_hESC_combined_all_chromosome.txt", 
                        out_dir = "./scDaPars_result", 
                        filter_gene_thre = 0.2, 
                        filter_cell_thre = 0.1)
## [1] "Reading in raw PDUI matrix ..."
## [1] "number of genes in raw count matrix 13029"
## [1] "number of cells in raw count matrix 743"
## [1] "Reading finished!"
## [1] "Start Processing raw PDUI matrix"
## [1] "Pre-Processing finished"
## [1] "Start Imputation Steps ..."
## [1] "Find potential Neighboring Cells ..."
## [1] "Number of neighbors(clusters) is 4"
## [1] "Outliers is/are "
## [1] "Imputation with neighboring cells ..."
## [1] 100
## [1] 100
## [1] 100
## [1] 200
## [1] 300
## [1] "Imputation Steps Finished!"
## [1] "Writing imputed PDUI matrix ..."
head(scDaPars.res)[,1:6]
##                      SRR2978558  SRR2978559  SRR2978560  SRR2978561  SRR2978562
## ENST00000000412.3    1.00000000 0.804251535 0.857067085 1.000000000 0.706309880
## ENST00000002165.10_1 0.01711468 0.002728381 0.001751477 0.001465125 0.001576918
## ENST00000003100.13_3 0.05000000 0.607145404 0.496367748 0.733390448 0.611959391
## ENST00000005082.13_1 0.75459201 1.000000000 0.788302984 0.110000000 0.672660621
## ENST00000005257.7_2  0.15491832 0.000000000 0.289365021 0.285070704 0.224915218
## ENST00000005259.8_2  0.05000000 0.046576290 0.036624434 0.050765637 0.070246239
##                       SRR2978564
## ENST00000000412.3    0.710688649
## ENST00000002165.10_1 0.002705341
## ENST00000003100.13_3 0.523058864
## ENST00000005082.13_1 0.655225597
## ENST00000005257.7_2  1.000000000
## ENST00000005259.8_2  0.071559391
  1. Visualize scDaPars' results
#Load cell type information for the example data

hESC_SRA = read.table("SraRunTable.txt", header = T, sep = ",", stringsAsFactors = F)
cell_type = hESC_SRA[which(hESC_SRA$Run %in% colnames(scDaPars.res)),]
cell_type = cell_type[match(colnames(scDaPars.res), cell_type$Run),]
head(cell_type)
##            Run Assay.Type AvgSpotLen  BioProject    BioSample Center.Name
## 933 SRR2978558    RNA-Seq         51 PRJNA305280 SAMN04322527         GEO
## 934 SRR2978559    RNA-Seq         51 PRJNA305280 SAMN04322528         GEO
## 935 SRR2978560    RNA-Seq         51 PRJNA305280 SAMN04322529         GEO
## 936 SRR2978561    RNA-Seq         51 PRJNA305280 SAMN04322530         GEO
## 937 SRR2978562    RNA-Seq         51 PRJNA305280 SAMN04322531         GEO
## 939 SRR2978564    RNA-Seq         51 PRJNA305280 SAMN04322533         GEO
##     Consent DATASTORE.filetype DATASTORE.provider
## 933  public                sra         gs,ncbi,s3
## 934  public                sra         gs,ncbi,s3
## 935  public                sra         gs,ncbi,s3
## 936  public                sra         gs,ncbi,s3
## 937  public                sra         gs,ncbi,s3
## 939  public                sra         gs,ncbi,s3
##                   DATASTORE.region Experiment facs_sorting GEO_Accession
## 933 gs.US,ncbi.public,s3.us-east-1 SRX1468145   not sorted    GSM1965870
## 934 gs.US,ncbi.public,s3.us-east-1 SRX1468146   not sorted    GSM1965871
## 935 gs.US,ncbi.public,s3.us-east-1 SRX1468147   not sorted    GSM1965872
## 936 gs.US,ncbi.public,s3.us-east-1 SRX1468148   not sorted    GSM1965873
## 937 gs.US,ncbi.public,s3.us-east-1 SRX1468149   not sorted    GSM1965874
## 939 gs.US,ncbi.public,s3.us-east-1 SRX1468151   not sorted    GSM1965876
##              Instrument LibraryLayout LibrarySelection  LibrarySource MBases
## 933 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    105
## 934 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    116
## 935 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    141
## 936 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC     99
## 937 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC     90
## 939 Illumina HiSeq 2500        SINGLE             cDNA TRANSCRIPTOMIC    103
##     MBytes     Organism       passage Platform          ReleaseDate sample_acc
## 933     65 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194214
## 934     72 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194213
## 935     85 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194212
## 936     61 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194211
## 937     54 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194210
## 939     62 Homo sapiens passage 30-35 ILLUMINA 2016-07-22T00:00:00Z SRS1194208
##     Sample.Name                          source_name SRA.Study
## 933  GSM1965870 H9 cells differentiated for 12 hours SRP067036
## 934  GSM1965871 H9 cells differentiated for 12 hours SRP067036
## 935  GSM1965872 H9 cells differentiated for 12 hours SRP067036
## 936  GSM1965873 H9 cells differentiated for 12 hours SRP067036
## 937  GSM1965874 H9 cells differentiated for 12 hours SRP067036
## 939  GSM1965876 H9 cells differentiated for 12 hours SRP067036

Perform UMAP analysis

scDaPars.res.umap = umap(t(scDaPars.res))
scDaPars.res.umap.data = data.frame(scDaPars.res.umap$layout)
colnames(scDaPars.res.umap.data) = c("Dim1", "Dim2")
scDaPars.res.umap.data$cellType = cell_type$source_name

Generate Scatter plot

ggplot(scDaPars.res.umap.data, aes(x=Dim1, y=Dim2, color = cellType)) +
  scale_color_manual(values = c("#7DD2D9", "#FFA500", "#e55b54", "#ad6c58", "#989797", "#166FD5")) +
  geom_point(size = 0.5) +
  theme_bw(base_size = 14) +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  theme(legend.position = "right", legend.title = element_blank())

scdapars's People

Contributors

yipeng-gao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

bit-vs-it

scdapars's Issues

confounding factor

Hello,
For example, I have 2 scRNAseq datasets (AD and control), each consisting of 10 samples. But each sample has some different features, such as age and sex. When we analyze differential expression for such bulk RNAseq or scRNAseq, we usually regress out these possible confounding factors(age and sex). my question is that DaPars or scDaPars regresses out those confounding factors?

Single cell sequencing data type selection

Hello!I just learned about this software.The original article described the use of scRNA-Seqdata[10x Chromium or full-length (Smart-seq2)]. I would like to ask if other forms of single cell data are available, such as Singleron's single cell sequencing data.

code

Does the code need further optimization?
image

var_thre

Hi,
What do you base your var_thre value on?

raw_PDUI = read.table("Dapars_hESC_combined_all_chromosome.txt", header = TRUE, stringsAsFactors = F)
raw_PDUI_matrix_sc = raw_PDUI[,5:ncol(raw_PDUI)]
genenames = sapply(strsplit(raw_PDUI$Gene, "[|]"), `[`, 1)
row.names(raw_PDUI_matrix_sc) = genenames
raw_PDUI_matrix_sc = raw_PDUI_matrix_sc[order(row.names(raw_PDUI_matrix_sc)),]
raw_PDUI_matrix_sc = raw_PDUI_matrix_sc[which(rowSums(!is.na(raw_PDUI_matrix_sc)) >= ncol(raw_PDUI_matrix_sc) * 0.2),]
pass_cells_Dapars= names(which(colSums(!is.na(raw_PDUI_matrix_sc)) >= nrow(raw_PDUI_matrix_sc) * 0.1))
raw_PDUI_matrix_sc = subset(raw_PDUI_matrix_sc, select = which(colnames(raw_PDUI_matrix_sc) %in% pass_cells_Dapars))
raw_PDUI_matrix_sNA = raw_PDUI_matrix_sc
raw_PDUI_matrix_sNA[is.na(raw_PDUI_matrix_sNA)] = 0
var_thre = 0.4     # var_thre 
pca = prcomp(t(raw_PDUI_matrix_sNA))
eigs = (pca$sdev)^2
var_cum = cumsum(eigs)/sum(eigs)

What do you base your var_thre value on?

image

If I modify var_thre = 0.5, my data can be analyzed using scDaPars
if i use var_thre = 0.4, run scDaPars,show error :

image

The screenshot of my data is as follows, pc1 = 0.4511342, so I want to consult you, can the value of var_thre be adjusted? What is the range of var_thre values?
image

Thank you very much.

cell type information

Hello,
when I try to run scDaPars on one public scRNA-seq, I reaslized that I need to provide the cell type information.
scDaPars.res.umap = umap(t(scDaPars.res))
scDaPars.res.umap.data = data.frame(scDaPars.res.umap$layout)
colnames(scDaPars.res.umap.data) = c("Dim1", "Dim2")
scDaPars.res.umap.data$cellType = cell_type$source_name

My question is that scDaPars can identify cell clusters, not depends known cell typ information?

differential 3'UTR

Hello,
I just ran scDaPars and got the results, it seems that scDaPars does not perform differential 3'UTR between differential cells or clusters.

image

ask for the minimum bam file numbers

Hello,

if I only have 4 bam files , can I continue to use scDapars for analysis?
when I run scDapars with 4 bam files ,the program reported the following info :


[1] "Reading in raw PDUI matrix ..."
[1] "number of genes in raw count matrix 2353"
[1] "number of cells in raw count matrix 8"
[1] "Reading finished!"
[1] "Start Processing raw PDUI matrix"
[1] "Pre-Processing finished"
[1] "Start Imputation Steps ..."
[1] "Find potential Neighboring Cells ..."
Error in RANN::nn2(t(mat_pcs), k = 30) :
Cannot find more nearest neighbours than there are points

So,what changes should I make to let this program work?
Looking forward to your reply. Thank you.

10x data

Hello!
For 10x example data,I want to know which GSMs in GEO132044 you used in scDapars, because there are too many samples in GEO132044.

the number of bam for single cell

Hi
I have a bam about 40G.I creat the bam for single cell about 60000.The number is so large.could you tell me your number of bam for single cell.If possiable ,would you mind providing your code for me .Thank you very much
Yang Fei
2021.9.24

How to obtain the site of distal PAS?

Thank you very much for launching scDaPars, I have a question.

In the process of running scDaPars, we first used DaPars2 to generate a table, as shown below, containing the predicted proximal PAS position, 3’UTR information, and PDUI values of different transcripts in each cell, but the table did not contain the distal PAS position information. I would like to ask how to obtain the location of the distal PAS and whether such information is generated during the process of using scDaPars?
image

scDaPars

scDaPars.res = scDaPars(raw_PDUI_file = "/DaPars2_Test_data_result_temp.chr3.txt",out_dir = "/scDaPars_result",filter_gene_thre = 0.2,filter_cell_thre = 0.1)
show error :

Error in RANN::nn2(t(mat_pcs), k = 30) :
Cannot find more nearest neighbours than there are points

generated one BAM file for each single cell

Hello, can you provide an operation for this step?
Write a script to operate this step. The 1G file is split into files for each cell. This process takes 10h.
Each bam file will be split into thousands of files

image

problem

  1. The bam file is separated into the bam of different cells. and there are about 700,000 files.
    The barcode sequence recorded by barcodes.tsv.gz is 9403.
    Need to refer to barcodes.tsv.gz information?
  2. DaPars produces the result file of each chromosome, and finally I use cat to merge the result files of all chromosomes to form a merged file, right?
  3. What are the thresholds of scDaPars?

PDUI of only less than 1000 genes were successfully estimated

Hi,
First thanks for developing such an awesome tool. I applied it on my 10X data but found only ~1000 genes were included in the final output file. I think it's strange because there are much more genes. in the figures from your Genome Research paper,

The quality of my data is quite good (median: ~ 3000 UMIs and > 1500 genes per cell). All my cells are generated from 10X 3UTR library. I use cellranger to map reads to GRCh38, and then split the bam by sinto (python package). Then I used umi_tools for deduplication. The file size of bam for each cell ranges from 5MB to 20 MB. Then I used bedtools genomecov to get the bedgraph file and fed them to dapars2. When I run dapars2, I set the Coverage_threshold to zero in include as much as genes.

I wonder that, from your experience, is my result strange? If so, which step is possibly wrong? Could you please give me some kind advice?

Best
Yang

the number of genes

Use 50G bam file to run DaPars software, the result file contains more than 10,000 genes.
Split the 50G bam file into single cell bam files according to the barcode, and run DaPars software. The DaPars result file contains hundreds of genes.
Excuse me, after the bam file is split into a single cell bam file according to the barcode, do I need to modify some parameters? Can you give advice on reducing the number of genes so much?

how to generate raw PDUI

Hello,
I want to run scDaPars to analyze scRNA-seq. scDaPars takes as input the raw PDUI generated by DaPars2. Would you like to tell me how to run DaPars2? I have checked the DaPars2 (https://github.com/3UTR/DaPars2), it does not show how to run it.
Best,

Error in qr.coef(qr(a, LAPACK = TRUE), b) : error code 95 from Lapack routine 'dtrtrs'

Hi, I came across this problem when I was running the following command lines:
scDaPars.res = scDaPars(raw_PDUI_file = paste0(path,"Dapars2_hspc_combined_all_chr.txt"),
out_dir = "./scDaPars_result",
filter_gene_thre = 0.2,
filter_cell_thre = 0.1)

The output messages are as follows:
[1] "Reading in raw PDUI matrix ..."
[1] "number of genes in raw count matrix 6276"
[1] "number of cells in raw count matrix 1250"
[1] "Reading finished!"
[1] "Start Processing raw PDUI matrix"
[1] "Pre-Processing finished"
[1] "Start Imputation Steps ..."
[1] "Find potential Neighboring Cells ..."
[1] "Number of neighbors(clusters) is 5"
[1] "Outliers is/are "
[1] "Imputation with neighboring cells ..."
[1] 100
[1] 200
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 84 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 63 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 347 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 49 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 81 from Lapack routine 'dtrtrs'
Error in qr.coef(qr(a, LAPACK = TRUE), b) :
error code 74 from Lapack routine 'dtrtrs'
.......

Which part of the code in scDaPars calls the qr.coef() function?I can't figure out why this error occurs. Any ideas?

Dapars_hESC_combined_all_chromosome.txt

Run DaPars2, Output format, /scDaPars/Example/Dapars_hESC_combined_all_chromosome.txt comes out with a column of numbers. What should I do with the data after Dapars2 runs? Dapars_hESC_combined_all_chromosome.txt What information is the number in the first column?
image

What does NA represent in the original PDUI matrix?

Hi,
I noticed that there are a lot of 'NA' values in raw PDUI matrix generated from DaPars2.
Does these ’NA‘ values mean that it is uncertain whether to use a distal polyA site for a certain gene?
Could you explain the specific meaning about 'NA' values?
Note: My data is scRNAseq data generated with SMART-seq2 protocol.
Thanks
Hanwen Yu
2021/07/18

Creating wiggle files for every cell

I am unsure of how I can make one Wiggle file for every cell. Should I split the bam file by barcode and then convert them to wig? Could you please help with this?

pipeline

Is the flow chart missing 3 steps and not displayed?

  1. selected reads with correct unique molecular identifier (UMI) using Drop-seq tools FilterBAM (Macosko et al. 2015)
  2. remove reads with duplicated UMIs using UMI-tools dedup (Smith et al. 2017).
  3. merged reads originated from same cells together and generated one BAM file for each single cell

image

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.