easyScRNA: An integrated, easy-to-use pipeline for standard single cell sequencing analysis

An automatic, batch-supported scRNA analysis pipeline integrates following functions: Recursive file import and group integration; UMAP dimension reduction and visualization; Multi-plots includes feature plot, violin plot and variable feature heatmap; cluster marker (differentially expressed genes) discovery and visualization; GSEA (Gene Set Enrichment Analysis) and visualization.

Workflow overview :

Prerequisites :

R (version 4.3.0)

R studio

What will be automatically built in pipeline ?

Seurat 4.3.0 (https://satijalab.org/seurat)

Microsoft C++ Build Tools (https://visualstudio.microsoft.com/visual-cpp-build-tools/)

CellTypist (https://github.com/Teichlab/celltypist)

GSEA v4.3.2 for the command line (https://www.gsea-msigdb.org/gsea/downloads.jsp)

Java, Conda (with python) and other R packages ....

Manual :

Installation

Download code and extract files (The “easyScRNA-main” folder will be the work folder).
Create your own R script for running the pipeline. Working with the “example.R” is recommended (with example commands and parameter settings).
Download all necessary R packages plus other softwares and load them with command:
```
source('functions/Main_functions.R')
```

Data orgranization

important : for GSEA command line integration, white space is not allowed in any path in work folder

To process raw 10X matrix files, a specific file sctructure is required (Since a recursive file reading process is introduced, the program supports multiple samples and groups):

A heatmap_key.xlsx file is required before plot grouped heatmap ("Cell annotation" is manually specified, "Cluster" is the seurat cluster number which will be presented in UMAP/violin plot, use a ',' to seperate multiple cluster numbers ):

sheet 1 :

Cell annotation	Cluster
Cell A	7
Cell B	8,9,10
....	....

sheet 2 :

Genes	Order
Gene A	1
Gene B	1
Gene C	3
Gene D	2
....	....

order represents the plotting order of each gene and corresponding cluster

Data processing

Specify the raw file folders :

raw_file_folder='Raw'

and features will be presented in featureplot, violinplot and dotplot :

e.g for gene feature plots :  plot_feature = c('Syn1','Cd74','Apoe')

specify other required numeric parameters (as shown in example.R):

nFeature_RNA_threshold_max = 20000
percent.mito_threshold = 0.1
nFeature_RNA_threshold_min = 200
.....

Run the command to start standard process automatically (if save=T is specified, group-inttegratetd and dimentional reduced data will be saved in RDS/Raw.rds if read=T is specified, it will skip the standard processing and read the RDS/Raw.rds for downstream analysis):

object_data <- standard_processing(save=T)

object_data <- standard_processing(read=T)

object_data will contain two types of seurat object : 1. "All" : Object integrated from all groups . 2. Object of each individual group.
Plots will be saved in PLOTs/, QC/ and RDS/

Example results (use GSE166635 as example)

QC of GSM5076750

UMAP of GSM5076750

heatmap of top100 variable genes

Split violin plot for certain feature

Dot plot, Feature plot, PCA and other plots

cluster filtering

continue with the "standard processing", specify additional features that you may want to plot :
```
extra_features = c('Cldn5')
```

or skip it by specify :

extra_features = c()

To exclude unwanted cell populations, set :

idents_you_want_to_keep = c(4,7,9,10,15,16)

cluster numbers are corresponding cell cluster numbers shown in UMAP plot

3.Run the command to re-clustre remaining cells and generate plots

reduced_data <-processing_reduced_data(object_data,,save=T,idents_ = idents_you_want_to_keep)

results will be saved under folder "reduced_PLOTs"
Auto annotation with CellTypist

Model_download(model_path = 'CelltypistModel')           ######### download models 
Model_list(model_path = 'CelltypistModel')             ####### list avalible models

predicted_data <- Runcelltypist(dt = reduced_data,model='Immune_All_Low',majority_voting = T)    ### choose one model from avaliable model list  e.g. 'Immune_All_low'
                   
d1 <- DimPlot(predicted_datadas)
d2 <- DimPlot(predicted_datadas,group.by = "typist_prediction")
d1+d2

DEG (differentially expressed genes) discovery

Pipeline contains two modes of DEG discovery :

Find DEG between different groups but for same cell populatiton (find DEG in cluster 6 between HCC1 group and HCC2 group):

find_DEG_between_groups(reduced_data,subset_cluster = 6,subset_name = 'cluster6',control_group = 'HCC1',variable_group = 'HCC2',save_folder='DEG')

subset_cluster is the corresponding cluster number of "reduced_data" (reduced_PLOTs); subset_name is customized label for that population; Control_group and variable_group should be exact match to the origin group names (same as group folder names).

For multiple subset input :

find_DEG_bewteen_groups(reduced_data,subset_cluster = c(4,5,8),subset_name = 'subsetA',control_group = c('UNTREATED','WATER'),variable_group = C('CHEMICAL_A','CHEMICAL_B'),save_folder='DEG')

Find DEG between different population but in same group (find DEG between cluster 6 and 8 in untreated group):

find_DEG_bewteen_clusters(reduced_data,subset_name='Subset_A',control_cluster=6,subset_group='UNTREATED',variable_cluster=8)

subset_group name should be exact match to group name (group folder name), specify one cluster as variable while set control cluster as "c()" to use all other clusters as control.

find_DEG_between_clusters(reduced_data,subset_name='HCC1_only_cluster6_vs_others',control_cluster=c(),subset_group='HCC1',variable_cluster=6)

Or

find_DEG_bewteen_clusters(reduced_data,subset_name='Subset_A',control_cluster=8,subset_group='UNTREATED',variable_cluster=c())

find DEG between cluster 8 and all other clusters in water and untreated merged group

find_DEG_bewteen_clusters(reduced_data,subset_name='Subset_A',control_cluster=8,subset_group=c('WATER','UNTREATED'),variable_cluster=c())

For downstream GSEA (Gene set enrichment analysis), convert DEG .csv file to .rnk file with the command :
```
DEG2RNK(DEG_path='DEG',p_hold=0.1,log2fc_hold=0)
```

That will add three .rnk files in addition to the DEG .csv file

Generate volcano plot for all DEG .csv files (tops: number of top dots you want to annotate and highlight):

volcano(DEG_path='DEG',p_hold = vol_plot.padj_hold,log2_fc_hold = 0.1,tops=10,highlight_top='avg_log2FC',highlight_by_keys=F,height=7,width=4)

To highlight specific genes in volcanoplot: create a "highlight_key.txt" under the subfolders of DEG, e.g:

highlight_key.txt :

Cd8a
Pdcd1
Cd3e
...

volcano(DEG_path='DEG',p_hold = vol_plot.padj_hold,log2_fc_hold = 0.1,tops=10,highlight_by_keys=T,width=4,height=4)

For example (highlight the CD55,IL7R,FABP1 in previous volcano plot)

heatmap visualization

To visualize gene expression difference, generate heatmap with heatmap_key.xlsx and reduced_data :

key_heatmap(object_ = reduced_data, 
                        key_file='test.xlsx',
                        group_level=c('WT','KP3'),
                        slot='scale.data',
                        row_cluster=F,
                        col_cluster=F,
                        aggregate='Cell') # aggregate by "cell" or "group"

heatmap example 1 | Aggregate by Cell:

sheet 1 :

Cell annotation	Cluster
cell set1	6
cell set2	7,8,9
cell set3	5

sheet 2 :

Genes	Order
CD55	1
IL7R	2
FABP1	3

key_heatmap(object_ = reduced_data, 
                        key_file='DEG_heatmap_key_example.xlsx',
                        group_level=c('HCC1','HCC2'),
                        slot='scale.data',
                        row_cluster=F,
                        col_cluster=F,
                        aggregate='Cell',
            group_color = c('HCC1'='red','HCC2'='blue'))

heatmap example 2 | Aggregate by Group:

sheet 1 :

Cell annotation	Cluster
cell set1	6
cell set2	7,8,9
cell set3	5

sheet 2 :

Genes	Order
CD55	1
IL7R	2
FABP1	3

key_heatmap(object_ = reduced_data, 
            key_file='DEG_heatmap_key_example.xlsx',
            group_level=c('HCC1','HCC2'),
            slot='scale.data',
            row_cluster=F,
            col_cluster=F,
            aggregate='Group',
            group_color = c('HCC1'='red','HCC2'='blue'))

group_level / cell_level represents the plotting orders and should be exactly match to group folder name / cluster numbers provided in heatmap_key.xlsx

Gene set enrichment and visualization

To perform batch Gene set enrichment analysis with GSEA software and MsigDB, run (if host is mouse):

GSEA_batch(
  DEG_path='DEG',
species = 'mouse',
  gene_sets =c(`hallmark gene sets`='mh.all.v2023.1.Mm.symbols.gmt',
               `positional gene sets`='m1.all.v2023.1.Mm.symbols.gmt',
               `curated gene sets`='m2.all.v2023.1.Mm.symbols.gmt',
               `regulatory target gene sets`='m3.all.v2023.1.Mm.symbols.gmt',
               `ontology gene sets`='m5.all.v2023.1.Mm.symbols.gmt',
               `cell type signature gene sets`='m8.all.v2023.1.Mm.symbols.gmt'),
  symbol_chip='Mouse_Gene_Symbol_Remapping_MSigDB.v2023.1.Mm.chip',
  out_dir='GSEA',
  GSEA_plots_number=30,
collapse='Collapse'
)   #GSEA_plots_number=30 : max lines displayed in GSEA result (default : top 20)

or replace the symbol_chip and gene_sets for human data analysis

GSEA_batch(
  DEG_path='DEG',
species = 'human',
  gene_sets =c(`hallmark gene sets`='h.all.v2023.1.Hs.symbols.gmt',
                 `positional gene sets`='c1.all.v2023.1.Hs.symbols.gmt',
                 `curated gene sets`='c2.all.v2023.1.Hs.symbols.gmt',
                 `regulatory target gene sets`='c3.all.v2023.1.Hs.symbols.gmt',
                 `ontology gene sets`='c6.all.v2023.1.Hs.symbols.gmt',
                 `cell type signature gene sets`='c8.all.v2023.1.Hs.symbols.gmt')
  symbol_chip='Human_Gene_Symbol_with_Remapping_MSigDB.v2023.1.Hs.chip',
  out_dir='GSEA',
  GSEA_plots_number=30,
  collapse ='Collapse'
)   #GSEA_plots_number=30 : max lines displayed in GSEA result (default : top 20)

visualize all GSEA results with bubble plot :

GSEA_bubble(GSEA_folder='GSEA',GSEA_fdr_hold=0.5,fdr_top=20)

specify fdr_top to plot the top enriched pathways ranked by fdr

gico1941 / easyscrna Goto Github PK

easyscrna's Introduction

easyScRNA: An integrated, easy-to-use pipeline for standard single cell sequencing analysis

Workflow overview :

Prerequisites :

What will be automatically built in pipeline ?

Manual :

Installation

Data orgranization

important : for GSEA command line integration, white space is not allowed in any path in work folder

Data processing

Example results (use GSE166635 as example)

QC of GSM5076750

UMAP of GSM5076750

heatmap of top100 variable genes

Split violin plot for certain feature

Dot plot, Feature plot, PCA and other plots

cluster filtering

DEG (differentially expressed genes) discovery

Or

heatmap visualization

heatmap example 1 | Aggregate by Cell:

heatmap example 2 | Aggregate by Group:

Gene set enrichment and visualization

easyscrna's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org