Giter Club home page Giter Club logo

easyscrna's Introduction

easyScRNA: An integrated, easy-to-use pipeline for standard single cell sequencing analysis

An automatic, batch-supported scRNA analysis pipeline integrates following functions: Recursive file import and group integration; UMAP dimension reduction and visualization; Multi-plots includes feature plot, violin plot and variable feature heatmap; cluster marker (differentially expressed genes) discovery and visualization; GSEA (Gene Set Enrichment Analysis) and visualization.

Workflow overview :

Prerequisites :

R (version 4.3.0)

R studio

What will be automatically built in pipeline ?

Seurat 4.3.0 (https://satijalab.org/seurat)

Microsoft C++ Build Tools (https://visualstudio.microsoft.com/visual-cpp-build-tools/)

CellTypist (https://github.com/Teichlab/celltypist)

GSEA v4.3.2 for the command line (https://www.gsea-msigdb.org/gsea/downloads.jsp)

Java, Conda (with python) and other R packages ....

Manual :

Installation

  1. Download code and extract files (The “easyScRNA-main” folder will be the work folder).
  2. Create your own R script for running the pipeline. Working with the “example.R” is recommended (with example commands and parameter settings).
  3. Download all necessary R packages plus other softwares and load them with command:
    source('functions/Main_functions.R')
    

Data orgranization

important : for GSEA command line integration, white space is not allowed in any path in work folder

  1. To process raw 10X matrix files, a specific file sctructure is required (Since a recursive file reading process is introduced, the program supports multiple samples and groups):
  1. A heatmap_key.xlsx file is required before plot grouped heatmap ("Cell annotation" is manually specified, "Cluster" is the seurat cluster number which will be presented in UMAP/violin plot, use a ',' to seperate multiple cluster numbers ):

sheet 1 :

Cell annotation Cluster
Cell A 7
Cell B 8,9,10
.... ....

sheet 2 :

Genes Order
Gene A 1
Gene B 1
Gene C 3
Gene D 2
.... ....

order represents the plotting order of each gene and corresponding cluster

Data processing

  1. Specify the raw file folders :
raw_file_folder='Raw'

and features will be presented in featureplot, violinplot and dotplot :

e.g for gene feature plots :  plot_feature = c('Syn1','Cd74','Apoe')

specify other required numeric parameters (as shown in example.R):

nFeature_RNA_threshold_max = 20000
percent.mito_threshold = 0.1
nFeature_RNA_threshold_min = 200
.....
  1. Run the command to start standard process automatically (if save=T is specified, group-inttegratetd and dimentional reduced data will be saved in RDS/Raw.rds if read=T is specified, it will skip the standard processing and read the RDS/Raw.rds for downstream analysis):
object_data <- standard_processing(save=T)

or

object_data <- standard_processing(read=T)
  1. object_data will contain two types of seurat object : 1. "All" : Object integrated from all groups . 2. Object of each individual group.
  2. Plots will be saved in PLOTs/, QC/ and RDS/

Example results (use GSE166635 as example)

QC of GSM5076750

UMAP of GSM5076750

heatmap of top100 variable genes

Split violin plot for certain feature

Dot plot, Feature plot, PCA and other plots

cluster filtering

  1. continue with the "standard processing", specify additional features that you may want to plot :
    extra_features = c('Cldn5')
    

or skip it by specify :

extra_features = c()
  1. To exclude unwanted cell populations, set :
idents_you_want_to_keep = c(4,7,9,10,15,16)

cluster numbers are corresponding cell cluster numbers shown in UMAP plot

3.Run the command to re-clustre remaining cells and generate plots

reduced_data <-processing_reduced_data(object_data,,save=T,idents_ = idents_you_want_to_keep) 
  1. results will be saved under folder "reduced_PLOTs"

  2. Auto annotation with CellTypist

Model_download(model_path = 'CelltypistModel')           ######### download models 
Model_list(model_path = 'CelltypistModel')             ####### list avalible models

predicted_data <- Runcelltypist(dt = reduced_data,model='Immune_All_Low',majority_voting = T)    ### choose one model from avaliable model list  e.g. 'Immune_All_low'
                   
d1 <- DimPlot(predicted_datadas)
d2 <- DimPlot(predicted_datadas,group.by = "typist_prediction")
d1+d2

DEG (differentially expressed genes) discovery

Pipeline contains two modes of DEG discovery :

  1. Find DEG between different groups but for same cell populatiton (find DEG in cluster 6 between HCC1 group and HCC2 group):
    find_DEG_between_groups(reduced_data,subset_cluster = 6,subset_name = 'cluster6',control_group = 'HCC1',variable_group = 'HCC2',save_folder='DEG') 
    

1705878888644

subset_cluster is the corresponding cluster number of "reduced_data" (reduced_PLOTs); subset_name is customized label for that population; Control_group and variable_group should be exact match to the origin group names (same as group folder names).

For multiple subset input :

find_DEG_bewteen_groups(reduced_data,subset_cluster = c(4,5,8),subset_name = 'subsetA',control_group = c('UNTREATED','WATER'),variable_group = C('CHEMICAL_A','CHEMICAL_B'),save_folder='DEG')
  1. Find DEG between different population but in same group (find DEG between cluster 6 and 8 in untreated group):
    find_DEG_bewteen_clusters(reduced_data,subset_name='Subset_A',control_cluster=6,subset_group='UNTREATED',variable_cluster=8)
    

subset_group name should be exact match to group name (group folder name), specify one cluster as variable while set control cluster as "c()" to use all other clusters as control.

find_DEG_between_clusters(reduced_data,subset_name='HCC1_only_cluster6_vs_others',control_cluster=c(),subset_group='HCC1',variable_cluster=6)

Or

find_DEG_bewteen_clusters(reduced_data,subset_name='Subset_A',control_cluster=8,subset_group='UNTREATED',variable_cluster=c())

find DEG between cluster 8 and all other clusters in water and untreated merged group

find_DEG_bewteen_clusters(reduced_data,subset_name='Subset_A',control_cluster=8,subset_group=c('WATER','UNTREATED'),variable_cluster=c())
  1. For downstream GSEA (Gene set enrichment analysis), convert DEG .csv file to .rnk file with the command :
    DEG2RNK(DEG_path='DEG',p_hold=0.1,log2fc_hold=0)
    

That will add three .rnk files in addition to the DEG .csv file

1705879279416

Generate volcano plot for all DEG .csv files (tops: number of top dots you want to annotate and highlight):

volcano(DEG_path='DEG',p_hold = vol_plot.padj_hold,log2_fc_hold = 0.1,tops=10,highlight_top='avg_log2FC',highlight_by_keys=F,height=7,width=4)

To highlight specific genes in volcanoplot: create a "highlight_key.txt" under the subfolders of DEG, e.g:

highlight_key.txt :

Cd8a
Pdcd1
Cd3e
... 
volcano(DEG_path='DEG',p_hold = vol_plot.padj_hold,log2_fc_hold = 0.1,tops=10,highlight_by_keys=T,width=4,height=4)

For example (highlight the CD55,IL7R,FABP1 in previous volcano plot)

heatmap visualization

To visualize gene expression difference, generate heatmap with heatmap_key.xlsx and reduced_data :

key_heatmap(object_ = reduced_data, 
                        key_file='test.xlsx',
                        group_level=c('WT','KP3'),
                        slot='scale.data',
                        row_cluster=F,
                        col_cluster=F,
                        aggregate='Cell') # aggregate by "cell" or "group"

heatmap example 1 | Aggregate by Cell:

sheet 1 :

Cell annotation Cluster
cell set1 6
cell set2 7,8,9
cell set3 5

sheet 2 :

Genes Order
CD55 1
IL7R 2
FABP1 3
key_heatmap(object_ = reduced_data, 
                        key_file='DEG_heatmap_key_example.xlsx',
                        group_level=c('HCC1','HCC2'),
                        slot='scale.data',
                        row_cluster=F,
                        col_cluster=F,
                        aggregate='Cell',
            group_color = c('HCC1'='red','HCC2'='blue'))

heatmap example 2 | Aggregate by Group:

sheet 1 :

Cell annotation Cluster
cell set1 6
cell set2 7,8,9
cell set3 5

sheet 2 :

Genes Order
CD55 1
IL7R 2
FABP1 3
key_heatmap(object_ = reduced_data, 
            key_file='DEG_heatmap_key_example.xlsx',
            group_level=c('HCC1','HCC2'),
            slot='scale.data',
            row_cluster=F,
            col_cluster=F,
            aggregate='Group',
            group_color = c('HCC1'='red','HCC2'='blue'))

group_level / cell_level represents the plotting orders and should be exactly match to group folder name / cluster numbers provided in heatmap_key.xlsx

Gene set enrichment and visualization

To perform batch Gene set enrichment analysis with GSEA software and MsigDB, run (if host is mouse):

GSEA_batch(
  DEG_path='DEG',
species = 'mouse',
  gene_sets =c(`hallmark gene sets`='mh.all.v2023.1.Mm.symbols.gmt',
               `positional gene sets`='m1.all.v2023.1.Mm.symbols.gmt',
               `curated gene sets`='m2.all.v2023.1.Mm.symbols.gmt',
               `regulatory target gene sets`='m3.all.v2023.1.Mm.symbols.gmt',
               `ontology gene sets`='m5.all.v2023.1.Mm.symbols.gmt',
               `cell type signature gene sets`='m8.all.v2023.1.Mm.symbols.gmt'),
  symbol_chip='Mouse_Gene_Symbol_Remapping_MSigDB.v2023.1.Mm.chip',
  out_dir='GSEA',
  GSEA_plots_number=30,
collapse='Collapse'
)   #GSEA_plots_number=30 : max lines displayed in GSEA result (default : top 20)

or replace the symbol_chip and gene_sets for human data analysis

GSEA_batch(
  DEG_path='DEG',
species = 'human',
  gene_sets =c(`hallmark gene sets`='h.all.v2023.1.Hs.symbols.gmt',
                 `positional gene sets`='c1.all.v2023.1.Hs.symbols.gmt',
                 `curated gene sets`='c2.all.v2023.1.Hs.symbols.gmt',
                 `regulatory target gene sets`='c3.all.v2023.1.Hs.symbols.gmt',
                 `ontology gene sets`='c6.all.v2023.1.Hs.symbols.gmt',
                 `cell type signature gene sets`='c8.all.v2023.1.Hs.symbols.gmt')
  symbol_chip='Human_Gene_Symbol_with_Remapping_MSigDB.v2023.1.Hs.chip',
  out_dir='GSEA',
  GSEA_plots_number=30,
  collapse ='Collapse'
)   #GSEA_plots_number=30 : max lines displayed in GSEA result (default : top 20)

visualize all GSEA results with bubble plot :

GSEA_bubble(GSEA_folder='GSEA',GSEA_fdr_hold=0.5,fdr_top=20)

specify fdr_top to plot the top enriched pathways ranked by fdr

easyscrna's People

Contributors

gico1941 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.