Thomas W. Battaglia 14 Sept 2016
RNA seq is an emerging technology for deep RNA sequencing. Not only does RNA seq technology have the ability to analyze differences in gene expression between samples, but can discover new isoforms of genes and analyze SNP variation of particular genes. This tutorial will cover the basic workflow for analyzing RNA seq data on host tissue. It will not cover the workflow for analyzing RNA seq on microbial community samples. For analysis of the meta-transcriptome using RNA seq, see the workflow for Whole Genome Sequencing Metagenomics.
The easiest way to set up the RNAseq environment is by using a bio-cookie template. This template will download all the necessary tools into a new virtual environment, so that there are no disruptions with your current computer packaging versions. It is high reccomended to run on all of these commands on a high performance cluster.
The template will also ask if you would like to download a human or mouse genome for transcriptome alignment. If you want to align to a genome which is neither human or murine, then you must select "None" in the option and download the genome and annotation file separately. Below are a list of the tools which are downloaded during the installation steps. See here for a list of availabel genomes (https://genome.ucsc.edu/cgi-bin/hgGateway / http://useast.ensembl.org/info/data/ftp/index.html )
# Load python 2.7.3 on cluster (if needed)
module load python/2.7.3
# Make sure python 2.7.3 was installed into environment
python --version
>> Python 2.7.3
# add --user if you are working on a HPC environment
pip install cookiecutter --user
Be sure to choose the correct options related to your data. For more information see: https://github.com/twbattaglia/bio-cookie-RNAseq
# Generate a bio-cookie RNAseq template project
cookiecutter https://github.com/twbattaglia/bio-cookie-RNAseq
Tools that are downloaded during bio-cookie creation:
a. cutadapt - adapter and base quality filtering
b. FastQC - sequence quality analysis
c. Trim Galore! - more easily run cutadapt and fastqc
d. SortMeRNA - remove rRNA sequence contamination
e. STAR-aligner - quickly align sequences to genome
f. Subread/featureCounts - summarize alignments into gene counts
g. MultiQC - generate a readable report of all logs
# This folder name will different depending upon the name chosen when running bio-cookie-RNAseq
cd new_rnaseq_analysis/
The STAR aligner requires an index to be created before aligning any fastq
sequences. The index command requires the host genome of interest and associated annotation file as inputs. The command is best run when submitted as a job. Download the shell file and submit it as a job. Additionally you must specific the length of the sequences in the RNAseq run in the command. This is typically 50 or 101, but will depend on what was requested.
# Submit the make_index.sh as a job. This can take ~20 minutes to finish.
qsub make_index.sh
The job can be submitted to the cluster once you generated a STAR-aligner index and you have placed the .fastq.gz
files into the folder, input
.
The command is written to submit an array job, which means you must specify how many times the job should run. In this case it is dependent upon how many individual .fastq.gz
files that are located in the input
folder. If there are 6 different samples within the folder you would run the command qsub -t 1:6 run_workflow.sh
, but if you have 10 files, you would run the command qsub -t 1:10 run_workflow.sh
. The -t
flag corresponds to how many different commands (files) should be submitted.
# Submit the array workflow script
qsub -t 1:4 run_workflow.sh
Note: The command is written for SGE clusters, but if you have a different cluster set up, you can change the variable $SGE_TASK_ID
to the correct array style variable.
After the main workflow has finished running, you will have generated aligned .bam
files for each sample. These files only give the coordinates for the sequence alignments on the genome, and must be summarized to give gene counts per sample, before any statistical analysis can be performed. To do this we perform further processing the output files using the postprocessing.sh
script There is no need for a parallel array script and the command can be submitted as is as long as no folder names/locations were not changed in main workflow step.
# Submit the postprocesing.sh script as a job. This should take ~10 minutes.
qsub postprocessing.sh
Once completed, many files are generated after running the main workflow and the post-processing scripts. Below is a breakdown of each of the folders and their respective description of their outputs. Each step reports a log file for every sample which can be used to calculcated the number of sequences.
── output/
│ └── 1_initial_qc/ - FastQC quality reports for each sample
│ └── 2_trimmed_output/ - Trimmed reports for each sample
│ └── 3_rRNA
│ ├── aligned/ - Sequences aligned to rRNA databases
│ ├── filtered/ - Sequences with rRNA sequences removed
│ ├── logs/ - Logs from running SortMeRNA
│ └── 4_aligned_sequences
│ ├── aligned_bam/ - Main alignment files for each sample
│ ├── aligned_logs/ - Log from running STAR alignment step
│ ├── aligned_counts/ - STAR alignment counts output (for comparison with featureCounts)
│ └── 5_final_counts - Final gene count summarization output from running featureCounts
│ └── 6_multiQC - Overall workflow report
Once the workflow has completed, you can now use the gene count table as an input into DESeq2 for statistical analysis using the R-programming language. It is highly reccomended to use RStudio when writing R code and generating R-related analyses. You can download RStudio for your system here: https://www.rstudio.com/products/rstudio/download/
Multiple libraries are required to perform analysis, plotting and enrichment.
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
biocLite("ggplot2")
biocLite("clusterProfiler")
biocLite("biomaRt")
biocLite("ReactomePA")
biocLite("DOSE")
biocLite("KEGG.db")
biocLite("pathview")
biocLite("org.Mm.eg.db")
biocLite("org.Hs.eg.db")
biocLite("pheatmap")
biocLite("genefilter")
biocLite("RColorBrewer")
biocLite("GO.db")
biocLite("topGO")
biocLite("dplyr")
One you have an R environment appropriatley set up, you can begin to import the featureCounts output of genes counts, found within the 5_final_counts
folder. This tutorial will use DESeq2 to normalize and perform the statistical analysis between sample groups. Be sure to copy the final_counts.txt
file generate from featureCounts step to your set working directory, or specify the full location when importing the table.
If you would like to use an example dataset, download the gene counts and metadata below (Right click --> Save as) :
Download gene counts per sample
Download sample metadata
Download R script of all commands shown below
library(DESeq2) # statistical analysis
library(ggplot2) # plotting
library(knitr) # for better formatting
# Import gene counts table
# - skip first row (general command info)
# - make row names the gene identifiers
countdata <- read.table("example/final_counts.txt", header = TRUE, skip = 1, row.names = 1)
# Remove .bam + '..' from column identifiers
colnames(countdata) <- gsub(".bam", "", colnames(countdata), fixed = T)
colnames(countdata) <- gsub("..", "", colnames(countdata), fixed = T)
# Remove length/char columns
countdata <- countdata[ ,c(-1:-5)]
# Make sure ID's are correct
head(countdata)
## SRR1374924 SRR1374923 SRR1374921 SRR1374922
## 4933401J01Rik 0 0 0 0
## Gm26206 0 0 0 0
## Xkr4 214 302 459 425
## Gm18956 0 0 0 0
## Gm37180 4 2 3 1
## Gm37363 1 0 0 1
# Import metadata file
# - make row names the matching sampleID's from the countdata
metadata <- read.delim("example/metadata.txt", row.names = 1)
# Add sampleID's to the mapping file
metadata$sampleid <- row.names(metadata)
# Reorder sampleID's to match featureCounts column order.
metadata <- metadata[match(colnames(countdata), metadata$sampleid), ]
# Make sure ID's are correct
head(metadata)
## Group Replicate sampleid
## SRR1374924 HiGlu Rep2 SRR1374924
## SRR1374923 HiGlu Rep1 SRR1374923
## SRR1374921 LoGlu Rep1 SRR1374921
## SRR1374922 LoGlu Rep2 SRR1374922
# - countData : count dataframe
# - colData : sample metadata in the dataframe with row names as sampleID's
# - design : The design of the comparisons to use.
# Use (~) before the name of the column variable to compare
ddsMat <- DESeqDataSetFromMatrix(countData = countdata,
colData = metadata,
design = ~Group)
# Find differential expressed genes
# Run DESEq2
ddsMat <- DESeq(ddsMat)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Get results from testing with FDR adjust pvalues
res <- results(ddsMat, pAdjustMethod = "fdr", alpha = 0.05)
# Generate summary of testing.
summary(res)
##
## out of 28681 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 3093, 11%
## LFC < 0 (down) : 2424, 8.5%
## outliers [1] : 0, 0%
## low counts [2] : 12328, 43%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
After aligment and summarization, we only have the annotated gene symbols. To get more information about significant genes, we can use annoated databases to convert gene symbols to full gene names and entrez ID's for further analysis.
library(AnnotationDbi)
library(org.Mm.eg.db)
# Add gene full name
res$description <- mapIds(x = org.Mm.eg.db,
keys = row.names(res),
column = "GENENAME",
keytype = "SYMBOL",
multiVals = "first")
# Add gene symbol
res$symbol <- row.names(res)
# Add ENTREZ ID
res$entrez <- mapIds(x = org.Mm.eg.db,
keys = row.names(res),
column = "ENTREZID",
keytype = "SYMBOL",
multiVals = "first")
# Subset for only significant genes (q < 0.05)
res_sig <- subset(res, padj < 0.05)
head(res_sig)
## log2 fold change (MAP): Group LoGlu vs HiGlu
## Wald test p-value: Group LoGlu vs HiGlu
## DataFrame with 6 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## Xkr4 343.8983 0.6473508 0.16708763 3.874319 1.069232e-04
## Mrpl15 862.4650 0.4024633 0.11708712 3.437298 5.875488e-04
## Tcea1 934.4692 1.3999089 0.11330340 12.355400 4.554828e-35
## St18 4875.8787 -0.9960725 0.06734494 -14.790606 1.684298e-49
## Pcmtd1 2073.1050 0.3941010 0.08469429 4.653218 3.267944e-06
## Sntg1 477.0359 0.7368634 0.14501538 5.081278 3.749043e-07
## padj
## <numeric>
## Xkr4 5.981918e-04
## Mrpl15 2.729132e-03
## Tcea1 3.475809e-33
## St18 2.314565e-47
## Pcmtd1 2.548435e-05
## Sntg1 3.487378e-06
## description
## <character>
## Xkr4 X Kell blood group precursor related family member 4
## Mrpl15 mitochondrial ribosomal protein L15
## Tcea1 transcription elongation factor A (SII) 1
## St18 suppression of tumorigenicity 18
## Pcmtd1 protein-L-isoaspartate (D-aspartate) O-methyltransferase domain containing 1
## Sntg1 syntrophin, gamma 1
## symbol entrez
## <character> <character>
## Xkr4 Xkr4 497097
## Mrpl15 Mrpl15 27395
## Tcea1 Tcea1 21399
## St18 St18 240690
## Pcmtd1 Pcmtd1 319263
## Sntg1 Sntg1 71096
# Write normalized gene counts to a .txt file
write.table(x = as.data.frame(counts(ddsMat), normalized = T),
file = 'DESeq2_normalized_counts.txt',
sep = '\t',
quote = F,
col.names = NA)
# Write significant normalized gene counts to a .txt file
write.table(x = counts(ddsMat[row.names(res_sig)], normalized = T),
file = 'DESeq2_normalized_counts_significant.txt',
sep = '\t',
quote = F,
col.names = NA)
# Write the annotated results table to a .txt file
write.table(x = as.data.frame(res),
file = "DESEq2_results_gene_annotated.txt",
sep = '\t',
quote = F,
col.names = NA)
# Write significant annotated results table to a .txt file
write.table(x = as.data.frame(res_sig),
file = "DESEq2_results_gene_annotated_significant.txt",
sep = '\t',
quote = F,
col.names = NA)
There are multiple ways to plot gene expression data. Below we are only listing a few popular methods, but there are many more resources (found in section 9 below) that will walk through different R commands/packages for plotting.
# Convert all samples to rlog
ddsMat_rlog <- rlog(ddsMat, blind = FALSE)
# Plot PCA by column variable
plotPCA(ddsMat_rlog, intgroup = "Group", ntop = 500) +
theme_bw() # remove default ggplot2 theme
# Load libraries
library(pheatmap)
library(RColorBrewer)
# Convert all samples to rlog
ddsMat_rlog <- rlog(ddsMat, blind = FALSE)
# Gather top 30 genes and make matrix
mat <- assay(ddsMat_rlog[row.names(res_sig)])[1:30, ]
# Choose which column variables you want to annotate the columns by.
annotation_col = data.frame(
Group = factor(colData(ddsMat_rlog)$Group),
Replicate = factor(colData(ddsMat_rlog)$Replicate),
row.names = colData(ddsMat_rlog)$sampleid
)
# Specify colors you want to annotate the columns by.
ann_colors = list(
Group = c("LoGlu" = "lightblue", "HiGlu" = "darkorange"),
Replicate = c(Rep1 = "darkred", Rep2 = "forestgreen")
)
# Make Heatmap with pheatmap function.
# See more in documentation for customization
pheatmap(mat = mat,
color = colorRampPalette(brewer.pal(9, "YlOrBr"))(255),
scale = "row",
annotation_col = annotation_col,
annotation_colors = ann_colors,
fontsize = 8,
show_colnames = F)
# Load libraries
library(ggplot2)
library(RColorBrewer)
# Gather Log-fold change and FDR-corrected pvalues from deseq2 results
data <- data.frame(pval = -log10(res$padj),
lfc = res$log2FoldChange,
row.names = row.names(res))
# Remove any rows that have NA as an entry
data <- na.omit(data)
# Make a basic ggplot2 object with x-y values
vol <- ggplot(data, aes(x = lfc, y = pval))
# Add ggplot2 layers
vol +
ggtitle(label = "Volcano Plot") +
geom_point(aes(colour = pval), size = 5, alpha = 0.7, na.rm = T) + # color the dots
theme_bw(base_size = 14) + # change overall theme
theme(legend.position = "right") + # change the legend
xlab(expression(log[2]("Treatment" / "Control"))) + # Change X-Axis label
ylab(expression(-log[10]("adjusted p-value"))) + # Change Y-Axis label
geom_vline(xintercept = c(-2, 2), colour = "darkgrey") + # Add fold change cutoff lines
geom_hline(yintercept = 1.3, colour = "darkgrey") + # Add p-adj value cutoff line
scale_colour_gradient(low = "red", high = "green") # Add red-green scaling by intensity
plotMA(res, ylim = c(-5, 5))
# Convert all samples to rlog
ddsMat_rlog <- rlog(ddsMat, blind = FALSE)
# Get gene with highest expression
top_gene <- rownames(res)[which.min(res$log2FoldChange)]
# Plot single gene
plotCounts(ddsMat, gene = top_gene,
intgroup = "Group",
normalized = T,
transform = T)
Pathway enrichment analysis is a great way to generate overall conclusions based on the individual gene changes. Sometimes individiual gene changes are overwheling and are difficult to interpret. But by analyzing which genes may fall into different pathways, we can get a better idea of what mRNA changes in relation to phenotype are occuring. More information about clusterProfiler here: http://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
# Load required libraries
library(clusterProfiler)
library(ReactomePA)
library(KEGG.db)
library(DOSE)
library(org.Mm.eg.db)
# Remove any genes that do not have any entrez identifiers
res_sig_entrez <- subset(res_sig, is.na(entrez) == FALSE)
# Create a matrix of gene log2 fold changes
gene_matrix <- res_sig_entrez$log2FoldChange
# Add the entrezID's as names for each logFC entry
names(gene_matrix) <- res_sig_entrez$entrez
kegg_enrich <- enrichKEGG(gene = names(gene_matrix),
organism = 'mouse',
pvalueCutoff = 0.05,
readable = TRUE)
# Plot results
barplot(kegg_enrich,
drop = TRUE,
showCategory = 10,
title = "KEGG Enrichment Pathways",
font.size = 8)
go_enrich <- enrichGO(gene = names(gene_matrix),
organism = 'mouse',
readable = T,
ont = "BP",
pvalueCutoff = 0.05)
# Plot results
barplot(go_enrich,
drop = TRUE,
showCategory = 10,
title = "GO Biological Pathways",
font.size = 8)
# Load pathview
library(pathview)
# Plot specific KEGG pathways (with fold change)
# pathway.id : KEGG pathway identifier
pathview(gene.data = gene_matrix,
pathway.id = "04070",
species = "mouse",
map.symbol = T)
You can the links below for a more in depth walk through of RNAseq analysis using R: http://www.bioconductor.org/help/workflows/rnaseqGene/ http://bioconnector.org/workshops/r-rnaseq-airway.html http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html http://www-huber.embl.de/users/klaus/Teaching/DESeq2.pdf https://web.stanford.edu/class/bios221/labs/rnaseq/lab_4_rnaseq.html http://www.rna-seqblog.com/which-method-should-you-use-for-normalization-of-rna-seq-data/ http://www.rna-seqblog.com/category/technology/methods/data-analysis/data-visualization/ http://www.rna-seqblog.com/category/technology/methods/data-analysis/pathway-analysis/ http://www.rna-seqblog.com/inferring-metabolic-pathway-activity-levels-from-rna-seq-data/
-
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
-
Martin, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, [S.l.], v. 17, n. 1, p. pp. 10-12, may. 2011. ISSN 2226-6089. Available at: http://journal.embnet.org/index.php/embnetjournal/article/view/200. doi:http://dx.doi.org/10.14806/ej.17.1.200.
-
Kopylova E., Noé L. and Touzet H., "SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data", Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611
-
Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. doi:10.1093/bioinformatics/bts635.
-
Lassmann et al. (2010) "SAMStat: monitoring biases in next generation sequencing data." Bioinformatics doi:10.1093/bioinformatics/btq614 [PMID: 21088025]
-
Liao Y, Smyth GK and Shi W (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30.
-
Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550.
-
Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287.
-
Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller. "MultiQC: Summarize analysis results for multiple tools and samples in a single report" Bioinformatics (2016). doi: 10.1093/bioinformatics/btw354. PMID: 27312411