- Description
- Setup pipeline
- Quality check
- Trim data
- Align to ref
- Count features
- DGE analysis
RNA-seq pipeline for Illumina data. This pipeline is designed to run on a Sun Grid Engine cluster.
Current implementation uses STAR for aligning, featureCounts for abundance counting and DESeq2 for DGE
Future verions will include k-mer based alignment (Kallisto/Salmon etc.) with their automatic feature abundance estimates and trasnscript quantification/estimation tools (isoEM/eXpress etc.). These tools can be used to analyse differential exon usage. The output is not directly usable by DESeq2 for DGE, but there are methods available to convert estimated abundances to count data.
I might add HISAT2/TOPHAT3 to the mix - once TH3 has been released.
# set RNSPL variable to pipeline folder
RNSPL=~/RNA-seq_pipeline
# set permanetly for future (bash) shell sessions (be careful with this, if you have settings in ~/.profile they will no longer load)
echo export RNSPL=~/RNA-seq_pipeline >>~/.bash_profile
Create project folder and linked to RNA-seq pipeline
PROJECT_FOLDER=~/projects/my_project_folder
mkdir -p $PROJECT_FOLDER
ln -s $RNSPL $PROJECT_FOLDER/RNA-seq_pipeline
Create some sub folders to store files and analysis results
# folder to store cluster job output
mkdir $PROJECT_FOLDER/cluster
mkdir -p $PROJECT_FOLDER/data/fastq
mkdir $PROJECT_FOLDER/data/trimmed
mkdir $PROJECT_FOLDER/data/filtered
mkdir $PROJECT_FOLDER/data/aligned
mkdir $PROJECT_FOLDER/data/counts
mkdir -p $PROJECT_FOLDER/analysis/quality
mkdir $PROJECT_FOLDER/analysis/DGE
mkdir $PROJECT_FOLDER/analysis/DEU
Get hold of the raw data and link to fastq folder.
# link
ln -s /data/RNA-seq/this_project_data/*.gz $PROJECT_FOLDER/data/fastq/.
# Decompress sym linked files (if required)
for FILE in $PROJECT_FOLDER/data/fastq/*.gz; do
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c unzip $FILE 2
done
Qualtiy checking with fastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
for FILE in $PROJECT_FOLDER/data/fastq/*.gz; do
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c qcheck $FILE $PROJECT_FOLDER/analysis/quality
done
Trimming can currently be performed with Trimmomatic (trim.sh, submit_trim.sh and truseq.fa should all be in same directory) The default settings will only remove adapter contamination, for quality/length filtering append something like SLIDINGWINDOW:8:20 MINLEN:36. This will use a sliding window of 8 bases and cut the sequence if quality in the window drops below 20, also any trimmed sequences with a length less than 36 will be dropped.
Low quality/adapter contaminated reads are dumped (change location /dev/null in submit_trim.sh to something else if you need to keep them)
for FR in $PROJECT_FOLDER/data/fastq/*_1.fastq.gz; do
RR=$(echo $FR|sed 's/\_1\.fastq/\_2\.fastq/')
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c trim \
$FR \
$RR \
$PROJECT_FOLDER/data/trimmed \
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/truseq.fa \
4
done
Remove phix/rrna or other contaminants
for FR in $PROJECT_FOLDER/data/trimmed/*_1.fq.gz; do
RR=$(echo $FR|sed 's/\_1\.fq/\_2\.fq/')
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c filter \
$PROJECT_FOLDER/RNA-seq_pipeline/phix/phix \
$PROJECT_FOLDER/data/filtered \
$FR $RR
done
DESeq (v1.11.23). https://f1000research.com/articles/4-1521/v2
Update to come with implementation
Salmon can run in two modes, psuedo mapping or alignmnet mode using a pre-aligned SAM/BAM. The BAM must be unsorted and aligned to the transcriptome.
O.K. running Salmon with the filtered read files - it's fast...
Assuming the transcriptome is available... If you have a genome plus a gff file, I have (somewhere) a scripts which can help make a transcriptome file. gff being a pretty grotty (non) standard, the results need to be checked.
First step is to build a psuedo mapping index file e.g. transcriptome located at $PROJECT_FOLDER/data/genome/transcriptome.fa
# Build index from transcript file
salmon index -t $PROJECT_FOLDER/data/genome/transcriptome.fasta -i $PROJECT_FOLDER/data/genome/SALMON_quasi
is a single process - it accepts gz/bgz compressed files
for FR in $PROJECT_FOLDER/data/filtered/*_1.filtered.fq.gz; do
RR=$(echo $FR|sed 's/_1/_2/')
OUTDIR=$(echo $FR|awk -F"/" '{print $NF}'|sed 's/_.*//')
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c salmon \
$PROJECT_FOLDER/data/genome/SALMON_quasi \
$PROJECT_FOLDER/data/counts/$OUTDIR \
$FR $RR \
--numBootstraps 1000 --dumpEq --seqBias --gcBias
done
--writeUnmappedNames --writeMappings=test.sam
From the above the last line includes some of the optional settings for salmon. see http://salmon.readthedocs.io/en/latest/index.html for description of all options (or; salmon quant --help-reads ).
tximport is an R library which can convert salmon transcript "pseudo" counts (or other pseudo mapped counts) to gene counts (looks like later versions of salmon can do this natively if you provide a gff file (with -g flag). I haven't tested this yet).
First needs a mapping file of transcript to gene. (Any of the output quant.sf files can be used)
awk -F"\t" '{c=$1;sub("\..*","",$1);print c,$1}' OFS="\t" quant.sf >trans2gene.txt
Tximport requires R v 3.3 to install via Bioconductor, otherwise can install from the binary Details are in DGE.R
Good for measuring gene level DE derived from transcripts (DTE) - but what about isoforms (different transcript/exon usage - DTU/DEU)? If you have a well anotated genome with most transcripts already described - or possibly try mapping to exons?
A index must first be created e.g. genome and gff (not essential but useful) located in $PROJECT_FOLDER/data/genome/
STAR \
--runMode genomeGenerate \
--genomeDir $PROJECT_FOLDER/genome/STAR_illumina \
--genomeFastaFiles $PROJECT_FOLDER/genome/my_genome.fa \
--sjdbGTFfile $PROJECT_FOLDER/genome/my_gff.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbGTFtagExonParentGene Parent
Basic star alignment parameters: STAR --genomeDir your_out_dir --outFileNamePrefix something --readFilesIn fastq_F fastq_R --outSAMtype SAM --runThreadN 16
Alignment using both 2-pass alignment (2-pass alignment and basic alignment) and basic alignment only .
For two pass, first pass finds extra splice junctions second pass uses these extra annotations for mapping.
For basic only alignment the below code was modified to comment out "--sjdbFileChrStartEnd $splice_list" (and remove the preceeding line continuation ))
for R1 in $PROJECT_FOLDER/filtered/*1.fq; do
R2=$(echo $R1|sed -e 's/\.1\./\.2\./');
prefix=$(echo $R1|awk -F"/" '{gsub(/\..*/,"",$NF);print $NF}');
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c star \
$PROJECT_FOLDER/genome/STAR_illumina \
$PROJECT_FOLDER/junctions \
$prefix \
$R1 \
$R2 \
--outSAMmode None
done
splice_list=$(ls $PROJECT_FOLDER/junctions/*.tab)
for R1 in $PROJECT_FOLDER/filtered/*1.fq; do
R2=$(echo $R1|sed -e 's/\.1\./\.2\./');
prefix=$(echo $R1|awk -F"/" '{gsub(/\..*/,"",$NF);print $NF}');
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c star \
$PROJECT_FOLDER/genome/STAR_illumina \
$PROJECT_FOLDER/aligned \
$prefix \
$R1 \
$R2 \
--chimSegmentMin 20 \
--chimOutType WithinBAM \
--outSAMtype BAM SortedByCoordinate \
--sjdbFileChrStartEnd $splice_list; # not used for basic only alignment
# --outFilterMatchNminOverLread 0.3 # unused parameter - useful for mapping short alignments
# --outFilterScoreMinOverLread 0.3 # unused parameter - useful for mapping short alignments
done
This can be done in R, but is slow with seqeuntial file import. Outside R each sample is counted seperately.
featureCounts -o output_file -a gff_file sam_files
Can be useful to produce a SAF file from an input gff (as they are not always consistent) The below will extract exon annotations and output the ninth column stipped of ID= and anything after the first ".", then the first column (chromosome) and etc.
grep exon final_genes_appended.gff3|awk -F"\t" '{gsub(/ID=/,"",$NF);gsub(/\..*/,"",$NF);print $NF,$1,$4,$5,$7}' OFS="\t" > $PROJECT_FOLDER/counts/exons.SAF
The RNA-seq pipeline can be used to run featureCounts: PIPELINE.sh -c counts annotaions output_dir output_file sam/bam(s) [options]
The below runs with 12 threads (-T 12), counts all multimapping reads (-M) and uses a SAF file for input (-F SAF)
for D in $PROJECT_FOLDER/aligned/treatment/WTCHG*; do
OUTFILE=$(echo $D|awk -F"/" '{print $(NF)}').counts
$PROJECT_FOLDER/RNA-seq_pipeline/scripts/PIPELINE.sh -c counts \
$PROJECT_FOLDER/counts/exons.SAF \
$PROJECT_FOLDER/counts \
$OUTFILE \
$D/star_aligmentAligned.sortedByCoord.out.bam -T 12 -M -F SAF
done
Follow script DGE.R
NOTE: this section requires editing
DEXSeq requires its own gtf file - they provide a python script to create it, but due to the mess of the gtf/gff "standard" it probably won't work.
Below is an exampe of a few lines from a typical GFF file (ignore the header - git requires it)
gff | gff | gff | gff | gff | gff | gff | gff | gff |
---|---|---|---|---|---|---|---|---|
contig_1 | AUGUSTUS | gene | 1 | 625 | 0.61 | - | . | ID=g1; |
contig_1 | AUGUSTUS | mRNA | 1 | 625 | 0.61 | - | . | ID=g1.t1;Parent=g1 |
contig_1 | AUGUSTUS | CDS | 1 | 625 | 0.61 | - | 0 | ID=g1.t1.CDS1;Parent=g1.t1 |
contig_1 | AUGUSTUS | exon | 1 | 625 | . | - | . | ID=g1.t1.exon1;Parent=g1.t1; |
contig_1 | AUGUSTUS | start_codon | 623 | 625 | . | - | 0 | Parent=g1.t1; |
contig_1 | AUGUSTUS | gene | 2887 | 5449 | 0.57 | + | . | ID=g2; |
contig_1 | AUGUSTUS | mRNA | 2887 | 5449 | 0.57 | + | . | ID=g2.t1;Parent=g2 |
contig_1 | AUGUSTUS | start_codon | 2887 | 2889 | . | + | 0 | Parent=g2.t1; |
contig_1 | AUGUSTUS | CDS | 2887 | 2901 | 0.57 | + | 0 | ID=g2.t1.CDS1;Parent=g2.t1 |
contig_1 | AUGUSTUS | exon | 2887 | 2901 | . | + | . | ID=g2.t1.exon1;Parent=g2.t1; |
contig_1 | AUGUSTUS | intron | 2902 | 2958 | 0.57 | + | . | Parent=g2.t1; |
contig_1 | AUGUSTUS | CDS | 2959 | 3208 | 0.57 | + | 0 | ID=g2.t1.CDS2;Parent=g2.t1 |
contig_1 | AUGUSTUS | exon | 2959 | 3208 | . | + | . | ID=g2.t1.exon2;Parent=g2.t1; |
contig_1 | AUGUSTUS | intron | 3209 | 3260 | 1 | + | . | Parent=g2.t1; |
contig_1 | AUGUSTUS | CDS | 3261 | 5449 | 1 | + | 2 | ID=g2.t1.CDS3;Parent=g2.t1 |
contig_1 | AUGUSTUS | exon | 3261 | 5449 | . | + | . | ID=g2.t1.exon3;Parent=g2.t1; |
contig_1 | AUGUSTUS | stop_codon | 5447 | 5449 | . | + | 0 | Parent=g2.t1; |
The DEXSeq script requires gene_id and transcript_id in field 9 - for exons only. In the above gff the Parent = transcript_id, but there is no gene_id. But it can be produces by stripping the .t[\d] from the end of Parent. The below script would make a DEXSeq compatible gff:
grep exon final_genes_appended_renamed.gff3|awk -F"\t" '{gsub(/;$/,"",$9);gsub(/$/,";",$9);x=$9;gsub(/.*=/,"gene_id=",x);gsub(/Parent=/,"transcript_id=",$9);sub(/\..*/,"",x);$9=$9x;print}' OFS="\t"> DEXSeq.gff
This can then be fed into the HTSeq script as per DEXSeq manual, or if using featureCounts dexseq_prepare_annotation2.py (https://github.com/vivekbhr/Subread_to_DEXSeq)
Youll also need to install the HTSeq python package - if you've got pip installed:
pip install HTSeq
You might need to upgrade numpy as well (numpy-1.9.2 doesn't work)
dexseq_prepare_annotation2.py -f featureCounts.gtf DEXSeq.gff DEXSeq_final.gff
Then count with featureCounts:
featureCounts -f -O -p -T 16 \
-F GTF -a featureCounts.gtf \
-o counts.out Test.bam
DEU.r