Giter Club home page Giter Club logo

rnaseq_pipeline's Introduction

Alignment of SE or PE Illumina sequenced RNA

Script will align the reads to the supplied reference using: STAR aligner for E(karyotyic) genomes Bowtie2 for P(prokaryotic) genomes

Each read file can first be aligned to one genome, then the unaligned sequences may be aligned to a second genome. This is useful when complex, mixed clinical samples have be sequenced.

Further unaligned read can be converted to fasta format and blasted.

cat fastq_unalignes.fastq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > fastq_unalignes..fa

or using the nifty seqtk

Usage

Get git

git clone https://github.com/SemiQuant/RNAseq_pipeline.git

Get the singularity contianer

bash ./SQ_RNAseq_pipeline.sh --container "./"

Index genome and exit

singularity run RNAseq_pipe.sif bash ./RNApipeline.sh \
  --index_exit \
  --threads 20 \
  --genome_reference1 "Path_to_genome.fna" \
  --GTF_reference1 "Path_to_genome.gtf" \
  --SRlength "sequencing_length" \
  --Type_1 "E"

Run a single sample with two genomes (remove the last two three lines to run only one genome)

R1="Sample_R1_001.fastq.gz"
R2="Sample_R2_001.fastq.gz"
nme="SampleTest"

singularity run RNAseq_pipe.sif bash ./SQ_RNAseq_pipeline.sh \
  --threads 8 \
  --genome_reference1 "Path_to_genome.fna" \
  --GTF_reference1 "Path_to_genome.gtf" \
  --Type_1 "E" \
  --read_dir "Path_to_directory_containing_read_files" \
  --read1 "$R1" \
  --read2 "$R2" \
  --out_dir "Path_to_output_directory" \
  --name "$nme" \
  --feat_count \
  --qualimap \
  --strand "yes" \
  --trim \
  --script_directory "../RNAseq_pipeline" \
  --fastQC \
  --multipleMet \
  --star2 \
  --keep_unpaired \
  --Type_2 "B" \
  --genome_reference2 "Path_to_genome2.fna" \
  --GTF_reference2 "Path_to_genome2.gtf"

Example wrapper slurm

#!/bin/sh
#SBATCH --account=USER
#SBATCH --partition=ada
#SBATCH --nodes=1 --ntasks=8
#SBATCH --time=6:00:00
#SBATCH --job-name="RNAseqRun_1"
#SBATCH [email protected]
#SBATCH --array=0-6

declare -a samples=("sample1" "sample2" "sample3" "sample4" "sample5" "sample6" "sample7")

R1="${samples[$SLURM_ARRAY_TASK_ID]}_R1_001.fastq.gz"
R2="${samples[$SLURM_ARRAY_TASK_ID]}_R2_001.fastq.gz"
nme=${samples[$SLURM_ARRAY_TASK_ID]%%_*}
#${samples[$SLURM_ARRAY_TASK_ID]%%_*} #"${samples[0]:0:4}"

Add the above here

Example wrapper portable batch system

#PBS -q UCTlong /
#PBS -l nodes=1:ppn=8:series600 /
#PBS -N CASS_SEQ /
#PBS -M [email protected] /
#PBS -m ae /
#PBS -t 0-6 /

declare -a samples=("sample1" "sample2" "sample3" "sample4" "sample5" "sample6" "sample7")

R1="${samples[${PBS_ARRAYID}]}_R1_001.fastq.gz"
R2="${samples[${PBS_ARRAYID}]}_R2_001.fastq.gz"
nme=${samples[${PBS_ARRAYID}]%%_*}
#"${samples[${PBS_ARRAYID}]}" #"${samples[0]:0:4}"

Add the above here

Get metrics of all files (after they have all been processed)

singularity run RNAseq_pipe.sif bash ./SQ_RNAseq_pipeline.sh \
    --get_metrics "../runs"

Convert gff to gtf

singularity run RNAseq_pipe.sif bash ./SQ_RNAseq_pipeline.sh \
    --GffToGtf "Path_to_GFF.gff"

Note

If the genome has not been indexed, this will happen and requires ~30Gb of ram. Index will be built using the length of sequencing reads from the supplied file.


Available options

Flag Description
-dl --container download the singularity container to this path and exit
-t, --threads
-g1, --genome_reference1 full path to genome reference
-gtf1, --GTF_reference1
-t1, --Type_1 E for eukaryotic or B for bacterial (defult is E)
-g2, --genome_reference2 full path to genome reference
-gtf2, --GTF_reference2
-t2, --Type_2 E for eukaryotic or B for bacterial (defult is E)
-r, --ram
-rd, --read_dir
-r1, --read1 Read 1 name
-r2, --read2 Read 2 name (leave blank is not PE)
-o, --out_dir Output directory
-n, --name Name of sequence to append to output files
-c, --cufflinks Run cufflinks?
-f, --feat_count Run subread feature count?
-q, --qualimap Run qualimap?
-s, --strand stranded library (yes, no, reverse)
-tr, --trim trim reads?
-sd, --script_directory
-fq, --fastQC run fastqc?
-sr, --shotRead is read length short (like 50nt)?
-sL, --SRlength for making the index, if shotRead is on then this defult is 50
-ht, --htSeq do htseq counts, default is on for Bacterial genome
-mM, --multipleMet get multiple metrics (picard)
-md1,--md5check1 exit if md5 does not match this input for read 1
-md2,--md5check2 exit if md5 does not match this input for read 2
-m, --miRNA Is this miRNA?
-k, --keep_unpaired Keep reads that dont align as fastq output?
-c2, --only_care do you onlu care about genome 2?
-s2, --star2 basic 2 pass mode on star aligner?
-ca, --cutAdapt (not implemented) use cutadapt and pass options (such as '-u $nt_trim_from_start -m $minimum_length_keep -j $no_threads' )
-ht, --htSeq do htseq counts, alinger does this already 'on the fly' when using STAR (for E genomes)
-hq, --htseq_qual htseq-count quality cutoff (defult = 0)
-rRm, --remove_rRNAmtb remove rRNA from H37Rv annotation file, provide which GTF given it is (g1 or g2)
-rR, --remove_rRNA remove rRNA from annotation file (not very robus, just deletes lines that say rRNA or ribosomal RNA), provide which GTF given it is (g1 or g2 or both)
-v, --variant_calling (not implemented) Perform variant callineg? Currently not working
-s2, --star2 basic 2 pass mode on star aligner?
Single use options
-mt, --get_metrics supply a dir and get metrics for all analyses in that dir, all other options will be ignored if this is non-empyt
-ie, --index_exit index genome and exit
-gtg, --GffToGtf convert gff (path here) to gtf

These are not implemented int his script, contact [email protected] for fancier things

Flag Description
-t, --Type_1 E for eukaryotic or B for bacterial
-rR, --remove_rRNA remove rRNA from annotation file (not very robus, just deletes lines that say rRNA or ribosomal RNA), provide which GTF given it is (g1 or g2 or both)
-rRm, --remove_rRNAmtb remove rRNA from H37Rv annotation file

Important notes ong gtf/gff files

Some programs can only handle GTF or GFF files. Both are downloaded using the --downloadHuman and the scripts will look for the one not supplied otherwise create it if necessary. Here is a description of the formats Also, in some instances, gtf are only used to store rRNA.

Program File
STAR GTF/GFF
Cufflinks/cuffquant (dep) GTF/GFF
htseq GTF/GFF
featureCounts GTF/GFF
qualimap GTF
Picard GTF/GFF?

ToDo


  1. Differnetial expression and randomForest predictors can be made using the shiny app XXX

  2. eQLT analysis can be done using the shiny app XXX

  3. Add fastQscreen as an option

The code error checking isnt robust, I'd recommend doing getting the container, reference, then indexing the reference while doing an initial run. Then you can multi-run.

rnaseq_pipeline's People

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.