Giter Club home page Giter Club logo

mepsuite's Introduction

MEPsuite: Discovery and Prediction of MicroExons in Plant genomes

  • The scripts are for the paper entitled "Pervasive misannotation of microexons that are evolutionarily conserved and crucial for gene function in plants" by Huihui Yu et al. Nat Commun 13, 820 (2022). https://doi.org/10.1038/s41467-022-28449-8
  • The pipeline golden_map.py can be used for general RNA-seq mapping and gene expression analysis in plant genomes.
  • There is a stand-alone R package MEPmodeler for microexon prediction/modeling in plant genomes.
  • The flow chart of the whole pipeline is as follows:

MEP_workflow

1. golden_map.py

A python pipeline of short-read RNA-seq mapping in plants:

  1. First round of mappping using STAR
  2. Generate new genome index for STAR
  3. second round of mappping using STAR
  4. Assemble novel transcripts using StringTie
  5. Estimate gene expression using StringTie

Note:

  • The ouput is ready for input of Ballgown.
  • Additaional splice junction information can be added from the result of other tools, such as OLego. Here is an example:
for file in simulated_reads/*.fq.gz; do 
    sample=$(basename "$file" | sed "s/.fq.gz//")
    olego -t 16 -e 3 -I 20000 --max-multi 5 olego_idx simulated_reads/$sample.fq.gz | samtools sort -o olego_out/$sample.bam
    samtools view olego_out/$sample.bam | sam2bed.pl -r - - | bed2junc.pl - olego_out/$sample.junc
done

cat olego_out/*.junc | awk '{OFS="\t"}{$2=$2+1}{print $1,$2,$3,$6}' > olego_junc.SJ

Usage:

python3 golden_map.py -h

Usage: golden_map.py [options]

Options:
  -h, --help       show this help message and exit
  -i INPUT         input file directory or file list (.)
  -p PATTERN       input file pattern (_1P.fq.gz)
  -l LENGTH        read length (101)
  -f FASTA         reference genome fasta file
  -x INDEX         STAR genome index directory
  -n NBASES        length of the STAR SA indexing string (14)
  -j JUNCTION      file for the splice junction introns ("")
  -g GTF           reference annotation GTF file
  -@ THREADS       number of threads (30)
  -e, --expresion  estimate gene expression
  -a, --assemble   assemble novel transcripts

Example:

python3 ./golden_map.py -i simulated_reads -p .fq.gz -l 50 -f A.thaliana.fa -x star_idx -n 12 -j olego_junc.SJ -g A.thaliana.gtf -e -a

  • The file pattern (-p) is the suffix of fastq files. For paired end, the pattern must contain the number 1 and the pipeline will search and match the second read files; otherwise, will treat as single end.
  • The input (-i) can be a path to the directory or a file list containing the paths to all the fastq files, one per line (files will be automatically paired according to the pattern).
  • GZ or BZ2 compressed files can be accepted.
  • Extra intron junction file (-j) should contains 4 columns: separated by tabs Chr\tStart\tEnd\tStrand and will pass to '--sjdbFileChrStartEnd' for STAR. Please note that Start and End are first and last bases of the introns (1-based chromosome coordinates), which are different from those in OLego junction file (in BED format). For more information, see STAR and OLego.
  • The sorted bam files are in "map2" directory and expression files are in "ballgown" directory.
  • The assembled transcrpts are in merged.gtf file.
  • To collect exon/intron/transcript level expression, open R and run:
 library(ballgown)
 bg<-ballgown(samples=list.dirs("ballgown",recursive = FALSE))
 bg

2. getex.R

Get the coordinates of internal exons and the flanking introns from a ballgown object.

Usage:

getex(bg, genome, min.junction=5, min.samples=1)

Return:

A list of exon and intron GRanges. The microexons are the smallest internal exons with the size <= 15 nt.

Example:

library(ballgown)
library(GenomicRanges)
library(Biostrings)
library(BSgenome)
genome<-readDNAStringSet("A.thaliana.fa")
names(genome)<-sub("^(\\S+)\\s+.*","\\1",names(genome))
splice<-getex(bg_at, genome)
mex<-splice$ex[width(splice$ex)<=15]

3. trans2prot.R

Get the longest ORFs from all assembled transcripts and translate the coding sequences to proteins

Usage:

trans2prot(bg, genome, mex, minsize=50, mincov=1, MSTRG="At")

Return:

A list of AAStringSet of protein sequences, GRangesList of transcripts, GRangesList of ORFs containing all 
coding exons and GRangeslist of ORFs containing coding microexons. 

4. cluster.R

Cluster coding microexons:

  1. Get the position and the phase of miroexons in coding sequences and proteins
  2. Search protein motifs (https://www.genome.jp/tools/motif/)
  3. Extract motifs encoded by microexons
  4. Cluster microexons according to the size and phase, and the miroexon-tags

5. mapPWM.R

Map gapped PWM to a genome.

Usage:

mapPWM(cons,exons,genome,focus=2, min.score="80%",include.intronLoss=TRUE, span=20000, min.intron=20, max.intron=10000, prior.params=letterFrequency(genome[[1]],DNA_BASES,as.prob = T), cores=1, check.params=TRUE)

Return:

A GRanges of the center microexon and the flanking exons and introns. 

6. MEPscan.R

Find miroexons in a plant genome using mapPWM.R and MEPdata.RData (MEPdata contains the information of microexon-tags in 45 conserved clusters).

Usage:

MEPscan(genome, min.score="80%",include.intronLoss=TRUE, span=20000, min.intron=20, max.intron=10000, cores=1)

Return

A GRanges of microexons and the flanking exons and introns in all clusters in a plant genome. 

For more details, please see function MEPmod in MEPmodeler.

mepsuite's People

Contributors

yuhuihui2011 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

yshi20

mepsuite's Issues

getex.R

Hi~~~
Now, I encounter some problem in runing script getex.R
Can you give me some case that makes scritp worked?
I'm looking forward to your reply!!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.