Giter Club home page Giter Club logo

simspliceevol's Introduction

๐Ÿ’ป SimSpliceEvol2: alternative splicing-aware simulation of biological sequence evolution and transcript phylogenies.

Overview

SimSpliceEvol is a tool designed to simulate the evolution of sets of alternative transcripts along the branches of an input gene tree. In addition to traditional sequence evolution events, the simulation also incorporates events related to the evolution of gene exon-intron structures and alternative splicing. These events modify the sets of transcripts produced from genes. Data generated using SimSpliceEvol is valuable for testing spliced RNA sequence analysis methods, including spliced alignment of cDNA and genomic sequences, multiple cDNA alignment, identification of orthologous exons, splicing orthology inference, and transcript phylogeny inference. These tests are essential for methods that require knowledge of the real evolutionary relationships between the sequences.

Image 2

๐Ÿ“– Table of Contents

  1. โžค Overview
  2. โžค Operating System
  3. โžค Requirements
  4. โžค Graphical User Interface (GUI) and Webserver
  5. โžค Getting Started
  6. โžค Main Command - Execution
  7. โžค Descriptions of Project Files
    1. โžค Description of Inputs
    2. โžค Description of Outputs

-----------------------------------------------------

๐Ÿ‘จโ€๐Ÿ’ป Operating System

The program was both developed and tested on a system operating Ubuntu version 18.04.6 LTS.

-----------------------------------------------------

โš’๏ธ Requirements

  • python3 (at leat python 3.6)
  • ETE toolkit (ete3)
  • networkx
  • pyQt5
  • timeout_decorator
  • Pandas
  • Numpy

-----------------------------------------------------

๐Ÿ“ฆ Graphical User Interface (GUI) and Webserver

Unzip the file application.zip and access the GUI simspliceevolv2GUI in the application folder..

โš ๏ธ It may take some time (not more than 15 seconds) to launch the program due to deploying the environment and the necessary modules to compute the program successfully. If any errors occur, feel free to contact us.

The webserver and the GUI are available at https://simspliceevol.cobius.usherbrooke.ca/

-----------------------------------------------------

๐Ÿš€ Getting Started

๐Ÿ’ป Main Command

Command

 usage: simspliceevolv2.py [-h] -i INPUT_TREE_FILE [-it ITERATIONS]
                          [-dir_name DIRECTORY_NAME] [-eic_el EIC_EL]
                          [-eic_ed EIC_ED] [-eic_eg EIC_EG] [-c_i C_I]
                          [-c_d C_D] [-k_nb_exons K_NB_EXONS] [-k_eic K_EIC]
                          [-k_indel K_INDEL] [-k_tc K_TC]
                          [-tc_a5 ALTERNATIVE_FIVE_PRIME]
                          [-tc_a3 ALTERNATIVE_THREE_PRIME]
                          [-tc_es EXON_SKIPPING] [-tc_me MUTUALLY_EXCLUSIVE]
                          [-tc_ir INTRON_RETENTION] [-tc_tl TRANSCRIPT_LOSS]

Usage example

(Note: We preserved the default settings/parameters for the next example and we redirected the output of our simulation to the directory ./execution/outputs)

 python3 simspliceevolv2.py -dir_name 'execution/outputs' -i 'execution/inputs/small_example.nw' 

Expected output

Image 2

-----------------------------------------------------

๐Ÿ“ Description of Project Files/Arguments

โŒจ๏ธ Description of Inputs

REQUIRED FILE

The only required file is the Newick file [-i INPUT TREE_FILE],which must contain the length of branches (NHX format is also accepted).

OPTIONAL ARGUMENTS

The other arguments are optional, and we describe them below:

[-it ITERATION] name of the simulation (default='1')

[-k_nb_exons K_NB_EXONS] multiplicative constant for number of exons in gene (default =1.5)

[-k_eic K_EIC] multiplicative constant for exon-intron change (eic) rate (default=25)

[-k_indel K_INDEL] multiplicative constant for codon indel rate (default= 5)

[-k_tc K_TC] multiplicative constant for transcript change (default=10)

[-eic_el EIC_EL] relative frequence of exon-intron structure change by exon loss (default=0.4)

[-eic_eg EIC_EG] relative frequence of exon-intron structure change by exon gain (default=0.5)

[-eic_ed EIC_ED] relative frequence of exon-intron structure change by exon duplication (default=0.1)

[-c_i C_I] relative frequence of codon insertions (default=0.7)

[-c_d C_D] relative frequence of codon deletions (default=0.3)

[-tc_a5 ALTERNATIVE_FIVE_PRIME] relative frequence of alternative five prime in tc (default =0.25)

[-tc_a3 ALTERNATIVE_THREE_PRIME] relative frequence of alternative three prime in tc (default =0.25)

[-tc_es EXON_SKIPPING] relative frequence of exon skipping in tc (default=0.35)

[-tc_me MUTUALLY_EXCLUSIVE] relative frequence of mutually exclusive in tc (default =0.15)

[-tc_ir INTRON_RETENTION] relative frequence of intron retention in tc (default=0.00)

[-tc_tl TRANSCRIPT_LOSS] relative frequence of transcript loss in tc (default=0.3)

๐Ÿ’ฝ Description of Outputs

Outputs files

SimSpliceEvol creates nine(9) folders.

[output_directory]/genes/[iteration#i]

  • The file genes.fasta contains all the gene sequences in FASTA format.

[output_directory]/transcripts/[iteration#i]

  • The file transcripts.fasta contains all the transcript sequences in FASTA format.

[output_directory]/transcripts_to_gene/[iteration#i]

  • The file mappings.txt contains all the transcript IDs along with their corresponding genes.

[output_directory]/pairwise_alignments/[iteration#i]

  • The file pairwise_alignments.fasta contains all the spliced alignments of transcripts with their corresponding gene sequences in FASTA format.

[output_directory]/multiple_alignments/[iteration#i]

  • The file msa_transcripts.alg contains the multiple sequence alignment of transcripts in FASTA format.

  • The file splicing_structure.csv describes the representation of exons in CSV format.

[output_directory]/exons_positions/[iteration#i]

  • The file exons_positions.txt contains the positions(start and end) of exons in transcripts and genes.

[output_directory]/clusters/[iteration#i]

  • The file ortholog_groups.clusters describes the clusters of orthologous transcripts(transcripts with the same structure). A cluster can induce recent paralogs or isoorthologs.

[output_directory]/phylogenies/[iteration#i]

  • The svg images and newick files contained in the directory describe the evolutionary history of transcripts. (For further exploration, refer to the section below)

    • Nodes

      • leaves

        • gold : transcripts of existing genes.

        • gray : transcripts of ancestral genes.

      • internal nodes

        • red : Intron Retention (IR)

        • orange : Mutually Exclusive exons (ME)

        • violet : 5 prime Splice Site (5SS)

        • medium blue : 3 prime Splice Site (3SS)

        • lime green : Exon Skipping (ES)

        • white : Conservation (Speciation or Duplication event under the LCA-reconciliation), i.e., not a creation event.

+ Example`.

Image 2

! two ME nodes 
# (orange internal nodes)

! one 5SS node
# (violet internal node)

! conservation nodes
# (white internal nodes)

! transcript in existing genes
# (gold leaves'nodes)

! ancient transcripts
# (gray leaves'nodes)

MORE with SimSpliceEvol

The main function simspliceevol()

simspliceevol(SRC, ITERATION_NAME, TREE_INPUT, K_NB_EXONS, K_INDEL, C_I, C_D, EIC_ED, EIC_EG, EIC_EL, K_EIC, K_TC, TC_RS, TC_A3, TC_A5, TC_ME, TC_ES, TC_IR, TC_TL)

returns a set that contains:

  • an ETE tree python object as presented in the library ete3 . Each node possesses attributes used to provide additional details about the simulation.
  • a pandas DataFrame with data containing exons sequences, indexed by the names of transcripts, and columns representing exons.

After a simulation, each tree node has two types of attributes: one describing the evolution of genes and the other describing the evolution of transcripts.

GENE EVOLUTION

METHOD DESCRIPTION
TreeNode.gene_name returns the name of the gene.
TreeNode.gene_stucture returns a description of the gene's structure. This is an ordered list showing the alteration of exons and introns.
TreeNode.exons_dict stores the exons of the gene and their sequences. The sequence depicts codon substitutions and indel evolution, represented by *** in the sequence.
TreeNode.introns_dict stores the introns of the gene and their sequences.

TRANSCRIPT EVOLUTION

METHOD DESCRIPTION
TreeNode.transcripts_dict stores transcripts of the gene node and the description of their structure.
TreeNode.transcripts_sequences_dicts stores the sequences of exons.

Copyright ยฉ 2023 CoBIUS LAB

simspliceevol's People

Contributors

dondavy avatar esaiekuitche avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.