Giter Club home page Giter Club logo

phylosofs's Introduction

PhyloSofS

A tool to model the evolution and structural impact of alternative splicing

Status Linux, OSX Windows
Project Status: Active – The project has reached a stable, usable state and is being actively developed. Build Status Build status

PhyloSofS (Phylogenies of Splicing isoforms Structures) is a fully automated computational tool that infers plausible evolutionary scenarios explaining a set of transcripts observed in several species and models the three-dimensional structures of the produced protein isoforms.
The phylogenetic reconstruction algorithm relies on a combinatorial approach and the maximum parsimony principle. The generation of the isoforms' 3D models is performed using comparative modeling.

Case study

PhyloSofS was applied to the c-Jun N-terminal kinase (JNK) family (60 transcripts in 7 species). It enabled to date the appearance of an alternative splicing event (ASE) resulting in substrate affinity modulation in the ancestor common to mammals, amphibians and fishes, and to identify key residues responsible for such modulation. It also highlighted a new ASE inducing a large deletion, yet conserved across several species. The resulting isoform is stable in solution and could play a role in the cell. More details about this case study, together with the algorithm description, can be found in the PhyloSofS' preprint available at bioRxiv.

Installation

1. Download

You can clone this PhyloSofS package using git:

git clone https://github.com/PhyloSofS-Team/PhyloSofS.git

2. Install

Then, you can access the cloned PhyloSofS folder and install the package using Python 3's pip:

cd PhyloSofS
python -m pip install .

3. Install dependencies

Phylogenetic inference

To run the phylogenetic module of PhyloSofS, you need to have Graphviz installed.
The easiest way to install Graphviz in...

  • Debian/Ubuntu is: sudo apt-get install graphviz
  • Windows is using Chocolatey: choco install graphviz
  • macOS is using Homebrew: brew install graphviz

Molecular modelling

The molecular modelling pipeline depends on Julia, HH-suite3 and MODELLER. This module can only run on Unix systems (because of the HH-suite). To alleviate that, we offer a Docker image with all these dependencies installed (see the Docker section for more details).

Julia

You can download Julia 1.1.1 binaries from its site.

LibZ

Some BioJulia packages can need LibZ to precompile. If you found a related error, you can install LibZ from its site. In Ubuntu 18.04 you can install it by doing: sudo apt-get install zlib1g-dev

HH-suite3

Clone our HH-suite fork at AntoineLabeeuw/hh-suite and follow the Compilation instructions in its README.md file.

MODELLER

PhyloSofS needs MODELLER version 9.21. Follow the instructions in the MODELLER site to install it and get the license key.

Databases

To run the molecular modelling module you need the HH-suite databases:

  • Sequence database: uniclust30_yyyy_mm_hhsuite.tar.gz (we have tested PhyloSofS using 20180_08 as yyyy_mm)
  • Structural database: pdb70_from_mmcif_latest.tar.gz

The needed mmCIF PDB files for MODELLER are downloaded on demand, if there are not present, in an indicated folder.

To set up the databases, you can use the script setup_databases (recommended). Alternatively, a manual installation can be performed following the instructions in docs/get_databases.md.

Using the setup_databases script

The setup_databases downloads and decompress the needed databases. It creates the following folder structure that can be easily used by PhyloSofS with the --databases argument:

databases
 ├── pdb
 ├── pdb70
 └── uniclust

You can do setup_databases -h to know more about the script and its arguments.

Docker (without installation)

You can directly use PhyloSofS via Docker without cloning this GitHub repository. To run PhyloSofS' Docker image you need to install Docker following these instructions.

The following example is going to run PhyloSofS' Docker image using Windows PowerShell. Databases for the molecular modelling module stored in D:\databases are going to be mounted in /databases and the local directory in /project. The actual folder is ${PWD} in Windows PowerShell, %cd% in Windows Command Line (cmd), and $(pwd) in Unix.

docker run -ti --rm --mount type=bind,source=d:\databases,target=/databases --mount type=bind,source=${PWD},target=/project diegozea/phylosofs

After this, we have access to the bash terminal of an Ubuntu 18.04 image with PhyloSofS and all its dependencies installed. You only need to indicate your MODELLER license key to use PhyloSofS. To do that, you run the following command after replacing license_key with your MODELLER license key:

sed -i 's/xxx/license_key/' /usr/lib/modeller9.21/modlib/modeller/config.py

Homology modelling example using Docker CE in Ubuntu

After installing Docker CE following these instructions, you can create a folder to work with the app, e.g.:

mkdir phylosofs

And then go into that folder and run the PhyloSofS Docker image bind-mounting the local folder into /project:

cd phylosofs
sudo docker run -ti --rm --mount type=bind,source=$(pwd),target=/project diegozea/phylosofs

This starts a bash console with PhyloSofS and all its dependencies installed. The sources are taken from diegozea/phylosofs. First, change xxx by your MODELLER license key using sed as indicated in the banner. Then, you can use the setup_databases script the first time to install the needed databases into the project folder. The databases are going to need some time to download and decompress depending on your internet connection and disk speed. You need almost 129 Gb in your disk before download and decompress them:

setup_databases

This has created a databases folder in /project (and therefore in the phylosofs folder of your system) with the needed sequence and structure databases for the homology modelling step.

To test the molecular modelling suite, we are going to create an example input pir file in a GeneName folder. PhyloSofS is going to look for transcripts.pir files in the indicated folder and its subfolders:

mkdir GeneName
echo ">P1;gene transcript ABCDE" >> ./GeneName/transcripts.pir
echo "AAAAAABBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCCCCCDDDDDDDDDEEEEEEEEE" >> ./GeneName/transcripts.pir
echo "ACTNEFCASCPTFLRMDNNAAAIKALELYKINAKLCDPHPSKKGASSAYLENSKGAPNNS*" >> ./GeneName/transcripts.pir

Where the pir annotation below the id is used to indicate the exon (A, B, C...) to which belong each residue of the protein isoform.

phylosofs -M -i GeneName --databases databases

Note: Installing databases with Windows as a host

If you are using the PhyloSofS' Docker image, you must know that errors can occur when very large files are being written to bind-mounted NTFS file systems. This happens particularly when setup_databases is run because it tries to download and decompress large files. To avoid this problem, you can install PhyloSofS on Windows and run setup_databases.exe to set up the databases before using the docker image.

Running PhyloSofS

You can run phylosofs -h to see the help and the list of arguments.

1. Phylogenetic Inference

phylosofs -P -s 100 --tree path_to_newick_tree --transcripts path_to_transcripts

2. Molecular modelling

If databases where installed using setup_databases and the HH-Suite3 scripts and programs are in the executable paths, then you can run:

phylosofs -M -i path_to_input_dir --databases path_to_databases_folder

PhyloSofS is going to look for transcripts.pir files in the folder and sub-folders of path_to_input_dir to perform the homology modelling of each sequence in those files.

If you have a more manual installation of the databases and/or the HH-Suite3 scripts and programs are not in the path:

phylosofs -M -i path_to_input_dir --hhlib path_to_hhsuite_folder --hhdb path_to_uniclust_database/uniclust_basename --structdb path_to_pdb70/pdb70 --allpdb path_mmcif_pdb_cache_folder

Please note that for the databases --hhdb and --structdb, you need to provide the path to the folder and also the basename of the files in it. For example, if the database uniclust30_2018_08 is located in /home, you need to write:

--hhdb /home/uniclust30_2018_08/uniclust30_2018_08

You can also find useful the arguments:

  • --ncpu number_of_cpu
  • --julia path_to_julia_executable

Docker example

If you installed the databases using setup_databases in a folder that you have bind-mounted to /databases, then you only need to run:

phylosofs -M --databases /databases

Because the Docker image has HH-Suite3 installed with its programs and scripts in the executable paths.

Licence

The PhyloSofS package has been developed under the MIT License.

Contact

For questions, comments or suggestions feel free to contact Elodie Laine or Hugues Richard

phylosofs's People

Contributors

antoinelabeeuw avatar diegozea avatar elolaine avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

phylosofs's Issues

Parallelization of the execution/code

At the moment, PhyloSofS is given a number of iterations for the reconstruction of transcripts' phylogenies, and each time a forest structure is visited (computation of the lower bound) counts as an iteration. As the space of forest structures is being explored, better phylogenies are found with lower costs, and the lower bound cut gets more efficient.

To gain computing time, we should probably try and parallelize the forest structure space search. A naive way to do this is the following:

  • (0) Launch 1 short run (a few tens of iterations) starting from the widest forest structure. Save the best forest structure, best associated phylogeny and best associated cost.
  • (1) Then launch n medium-length runs (a few hundreds of thousands of iterations) in parallel, starting from a random phylogeny and given the best cost found for the cut. One of the run can be started from the best phylogeny instead of a random one. Each run will end up with a best phylogeny and associated best cost.
  • (2) We can launch again in the same way n medium-length runs... and loop over this procedure.
    The different runs will probably visit a lot of identical forest structures that will systematically be eliminated with the lower bound cut. It remains to be seen whether it would be interesting to try and record visited structures and avoid to visit them several times... the computation of the lower bound is clearly quick, so... it may not be worth it.

A more refined procedure would be to make the n processes (a) write down best phylogeny/cost each time they find a new one, and (b) read best cost each time they are about to make a random jump, so that the lower bound cut is as efficient as it can be. This would require to share I/O between processes, maybe by using semaphores...?

ModuleNotFoundError in Windows 10

PS C:\Users\Diego\MASSIV\PhyloSofS> phylosofs.exe -h
Traceback (most recent call last):
  File "C:\Users\Diego\AppData\Local\Programs\Python\Python37\Scripts\phylosofs-script.py", line 11, in <module>
    load_entry_point('phylosofs', 'console_scripts', 'phylosofs')()
  File "C:\Users\Diego\AppData\Local\Programs\Python\Python37\lib\site-packages\pkg_resources\__init__.py", line 489, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "C:\Users\Diego\AppData\Local\Programs\Python\Python37\lib\site-packages\pkg_resources\__init__.py", line 2793, in load_entry_point
    return ep.load() 
  File "C:\Users\Diego\AppData\Local\Programs\Python\Python37\lib\site-packages\pkg_resources\__init__.py", line 2411, in load
    return self.resolve()  
  File "C:\Users\Diego\AppData\Local\Programs\Python\Python37\lib\site-packages\pkg_resources\__init__.py", line 2417, in resolve
    module = __import__(self.module_name, fromlist=['__name__']
  File "c:\users\diego\massiv\phylosofs\phylosofs\__init__.py", line 5, in <module>
    from phylosofs import phylosofs 
  File "c:\users\diego\massiv\phylosofs\phylosofs\phylosofs.py", line 22, in <module>
    import initData
ModuleNotFoundError: No module named 'initData'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.