Giter Club home page Giter Club logo

dmpfold's Introduction

DMPfold

Build Status

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints.

See our paper in Nature Communications for more. Please cite the paper if you use DMPfold.

You can also run DMPfold via the PSIPRED web server. This is a good way to get models for a few sequences, but if you want to run DMPfold on many sequences we strongly recommend you run it locally. The server version of DMPfold has restrictions on run time and uses parameters that give faster runs, so should not be used to benchmark DMPfold.

Installation

As it makes use of a lot of different software, installation can be a little fiddly. However we have aimed to make it as straightforward as possible. These instructions should work for a Linux system:

  • Make sure you have Python 3 with PyTorch 0.4 or later, NumPy and SciPy installed. GPU setup is optional for Pytorch - it won't speed things up much because running the network isn't a time-consuming step. DMPfold has been tested on Python 3.6 and 3.7. The command python3 should point to the Python that you want to use.
  • Install HH-suite and the uniclust30 database, unless you are getting your alignments from elsewhere.
  • Install FreeContact.
  • Install CCMpred.
  • Install MODELLER, which requires a license key. Only the Python package is required so this can be installed with conda install modeller -c salilab.
  • Install CNS. We found we had to follow all of the steps in this comment to get CNS working: set MXFPEPS2 in machvar.inc to 8192, remove -fastm flag in the make file, set MXRTP in rtf.inc in the source directory to 4000 and in machvar.f add WRITE (6,โ€™(I6,E10.3,E10.3)โ€™) I, ONEP, ONEM just above line 67, which looks like IF (ONE .EQ. ONEP .OR. ONE .EQ. ONEM) THEN. We also had to install the flex-devel package via our system package manager. In addition, you should change two values in cns_solve_1.3/modules/nmr/readdata to larger numbers to allow DMPfold to run on larger structures. Change the nrestraints = 20000 line to something like nrestraints = 50000 and the nassign 1600 line to something like nassign 3000.
  • Download and patch the required CNS scripts by changing into the cnsfiles directory and running sh installscripts.sh.
  • Install CD-HIT, which is usually as simple as a clone and make. CD-HIT is not required if you don't need to predict the TM-score of generated models.
  • Install the legacy BLAST software, in particular formatdb, blastpgp and makemat. We may update this to BLAST+ in the future.
  • Other software is pre-compiled and included here (PSIPRED, PSICOV, various utility scripts with the code in src). This should run okay but may need separate compilation using the makefile if issues arise. Some other standard programs, such as csh shell, are assumed.
  • Change lines 10/13-15/18/21/24 in seq2maps.csh, lines 11/14/17/20 in aln2maps.csh, lines 4/7 in bin/runpsipredandsolvwithdb, lines 10/13 in run_dmpfold.sh and lines 7/10 in predict_tmscore.sh to point to the installed locations of the above software. You can also set the number of cores to use in seq2maps.csh and aln2maps.csh. This sets the number of cores for HHblits, PSICOV, FreeContact and CCMpred - the script will run faster with this set to a value larger than 1 (e.g. 4 or 8).

Check the continuous integration setup script and logs for additional tips and a step-by-step installation on Ubuntu.

Usage

Here we give an example of running DMPfold on Pfam family PF10963. First you need to generate the .21c and .map files. This can be done in one of two ways:

  • From a single sequence: csh seq2maps.csh example/PF10963.fasta to run HHblits, PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats.
  • From an alignment: csh aln2maps.csh example/PF10963.aln to run PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats. The file PF10963.aln has one sequence per line with the ungapped target sequence as the first line.

Then run sh run_dmpfold.sh example/PF10963.fasta PF10963.21c PF10963.map ./PF10963 to run DMPfold, where the last parameter is an output directory that will be created. Running sh run_dmpfold.sh example/PF10963.fasta PF10963.21c PF10963.map ./PF10963 5 20 instead runs 5 iterations with 20 models per iteration (default is 3 and 50). The final model is final_1.pdb and other structures may or may not be generated as final_2.pdb to final_5.pdb if they are significantly different. Many other files are generated totalling around 100 MB - these should be deleted to save disk space if you are running DMPfold on many sequences.

To predict the TM-score of a DMPfold model using our trained predictor, run sh predict_tmscore.sh example/PF10963.fasta PF10963.aln PF10963/final_1.pdb PF10963/rawdistpred.1. If this predictor estimates that a model has a TM-score of at least 0.5 then there is an 83% chance of this being the case according to cross-validation of the Pfam validation set.

See Supplementary Figure 1 in the paper for estimations on run time. It takes around 3 hours on a single core to carry out a complete DMPfold run for a 200 residue protein, but this can occasionally be much longer due to PSICOV not converging. 8 GB memory is generally sufficient to run DMPfold but more may be required for larger proteins.

Figure 5 in the paper gives some data on how DMPfold performs with respect to sequence length. Sequences up to around 600 residues in length can be modelled accurately, with performance degrading above this.

Data

Models for the 1,475 Pfam families modelled in the paper can be downloaded here. Additional models for the remainder of the dark Pfam families can be downloaded here (some were not modelled due to small sequence alignments). Alignments for the Pfam families without available templates can be downloaded here. The format is one sequence per line with the ungapped target sequence as the first line.

The directory pfam in this repository contains text files with the lists from Figure 4A of the paper, target sequences for modelled families and data for modelled families (sequence length, effective sequence count, distogram satisfaction scores, estimated TM-score and probability TM-score >= 0.5).

The list of PDB chains used for training can be found here.

dmpfold's People

Contributors

danbuchan avatar davtjon avatar jgreener64 avatar shaunmk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.