deeprank / pdb2sql Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 12.0 2.83 MB

Fast and versatile biomolecular structure PDB file parser using SQL queries

Home Page: https://pdb2sql.readthedocs.io

License: Apache License 2.0

Python 92.52% TeX 7.00% Makefile 0.48%

pdb2sql's Introduction

⚠️ Archiving Note

This repository is no longer being maintained and has been archived for historical purposes.

We have now developed DeepRank2, an improved and unified version of DeepRank, DeepRank-GNN, and DeepRank-Mut.

✨ DeepRank2 allows for transformation and storage of 3D representations of both protein-protein interfaces (PPIs) and protein single-residue variants (SRVs) into either graphs or volumetric grids containing structural and physico-chemical information. These can be used for training neural networks for a variety of patterns of interest, using either our pre-implemented training pipeline for graph neural networks (GNNs) or convolutional neural networks (CNNs) or external pipelines.

🔧 Pull Requests at github.com/DeepRank/deeprank2/pulls
🐛 Bugs: Reports of bugs can be filed agains our new repo github.com/DeepRank/deeprank2/issues
⭐ Feature Requests: Add your request or discuss the project w/ the community at github.com/DeepRank/deeprank2/issues

We look forward to seeing you in our new space - DeepRank2!

DeepRank

Overview
Installation
Quick Tutorial
Documentation
License
Issues & Contributing

Overview

DeepRank is a general, configurable deep learning framework for data mining protein-protein interactions (PPIs) using 3D convolutional neural networks (CNNs).

DeepRank contains useful APIs for pre-processing PPIs data, computing features and targets, as well as training and testing CNN models.

Features:

Predefined atom-level and residue-level PPI feature types
- e.g. atomic density, vdw energy, residue contacts, PSSM, etc.
Predefined target types
- e.g. binary class, CAPRI categories, DockQ, RMSD, FNAT, etc.
Flexible definition of both new features and targets
3D grid feature mapping
Efficient data storage in HDF5 format
Support both classification and regression (based on PyTorch)

Installation

DeepRank requires a Python version 3.7 or 3.8 on Linux and MacOS. Make sure that mpi4py is installed in your environment before installing deeprank: conda install mpi4py

Stable Release

DeepRank is available in stable releases on PyPI:

Install the module pip install deeprank

Development Version

You can also install the under development source code from the branch development

Clone the repository git clone --branch development https://github.com/DeepRank/deeprank.git
Go there cd deeprank
Install the package pip install -e ./

To check if installation is successful, you can run a test

Go into the test directory cd test
Run the test suite pytest

Tutorial

We give here the tutorial like introduction to the DeepRank machinery. More informatoin can be found in the documentation http://deeprank.readthedocs.io/en/latest/. We quickly illsutrate here the two main steps of Deeprank:

the generation of the data
running deep leaning experiments.

A . Generate the data set (using MPI)

The generation of the data require only require PDBs files of decoys and their native and the PSSM if needed. All the features/targets and mapped features onto grid points will be auomatically calculated and store in a HDF5 file.

from deeprank.generate import *
from mpi4py import MPI

comm = MPI.COMM_WORLD

# let's put this sample script in the test folder, so the working path will be deeprank/test/
# name of the hdf5 to generate
h5file = './hdf5/1ak4.hdf5'

# for each hdf5 file where to find the pdbs
pdb_source = ['./1AK4/decoys/']


# where to find the native conformations
# pdb_native is only used to calculate i-RMSD, dockQ and so on.
# The native pdb files will not be saved in the hdf5 file
pdb_native = ['./1AK4/native/']


# where to find the pssm
pssm_source = './1AK4/pssm_new/'


# initialize the database
database = DataGenerator(
    chain1='C', chain2='D',
    pdb_source=pdb_source,
    pdb_native=pdb_native,
    pssm_source=pssm_source,
    data_augmentation=0,
    compute_targets=[
        'deeprank.targets.dockQ',
        'deeprank.targets.binary_class'],
    compute_features=[
        'deeprank.features.AtomicFeature',
        'deeprank.features.FullPSSM',
        'deeprank.features.PSSM_IC',
        'deeprank.features.BSA',
        'deeprank.features.ResidueDensity'],
    hdf5=h5file,
    mpi_comm=comm)


# create the database
# compute features/targets for all complexes
database.create_database(prog_bar=True)


# define the 3D grid
 grid_info = {
   'number_of_points': [30,30,30],
   'resolution': [1.,1.,1.],
   'atomic_densities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
 }

# Map the features
database.map_features(grid_info,try_sparse=True, time=False, prog_bar=True)

This script can be exectuted using for example 4 MPI processes with the command:

    NP=4
    mpiexec -n $NP python generate.py

In the first part of the script we define the path where to find the PDBs of the decoys and natives that we want to have in the dataset. All the .pdb files present in pdb_source will be used in the dataset. We need to specify where to find the native conformations to be able to compute RMSD and the dockQ score. For each pdb file detected in pdb_source, the code will try to find a native conformation in pdb_native.

We then initialize the DataGenerator object. This object (defined in deeprank/generate/DataGenerator.py) needs a few input parameters:

pdb_source: where to find the pdb to include in the dataset
pdb_native: where to find the corresponding native conformations
compute_targets: list of modules used to compute the targets
compute_features: list of modules used to compute the features
hdf5: Name of the HDF5 file to store the data set

We then create the data base with the command database.create_database(). This function autmatically create an HDF5 files where each pdb has its own group. In each group we can find the pdb of the complex and its native form, the calculated features and the calculated targets. We can now mapped the features to a grid. This is done via the command database.map_features(). As you can see this method requires a dictionary as input. The dictionary contains the instruction to map the data.

number_of_points: the number of points in each direction
resolution: the resolution in Angs
atomic_densities: {'atom_name': vvdw_radius} the atomic densities required

The atomic densities are mapped following the protein-ligand paper. The other features are mapped to the grid points using a Gaussian function (other modes are possible but somehow hard coded)

Visualization of the mapped features

To explore the HDf5 file and vizualize the features you can use the dedicated browser https://github.com/DeepRank/DeepXplorer. This tool saloows to dig through the hdf5 file and to directly generate the files required to vizualie the features in VMD or PyMol. An iPython comsole is also embedded to analyze the feature values, plot them etc ....

B . Deep Learning

The HDF5 files generated above can be used as input for deep learning experiments. You can take a look at the file test/test_learn.py for some examples. We give here a quick overview of the process.

from deeprank.learn import *
from deeprank.learn.model3d import cnn_reg
import torch.optim as optim
import numpy as np

# input database
database = '1ak4.hdf5'

# output directory
out = './my_DL_test/'

# declare the dataset instance
data_set = DataSet(database,
            chain1='C',
            chain2='D',
            grid_info={
                'number_of_points': (10, 10, 10),
                'resolution': (3, 3, 3)},
            select_feature={
                'AtomicDensities': {'C': 1.7, 'N': 1.55, 'O': 1.52, 'S': 1.8},
                'Features': ['coulomb', 'vdwaals', 'charge', 'PSSM_*']},
            select_target='DOCKQ',
            normalize_features = True, normalize_targets=True,
            pair_chain_feature=np.add,
            dict_filter={'DOCKQ':'<1'})


# create the network
model = NeuralNet(data_set,cnn_reg,model_type='3d',task='reg',
                  cuda=False,plot=True,outdir=out)

# change the optimizer (optional)
model.optimizer = optim.SGD(model.net.parameters(),
                            lr=0.001,
                            momentum=0.9,
                            weight_decay=0.005)

# start the training
model.train(nepoch = 50,divide_trainset=0.8, train_batch_size = 5,num_workers=0)

In the first part of the script we create a Torch database from the HDF5 file. We can specify one or several HDF5 files and even select some conformations using the dict_filter argument. Other options of DataSet can be used to specify the features/targets the normalization, etc ...

We then create a NeuralNet instance that takes the dataset as input argument. Several options are available to specify the task to do, the GPU use, etc ... We then have simply to train the model. Simple !

Issues and Contributing

If you have questions or find a bug, please report the issue in the Github issue channel.

If you want to change or further develop DeepRank code, please check the Developer Guideline to see how to conduct further development.

pdb2sql's People

Contributors

Stargazers

Watchers

Forkers

mtrellet githublucien joaomcteixeira codacy-badger manonreau farzanehparizi heejongkim cffbots dtrademaker ricomnl gcroci2 marigouu

pdb2sql's Issues

CONTRIBUTING.rst is not for this project

It seems you copied the CONTRIBUTING.rst file from another repository but didn't edit the links nor the text, e.g.

push your feature branch to (your fork of) the pyCHAMP repository on GitHub;

Link to JOSS review: openjournals/joss-reviews#2077

provide info for help(pdb2sql)

Provide documentation at the module level: import pdb2sql; help(pdb2sql) . Though the documentation website states that the user should from pdb2sql import pdb2sql, the pdb2sql library level is the first place where a user at a Python prompt will look for help. I was expecting here a general overview of how to use the library along with external references for more detailed explanations.

compute_irmsd_fast() and compute_irmsd_pdb2sql() give very different RMSD

Describe the bug
compute_irmsd_fast() and compute_irmsd_pdb2sql() give very different RMSD: 1.13 and 5.742, respectively.

To Reproduce

from pdb2sql.StructureSimilarity import StructureSimilarity
sim = StructureSimilarity('model.pdb', 'ref.pdb')
sim.compute_irmsd_fast() # this gives an irmsd of 1.13
sim.compute_irmsd_pdb2sql() #this gives an irmsd of 5.742

Input files:
test.tar.gz

Maybe it is related with this full request: #72

pytest failed

pytest failed. I guess it is due to the recent update.

E FileNotFoundError: [Errno 2] No such file or directory: 'pdb/target136-scoring_0506_conv.pdb'

../pdb2sql/pdb2sqlcore.py:193: FileNotFoundError

update commit method to private

what is the meaning of the .commit() method? Can we apply here the same rationale as for the .close() method issue stated before

udpate commit to private
check other packages

Inconsistency in imports between documentation and quick example

The README.md file suggests from pdb2sql.pdb2sqlcore import pdb2sql while the 10-min intro on the docs page suggests (the much simpler!!) from pdb2sql import pdb2sql.

Link to JOSS review: openjournals/joss-reviews#2077

pdb2sql read Path from pathlib

Is your feature request related to a problem? Please describe.

I have tried to load a pdb file from my hardisk after fetch and using it as a Path object, but pdb2sql cannot read a PDB from pathlib.Path.

Describe the solution you'd like

Compatibility with pathlib would facilitate using pdb2sql as a library and dependency

How to reproduce the issue:

from pathlib import Path
from pdb2sql import pdb2sql, fetch

fetch('16PK')
pdbpath = Path(f'16PK.pdb')
assert pdbpath.exists()
print(pdbpath)
pdb2sql(pdbpath)

it breaks though the file exists in the computer:

16PK.pdb
Traceback (most recent call last):
  File "/home/joao/GitHub/pdb2sql/pdb2sql/tpath.py", line 8, in <module>
    pdb2sql(pdbpath)
  File "/home/joao/anaconda3/lib/python3.7/site-packages/pdb2sql/pdb2sqlcore.py", line 26, in __init__
    self._create_sql()
  File "/home/joao/anaconda3/lib/python3.7/site-packages/pdb2sql/pdb2sqlcore.py", line 70, in _create_sql
    pdbdata = pdb2sql.read_pdb(pdbfile)
  File "/home/joao/anaconda3/lib/python3.7/site-packages/pdb2sql/pdb2sqlcore.py", line 176, in read_pdb
    raise ValueError(f'Invalid pdb input: {pdbfile}')
ValueError: Invalid pdb input: 16PK.pdb

Link to JOSS review: openjournals/joss-reviews#2077

Fix python errors in doc

The errors are https://pdb2sql.readthedocs.io/en/latest/tutorial.html#structure-similarity-calculation

Import (rename pdb2sql.py)

The import are a bit a tricky since we have pdb2sq.py in the the package (also called pdb2sql). We should change that: pdb2sql.py -> pdb2sqlcore.py

But then we'll have to change all the import in the other files ...

add pdb matching function

pdb matching is a very useful step for pdb analysis. It would be nice if we could add this to pdb2sql.

Expected performance:

INPUT:

a reference pdb file with multiple chains
a set of pdb files for the same protein complex but with different numbering and chain IDs

OUTPUT:

chain ID mapping
pdb files renumbered based on the reference pdb. Chain IDs are also changed based on the reference pdb

Ideally, we hope to separate pdb_matching into two functions (steps):

Step 1. pdb_match_chn_batch.py: match chain IDs of pdb files to ref.pdb. Output _newChnID.pdb files.
Note: This step can be skipped if model.pdb files have already matched chain IDs. This step is also error-prone when multiple chains are highly similar to each other. Therefore, a human visual check is necessary.

Step 2. pdb_renum_batch.py: align and renumber pdb files to ref.pdb. Output _renum.pdb files.

There are two existing solutions:

https://github.com/LilySnow/PDB-matching (python + cpp)
DeepRank/haddock-tools@ed9beee (python, by the haddock group)

Maybe we could use these solutions as the basis?

Tests fail or not found

Please notice this behavior when running the tests:

Steps to reproduce:

# from within the cloned repository folder
conda create -n pdb2sqlgit python=3.7
conda activate pdb2sqlgit
python setup.py develop
python setup.py test

What you be a way to run the test locally, instead? Am I doing something wrong?

$ python setup.py test
running test
WARNING: Testing via this command is deprecated and will be removed in a future version. Users looking for a generic test entry point independent of test runner are encouraged to use tox.
running egg_info
writing pdb2sql.egg-info/PKG-INFO
writing dependency_links to pdb2sql.egg-info/dependency_links.txt
writing requirements to pdb2sql.egg-info/requires.txt
writing top-level names to pdb2sql.egg-info/top_level.txt
reading manifest file 'pdb2sql.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'pdb2sql.egg-info/SOURCES.txt'
running build_ext
tests (unittest.loader._FailedTest) ... ERROR

======================================================================
ERROR: tests (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: tests
Traceback (most recent call last):
  File "/home/joao/anaconda3/envs/pdb2sqlgit/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName
    module = __import__(module_name)
ModuleNotFoundError: No module named 'tests'


----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)
Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>

I am on:

commit a03707d7c5b77b2b4f7a9c6caf526c40ae6fa7f7 (HEAD -> master, origin/master, origin/HEAD)
Author: NicoRenaud <[email protected]>
Date:   Thu Apr 30 22:09:46 2020 +0200

    fix align interface

Related to the JOSS revision: openjournals/joss-reviews#2077

Update documentation about HETATM

I found no place in the documentation that addresses the fact that HETATM are ignored. It would be valuable to have explicit clarification on the decisions to which information is parsed out from the original PDB.

update docstring
update documentation

Orient

Is your feature request related to a problem? Please describe.
It would be useful to be able to fully orient the structure along the cartesian axis. So far align only aligns the larger/smaller pca eigenvector to one of the axis but leave the other two axis unchanged

Describe the solution you'd like
Would be great to be for example able to write something like orient(pdb, axis='z') and have all pca vectors aligned along the x,y,z axis (with here the larger vect aligned along the 'z' direction)

https://pymolwiki.org/index.php/Orient,
https://nmr.cit.nih.gov/xplor-nih/xplorMan/node159.html, https://salilab.org/modeller/9v4/manual/node180.html

Continuous integration

We need CI asap !!

when superposing without defining the chainID, the error 'NameError: name 'warnings' is not defined', instead of printing the warning itself

Describe the bug
When superposing a structure (e.g. TCR) onto that same structure but then part of a complex (e.g. TCRpMHC), without defining chainID, the error is not printed correctly. In your code (line 54), I see that the warning to be printed should be 'selection have different size, getting intersection'. Instead, what is printed is 'NameError: name 'warnings' is not defined'.

Environment:

OS system: macOS Monterey
Version: 12.2.1
Branch commit ID:
Inputs:

To Reproduce
Steps/commands to reproduce the behaviour:

from pdb2sql import pdb2sql
from pdb2sql import superpose
db1 = pdb2sql('2ckb_l_u.pdb')
db2 = pdb2sql('2ckb_b_t_aligned')
superposed_db1 = superpose(db1, db2)

Expected Results
The correct warning message to be printed

Actual Results or Error Info
File "/Applications/anaconda3/lib/python3.7/site-packages/pdb2sql/superpose.py", line 53, in superpose
warnings.warn(

NameError: name 'warnings' is not defined

2ckb_l_u.pdb.txt
2ckb_b_t_aligned.pdb.txt

StructureSimilarity.compute_lrmsd_fast() and StructureSimilarity.compute_lrmsd_fast() provide different results

Describe the bug
In a case where the model and the reference file cannot be easily aligned, StructureSimilarity.compute_lrmsd_fast() and StructureSimilarity.comp

ute_lrmsd_fast() provide different results (respectively, 0.837 and 3.38)

Environment:

OS system: Linux
Version: CentOS Linux 7 (Core)
Branch commit ID: 7a6d7cd
Inputs: python get_rmsd.py (in the attached folder)

To Reproduce
Steps/commands to reproduce the behaviour:

open the attached folder named different_lrmsd
run python get_rmsd.py

Expected Results
StructureSimilarity.compute_lrmsd_fast() and StructureSimilarity.compute_lrmsd_pdb2sql() should be providing the same output CA l-RMSD.

Actual Results or Error Info
The two methods provide very different CA l-RMSDs (respectively, 0.837 and 3.38)

Additional Context
A .pse PyMol session is included in the attached folder. It is possible to see how the _model is technically very similar to the ref file, but they are not properly aligned by PyMol.
StructureSimilarity.compute_lrmsd_pdb2sql() as well is probably failing in aligning them, but StructureSimilarity.compute_lrmsd_fast() seems to work properly.

different_lrmsd.zip

Proper error message when fetching CIF files

When fetching a PDB ID that only exists in mmCIF format, the error message raised is that same when a non-existing code is given. To my experience, not accurate error messages generate frustrating experiences to a user, in this case, one who actually knows the PDB ID exists. Considering pdb2sql is to be used mostly as a library, I suggest adding a more specific error message, for example: The IDB ID given is only represented in mmCIF format and pdb2sql does not handle mmCIF formats.

to reproduce the issue:

# https://www.rcsb.org/structure/6SL9
pdb2sql.fetch('6SL9')

Link to the JOSS review: openjournals/joss-reviews#2077

wrong assignment of CAPRI classes

in the static-method
compute_CapriClass()
of the
class StructureSimilarity

there is a wrong identification of CAPRI classes.
Example: consider fnat = 0, L-rms = 11 A and I_rms = 2.5 A.
The code presented in compute_CapriClass() would classify such example into acceptable category.
But the Table II in https://doi.org/10.1002/prot.10393 clearly places it into the incorrect category. All conditions in the compute_CapriClass() need to be fixed.

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

Unclear warning messages on PDB read.

When loading a PDB file (e.g. 1AK4_5w.pdb from the tests folder) there is a cryptic warning message about missing elements:

C:\Users\joaor\Miniconda3\envs\pdb2sql\lib\site-packages\pdb2sql\pdb2sqlcore.py:230: UserWarning: Missing element and guess it with atom type
  warnings.warn("Missing element and guess it with atom type")

I would improve the message itself. I have no idea which atoms are missing element columns or if the program guessed the elements (correctly or not).

Link to JOSS review: openjournals/joss-reviews#2077

returns class instance

I found a bit anti-Pythonic that none of the methods of the pdb2sql core class return an instance of the same class. The return value is always a list, and this hinders functionalities like the following:
pdb_db.get('*', resSeq=['1']).pprint()

Please add backbone selection to compute_lrmsd_pdb2sql()

Is your feature request related to a problem? Please describe.
I would be able to calculate l-RMSD on different atoms of the backbone: only CA, or [CA, C, O, N] or [CA, CB, C, O, N], etc.

Describe the solution you'd like
Allowing the user to give a list of atoms to be used for l-RMSD calculation as an optional parameter of compute_lrmsd_pdb2sql

Describe alternatives you've considered
This might be appliable to irmsd as well.

pdb line is longer than 80

Describe the bug
When using pdb2sql to calculate iRMSD of Zdock models, pdb2sql reports an error: pdb line is longer than 80.

It is not clear to me why we need this function _format_pdb_linelength and why we have 80 as a cutoff. It seems that we can simply increase 80 to 150, for example.

Error message:

File "/Applications/anaconda3/lib/python3.7/site-packages/pdb2sql/pdb2sqlcore.py", line 136, in _create_table
line = pdb2sql._format_pdb_linelength(line)
File "/Applications/anaconda3/lib/python3.7/site-packages/pdb2sql/pdb2sqlcore.py", line 255, in _format_pdb_linelength
f'pdb line is longer than 80:\n{pdb_line}')
ValueError: pdb line is longer than 80:
ATOM 1 N GLN A 1 1.795 -7.034 21.703 19 1 1.63 -.15

To Reproduce

model = 'complex_6.pdb'
ref = 'cleaned_1oga_b_differently_numbered.pdb'
sim = StructureSimilarity(model,ref)
irmsd = sim.compute_irmsd_pdb2sql(method='svd')

complex_6.pdb.txt
cleaned_1oga_b_differently_numbered.pdb.txt

the atom order is independent of the returned values, which should not be the case

Hello,

Li and I found a bug. It seems that when you want the specific coordinates (or other values) from a pdb, the order of the atom-list is not uphold in the return values. So if you switch the order (e.g., CA, C, O -> O, C, CA) the return values are identical. Therefore the user will never know which values belong to which atoms. The order the program returns seems to be in the order the pdb itself has. Therefore, also, if one compares 2 pdbs where the atom-name order is different in the pdbs, the user will get wrong results and conclusions.

Some example code below, the pdb is uploaded with this issue.

from pdb2sql import pdb2sql
db = pdb2sql('ref.pdb')
db.get('x', name=['C', 'O', 'CA'])
[-17.244, -16.714, -18.362]
db.get('x', name=['C', 'CA', 'O'])
[-17.244, -16.714, -18.362]

ref.zip

compute_lrmsd_pdb2sql works only if there is no missing backbone atom

Describe the bug
If a backbone atom is missing in the Ligand part of one of the two PDBs, compute_lrmsd_pdb2sql does not report it and leads to an error

Environment:

OS system: Ubuntu
Branch commit ID: fix_lrmsd

To Reproduce

Input these two PDBs:

BL00190001_decoy.txt
BL00190001_ref.txt

sim = StructureSimilarity(decoy_path, ref_path)
lrmsd = sim.compute_lrmsd_pdb2sql(exportpath=None, method='svd')

Expected Results
calculates the LRMSD value even if one (or more) of the backbone atoms is missing
or
prints a proper error message to report the mismatched backbone atom(s)

Actual Results or Error Info

624         # compute the RMSD
625         lrmsd = self.get_rmsd(xyz_decoy_short, xyz_ref_short)
626 
627         # export the pdb for verifiactions

             ..../pdb2sql/pdb2sql/StructureSimilarity.py in get_rmsd(P, Q)
1280         """
1281         n = len(P)
1282         return round(np.sqrt(1. / n * np.sum((P - Q)**2)), 3)

Additional Context
The compute_lrmsd_fast does not have this problem and prints the backbone LRMSD value

pdb2sql selects one alternative location for atoms

Better pdb2sql has an option to select one alternative location for atoms of a residue

Description of problem
While checking the similarity of two PDBs (decoy & ref) when there are alternative locations of atoms for a PDB, it fails to calculate RMSD, as one has more atoms for a residue.

Suggested solution
pdb2sql can have a feature that the user selects which of the alternative locations to select (e.g. altloc= "A") while generating a pdb2sql object (not using only occ value on PDB column)

Alternative solution
While checking the similarity of two pdb2sql objects, selects one alternative location

Additional context
If letting the user defines based on occ value on PDB column, it happens in some PDBs that the occ value is around 50 and that would not be that much helpful

shall we write an installation instruction in the README?

Afer I cloned pdb2sql, it does not seem that it can run immediately. How can I install it? Many thanks.

Refactor StructureSimilarity

StructureSimilarity is a bit of a mess now. We have different ways of computing the metrics and they don't all have the same features and internal checks :

Some routines check for residue and atom matching some don't
Some routines accept multiple chains some don't

We want a unified API for all metrics such as :

def compute_x(self, chain_pairs=None, xzone=None, method='svd', check=True, name=['CA','C','N','O'] ):

compute_lrmsd_fast() does not report proper error message on pdb files with non-amino acids in ATOM section

When a pdb file has a non-amino acid in the ATOM section instead of the HETATM section, compute_lrmsd_fast() reports the following error messages:

Traceback (most recent call last):
File "lrmsd.py", line 12, in
lrmsd_fast = sim.compute_lrmsd_fast()
File "/home/lixue1/tools/pdb2sql/pdb2sql/StructureSimilarity.py", line 140, in compute_lrmsd_fast
xyz_decoy_short, xyz_decoy_long, xyz_ref_long, method)
File "/home/lixue1/tools/pdb2sql/pdb2sql/superpose.py", line 94, in superpose_selection
rmat = get_rotation_matrix(sel_mob, sel_tar, method=method)
File "/home/lixue1/tools/pdb2sql/pdb2sql/superpose.py", line 136, in get_rotation_matrix
mat = get_rotation_matrix_Kabsh(p, q)
File "/home/lixue1/tools/pdb2sql/pdb2sql/superpose.py", line 173, in get_rotation_matrix_Kabsh
P.shape, Q.shape)
ValueError: ("Matrix don't have the same number of points", (1827, 3), (1824, 3))

The pdb file (1DFJ_refb-it1_33.pdb) has these lines (HETATM instead ATOM should have been used):

ATOM 5393 CA ACE 1A 15.175 -17.763 -20.935 1.00 10.00 B
ATOM 5394 C ACE 1A 15.311 -16.206 -21.022 1.00 10.00 B
ATOM 5395 O ACE 1A 15.783 -15.774 -22.071 1.00 10.00 B

Data and code are here: /projects/0/deeprank/BM5/issue44

Can we report a proper error message for such issue, for example, "xxx.pdb has non-supported amino acids XXX in ATOM"?

minor typo documentation

Very little typo on the documentation of pdb2sql

|  commit(self)
 |      Cpmmit to the database.
         *

Link to JOSS review: openjournals/joss-reviews#2077

compute_irmsd_fast() only works on two chains

>>> ref = "ref.pdb"
>>> model = "model.pdb"
>>> sim = StructureSimilarity(model,ref)
>>> irmsd_fast = sim.compute_irmsd_fast()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lixue/tools/anaconda3/lib/python3.7/site-packages/pdb2sql/StructureSimilarity.py", line 277, in compute_irmsd_fast
    resData = self.compute_izone(cutoff, save_file=False)
  File "/home/lixue/tools/anaconda3/lib/python3.7/site-packages/pdb2sql/StructureSimilarity.py", line 328, in compute_izone
    'exactly two chains are needed for irmsd calculation but we found %d' % len(chains), chains)
ValueError: ('exactly two chains are needed for irmsd calculation but we found 5', ['A', 'B', 'C', 'D', 'E'])

Archive.zip

Use Codacy (or other code quality check)

We should link to Coday and have the code quality badge on the Readme

pdb2sql.transform.rot_mat does not allow to set center

Sometimes we need to set rotation center that is other than the geometric center of the whole protein.

This could be fixed by changing this line of pdb2sql.transform.rot_mat():

xyz = rotate(xyz, mat)

xyz = rotate(xyz, mat, center)

as rotate() supports setting center

many2pdb arguments type check

add the type check for many2pdb arguments

Allow pdb2sql to read string data

Currently, pdb2sql documentation state that pdb2sql can be initiated from str, list or array.

|  __init__(self, pdbfile, **kwargs)
 |      Create a SQL database with PDB data.
 |      
 |      Args:
 |          pdbfile(str, list, ndarray): pdb file or data

However pdb2sql cannot read from str representing the PDB data.

_3cro = Path('3CRO.pdb') 
assert _3cro.exists() 
zz = pdb2sql(_3cro.read_text())

when I perform this, the whole pdb data is printed to the stdout with a very minor message at the end:

HETATM 1884  O   HOH R  76     -25.999 -27.796  41.843  1.00 56.38           O  
HETATM 1885  O   HOH R  77     -22.979 -24.940  42.923  1.00 25.75           O  
MASTER      374    0    0   10    0    0    0    6 1881    4    0   16          
END                                                                             
 not found

Implementing reading from PDB string data to instantiate pdb2sql would be a very valuable feature because users could fetch the PDB from the database (I am not referring to the current pdb2sql.fetch function here) and direct the download output to pdb2sql without needing to save it to the disk or parsing it aroud to list or array before hand.

Link to JOSS review: openjournals/joss-reviews#2077

Update docstring of pdb2sql.exportpdb

documentation of pdb2sql.exportpdb is insuficient.

structure similarity fails if the chains are not called A and B

We should instead check which chains are present in the pdb (set(db.get('chainID'))) and if there are only 2 use those ones and if there are more than two raise a Warning

improve print

- bind one of the general printing actions (*kwags) in python to the .pprint() method,
- also provide a general repr message instead of the memory coordinates:

In [117]: pdb_db
Out[117]: <pdb2sql.pdb2sqlcore.pdb2sql at 0x7f3164223890>

In [118]: print(pdb_db)
<pdb2sql.pdb2sqlcore.pdb2sql object at 0x7f3164223890>

update doc with the new align feature

Getting different results for irmsd, if the chain identifiers are in different order

Hi,
I have different pdb files from docking tools and I want to calculate the irmsd. Some of the files have a different order for the chain identifier, but the first chain is always the protein and the second the ligand. One file has A, B as chain identifier and the other one B, A. So the order of the chains is the same, but not the identifier. The value of chains_decoy and chains_ref are checked to see, if the chains in the structures are different, but the funtion get_chains() returns the chain IDs in alphabetical order.

# get the chains
chains_decoy = sql_decoy.get_chains()
chains_ref = sql_ref.get_chains()

if chains_decoy != chains_ref:
raise ValueError('Chains are different in decoy and reference structure')

If the chain identifer would be A, B and X, Y, it would raise an error. But because A, B is also not the same like B, A, I normally would expect it to raise an error.
testdata.zip

Is there a possibility to integrate this in StructureSimilarity.py or do I have to rename the chain identifiers before I calculate the irmsd?

StructureSimilarity "Matrix don't have the same number of points" between two matched files

Describe the bug
When trying to run compute_lrmsd_fast() on two files (model and target structure), the following error is raised:
ValueError: ("Matrix don't have the same number of points", (524, 3), (528, 3))

Environment:

OS system: Linux
Version: CentOS Linux 7 (Core)
Branch commit ID: 7a6d7cd
Inputs: python run.py (in the attached folder)

To Reproduce
Steps/commands to reproduce the behaviour:

open the attached folder
run python run.py

Expected Results
It should compute the l-RMSD

Actual Results or Error Info
it raises the following error:
ValueError: ("Matrix don't have the same number of points", (524, 3), (528, 3))

Additional Context
The model and ref file have been matched, thus only the common residues are present.

Attached folder:
pdb2sql_issue_58.zip

Update doc on structureSimilarity methods

explain the fast and slow methods of structureSimilarity
provide a speed performance benchmark for structureSimilarity methods

Installation: unused dependencies in setup.py

Installing pdb2sql from pip downloads a number of third-party modules but not all seem necessary. For example, cython and matplotlib are never used anywhere on the codebase and add quite a bit of space/time to the installation process.

Also, the dependencies listed on the requirements.txt file are not synced with those on setup.py. I know these serve different purposes but it would be better if they matched. If you want to have a 'development' requirements file, then I'd name it requirements-dev.txt and add extra dependencies there.

Linking to the review at JOSS: openjournals/joss-reviews#2077

deeprank / pdb2sql Goto Github PK

pdb2sql's Introduction

⚠️ Archiving Note

DeepRank

Contents

Overview

Features:

Installation

Stable Release

Development Version

Tutorial

A . Generate the data set (using MPI)

Visualization of the mapped features

B . Deep Learning

Issues and Contributing

pdb2sql's People

Contributors

Stargazers

Watchers

Forkers

pdb2sql's Issues

Recommend Projects

Recommend Topics

Recommend Org