rdk / p2rank Goto Github PK

P2Rank: Protein-ligand binding site prediction tool based on machine learning. Stand-alone command line program / Java library for predicting ligand binding pockets from protein structure.

Home Page: https://rdk.github.io/p2rank/

License: MIT License

Shell 3.78% Groovy 83.62% Batchfile 0.03% Tcl 1.27% Java 10.06% Python 0.16% Promela 1.09%

bioinformatics machine-learning protein-structure pdb binding-sites proteins ligand protein-surface structural-bioinformatics protein-ligand-interactions

p2rank's Introduction

Ligand-binding site prediction based on machine learning.

Description

P2Rank is a stand-alone command line program that predicts ligand-binding pockets from a protein structure. It achieves high prediction success rates without relying on an external software for computation of complex features or on a database of known protein-ligand templates.

Version 2.4 adds support for .cif input and contains a special profile for predictions on AlphaFold models and NMR/cryo-EM structures.

Requirements

Java 11 to 21
PyMOL 1.7 (or newer) for viewing visualizations (optional)

P2Rank is tested on Linux, macOS, and Windows. On Windows, it is recommended to use the bash console to execute the program instead of cmd or PowerShell.

Setup

P2Rank requires no installation. Binary packages are available as GitHub Releases.

Download: https://github.com/rdk/p2rank/releases
Source code: https://github.com/rdk/p2rank
Datasets: https://github.com/rdk/p2rank-datasets

Usage

prank predict -f test_data/1fbl.pdb         # predict pockets on a single pdb file

See more usage examples below...

Algorithm

P2Rank makes predictions by scoring and clustering points on the protein's solvent accessible surface. Ligandability score of individual points is determined by a machine learning based model trained on the dataset of known protein-ligand complexes. For more details see the slides and publications.

Presentation slides introducing the original version of the algorithm: Slides (pdf)

Publications

If you use P2Rank, please cite relevant papers:

Software article about P2Rank pocket prediction tool
Krivak R, Hoksza D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of Cheminformatics. 2018 Aug.
A new web-server article about updates in the web interface prankweb.cz
Jakubec D, Skoda P, Krivak R, Novotny M, Hoksza D PrankWeb 3: accelerated ligand-binding site predictions for experimental and modelled protein structures. Nucleic Acids Research, Volume 50, Issue W1, 5 July 2022, Pages W593–W597
Web-server article introducing the web interface at prankweb.cz
Jendele L, Krivak R, Skoda P, Novotny M, Hoksza D. PrankWeb: a web server for ligand binding site prediction and visualization. Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W345-W349
Conference paper introducing P2Rank prediction algorithm
Krivak R, Hoksza D. P2RANK: Knowledge-Based Ligand Binding Site Prediction Using Aggregated Local Features. International Conference on Algorithms for Computational Biology 2015 Aug 4 (pp. 41-52). Springer
Research article about PRANK rescoring algorithm
Krivak R, Hoksza D. Improving protein-ligand binding site prediction accuracy by classification of inner pocket points using local features. Journal of Cheminformatics. 2015 Dec.

Usage Examples

Following commands can be executed in the installation directory.

Print help

prank help

Predict ligand binding sites (P2Rank algorithm)

prank predict test.ds                    # run on dataset containing a list of pdb/cif files

prank predict -f test_data/1fbl.pdb      # run on a single pdb file
prank predict -f test_data/1fbl.cif      # run on a single cif file
prank predict -f test_data/1fbl.pdb.gz   # run on a single gzipped pdb file

prank predict -threads 8     test.ds     # specify num. of working threads for parallel dataset processing
prank predict -o output_here test.ds     # explicitly specify output directory

prank predict -c alphafold   test.ds     # use alphafold config and model (config/alphafold.groovy)  
                                         # this profile is recommended for AlphaFold models, NMR and cryo-EM 
                                         # structures since it doesn't depend on b-factor as a feature

Prediction output

For each structure file <struct_file> in the dataset P2Rank produces several output files:

<struct_file>_predictions.csv: contains an ordered list of predicted pockets, their scores, coordinates of their centers together with a list of adjacent residues, list of adjacent protein surface atoms, and a calibrated probability of being a ligand-binding site
<struct_file>_residues.csv: contains list of all residues from the input protein with their scores, mapping to predicted pockets, and a calibrated probability of being a ligand-binding residue
visualizations/<struct_file>.pml: PyMol visualization (.pml script with data files in data/)
- generating visualizations can be turned off by -visualizations 0 parameter
- coordinates of the SAS points can be found in visualizations/data/<struct_file>_points.pdb.gz. There the "Residue sequence number" (23-26) of HETATM record corresponds to the rank of the corresponding pocket (points with value 0 do not belong to any pocket).

Configuration

You can override the default params with a custom config file:

prank predict -c config/example.groovy  test.ds
prank predict -c example                test.ds # same effect, config/ is default location and .groovy implicit extension

It is also possible to override the default params on the command line using their full name.

prank predict                   -visualizations 0 -threads 8  test.ds   #  turn off visualizations and set the number of threads
prank predict -c example.groovy -visualizations 0 -threads 8  test.ds   #  overrides defaults as well as values from example.groovy

P2Rank has numerous configurable parameters. To see the list of standard params look into config/default.groovy and other example config files in this directory. To see the complete commented list of all (including undocumented) params see Params.groovy in the source code.

Evaluate prediction model

...on a file or a dataset with known ligands.

prank eval-predict -f test_data/1fbl.pdb
prank eval-predict test.ds

Rescoring (PRANK algorithm)

In addition to predicting new ligand binding sites, P2Rank is also able to rescore pockets predicted by other methods (Fpocket, ConCavity, SiteHound, MetaPocket2, LISE and DeepSite are supported at the moment).

prank rescore test_data/fpocket.ds
prank rescore fpocket.ds                 # test_data/ is default 'dataset_base_dir'
prank rescore fpocket.ds -o output_dir   # test_output/ is default 'output_base_dir'       
prank eval-rescore fpocket.ds            # evaluate rescoring model

Build from sources

This project uses Gradle build system via included Gradle wrapper. On Windows use bash to execute build commands (bash is installed as a part of Git for Windows).

git clone https://github.com/rdk/p2rank.git && cd p2rank
./make.sh       

./unit-tests.sh    # optionally you can run tests to check everything works fine on your machine        
./tests.sh quick   # runs further tests

Now you can run the program via:

distro/prank       # standard mode that is run in production
./prank.sh         # development/training mode

To use ./prank.sh (development/training mode) first you need to copy and edit misc/locval-env.sh into repo root directory (see https://github.com/rdk/p2rank/blob/develop/misc/tutorials/training-tutorial.md#preparing-the-environment).

Comparison with Fpocket

Fpocket is a widely used open source ligand binding site prediction program. It is fast, easy to use and well documented. As such, it was a great inspiration for this project. Fpocket is written in C, and it is based on a different geometric algorithm.

Some practical differences:

Fpocket
- has a much smaller memory footprint
- runs faster when executed on a single protein
- produces a high number of less relevant pockets (and since the default scoring function isn't very effective the most relevant pockets often don't get to the top)
- contains MDpocket algorithm for pocket predictions from molecular trajectories
- still better documented
P2Rank
- achieves significantly higher identification success rates when considering top-ranked pockets
- produces a smaller number of more relevant pockets
- speed:
  - slower when running on a single protein (due to JVM startup cost)
  - approximately as fast on average running on a big dataset on a single core
  - due to parallel implementation potentially much faster on multi-core machines
- higher memory footprint (~1G but doesn't grow much with more parallel threads)

Both Fpocket and P2Rank have many configurable parameters that influence behaviour of the algorithm and can be tweaked to achieve better results for particular requirements.

Thanks

This program builds upon software written by other people, either through library dependencies or through code included in its source tree (where no library builds were available). Notably:

FastRandomForest by Fran Supek (https://code.google.com/archive/p/fast-random-forest/)
FastRandomForest 2.0 (https://github.com/GenomeDataScience/FastRandomForest)
KDTree by Rednaxela (http://robowiki.net/wiki/User:Rednaxela/kD-Tree)
BioJava (https://github.com/biojava)
Chemistry Development Kit (https://github.com/cdk)
Weka (http://www.cs.waikato.ac.nz/ml/weka/)

Contributing

We welcome any bug reports, enhancement requests, and other contributions. To submit a bug report or enhancement request, please use the GitHub issues tracker. For more substantial contributions, please fork this repo, push your changes to your fork, and submit a pull request with a good commit message.

p2rank's People

Contributors

Stargazers

Watchers

p2rank's Issues

Add README section about running dockerized version

Make p2rank behave more like a standard unix program by default (wrt output location and working directory)

Program behaves weirdly when not executed from distro root directory.
Output location cannot be overridden on the command line. (#50, now fixed)
Default output location should be inside working directory.

ResidueNumberWrapper ignore chain ?

I've been struggling to get same results with external csv feature as with conservation.

And I seem that they handle chains in a different way. For example in joined(mlig) dataset for 1gm8.
There are two conservation files, each for one chain.

A: 469 0.34542 K ...
B 655 0.74410 W ...

Input for my CSV feature is:

"A",,"179","0.34542"
"B",,"179","0.7441"

However, values assigned by my feature are not the same as by the conservation.

The loaded conservation score holds information for only one residue with number 179 and this value is from chain B - that is loaded later.

The conservation feature utilize Map with key ResidueNumberWrapper (unfortunately I have no idea why this class is there :( from the implementation of hash and hashCode:

    @Override
    public boolean equals(Object o) {
        if (this.is(o)) return true;
        if (o.is(null) || getClass() != o.getClass()) return false;
        ResidueNumberWrapper that = (ResidueNumberWrapper) o;
        return resNum != null ? equalsPositional(resNum,that.resNum) : that.resNum == null;
    }

    @Override
    public int hashCode() {
        if (resNum == null) return 0;
        final int prime = 31;
        int result = 1;
        result = prime * result + ((resNum.getInsCode() == null) ? 0 : resNum.getInsCode().hashCode());
        result = prime * result + ((resNum.getSeqNum() == null) ? 0 : resNum.getSeqNum().hashCode());
        return result;
    }

    public static boolean equalsPositional(ResidueNumber r1, ResidueNumber r2) {
        if (r1 == r2)
            return true;
        if (r2 == null)
            return false;
        if (r1.getInsCode() == null) {
            if (r2.getInsCode() != null)
                return false;
        } else if (!r1.getInsCode().equals(r2.getInsCode()))
            return false;
        if (r1.getSeqNum() == null) {
            if (r2.getSeqNum() != null)
                return false;
        } else if (!r1.getSeqNum().equals(r2.getSeqNum()))
            return false;

        return true;
    }

it seems that chain is not considered, meaning that this class is not able to distinguish residues with same number and insert code from different chains.

If you are interested I can post command and a branch with CSV feature that make this bug visible.

Out of memory error

I was trying to train a new model on train dataset (holo4k_train.ds containing 3634 proteins) and validation dataset (holo4k_test.ds containing 409 proteins) (memory settings: $JAVA_OPTS -Xmx16G) using the following command:
./prank traineval -t p2rank-datasets/holo4k_train.ds -e p2rank-datasets/holo4k_test.ds -threads 4 -rf_trees 200 -delete_models 0 -loop 1 -seed 42
I got out of memory error (see run.log attached).. any idea how many Gs do I need?
out_of_error.zip

Add timestamps to stdout

It would be useful to add timestamps to stdout, as for example when running on remote server (using screen) it is not clear how many entries were processed and how fast is the progress.

hyperparameter optimization tutorial

In the hyperparameter optimization tutorial (hyperparameter-optimization-tutorial.md) there is only an example. It would be nice to a comprehensive list of all the values user can optimize.

After reading the tutorial I do not know what I can put after -paramA , -paramB.
It is also not clear that paramA and paramA are name of the parameters and not cmd-line keywords.

In the Real example section it is not clear why the values in List expression are in extra braces.
Why ((protrusion.bfactor),(protrusion.bfactor.new_feature)) instead of (protrusion.bfactor,protrusion.bfactor.new_feature). There is no explanation give.

Also, there is a paragraph in the braces, is it optional? Is it important? If it is extra information not needed right now it should not be there, if it is useful information it should be without braces,

Also in the last section (Run optimization experiment) you use -<param1> to denote parameter, this is inconsistent with the -paramA used before.

Thread count limitation

Using -crossval_threads 1 -rf_threads 4 I would expect p2rank to use at most 4 threads, but that is not the case as on my 8 core machine the CPU utilization by p2rank sometimes reaches 100%; meaning that p2rank uses all 8 core, ie. 8 threads.

Is there a simple way how to limit the number of maximum threads to a given number? This is useful for running p2rank in the background, on the working station, or just on a server that is not entirely dedicated to p2rank.

easy way to get features

Is there an easy way to access the feature descriptor of the SAS points?

Update biojava dependencies

Hi,

please update your biojava dependencies to 4.2.8. The url paths to RCSB have changed and parsing PDB files does not work in the older versions any longer (See this bug). I would send you a pull request, however, I don't have the source code for your forked biojava-strcture-rdk library, which need to be updated as well.

Thanks,

Lukas

IndexOutOfBoundsException when loading PDB file

Running traineval I got following exception in log file:

[INFO] Protein - loading protein [/data/p2rank/ions/data/2020-01/mg/pdb/5m1k.pdb]
[INFO] PDBUtils - loading file [data/2020-01/mg/pdb/5m1k.pdb]
[ERROR] ConsoleWriter - Index 0 out of bounds for length 0
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
        at jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) ~[?:?]
        at jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) ~[?:?]
        at jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248) ~[?:?]
        at java.util.Objects.checkIndex(Objects.java:372) ~[?:?]
        at java.util.ArrayList.get(ArrayList.java:458) ~[?:?]
        at org.biojava.nbio.structure.io.PDBFileParser.sourceValueSetter(PDBFileParser.java:1237) ~[biojava-structure-4.2.12.jar:4.2.12]
        at org.biojava.nbio.structure.io.PDBFileParser.pdb_SOURCE_Handler(PDBFileParser.java:1204) ~[biojava-structure-4.2.12.jar:4.2.12]
        at org.biojava.nbio.structure.io.PDBFileParser.makeCompounds(PDBFileParser.java:2818) ~[biojava-structure-4.2.12.jar:4.2.12]
        at org.biojava.nbio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2764) ~[biojava-structure-4.2.12.jar:4.2.12]
        at org.biojava.nbio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2635) ~[biojava-structure-4.2.12.jar:4.2.12]
        at cz.siret.prank.utils.PDBUtils.loadFromFile(PDBUtils.groovy:51) ~[p2rank.jar:?]
        at cz.siret.prank.utils.PDBUtils$loadFromFile$10.call(Unknown Source) ~[?:?]
        at cz.siret.prank.domain.Protein.loadFile(Protein.groovy:157) ~[p2rank.jar:?]
        at jdk.internal.reflect.GeneratedMethodAccessor213.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.codehaus.groovy.runtime.callsite.PlainObjectMetaMethodSite.doInvoke(PlainObjectMetaMethodSite.java:43) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:190) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.call(PogoMetaMethodSite.java:70) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:135) ~[groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.domain.Protein.load(Protein.groovy:146) ~[p2rank.jar:?]
        at cz.siret.prank.domain.Protein$load$5.call(Unknown Source) ~[?:?]
        at cz.siret.prank.domain.loaders.PredictionLoader.loadPredictionPair(PredictionLoader.groovy:34) ~[p2rank.jar:?]
        at cz.siret.prank.domain.loaders.PredictionLoader$loadPredictionPair$26.call(Unknown Source) ~[?:?]
        at cz.siret.prank.domain.Dataset$Item.loadPredictionPair(Dataset.groovy:96) ~[p2rank.jar:?]
        at cz.siret.prank.domain.Dataset$Item.getPredictionPair(Dataset.groovy:86) ~[p2rank.jar:?]
        at jdk.internal.reflect.GeneratedMethodAccessor82.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101) ~[groovy-2.5.6.jar:2.5.6]
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.metaclass.MethodMetaProperty$GetBeanMethodMetaProperty.getProperty(MethodMetaProperty.java:76) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.GetEffectivePogoPropertySite.callGetProperty(GetEffectivePogoPropertySite.java:48) ~[groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.program.routines.CollectVectorsRoutine$_collectVectors_closure1.doCall(CollectVectorsRoutine.groovy:91) ~[p2rank.jar:?]
        at jdk.internal.reflect.GeneratedMethodAccessor452.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101) ~[groovy-2.5.6.jar:2.5.6]
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:263) ~[groovy-2.5.6.jar:2.5.6]
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1041) ~[groovy-2.5.6.jar:2.5.6]
 at groovy.lang.Closure.call(Closure.java:405) ~[groovy-2.5.6.jar:2.5.6]
        at groovy.lang.Closure.call(Closure.java:421) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3540) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3525) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3625) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.dgm$87.invoke(Unknown Source) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoMetaMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:244) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:53) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:115) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:127) ~[groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.program.routines.CollectVectorsRoutine.collectVectors(CollectVectorsRoutine.groovy:91) ~[p2rank.jar:?]
        at cz.siret.prank.program.routines.TrainEvalRoutine.doCollectVectors(TrainEvalRoutine.groovy:87) ~[p2rank.jar:?]
        at cz.siret.prank.program.routines.TrainEvalRoutine.collectTrainVectors(TrainEvalRoutine.groovy:68) ~[p2rank.jar:?]
        at cz.siret.prank.program.routines.TrainEvalRoutine$collectTrainVectors.call(Unknown Source) ~[?:?]
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:115) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:119) ~[groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.program.routines.Experiments.doTrainEval(Experiments.groovy:108) ~[p2rank.jar:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.StaticMetaMethodSite$StaticMetaMethodSiteNoUnwrapNoCoerce.invoke(StaticMetaMethodSite.java:149) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.StaticMetaMethodSite.callStatic(StaticMetaMethodSite.java:100) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:55) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:196) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:224) ~[groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.program.routines.Experiments.traineval(Experiments.groovy:129) ~[p2rank.jar:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101) ~[groovy-2.5.6.jar:2.5.6]
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323) ~[groovy-2.5.6.jar:2.5.6]
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1217) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnCurrentN(ScriptBytecodeAdapter.java:94) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnCurrent0(ScriptBytecodeAdapter.java:125) ~[groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.program.routines.Experiments.execute(Experiments.groovy:92) ~[p2rank.jar:?]
        at cz.siret.prank.program.routines.Experiments$execute$0.call(Unknown Source) ~[?:?]
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:115) ~[groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:119) ~[groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.program.Main.runExperiment(Main.groovy:261) ~[p2rank.jar:?]
        at cz.siret.prank.program.Main.run(Main.groovy:324) ~[p2rank.jar:?]
        at cz.siret.prank.program.Main$run.call(Unknown Source) ~[?:?]
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47) [groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:115) [groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:119) [groovy-2.5.6.jar:2.5.6]
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:119) [groovy-2.5.6.jar:2.5.6]
        at cz.siret.prank.program.Main.main(Main.groovy:377) [p2rank.jar:?]
[INFO] ConsoleWriter - For details see log file: mg/run.log
[INFO] ConsoleWriter -
[INFO] ConsoleWriter - ----------------------------------------------------------------------------------------------
[INFO] ConsoleWriter -  finished with ERROR in 2 minutes, 57.339 seconds
[INFO] ConsoleWriter - ----------------------------------------------------------------------------------------------

The record in dataset.ds file for relevant protein is

pdb/5m1k.pdb    MG

From the exception, it is not clear what is wrong and how it can be fixed.

Tested using latest release 2.0.1.

dataset-joined pdb_residues file doesn`t match with fasta sequence

I run these commands, where joined.ds is from: https://github.com/rdk/p2rank-datasets

./prank.sh analyze residues joined.ds
./prank analyze fasta-masked joined.ds

But several files with residues don`t match with the fasta sequence.
All the files are here:
files.zip

In these files length of the sequence of chain, I and L are OK, but the sequence of the chain H should be longer according to csv file.

1hxf.pdb_residues.csv

1hxf_H.fasta
1hxf_I.fasta
1hxf_L.fasta

In these files, the length of chain A is 66 and the length of B is 65 but there are 232 rows in 1pts.pbd_residues.csv and I'm not getting any other files.

1pts.pbd_residues

1pts_A.fasta
1pts_B.fasta

I always get one fasta file for each csv file with residues and the sequence is shorter than the number of rows in csv.

1bbs.pdb_residues.csv
1bb_A.fasta

1chg.pdb_residues.csv
1chg_A.fasta

1djb.pdb_residues.csv
1djb_A.fasta

2cba.pdb_residues.csv
2cba_A.fasta

2fbp.pdb_residues.csv
2fbp_A.fasta

2tga.pdb_residues.csv
2tga_A.fasta

3lck.pdb_residues.csv
3lck_A.fasta

3p2p.pdb_residues.csv
3p2p_A.fasta

3ptn.pdb_residues.csv
3ptn_A.fasta

4ca2.pdb_residues.csv
4ca2_A.fasta

5dfr.pdb_residues.csv
5dfr_A.fasta

failFast for conservation

I used the command:

.\p2rank-develop\prank.bat traineval -t .\datasets\chen11.ds -e .\datasets\joined(mlig).ds -threads 4 -rf_trees 128 delete_models 0 -loop 1 -seed 42 -delete_vectors 0 -feature_importances 1 -failFast 1 -l conserv_cloud -extra_features 'conserv_cloud' -load_conservation 1 -load_conservation_paths 1  -conservation_dir .\..\conservation

in LOG file I see

[ERROR] ConservationScore - Score file doesn't exist [.\datasets\.\..\conservation\a.029.003.001_1rx0aA.hssp.hom.gz]

I would expect p2rank to fail with -failFast 1 option in case of error, as now it seems like all is ok but in fact no conservation is loaded as the path is wrong.

Tested with f4c1e63.

Possibility to retrieve protein sequence directly

Hi,

I'm trying to use P2Rank inside my pipeline to retrieve the protein sequence of the putative binding site.
I'd like to know if there is the possibility to get the protein sequence other than the list of residue IDs. I'm trying to use BioPython but the sequence I get from it is shorter than the one in the PDB file. Right now I'm retrieving the sequence from the PDB file using some code I wrote but it should be an output of the program, I guess. Thanks.

Lorenzo

Spearmint and Python 3

As the end of Python 2 support, have you tested Spearmint with Python 3? Spearmint's repository
(from 2016) operates (in readme.md) with Python 2.7.

What are best practices to generating docking gridboxes from P2R points?

Hello, I'm using P2R to aid autodock vina virtual screening and I was wondering how exactly should a grid box be derived from the points generated by P2R? Specifically, is containing all the points designated as "surface" of the pocket enough to catch all high affinity poses? Or should a margin of space be added around that, to facilitate poses which might be partially contained in the pocket but not entirely? If so, how big should the margin be?

(Btw, thank you for making this software, it's really quite wonderful.)

Add README section abut P2Rank framework

[SUGGESTION] Release Binary Should Log To STDERR instead of to install directory

In https://github.com/rdk/p2rank/blob/develop/distro/prank the program logs to a logfile, that is default inside the installation directory. I believe this is the run-script distributed in the release tarball.

When installing this on HPC systems often the installation directory is not writable by the users running the software leading to errors. In the case that it is writable by users then there can be thrashing and race conditions in the logfile if multiple jobs are writing to it at once.

It might be better to distribute the version that logs to stdout instead of a logfile in the install directory.

Add probability to output *_predictions.csv file

Probability is calculated but not added to the pockets output file (for residues it is).

Case-study to extra file

Why there is Case study: Implementing and evaluating new feature in the training-tutorial.md, the file is called training tutorial not case study. It just makes the tutorial longer and for eyeballing for complicated.

Why not move it to an extra tutorial?

Log output change

Is it possible to redirect log to stdout or stderr ? As by default many of the logs are stored in the {output}/run.log directory.

File path in *.ds file

Hi,
It seems the file name in *.ds file does not support an absolute path like /home/user/dataset/*.pdb.
I have a .ds file:

/home/jsun/gpse_v2/test/structure_dataset/A0A1Q8LK65_AF.pdb
/home/jsun/gpse_v2/test/structure_dataset/W7ISS8_AF.pdb
/home/jsun/gpse_v2/test/structure_dataset/A5EDZ7_AF.pdb
/home/jsun/gpse_v2/test/structure_dataset/A0A3G7HEQ2_AF.pdb
/home/jsun/gpse_v2/test/structure_dataset/A0A6J4VWF3_AF.pdb
/home/jsun/gpse_v2/test/structure_dataset/A0A1R4K1N1_AF.pdb
/home/jsun/gpse_v2/test/structure_dataset/A0A7W8E001_AF.pdb

and in the error log:

[INFO] Console - ----------------------------------------------------------------------------------------------
[INFO] Console -  P2Rank 2.4
[INFO] Console - ----------------------------------------------------------------------------------------------
[INFO] Console - 
[INFO] Main - loading default config from [/home/jsun/gpse_v2/p2rank_2.4/config/default.groovy]
[INFO] Main - looking for dataset in dataset_base_dir [/home/jsun/gpse_v2/GPSE/filtered.ds]...
[INFO] Dataset - loading dataset [/home/jsun/gpse_v2/GPSE/filtered.ds]
[ERROR] Dataset - protein file doesn't exist: .//home/jsun/gpse_v2/test/structure_dataset/A0A7W8E001_AF.pdb
[ERROR] Dataset - protein file doesn't exist: .//home/jsun/gpse_v2/test/structure_dataset/W7ISS8_AF.pdb
[ERROR] Dataset - protein file doesn't exist: .//home/jsun/gpse_v2/test/structure_dataset/A0A1Q8LK65_AF.pdb
[ERROR] Dataset - protein file doesn't exist: .//home/jsun/gpse_v2/test/structure_dataset/A5EDZ7_AF.pdb
[ERROR] Dataset - protein file doesn't exist: .//home/jsun/gpse_v2/test/structure_dataset/A0A1M4DY30_AF.pdb
[ERROR] Dataset - protein file doesn't exist: .//home/jsun/gpse_v2/test/structure_dataset/A0A447WAH0_AF.pdb
[ERROR] Dataset - protein file doesn't exist: .//home/jsun/gpse_v2/test/structure_dataset/A0A1R4K1N1_AF.pdb

How can I modify the setting to avoid this? Thanks a lot! 🙏

Improve behaviour - wrong parameters

Hi,
when I run a command with a wrong parameter, for example

prank predict -f pdb1hdb.ent -wrong-parameter foo bar

it is executed.
I think it would be more convenient if that wasn't executed or at least the user was warned about the wrong parameter(s).

Error running code

Hi!

I'm getting the following error when running prank:

ERROR: No signature of method: cz.siret.prank.program.routines.Experiments.single.ds() is applicable for argument types: () values: []
groovy.lang.MissingMethodException: No signature of method: cz.siret.prank.program.routines.Experiments.single.ds() is applicable for argument types: () values: []

What are the pre-requisites for the precompiled version?

Thanks

Note should not be in the text

In the file hyperparameter-optimization-tutorial.md in the paragraphs there is Note: ... which spans half of the paragraph, that does not look like a note.

Fpocket in tutorial

Is it necessary to mention historical reason and Fpocket in the training-tutorial.md ? I though p2rank provide better results and as a user, this information is not necessary for me making the tutorial harder to read.

Extend and reorganize documentation

Move tutorials from misc to documentation dir.
Consider what to distrubute in distro.
Add links to wiki.
Add detailed documentation of output format.

Allow to use conservation for a single protein prediction

Add arguments to allow utilization of conservation in a single protein prediction using command line. It would be also great to provide an example of the command/inputs/outputs of such run for integration purposes.

What is a huge dataset?

In the training-tutorial.md there is a note '''Turn off for huge datasets that won't fit to memory.'''. It would really help to give an estimate of how much memory does the p2rank use. As one may consider a hundreds of proteins a small dataset that should fit into main memory. For example if I have 2500 proteins and 32G RAM I would expect that it is fine, but is it?

Bug in -conservation_dirs argument parsing

Hi,

I think that the "new" option -conservation_dirs parses its arguments incorrecly. Say my dataset definition file coach420.ds is in ./datasets/, specifying -conservation_dirs coach420/conservation/e5i1/scores/ results in [INFO] DatasetItemLoader - Conservation lookup dirs: [datasets/oach420/conservation/e5i1/scores]... in other words, the first character of the specified path is lost and the conservations are not loaded. One has to do a workaround, e.g., use the option with an extra character (X here), -conservation_dirs Xcoach420/conservation/e5i1/scores/ for the desired path to be used correctly.

This is especially unfortunate in combination with #36, as the user is informed in no way of the error (i.e., that no conservation files are actually being parsed) unless they check the logs specifically to verify this.

David

Commented code

There is commented code in the source.
It is not clear why and when it can be safely deleted a better practice is to not have it.
A deleted code can be always recovered from GitHub.

Prepare tests for backward compatibility

We should have tests that will ensure backward compatibility of p2rank.
This is important for further development as contributors may not know why are certain things in the code.

For example:

    static String correctResidueCode(String residueCode) {
        //MSE is only found as a molecular replacement for MET
        //'non-standard', genetically encoded
        if (residueCode=="MSE")
            residueCode = "MET"
        else if (residueCode=="MEN")  // N-METHYL ASPARAGINE
            residueCode = "ASP"

        return residueCode
    }

is this specific for certain dataset/protein, etc...

In the ideal scenario we should have also test for every non-standard/invalid data p2rank can consume and internally sanitize.

PyMOL output file names are not quoted

When carrying out a prediction of a multichain protein via PrankWeb , the resulting PyMOL script does not work because the filename contains a coma and it is not quoted.

load data/pdbid_6lu7_A,C.ent.gz, protein

needs to be changed to

load "data/pdbid_6lu7_A,C.ent.gz", protein.

Using a coma in filenames should be avoided as well.

Invalid format and content of runs_pred.csv

After running .\prank.bat eval-predict .\..\coach420-mlig.ds file called runs_pred.csv is produced.
The file has following content.

dir,label,proteins,ligands,pockets,DCA_4_0,DCA_4_2,P,R,F1,MCC,ligSize,pocketVol,pocketSurf
eval_predict_coach420,FastRandomForest (...),420,511,2538,71,6,77,1,     NaN,     NaN,     NaN,   0,000,  26,898,   0,000,  25,418
eval_predict_coach420-mlig,FastRandomForest (...,300,378,1772,71,2,75,7,     NaN,     NaN,     NaN,   0,000,  28,839,   0,000,  25,492

It seems that a comma is used as a CSV separator but also as a decimal separator, making the CSV hard to use and read.

Building on WSL failing

I tried to build p2rank on WSL, but I am getting the following error (seems like it's not related to WSL at all):

hellb@DESKTOP-21PNO6M:/mnt/c/projects/git/p2rank$ sudo ./make.sh

> Task :compileGroovy
/mnt/c/projects/git/p2rank/src/main/groovy/cz/siret/prank/prediction/pockets/rescorers/InstancePredictor.java:36: error: cannot find symbol
                if (((FasterForest)classifier).isVersion2()) {
                                              ^
  symbol:   method isVersion2()
  location: class FasterForest
/mnt/c/projects/git/p2rank/src/main/groovy/cz/siret/prank/prediction/pockets/rescorers/InstancePredictor.java:62: error: cannot find symbol
                    return ff.distributionForAttributes(vect.getArray(), 2);
                             ^
  symbol:   method distributionForAttributes(double[],int)
  location: variable ff of type FasterForest2
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
2 errors
startup failed:
Compilation failed; see the compiler error output for details.

1 error


> Task :compileGroovy FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':compileGroovy'.
> Compilation failed; see the compiler error output for details.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.7/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 1m 23s
1 actionable task: 1 executed

readme.txt vs readme.md

Some documentation files use txt extension although the seems to be written in markdown. For example this file.

Why is not md extension used everywhere?
Also, it is not that in dist directory only txt is used as the root of that directory contains md file.

Prank not compiling for a dataset (.ds)

I am trying to run prank predict on a dataset containing the list of pdb files with their path. But it doesn't compile. It gives the following error:
"""[INFO] ConsoleWriter -
[INFO] Main - loading default config from [/home/purvi/genome/p2rank-master/distro/config/default.groovy]
[INFO] Main - looking for dataset in dataset_base_dir [/home/purvi/genome/p2rank-master/test_swiss.ds]...
[INFO] Dataset - loading dataset [/home/purvi/genome/p2rank-master/test_swiss.ds]
[ERROR] Dataset - prediction file doesn't exist: null/dir_test/1a52.pdb
[ERROR] Dataset - protein file doesn't exist: null/null"""
I am not sure where the "null" comes from.
It runs on a single pdb file though.
Please let me know how to fix this.

Feature implementation not found: conserv

There are several conservation features named:

conserv_cloud
conserv
conservationcloud
conservationcloudscaled
conservation

However, not all of then works, for conserv I got:

java.lang.IllegalStateException: Feature implementation not found: conserv

I do not see a difference between the features - is there a description somewhere?
As far as I know, there is also no recommendation of which feature should be used.

Any idea why there is a difference?

The p2rank provide different predictions for those two PDB files:

structure-full.pdb.txt
structure-chains.pdb.txt

The chain file was created only by selecting atoms on chains A, B (all chains in the full PDB)
Used command:

'/opt/p2rank-runtime/p2rank.sh predict -c /opt/p2rank/p2rank_2.1/config/default -threads 1 -f /data/p2rank/task/database/v2/19HC/working/structure.pdb -o /data/p2rank/task/database/v2/19HC/working/p2rank-output --log_to_console 1

Executed with https://github.com/rdk/p2rank/releases/download/2.1/p2rank_2.1.tar.gz

Any idea why this can be? Does p2rank utilize non-polymer atoms in any way?

Output directory parameter `-o` is sometimes ignored.

-o should override all other parameters that redirect output.

How to train with conservation (Add parameter -conservation_dir_train)

Commands such as

./prank.sh eval-predict ../p2rank-datasets/coach420.ds \
    -c distro/config/conservation \
    -conservation_dir 'coach420/conservation/e5i1/scores'

can be utilized to use conservation score to evaluate performance on a dataset.

I would expect train command to be similar ie.

.\p2rank\prank.bat traineval 
    -t .\datasets\chen11.ds \
    -e .\datasets\joined(mlig).ds \
    -threads 4 -rf_trees 128 -fr_depth 6 -delete_models 0 -loop 1 -seed 42 \
    -c distro/config/conservation
    -conservation_dir ...

however, it is not clear how to specify two directories as inputs for conservation i.e. one for chen11 and one for joined dataset.

Or do I have to merge all files into a single directory?

Resolve TODOs or change into issues

The p2rank code contains many TODOs, for a contributor they often do not fully explain what is needed or what is the severity of TODO.

It would be better to remove them and rather create GitHub issues with a more detailed description (what, why, how .. ).

How does predict-eval work

Running on 2 datasets, namely:

coach420.ds
coach420(mlig).ds
this lead to different results DCA_4_2 being 77.1 and 75.7 respectively.

How can run on coach420.ds determine the ligand bindings sites for evaluation, as they are not provided in the ds file?

If it can, why is there a difference in the prediction performance?

Besides in ```runs_pred.csv`` it is not clear why DCA value is used when the reported results use what seems to be DCC.

Also, the result using eval_predict_coach420-mlig are 71.2 and 75.7 compared to 72.0 78.3 reported in the article. Is it possible to reproduce the article results and if so, how?

Is there a traineval/optimization command used to train models to use to train the reported result?

Tutorial/log from model training

In order to utilize p2rank to it's fullest potential t hyper-parameter training is required. Would it be possible to make available a tutorial/log that was used to train published models?
It may also include notes on evaluation etc .. ie. how to decide how good the parameters are.

Such a document can be used as a starting point for other learning and applications.

Java options unrecognized by OpenJDK 15

When trying to run the P2Rank with OpenJDK 15, I am getting

Unrecognized VM option 'CMSClassUnloadingEnabled'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

The same with the second option (UseConcMarkSweepGC) when I remove CMSClassUnloadingEnabled. When I remove both prank predict -f test_data/1fbl.pdb runs just fine.

installation not clear?

Just came across P2Rank and interested to run it on a zipped folder containing protein conformations from a trajectory.
I have pymol installed on the linux cluster and I did this:
#git clone https://github.com/rdk/p2rank.git

I see a p2rank folder created.
What is the next step? Do I have to do a make install or how to run it on a test.pdb file?

Thanks

Improve error message

For example when using following command:

./p2rank_2.0.1/prank -t data/2020-01/mg/dataset-train.ds -e data/2020-01/mg/dataset-train.ds -o mg

the output is:

----------------------------------------------------------------------------------------------
 P2Rank 2.0.1
----------------------------------------------------------------------------------------------

  uasge:

     prank <command> <dataset.ds> [options]

  commands:

     predict      ... predict pockets (P2RANK)

     eval-predict ... evaluate model on a dataset with known ligands

     rescore      ... rescore previously detected pockets (PRANK)

     eval-rescore ... evaluate rescoring model on a dataset with known ligands

  datasets:

        Dataset files for prediction should contain list of pdb files.
        Dataset files for rescoring should contain list of protein files
        that are outputs of one of the supported pocket prediction methods
        (fpocket, ConCavity). In datasets for evaluation and training they
        must be paired with liganated-proteins (correct solutions).
        See example datasets in test_data/ directory.

  options:

     -f <path>   run on single pdb file instead of a dataset

     -c <path>   use configuration file that overrides default configuration
                 in config/default.groovy, path relative to config/ directory

     -m <path>   use previously trained classifier file relative to models/ directory
                 default: models/default.model

     -o <path>   specify output directory (relative to working dir)
                 default: test_output/<comamnd>_<dataset>

  other parameters:

     -threads <int>         number of execution threads
                            dafault: num. of processors + 1

     -visualizations <0/1>  produce PyMOL visualizations
                            default: true

     -<param> <value>       for full list of parameters see config/default.groovy

----------------------------------------------------------------------------------------------
 finished with ERROR in 0.117 seconds

log output is:

[INFO] ConsoleWriter - ----------------------------------------------------------------------------------------------
[INFO] ConsoleWriter -  P2Rank 2.0.1
[INFO] ConsoleWriter - ----------------------------------------------------------------------------------------------
[INFO] ConsoleWriter -
[INFO] ConsoleWriter -
[INFO] ConsoleWriter - ----------------------------------------------------------------------------------------------
[INFO] ConsoleWriter -  finished with ERROR in 0.117 seconds
[INFO] ConsoleWriter - ----------------------------------------------------------------------------------------------

From none of the output, I immediately see that I have not provided the command options.
For example, commons-cli provide nice message when mandatory argument is omitted, something similar can make p2rank more user friendly.

[WARN] Routine - failed to get git commit version
java.io.IOException: Cannot run program "git": error=2, No such file or directory

Is it necessary to have git installed to get rid of this warning ?

exception: Not enough training instances with class labels

ERROR: hr.irb.fastRandomForest.FastRandomTree: Not enough training instances with class labels (required: 1, provided: 0)!
weka.core.WekaException: hr.irb.fastRandomForest.FastRandomTree: Not enough training instances with class labels (required: 1, provided: 0)!

It is not clear what went wrong, why and how to fix it.
Is it positive or negative .. was there no data, bad parameters used? How many instances were extracted from given data and if none why?

Tested using latest release 2.0.1.

rdk / p2rank Goto Github PK

p2rank's Introduction

Description

Requirements

Setup

Usage

Algorithm

Publications

Usage Examples

Print help

Predict ligand binding sites (P2Rank algorithm)

Prediction output

Configuration

Evaluate prediction model

Rescoring (PRANK algorithm)

Build from sources

Comparison with Fpocket

Thanks

Contributing

p2rank's People

Contributors

Stargazers

Watchers

Forkers

p2rank's Issues

Recommend Projects

Recommend Topics

Recommend Org