ci-lab-cz / easydock Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
Improve description of the repository in README. Mention which docking programs can be run, how to install them (can be a link to the corresponding instructions). Provide examples of scripts running.
Hi! I have read the easydock paper and i am interested if it is possible to add the stereoisomer enumeration in the source code. The paper mentioned that it is recommended to either use the RDKit function (I assume it is the rdkit.Chem.EnumerateStereoisomers) or the gen_stereo_rdkit.py script. I am new to RDKit as a whole so I am unsure of what to do. As I tried to understand the script, I don't really know exactly which file I should modify to add this. I thought of modifying these following scripts (PSEUDOCODE), but I am afraid it will interfere with the total number of files going through each node.
vina_dock.mol_dock(mol,...):
stereoisomer_tuple = EnumerateStereoisomer(mol)
for index, mol in enumerate(stereoisomer_tuple):
ligand_pdbqt = ligand_preparation(mol)
..... (the rest of the code)
The output_fname can be modified to output_fname + '_' + index
Again, I don't fully understand the whole script, and would appreciate it if there is a rough outline (or specific script/function) that can be modified to include the stereo isomer.
Due to changes in the interface of some functions we froze meeko version to particular commit. This should be fixed to be able to use the latest meeko version.
While running easydock, I figured that if there are salts inside a ligand that is being docked (the salt part is marked in the smiles via a .), it causes the following error:
Traceback (most recent call last):
File ".../lib/python3.11/site-packages/easydock/preparation_for_docking.py", line 158, in pdbqt2molblock
rdkitmol_list = RDKitMolCreate.from_pdbqt_mol(pdbqt_mol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/meeko/rdkit_mol_create.py", line 224, in from_pdbqt_mol
mol = cls.add_pose_to_mol(mol, coordinates, index_map)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/meeko/rdkit_mol_create.py", line 322, in add_pose_to_mol
raise RuntimeError("Only H allowed to be in SMILES but not in PDBQT")
RuntimeError: Only H allowed to be in SMILES but not in PDBQT
Parsing PDB was failed: 9134_0
This error is generated through molecules like this
CC1=C(N2C=CC=NC2=N1)C3=CSC(=N3)NC4=CC=C(C=C4)O.Br
CC1=CC=C(C=C1)C2=[N+](SC(=N2)N3CCOCC3)C4=CC=CC=C4.[O-]Cl(=O)(=O)=O
and happens for nearly all (of my tested) molecules that contain a salt group (indicated via the .Br at the end of the smiles string)
It is impossible to retrieve poses which are not the first one.
I do have a working solution running in my system already. If I verify everything works correctly I'll refactor the code a little bit and submit a PR.
On a side note: what is the reason of being able to select single poses? I'd rather have --poses 5 return the first 5 poses.
Greetings
get_sdf_from_dock_db -i docked.db -o docked.sdf --poses 2
Traceback (most recent call last):
File "/home/davide/Lavoro/libraries/easydock/easydock/preparation_for_docking.py", line 157, in pdbqt2molblock
pdbqt_mol = PDBQTMolecule(pdbqt_block, is_dlg=False, skip_typing=True, poses_to_read=1)
File "/home/davide/.config/conda/envs/easydock/lib/python3.9/site-packages/meeko/molecule_pdbqt.py", line 360, in __init__
results = _read_ligand_pdbqt_file(pdbqt_string, poses_to_read, energy_range, is_dlg, skip_typing)
File "/home/davide/.config/conda/envs/easydock/lib/python3.9/site-packages/meeko/molecule_pdbqt.py", line 282, in _read_ligand_pdbqt_file
positions = np.array(positions).reshape((n_poses, atoms.shape[0], 3))
AttributeError: 'NoneType' object has no attribute 'shape'
Parsing PDB was failed (fixing did not help): i1_2
If docking was run with dask support, it cannot be continued after the script restart. This happens due to incorrect storage of temporary files, which should be accessible across nodes. However, files are stored on the main node in /tmp
dir. This should be changed to a custom directory provided by a user as it was previously. Uncomment arg --tempdir
and check working of run_dock
I am not sure if I should try asking this at the pkasolver repo instead, but I was hoping that maybe this issue can be solved here.
I am trying to dock 600k molecules for almost a week, but I am stuck on the protonation step with pkasolver. I am currently using GPU because it is significantly faster than using CPU by 10x in my case. However, I keep getting stuck with the warning message and a sudden termination.
/home/user/miniforge3/envs/easydock_gpu/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
I tried print debugging to see which code it specifically terminate, but unfortunately I do not get any more information beyond that.
def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1, dock_ncpu: int=1):
with open(input_fname,'r') as input_file:
smi_l = input_file.readlines()
# Assuming that using 25000 compound per loop will avoid the GPU RAM Error (Semparhore leakage and out of memory error)
chunk_number = ceil(len(smi_l)/25000)
chunk_smi_l = chunk_into_n(smi_l, chunk_number)
temp_fname_list = []
for chunk in chunk_smi_l:
with tempfile.NamedTemporaryFile(suffix='.smi', mode='w', encoding='utf-8', delete=False) as input_tmp:
input_tmp.write(''.join(chunk))
output_tmp = tempfile.NamedTemporaryFile(suffix='.smi', mode='w', encoding='utf-8', delete=False)
temp_fname_list.append((input_tmp.name, output_tmp.name))
from pkasolver.query import QueryModel
model = QueryModel()
for input_f, output_f in temp_fname_list:
print(f"Output file {output_f} from {input_f}")
#using maxtasksperchild=10 and chunksize=10 means that each GPU worker will be replaced after doing 100 molecules (new PID)
#in other words, the worker are allowed to do 10 tasks, with each tasks having to protonate 10 molecules.
#this is done because I suspect the CUDA out of memory Error is due to the worker having to protonate #num_compound/gpu_threads without being replaced,
#which cause the memory leak.
#https://github.com/pytorch/pytorch/issues/44156
pool = mp.get_context('spawn').Pool(20, maxtasksperchild=10)
with open(output_f, 'wt') as f:
for smi, name in pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_f), chunksize=10):
f.write(f'{smi}\t{name}\n')
pool.close()
pool.join()
print(f'writing actual content to {output_fname} from {input_fname}')
with open(output_fname, 'wt') as output_file:
for temp_fname in temp_fname_list:
input_temp_fname, output_temp_fname = temp_fname
with open(output_temp_fname,'r') as output_smi:
output_file.write(''.join(output_smi))
output_smi.seek(0,0)
print(f"{output_temp_fname} has {len(output_smi.readlines())}")
print('removing file...')
os.remove(input_temp_fname)
os.remove(output_temp_fname)
print('removing file is done')
Here is the output that I get:
Output file /tmp/tmppbb8kr9v.smi from /tmp/tmp3ypkg8g6.smi
Output file /tmp/tmpzre3vlxp.smi from /tmp/tmpriot8o4b.smi
Output file /tmp/tmplmqr6i4a.smi from /tmp/tmpk1xplaiu.smi
Output file /tmp/tmplp9lpagp.smi from /tmp/tmpf3ud4ma5.smi
Output file /tmp/tmpnru5ewr7.smi from /tmp/tmphe0l4ud5.smi
Output file /tmp/tmpv_lm44q7.smi from /tmp/tmpr77d0p1d.smi
Output file /tmp/tmpvdive5d0.smi from /tmp/tmpy29nuqvr.smi
Output file /tmp/tmpracf5f6m.smi from /tmp/tmpdgjt6y3p.smi
Output file /tmp/tmpcy_vj5j7.smi from /tmp/tmpdbxwh340.smi
Output file /tmp/tmpi95qdgza.smi from /tmp/tmp58lse777.smi
Output file /tmp/tmpvv2z3wnk.smi from /tmp/tmp_qtepbfn.smi
Output file /tmp/tmp537lkrwj.smi from /tmp/tmpiouw2wnq.smi
Output file /tmp/tmpb_8waiba.smi from /tmp/tmp53lkpoxq.smi
Output file /tmp/tmpkspme0_t.smi from /tmp/tmp5l7hlpkv.smi
Output file /tmp/tmprsuep50_.smi from /tmp/tmp2trllacy.smi
Output file /tmp/tmphn_5dpgx.smi from /tmp/tmp53zii_e9.smi
Output file /tmp/tmpi1f40gu3.smi from /tmp/tmp42dfrmet.smi
Output file /tmp/tmpqwvjz8f1.smi from /tmp/tmpk2g8yssc.smi
Output file /tmp/tmpqcpuhwx0.smi from /tmp/tmp7pgv2yvt.smi
Output file /tmp/tmpdb2q5yef.smi from /tmp/tmp0x73hz0y.smi
Output file /tmp/tmpkomy499p.smi from /tmp/tmpmdqb6snw.smi
Output file /tmp/tmp768qr0ag.smi from /tmp/tmp2srprmwb.smi
Output file /tmp/tmpx0qq1fyk.smi from /tmp/tmp0kxxtgaw.smi
Output file /tmp/tmp6hz61r22.smi from /tmp/tmp4occm7n1.smi
/home/user/miniforge3/envs/easydock_gpu/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Surprisingly, the input and output files from chunk_into_n
functions do not exist anymore even when it does not reach the os.remove(file)
I assume, which is very odd.
As can be seen in the modified protonate_pkasolver()
function, I am using 20 GPU threads from 1 GPU, which consumes almost 20 GB of memory. If it truly is a memory issue, I was wondering if there is a way to avoid that while still using torch.multiprocessing
to protonate the molecule.
When I try to docking a .smi file containing 10k molecules against a receptor, the size of output.db increases continually.
However, after some time (tens of minutes), the increase stopped, while run_dock still gives good output.
Then I stop the run_dock running, and then just re-run the run_dock task. The size of output.db will increase again.
This process will repeat and repeat.
How to solve this problem and continue docking until the last molecue ?
Hi everyone,
I install easydock with the snippet of code in conda:
conda create -n easydock python=3.9 -y
conda activate easydock
conda install -c conda-forge numpy=1.20 rdkit scipy dask distributed -y
pip install meeko
pip install easydock
conda install ipykernel -y
I run your python example, and the error appears
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
[d:\OneDrive](file:///D:/OneDrive) - UMP\test\test.ipynb Cell 1 line 1
----> [1](vscode-notebook-cell:/d%3A/OneDrive%20-%20UMP/test/test.ipynb#W0sZmlsZQ%3D%3D?line=0) from easydock.run_dock import docking
[2](vscode-notebook-cell:/d%3A/OneDrive%20-%20UMP/test/test.ipynb#W0sZmlsZQ%3D%3D?line=1) from easydock.vina_dock import mol_dock
[3](vscode-notebook-cell:/d%3A/OneDrive%20-%20UMP/test/test.ipynb#W0sZmlsZQ%3D%3D?line=2) # from easydock.gnina_dock import mol_dock # <- enable gnina docking
File [e:\Software\Anaconda3\envs\easydock\lib\site-packages\easydock\run_dock.py:10](file:///E:/Software/Anaconda3/envs/easydock/lib/site-packages/easydock/run_dock.py:10)
7 from functools import partial
8 from multiprocessing import Pool
---> 10 from rdkit import Chem
11 from rdkit.Chem.rdMolDescriptors import CalcNumRotatableBonds
12 from easydock.database import create_db, restore_setup_from_db, init_db, update_db, save_sdf, select_mols_to_dock, \
13 add_protonation
File [e:\Software\Anaconda3\envs\easydock\lib\site-packages\rdkit\Chem\__init__.py:23](file:///E:/Software/Anaconda3/envs/easydock/lib/site-packages/rdkit/Chem/__init__.py:23)
21 _HasSubstructMatchStr = rdchem._HasSubstructMatchStr
22 from rdkit.Chem.rdchem import *
---> 23 from rdkit.Chem.rdmolfiles import *
24 from rdkit.Chem.rdmolops import *
25 from rdkit.Chem.rdCIPLabeler import *
ImportError: DLL load failed while importing rdmolfiles: The specified procedure could not be found.
I am trying to use multiprocessing to increase the computational speed of running the init_db()
function, but I found that there is a mismatch of the result when I use multiprocessing.
Using multiprocessing seems to produce less isomers overall compared when not using it. Below is the function that I use for multiprocessing. While using subprocess increase the time significantly (30 CPUs: 90s for 250k molecules compared to 5h without using it), I was wondering if there is any explanation for why it produces less isomer. I also noticed that some isomers produced without multiprocessing tends to have the same vina score.
How to reproduce(2 'problematic' SMILES):
c1cn(nn1)[C@H]2C[NH+](C[C@H]2NC(=O)C3CC4(C3)CCC4)CCCO
C[NH+]1CCCC[C@H]1C(=O)NCc2cn[nH]c2[C@@H]3CC[NH+](C3)CCOC
run_dock -i smile/train_smiles_final_updated.smi -o docked/multi_train_smiles_final_updated.db --program vina --config config.yml -c 30 --sdf -s 5 --protonation dimorphite
def small_init_db(lists, max_stereoisomers, prefix):
mol, mol_name = lists
if prefix:
mol_name = f'{prefix}-{mol_name}'
if mol_is_3d(mol):
smi = Chem.MolToSmiles(mol, isomericSmiles=True)
return [['mol', (mol_name, 0, smi, Chem.MolToMolBlock(mol))]]
else:
isomer_list = []
isomers = get_isomers(mol, max_stereoisomers)
for stereo_id, stereo_mol in enumerate(isomers):
smi = Chem.MolToSmiles(stereo_mol, isomericSmiles=True)
isomer_list.append(['smi', (mol_name, stereo_id, smi)])
return isomer_list
def init_db(db_fname, input_fname, ncpu, max_stereoisomers=1, prefix=None):
start = time.time()
pool = Pool(processes=ncpu)
conn = sqlite3.connect(db_fname)
cur = conn.cursor()
data_smi = [] # non 3D structures
data_mol = [] # 3D structures
prod_x = partial(small_init_db, max_stereoisomers=max_stereoisomers, prefix=prefix)
result_list = pool.map(prod_x, read_input.read_input(input_fname), chunksize=1)
for result in result_list:
try:
for subresult in result:
if subresult[0] == 'smi':
data_smi.append(subresult[1])
else:
data_mol.append(subresult[1])
except TypeError:
print(result)
cur.executemany(f'INSERT INTO mols (id, stereo_id, smi) VALUES(?, ?, ?)', data_smi)
cur.executemany(f'INSERT INTO mols (id, stereo_id, smi, source_mol_block) VALUES(?, ?, ?, ?)', data_mol)
conn.commit()
end = time.time()
print(end - start)
It seems that stereoconfiguration in smiles is not always saved properly. Sometimes we get achiral smiles. Check this and fix if needed.
After canceling of the running task (run docking with dask by SLURM), the actual docking is continued. Check and fix.
Skip molecules and print to stderr message that a molecules with a name (if name is possible to retrieve) cannot be processed and will be skipped
For the --init
process, yes I notice that the compound is initialized very slowly a long time ago because some molecules take a long time to generate the isomers. That's why to speed up the process, I tend to multiply the ncpu needed with the cpu in the config.yml
for docking (I hardcoded it since I don't want to add up more argument to --init_db
at that time), which speeds up the process in linear fashion if I remembered (it takes around 3 hours to initialize ~600k compound including isomers with 150 CPU).
Originally posted by @Feriolet in #35 (comment)
Can I know if it is possible to use get_sdf_from_dock_db.py
to get the top n compound based on the lowest docking score? My initial plan is to use an SQLite editor to sort the docking_score and save the .db
to a new file, and then use the script to get the sdf, but I'm curious if I can use the script directly.
I assume that with SQLite command, I would have to use SELECT TOP(n) ..... ORDER BY 'docking_score'
, but that may break the compatibility of the script. Or maybe using SELECT ..... ORDER BY 'docking_score' LIMIT n
is also okay?
This issue was occurred here DrrDom/rdkit-scripts#20. The proposed solution can be implemented, but it will require to switch for the newest meeko version, that will require to fix issues with changed interface of some meeko functions.
For molecules with saturated ring systems (e.g. sugars) it can be useful to implement sampling of initial ring conformations, dock them separately and choose best output. This looks critically important for such molecules.
Solution 1.
This can be implemented in a general way.
state_id
should be added to DB to designate different states.Drawbacks:
GROUP BY
statement.Alternative solution 1.
Every type of a possible state can be designated to its own column in DB, e.g. tautomer_id
, protonatrion_id
, conf_id
. This gives more control over results, but the same drawback will persist.
Solution 2.
Starting conformers can be enumerated by ligand_preparation
function which will return a list of PBDQT blocks (instead of a single one like now) but only for molecules which require this (comprising saturated rings and maybe something else). After docking the best output is selected among several ones (in mol_dock
function) and it is returned and stored in DB. Thus for a user of mol_dock
function nothing will change and DB structure can be kept as is.
The number and diversity of generated conformers should be controlled to keep only distinct ones. This may require some experiments to derive empirical thresholds. Issues may also occur with compounds having several saturated rings.
Currently solution 2 looks preferable as it will require less changes.
Thank you for providing such an easy-to-use package.
I want to ask a stupid question about how to prepare the grid.txt
file. Could someone provide your example grid.txt
file for me to follow? Thank you so much!
maybe switch to Dimorphite-DL:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0336-9
I am in academia and I don't even have a chemaxon license anymore...
Software vendors always change their license terms one day or another...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.