ci-lab-cz / easydock Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 12.0 183 KB

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

easydock's People

Contributors

Stargazers

Watchers

Forkers

guzelminibaeva bio-otto unixjunkie ydzdfci jcheminform ardeat biocheming davidoskky feriolet smaiti7 fongchun

easydock's Issues

Improve description

Improve description of the repository in README. Mention which docking programs can be run, how to install them (can be a link to the corresponding instructions). Provide examples of scripts running.

adding stereoisomer enumeration

Hi! I have read the easydock paper and i am interested if it is possible to add the stereoisomer enumeration in the source code. The paper mentioned that it is recommended to either use the RDKit function (I assume it is the rdkit.Chem.EnumerateStereoisomers) or the gen_stereo_rdkit.py script. I am new to RDKit as a whole so I am unsure of what to do. As I tried to understand the script, I don't really know exactly which file I should modify to add this. I thought of modifying these following scripts (PSEUDOCODE), but I am afraid it will interfere with the total number of files going through each node.

vina_dock.mol_dock(mol,...):
    stereoisomer_tuple = EnumerateStereoisomer(mol)
    for index, mol in enumerate(stereoisomer_tuple):
        ligand_pdbqt = ligand_preparation(mol)
        ..... (the rest of the code)

The output_fname can be modified to output_fname + '_' + index

Again, I don't fully understand the whole script, and would appreciate it if there is a rough outline (or specific script/function) that can be modified to include the stereo isomer.

Implement the latest version of meeko into the pipeline

Due to changes in the interface of some functions we froze meeko version to particular commit. This should be fixed to be able to use the latest meeko version.

Salts in molecules cause problems

While running easydock, I figured that if there are salts inside a ligand that is being docked (the salt part is marked in the smiles via a .), it causes the following error:

Traceback (most recent call last):
File ".../lib/python3.11/site-packages/easydock/preparation_for_docking.py", line 158, in pdbqt2molblock
rdkitmol_list = RDKitMolCreate.from_pdbqt_mol(pdbqt_mol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/meeko/rdkit_mol_create.py", line 224, in from_pdbqt_mol
mol = cls.add_pose_to_mol(mol, coordinates, index_map)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/meeko/rdkit_mol_create.py", line 322, in add_pose_to_mol
raise RuntimeError("Only H allowed to be in SMILES but not in PDBQT")
RuntimeError: Only H allowed to be in SMILES but not in PDBQT
Parsing PDB was failed: 9134_0

This error is generated through molecules like this

CC1=C(N2C=CC=NC2=N1)C3=CSC(=N3)NC4=CC=C(C=C4)O.Br

CC1=CC=C(C=C1)C2=[N+](SC(=N2)N3CCOCC3)C4=CC=CC=C4.[O-]Cl(=O)(=O)=O

and happens for nearly all (of my tested) molecules that contain a salt group (indicated via the .Br at the end of the smiles string)

Retrieving poses fails

It is impossible to retrieve poses which are not the first one.

I do have a working solution running in my system already. If I verify everything works correctly I'll refactor the code a little bit and submit a PR.

On a side note: what is the reason of being able to select single poses? I'd rather have --poses 5 return the first 5 poses.

Greetings

get_sdf_from_dock_db -i docked.db -o docked.sdf --poses 2

Traceback (most recent call last):
  File "/home/davide/Lavoro/libraries/easydock/easydock/preparation_for_docking.py", line 157, in pdbqt2molblock
    pdbqt_mol = PDBQTMolecule(pdbqt_block, is_dlg=False, skip_typing=True, poses_to_read=1)
  File "/home/davide/.config/conda/envs/easydock/lib/python3.9/site-packages/meeko/molecule_pdbqt.py", line 360, in __init__
    results = _read_ligand_pdbqt_file(pdbqt_string, poses_to_read, energy_range, is_dlg, skip_typing)
  File "/home/davide/.config/conda/envs/easydock/lib/python3.9/site-packages/meeko/molecule_pdbqt.py", line 282, in _read_ligand_pdbqt_file
    positions = np.array(positions).reshape((n_poses, atoms.shape[0], 3))
AttributeError: 'NoneType' object has no attribute 'shape'
Parsing PDB was failed (fixing did not help): i1_2

Docking with dask cannot be continued

If docking was run with dask support, it cannot be continued after the script restart. This happens due to incorrect storage of temporary files, which should be accessible across nodes. However, files are stored on the main node in /tmp dir. This should be changed to a custom directory provided by a user as it was previously. Uncomment arg --tempdir and check working of run_dock

pkasolver-GPU issue at high scale protonation

I am not sure if I should try asking this at the pkasolver repo instead, but I was hoping that maybe this issue can be solved here.

I am trying to dock 600k molecules for almost a week, but I am stuck on the protonation step with pkasolver. I am currently using GPU because it is significantly faster than using CPU by 10x in my case. However, I keep getting stuck with the warning message and a sudden termination.

/home/user/miniforge3/envs/easydock_gpu/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I tried print debugging to see which code it specifically terminate, but unfortunately I do not get any more information beyond that.

def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1, dock_ncpu: int=1):
    
    with open(input_fname,'r') as input_file:
        smi_l = input_file.readlines()

    # Assuming that using 25000 compound per loop will avoid the GPU RAM Error (Semparhore leakage and out of memory error)
    chunk_number = ceil(len(smi_l)/25000)
    chunk_smi_l = chunk_into_n(smi_l, chunk_number)

    temp_fname_list = []
    for chunk in chunk_smi_l:
        with tempfile.NamedTemporaryFile(suffix='.smi', mode='w', encoding='utf-8', delete=False) as input_tmp:
            input_tmp.write(''.join(chunk))
            output_tmp = tempfile.NamedTemporaryFile(suffix='.smi', mode='w', encoding='utf-8', delete=False)
            temp_fname_list.append((input_tmp.name, output_tmp.name))

    from pkasolver.query import QueryModel
    model = QueryModel()
    
    for input_f, output_f in temp_fname_list:
        print(f"Output file {output_f} from {input_f}")
    #using maxtasksperchild=10 and chunksize=10 means that each GPU worker will be replaced after doing 100 molecules (new PID)
    #in other words, the worker are allowed to do 10 tasks, with each tasks having to protonate 10 molecules.
    #this is done because I suspect the CUDA out of memory Error is due to the worker having to protonate #num_compound/gpu_threads without being replaced,
    #which cause the memory leak.
    #https://github.com/pytorch/pytorch/issues/44156
        pool = mp.get_context('spawn').Pool(20, maxtasksperchild=10)
        with open(output_f, 'wt') as f:
            for smi, name in pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_f), chunksize=10):
                f.write(f'{smi}\t{name}\n')

        pool.close()
        pool.join()
    print(f'writing actual content to {output_fname} from {input_fname}')
    with open(output_fname, 'wt') as output_file:
        for temp_fname in temp_fname_list:
            input_temp_fname, output_temp_fname = temp_fname

            with open(output_temp_fname,'r') as output_smi:
                output_file.write(''.join(output_smi))
                output_smi.seek(0,0)
                print(f"{output_temp_fname} has {len(output_smi.readlines())}")
            print('removing file...')
            os.remove(input_temp_fname)
            os.remove(output_temp_fname)
            print('removing file is done')

Here is the output that I get:

Output file /tmp/tmppbb8kr9v.smi from /tmp/tmp3ypkg8g6.smi
Output file /tmp/tmpzre3vlxp.smi from /tmp/tmpriot8o4b.smi
Output file /tmp/tmplmqr6i4a.smi from /tmp/tmpk1xplaiu.smi
Output file /tmp/tmplp9lpagp.smi from /tmp/tmpf3ud4ma5.smi
Output file /tmp/tmpnru5ewr7.smi from /tmp/tmphe0l4ud5.smi
Output file /tmp/tmpv_lm44q7.smi from /tmp/tmpr77d0p1d.smi
Output file /tmp/tmpvdive5d0.smi from /tmp/tmpy29nuqvr.smi
Output file /tmp/tmpracf5f6m.smi from /tmp/tmpdgjt6y3p.smi
Output file /tmp/tmpcy_vj5j7.smi from /tmp/tmpdbxwh340.smi
Output file /tmp/tmpi95qdgza.smi from /tmp/tmp58lse777.smi
Output file /tmp/tmpvv2z3wnk.smi from /tmp/tmp_qtepbfn.smi
Output file /tmp/tmp537lkrwj.smi from /tmp/tmpiouw2wnq.smi
Output file /tmp/tmpb_8waiba.smi from /tmp/tmp53lkpoxq.smi
Output file /tmp/tmpkspme0_t.smi from /tmp/tmp5l7hlpkv.smi
Output file /tmp/tmprsuep50_.smi from /tmp/tmp2trllacy.smi
Output file /tmp/tmphn_5dpgx.smi from /tmp/tmp53zii_e9.smi
Output file /tmp/tmpi1f40gu3.smi from /tmp/tmp42dfrmet.smi
Output file /tmp/tmpqwvjz8f1.smi from /tmp/tmpk2g8yssc.smi
Output file /tmp/tmpqcpuhwx0.smi from /tmp/tmp7pgv2yvt.smi
Output file /tmp/tmpdb2q5yef.smi from /tmp/tmp0x73hz0y.smi
Output file /tmp/tmpkomy499p.smi from /tmp/tmpmdqb6snw.smi
Output file /tmp/tmp768qr0ag.smi from /tmp/tmp2srprmwb.smi
Output file /tmp/tmpx0qq1fyk.smi from /tmp/tmp0kxxtgaw.smi
Output file /tmp/tmp6hz61r22.smi from /tmp/tmp4occm7n1.smi
/home/user/miniforge3/envs/easydock_gpu/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Surprisingly, the input and output files from chunk_into_n functions do not exist anymore even when it does not reach the os.remove(file) I assume, which is very odd.

As can be seen in the modified protonate_pkasolver() function, I am using 20 GPU threads from 1 GPU, which consumes almost 20 GB of memory. If it truly is a memory issue, I was wondering if there is a way to avoid that while still using torch.multiprocessing to protonate the molecule.

run_dock is running, but stops output into the output.db file

When I try to docking a .smi file containing 10k molecules against a receptor, the size of output.db increases continually.
However, after some time (tens of minutes), the increase stopped, while run_dock still gives good output.

Then I stop the run_dock running, and then just re-run the run_dock task. The size of output.db will increase again.

This process will repeat and repeat.

How to solve this problem and continue docking until the last molecue ?

DLL load failed while importing rdmolfiles: The specified procedure could not be found

Hi everyone,
I install easydock with the snippet of code in conda:

conda create -n easydock python=3.9 -y
conda activate easydock
conda install -c conda-forge numpy=1.20 rdkit scipy dask distributed -y
pip install meeko
pip install easydock
conda install ipykernel -y

I run your python example, and the error appears

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[d:\OneDrive](file:///D:/OneDrive) - UMP\test\test.ipynb Cell 1 line 1
----> [1](vscode-notebook-cell:/d%3A/OneDrive%20-%20UMP/test/test.ipynb#W0sZmlsZQ%3D%3D?line=0) from easydock.run_dock import docking
      [2](vscode-notebook-cell:/d%3A/OneDrive%20-%20UMP/test/test.ipynb#W0sZmlsZQ%3D%3D?line=1) from easydock.vina_dock import mol_dock
      [3](vscode-notebook-cell:/d%3A/OneDrive%20-%20UMP/test/test.ipynb#W0sZmlsZQ%3D%3D?line=2) # from easydock.gnina_dock import mol_dock  # <- enable gnina docking

File [e:\Software\Anaconda3\envs\easydock\lib\site-packages\easydock\run_dock.py:10](file:///E:/Software/Anaconda3/envs/easydock/lib/site-packages/easydock/run_dock.py:10)
      7 from functools import partial
      8 from multiprocessing import Pool
---> 10 from rdkit import Chem
     11 from rdkit.Chem.rdMolDescriptors import CalcNumRotatableBonds
     12 from easydock.database import create_db, restore_setup_from_db, init_db, update_db, save_sdf, select_mols_to_dock, \
     13     add_protonation

File [e:\Software\Anaconda3\envs\easydock\lib\site-packages\rdkit\Chem\__init__.py:23](file:///E:/Software/Anaconda3/envs/easydock/lib/site-packages/rdkit/Chem/__init__.py:23)
     21 _HasSubstructMatchStr = rdchem._HasSubstructMatchStr
     22 from rdkit.Chem.rdchem import *
---> 23 from rdkit.Chem.rdmolfiles import *
     24 from rdkit.Chem.rdmolops import *
     25 from rdkit.Chem.rdCIPLabeler import *

ImportError: DLL load failed while importing rdmolfiles: The specified procedure could not be found.

Different get_isomers() result when using multiprocessing

I am trying to use multiprocessing to increase the computational speed of running the init_db() function, but I found that there is a mismatch of the result when I use multiprocessing.

Using multiprocessing seems to produce less isomers overall compared when not using it. Below is the function that I use for multiprocessing. While using subprocess increase the time significantly (30 CPUs: 90s for 250k molecules compared to 5h without using it), I was wondering if there is any explanation for why it produces less isomer. I also noticed that some isomers produced without multiprocessing tends to have the same vina score.

How to reproduce(2 'problematic' SMILES):

c1cn(nn1)[C@H]2C[NH+](C[C@H]2NC(=O)C3CC4(C3)CCC4)CCCO
C[NH+]1CCCC[C@H]1C(=O)NCc2cn[nH]c2[C@@H]3CC[NH+](C3)CCOC

run_dock -i smile/train_smiles_final_updated.smi -o docked/multi_train_smiles_final_updated.db --program vina --config config.yml -c 30 --sdf -s 5 --protonation dimorphite

def small_init_db(lists, max_stereoisomers, prefix):
    mol, mol_name = lists
    if prefix:
        mol_name = f'{prefix}-{mol_name}'
    if mol_is_3d(mol):
        smi = Chem.MolToSmiles(mol, isomericSmiles=True)
        return [['mol', (mol_name, 0, smi, Chem.MolToMolBlock(mol))]]

    else:
        isomer_list = []
        isomers = get_isomers(mol, max_stereoisomers)
        for stereo_id, stereo_mol in enumerate(isomers):
            smi = Chem.MolToSmiles(stereo_mol, isomericSmiles=True)
            isomer_list.append(['smi', (mol_name, stereo_id, smi)])
        return isomer_list

def init_db(db_fname, input_fname, ncpu, max_stereoisomers=1, prefix=None):
    start = time.time()

    pool = Pool(processes=ncpu)
    conn = sqlite3.connect(db_fname)
    cur = conn.cursor()
    data_smi = []  # non 3D structures
    data_mol = []  # 3D structures
    prod_x = partial(small_init_db, max_stereoisomers=max_stereoisomers, prefix=prefix)

    result_list = pool.map(prod_x, read_input.read_input(input_fname), chunksize=1)

    for result in result_list:
        try:
            for subresult in result:
                if subresult[0] == 'smi':
                    data_smi.append(subresult[1])
                else:
                    data_mol.append(subresult[1])
        except TypeError:
            print(result)

    cur.executemany(f'INSERT INTO mols (id, stereo_id, smi) VALUES(?, ?, ?)', data_smi)
    cur.executemany(f'INSERT INTO mols (id, stereo_id, smi, source_mol_block) VALUES(?, ?, ?, ?)', data_mol)
    conn.commit()
 
    end = time.time()
    print(end - start)

Stereoconfiguration in SMILES

It seems that stereoconfiguration in smiles is not always saved properly. Sometimes we get achiral smiles. Check this and fix if needed.

Interrupting docking with dask does not interrupt calculations

After canceling of the running task (run docking with dask by SLURM), the actual docking is continued. Check and fix.

Skip molecules which cannot be read

Skip molecules and print to stderr message that a molecules with a name (if name is possible to retrieve) cannot be processed and will be skipped

Database initialization takes too much time due to stereoisomer enumeration

For the --init process, yes I notice that the compound is initialized very slowly a long time ago because some molecules take a long time to generate the isomers. That's why to speed up the process, I tend to multiply the ncpu needed with the cpu in the config.yml for docking (I hardcoded it since I don't want to add up more argument to --init_db at that time), which speeds up the process in linear fashion if I remembered (it takes around 3 hours to initialize ~600k compound including isomers with 150 CPU).

Originally posted by @Feriolet in #35 (comment)

Pick Top n compound with best docking score with `get_sdf_from_dock_db.py`

Can I know if it is possible to use get_sdf_from_dock_db.py to get the top n compound based on the lowest docking score? My initial plan is to use an SQLite editor to sort the docking_score and save the .db to a new file, and then use the script to get the sdf, but I'm curious if I can use the script directly.

I assume that with SQLite command, I would have to use SELECT TOP(n) ..... ORDER BY 'docking_score', but that may break the compatibility of the script. Or maybe using SELECT ..... ORDER BY 'docking_score' LIMIT n is also okay?

Error parsing PDBQT to Mol: Element 'G' not found

This issue was occurred here DrrDom/rdkit-scripts#20. The proposed solution can be implemented, but it will require to switch for the newest meeko version, that will require to fix issues with changed interface of some meeko functions.

Sampling of ring conformations before docking

For molecules with saturated ring systems (e.g. sugars) it can be useful to implement sampling of initial ring conformations, dock them separately and choose best output. This looks critically important for such molecules.

Solution 1.
This can be implemented in a general way.

Every new state of a molecule should be stored as a separate record. Currently it may be tautomeric state, protonation state and conformation (further states can be added).
These states are enumerated before docking will start during the DB preparation step.
A column state_id should be added to DB to designate different states.
A script which retrieve best results from DB should be adapted to return results for only best scored output state.

Drawbacks:

this will break direct extraction of results from DB by SQL queries. However, if a user is interested only in the best result for each entry molecule this will be easy to return with GROUP BY statement.
docking will take longer as more states should be docked, so additional states should not be too numerous.

Alternative solution 1.
Every type of a possible state can be designated to its own column in DB, e.g. tautomer_id, protonatrion_id, conf_id. This gives more control over results, but the same drawback will persist.

Solution 2.
Starting conformers can be enumerated by ligand_preparation function which will return a list of PBDQT blocks (instead of a single one like now) but only for molecules which require this (comprising saturated rings and maybe something else). After docking the best output is selected among several ones (in mol_dock function) and it is returned and stored in DB. Thus for a user of mol_dock function nothing will change and DB structure can be kept as is.

The number and diversity of generated conformers should be controlled to keep only distinct ones. This may require some experiments to derive empirical thresholds. Issues may also occur with compounds having several saturated rings.

Currently solution 2 looks preferable as it will require less changes.

How to prepare the `grid.txt` file

Thank you for providing such an easy-to-use package.

I want to ask a stupid question about how to prepare the grid.txt file. Could someone provide your example grid.txt file for me to follow? Thank you so much!

dependance on chemaxon is annoying

maybe switch to Dimorphite-DL:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0336-9

I am in academia and I don't even have a chemaxon license anymore...
Software vendors always change their license terms one day or another...