steinbeck-lab / chemical-dataset-comparator Goto Github PK

ChemIcal DatasEt comparatoR (CIDER) is a Python package and ready-to-use Jupyter Notebook workflow which primarily utilizes RDKit to compare two or more chemical structure datasets (SD files) in different aspects, e.g. size, overlap, molecular descriptor distributions, chemical space clustering, etc., most of which can be visually inspected.

License: Other

Jupyter Notebook 96.13% Python 3.87%

chemical-dataset-comparator's People

Contributors

Stargazers

Watchers

Forkers

kohulan jonasschaub ahelaterr

chemical-dataset-comparator's Issues

PDF report layout errors

First of all, the PDF report has definitely improved greatly and I am very happy about it!

But there are some layout problems in the report file I generated:
cider_report.pdf

Most of the figures (molecules plots and chemical space plots) are missing their right half.

And very minor: On the first page, maybe you could fit "Generated CIDER keys" and figures into a single line in the table? Maybe simply remove the "CIDER" here. And there could be a little more space between the bar plots.

Apart from that, I'm really happy!

Logging - create a new log file for a new run of CIDER

The logging functionality just appends everything to the log file if there is one already existing in the output folder. This file can then become very big and cluttered. It would be better to create a new one, e.g. at data import where there is also the warning that old outputs will be overwritten. Alternatively, you could create a log file with a time stamp in its name and not delete the old log file. Maybe better this way, but the log files should be moved to their own folder within the output folder then.

Molecule sanitization check at import

Dear all,

while reading the manuscript I came across the following issue/question: Do the molecules undergo an RDKit sanitization check at import? For the SMILES import, this is definitely the case, because the method Chem.MolFromSmiles() is used which does the RDKit sanitization check as default if I'm not mistaken.

For SDF import, these are the central code blocks, I think (taken from import_as_data_dict() and _check_invalid_mols_in_SDF()):

for dict_name in os.listdir(data_dir):
            if dict_name[-3:] == "sdf" or dict_name[-3:] == "SDF":
                single_dict = {}
                dict_path = os.path.join(data_dir, dict_name)
                single_dict[self.import_keyname] = Chem.SDMolSupplier(dict_path)
                all_dicts[dict_name] = single_dict
...
for single_dict in all_dicts:
            if single_dict == self.figure_dict_keyname:
                continue
            mol_index = -1
            invalid_index = []
            for mol in all_dicts[single_dict][self.import_keyname]:
                mol_index += 1
                if not mol:
                    logger.warning(
                        "%s has invalid molecule at index %d" % (single_dict, mol_index)
                    )
                    invalid_index.append(mol_index)

Since I cannot deal with this high-level notation used here, I cannot tell whether a sanitization check is done here. Can somebody tell me?

Apart from the current state: Maybe applying or not applying the RDKit sanitization check to the imported molecules could be an option for all import methods, or otherwise, I think, the sanitization should always be applied, as it is in the SMILES import.

Fix duplicate search

Use hashmap (dictionary) for a fast search

Switch Lipinski rule violations to RDKit-native functionality?

See Kohulans issue on the RDKit repo: rdkit/rdkit#6180 (comment)

Examples for every methods in the documentation/doc strings

Pytests failing

Hi @hannbus ,

The pytests are failing according to the GitHub actions:

Please take a look at this error:

==================================== ERRORS ====================================
___________________ ERROR collecting Tests/test_functions.py ___________________
Tests/test_functions.py:7: in <module>
    testdict = cider.import_as_data_dict("unittest_data")
.tox/py310/lib/python3.10/site-packages/CIDER/cider.py:50: in import_as_data_dict
    for dict_name in os.listdir(data_dir):
E   FileNotFoundError: [Errno 2] No such file or directory: 'unittest_data'
=========================== short test summary info ============================
ERROR Tests/test_functions.py - FileNotFoundError: [Errno 2] No such file or ...
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!

Kohulan

Add date + time of the analysis in the PDF report

Would it be possible to show the molecules in the intersections?

Similar to the funcitonality that has been implemented for the chemical space visualisation

It would be nice to be able to hover over molecules in the intersections of the datasets.

Once issues are solved release a new version of package on PyPI

Revise short description of CIDER in readme, code documentation, and GitHub repo "About"

I would propose something like this:
"ChemIcal DatasEt comparatoR (CIDER) is a Python package and ready-to-use Jupyter Notebook script which primarily utilizes RDKit to compare two or more chemical structure datasets (SD files) in different aspects, e.g. size, overlap, molecular descriptor distributions, chemical space clustering, etc., most of which can be visually inspected in the notebook."
Opinions?

Use different symbols for PCA plots for chemical space visualisation

This may be helpful when one molecule dataset is a subset of another one. Overlapping data points in the PCA could be made visible this way.

Update PDF report export function name in workflow notebook

In the workflow notebook, the last cell tries to use the old method for the PDF report export, which has been renamed to export_figure_report(). Please update.

Test CIDER for two SDF with one molecule each

Duplicates in workflow datasets

@hannbus , CIDER detects two pairs of duplicates in the set_phenole.sdf dataset according to the workflow notebook. That got me wondering, did you put them in there or where they there originally? The latter would be weird, because these are datasets I downloaded from COCONUT and I can't find the duplicates there.

Use identifier or position in SD file to report duplicates

Import on Windows fails

I have set up the conda environment as it is described in the readme (but using the full conda, not mini conda) on my Windows system:

$ conda create --name ChemIcal_DatasEt_compaRator python=3.8
$ conda activate ChemIcal_DatasEt_compaRator
$ conda install pip
$ python -m pip install -U pip #Upgrade pip
$ pip install notebook rdkit-pypi matplotlib==3.5.1 seaborn==0.11.2 chemplot==1.2.0 matplotlib_venn==0.11.6 FPDF==1.7.2

When I execute the import cell in the tutorial notebook, I get an import error saying that the rdBase dll has not been found:

Any ideas or suggestions?

Test CIDER Workflow with COCONUT SDF

Similar to #29, we should test the CIDER workflow notebook for the COCONUT SDF. For one, because it is a nice case study for a relevant data set that is already interesting when it is not compared to another data set (-> next step), and also, it would be good to test CIDER on a bigger data set like COCONUT to identify (and tag in the comments) the methods that scale unfavorably with the dataset size, like e.g. get_duplicate_key() where every molecule is tested against every other molecule. The COCONUT SDF can be downloaded here: https://coconut.naturalproducts.net/download

Failing: cider.draw_most_frequent_scaffolds(testdict, graph_framework=True, number_of_structures=8, structures_per_row=4)

@hannbus, As we discussed today in the meeting apparently this function fails.

2023-03-10 11:37:30,722 [ERROR] CIDER: An Error occured while executing CIDER!
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3433, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_79681/1434904516.py", line 1, in <module>
    cider.draw_most_frequent_scaffolds(testdict, graph_framework=True, number_of_structures=8, structures_per_row=4)
  File "/home/kohulan/.local/lib/python3.10/site-packages/CIDER/cider.py", line 1489, in draw_most_frequent_scaffolds
    scaffolds = self._get_scaffold(
  File "/home/kohulan/.local/lib/python3.10/site-packages/CIDER/cider.py", line 1424, in _get_scaffold
    graph_framework_list.append(MurckoScaffold.MakeScaffoldGeneric(mol))
  File "/home/kohulan/.local/lib/python3.10/site-packages/rdkit/Chem/Scaffolds/MurckoScaffold.py", line 47, in MakeScaffoldGeneric
    return Chem.RemoveHs(res)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 40 C, 5, is greater than permitted
[11:37:30] Explicit valence for atom # 40 C, 5, is greater than permitted
---------------------------------------------------------------------------
AtomValenceException                      Traceback (most recent call last)
Cell In [32], line 1
----> 1 cider.draw_most_frequent_scaffolds(testdict, graph_framework=True, number_of_structures=8, structures_per_row=4)

File ~/.local/lib/python3.10/site-packages/CIDER/cider.py:1489, in ChemicalDatasetComparator.draw_most_frequent_scaffolds(self, all_dicts, number_of_structures, structures_per_row, image_size, framework, graph_framework, normalize, data_type, figsize, fontsize_title, fontsize_subtitle)
   1487     continue
   1488 title_list.append(single_dict)
-> 1489 scaffolds = self._get_scaffold(
   1490     all_dicts[single_dict][self.import_keyname],
   1491     number_of_structures,
   1492     structures_per_row,
   1493     image_size,
   1494     framework,
   1495     graph_framework,
   1496     normalize,
   1497 )
   1498 image_list.append(scaffolds[0])
   1499 all_dicts[single_dict][self.scaffold_list_keyname] = scaffolds[1]

File ~/.local/lib/python3.10/site-packages/CIDER/cider.py:1424, in ChemicalDatasetComparator._get_scaffold(self, moleculeset, number_of_structures, structures_per_row, image_size, framework, graph_framework, normalize)
   1422 graph_framework_list = []
   1423 for mol in framework_list:
-> 1424     graph_framework_list.append(MurckoScaffold.MakeScaffoldGeneric(mol))
   1425 for mol in graph_framework_list:
   1426     structure_list.append(Chem.MolToSmiles(mol))

File ~/.local/lib/python3.10/site-packages/rdkit/Chem/Scaffolds/MurckoScaffold.py:47, in MakeScaffoldGeneric(mol)
     45   bond.SetBondType(Chem.BondType.SINGLE)
     46   bond.SetIsAromatic(False)
---> 47 return Chem.RemoveHs(res)

AtomValenceException: Explicit valence for atom # 40 C, 5, is greater than permitted

PDF export in the Workflow notebook is not working

Branch: dev-Hannah
Commit: d339471
@hannbus

In the Workflow notebook, the last cell cider.export_all_picture_pdf() is not working.
Output:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [35], in <cell line: 1>()
----> 1 cider.export_all_picture_pdf()

AttributeError: 'ChemicalDatasetComparator' object has no attribute 'export_all_picture_pdf'

I guess the function call in the notebook has to be updated to export_all_figures_pdf() as it is now in the cider.py file.

Reduce default FP size in chemical space visualization

Have a look at this first:
https://greglandrum.github.io/rdkit-blog/posts/2023-03-26-fingerprint-size-and-similarity-searching1.html

I would suggest let's set the FP size to 512 or lower as a default value.

draw_molecules with smi input does not work

When importing the dataset not from .sdf but from .smi or .txt files, draw_molecules is failing due to a Value

ValueError                                Traceback (most recent call last)
[c:\Users\hannah\Desktop\Data\Git\Not](file:///C:/Users/hannah/Desktop/Data/Git/Not) for upload\CIDER_Workflow_Copy.ipynb Cell 19 in <cell line: 1>()
----> [1](vscode-notebook-cell:/c%3A/Users/hannah/Desktop/Data/Git/Not%20for%20upload/CIDER_Workflow_Copy.ipynb#X23sZmlsZQ%3D%3D?line=0) cider.draw_molecules(testdict, number_of_mols = 20, mols_per_row = 5)

File [c:\Users\hannah\.conda\envs\CDC_neu\lib\site-packages\CIDER\cider.py:473](file:///C:/Users/hannah/.conda/envs/CDC_neu/lib/site-packages/CIDER/cider.py:473), in ChemicalDatasetComparator.draw_molecules(self, all_dicts, number_of_mols, mols_per_row, image_size, data_type, figsize, fontsize_title, fontsize_subtitle)
    470     to_draw.append(all_dicts[single_dict][self.import_keyname][i])
    471 for mol in to_draw:
    472     atom0_pos = [
--> 473         mol.GetConformer().GetAtomPosition(0).x,
    474         mol.GetConformer().GetAtomPosition(0).y,
    475         mol.GetConformer().GetAtomPosition(0).z,
    476     ]
    477     atom1_pos = [
    478         mol.GetConformer().GetAtomPosition(1).x,
    479         mol.GetConformer().GetAtomPosition(1).y,
    480         mol.GetConformer().GetAtomPosition(1).z,
    481     ]
    482     if atom0_pos == atom1_pos:

ValueError: Bad Conformer Id

This should not be too hard to fix, I think. Should I do that in the next days sometime?

In the Workflow, "Get descriptor value with database ID" function does not work

Branch: dev-Hannah
Commit: d339471
@hannbus

In the Workflow notebook, the method "get_value_from_id" defined in the section "Get descriptor value with database ID" does not work.
Output:

Molecule found in set_chlorbenzene-5.sdf
Molecular Weight value for ID CNP0291002: 173.55499999999998
Molecule found in set_chlorbenzene.sdf
Molecular Weight value for ID CNP0291002: 173.55499999999998
Molecule found in set_phenole.sdf
Molecular Weight value for ID CNP0291002: 173.55499999999998
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 get_value_from_id(testdict, 'CNP0291002', 'Molecular Weight')

Input In [23], in get_value_from_id(all_dicts, wanted_id, descriptor_list_keyname)
      4 """
      5 This function returns a descriptor value for a specific molecule referred to by its database ID and
      6 the dataset where the molecule has been found.
   (...)
     15 
     16 """
     17 for single_dict in all_dicts:
---> 18     if wanted_id in all_dicts[single_dict][cider.database_id_keyname]:
     19         print("Molecule found in " + str(single_dict))
     20         index = all_dicts[single_dict][cider.database_id_keyname].index(
     21             wanted_id
     22         )

KeyError: 'coconut_id'

I think the problem is that the sub-dictionary "figures" has no "coconut_id" key, which makes sense because it does not contain molecules. You have to either explicitly ignore the "figures" sub-dict here or (preferably) check every sub-dict first for the "coconut_id" key or rather "cider.database_id_keyname" key.

README.md adjustments

In readme, Python 3.8 needs to be updated to Python 3.10
White backgrounds for logos in readme?
Why is the logo in the workflow dir? Is there a better place for it? Place it in repo root dir

Catch warnings

Catch warnings specifically where they are produced using the warnings.catch_warnings() context manager
There are a couple of FutureWarnings:
´´´
=============================== warnings summary ===============================
Tests/test_functions.py::test_chemical_space_plot_tsne
/home/runner/work/ChemIcal-DatasEt-comparatoR/ChemIcal-DatasEt-comparatoR/.tox/py310/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:795: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
´´´

Full example about this from StackOverflow:

import warnings

# Custom warning class for the example
class MyWarning(UserWarning):
    pass

def function_that_raises_warning(num):
    print(num)
    warnings.warn("This is a warning", MyWarning)


with warnings.catch_warnings():
    function_that_raises_warning(1) # raises warning
    warnings.filterwarnings(
        action="ignore", category=MyWarning, message="This is a warning"
    )
    function_that_raises_warning(2) # Warning is filtered
function_that_raises_warning(3) # raises warning

Rename repository

Hey @hannbus ,

Could you rename the repository to match the name on Read-Me? ChemIcal DatasEt comparatoR (CIDER)

Thanks

Write unit tests and use pytest to automatically run all of them

Depictions without coordinates

I have imported an SD file without any atom coordinates into CIDER and it plots all atoms on top of each other:

This is the file:
COCONUT_1000_subset.txt

Is there a way we could fix that, i.e. detect whether there are or there aren't coordinates given in the input SD files?

@hannbus @OBrink @Kohulan

AttributeError in chemical_space_visualization from _t_sne.py

When plotting the chemical space visualization with tsne as dimension reduction method in a notebook, an AttributeError is raised from sklearn\manifold_t_sne.py. It says "AttributeError: 'list' object has no attribute 'shape'". For the other dimension reduction methods the chemical space visualization works well. Any ideas what the problem is?

Update fpdf version in Python_Requirements.txt

Update notebook

Required modifications:

delete function definitions from notebook
adapt all examples in the notebook to use the ChemicalDatasetComparator class methods