Giter Club home page Giter Club logo

steinbeck-lab / chemical-dataset-comparator Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 3.0 23.31 MB

ChemIcal DatasEt comparatoR (CIDER) is a Python package and ready-to-use Jupyter Notebook workflow which primarily utilizes RDKit to compare two or more chemical structure datasets (SD files) in different aspects, e.g. size, overlap, molecular descriptor distributions, chemical space clustering, etc., most of which can be visually inspected.

License: Other

Jupyter Notebook 96.13% Python 3.87%

chemical-dataset-comparator's People

Contributors

hannbus avatar jonasschaub avatar kohulan avatar obrink avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

chemical-dataset-comparator's Issues

PDF report layout errors

First of all, the PDF report has definitely improved greatly and I am very happy about it!

But there are some layout problems in the report file I generated:
cider_report.pdf

Most of the figures (molecules plots and chemical space plots) are missing their right half.

And very minor: On the first page, maybe you could fit "Generated CIDER keys" and figures into a single line in the table? Maybe simply remove the "CIDER" here. And there could be a little more space between the bar plots.

Apart from that, I'm really happy!

Logging - create a new log file for a new run of CIDER

The logging functionality just appends everything to the log file if there is one already existing in the output folder. This file can then become very big and cluttered. It would be better to create a new one, e.g. at data import where there is also the warning that old outputs will be overwritten. Alternatively, you could create a log file with a time stamp in its name and not delete the old log file. Maybe better this way, but the log files should be moved to their own folder within the output folder then.

Molecule sanitization check at import

Dear all,

while reading the manuscript I came across the following issue/question: Do the molecules undergo an RDKit sanitization check at import? For the SMILES import, this is definitely the case, because the method Chem.MolFromSmiles() is used which does the RDKit sanitization check as default if I'm not mistaken.

For SDF import, these are the central code blocks, I think (taken from import_as_data_dict() and _check_invalid_mols_in_SDF()):

for dict_name in os.listdir(data_dir):
            if dict_name[-3:] == "sdf" or dict_name[-3:] == "SDF":
                single_dict = {}
                dict_path = os.path.join(data_dir, dict_name)
                single_dict[self.import_keyname] = Chem.SDMolSupplier(dict_path)
                all_dicts[dict_name] = single_dict
...
for single_dict in all_dicts:
            if single_dict == self.figure_dict_keyname:
                continue
            mol_index = -1
            invalid_index = []
            for mol in all_dicts[single_dict][self.import_keyname]:
                mol_index += 1
                if not mol:
                    logger.warning(
                        "%s has invalid molecule at index %d" % (single_dict, mol_index)
                    )
                    invalid_index.append(mol_index)

Since I cannot deal with this high-level notation used here, I cannot tell whether a sanitization check is done here. Can somebody tell me?

Apart from the current state: Maybe applying or not applying the RDKit sanitization check to the imported molecules could be an option for all import methods, or otherwise, I think, the sanitization should always be applied, as it is in the SMILES import.

Pytests failing

Hi @hannbus ,

The pytests are failing according to the GitHub actions:

Please take a look at this error:

==================================== ERRORS ====================================
___________________ ERROR collecting Tests/test_functions.py ___________________
Tests/test_functions.py:7: in <module>
    testdict = cider.import_as_data_dict("unittest_data")
.tox/py310/lib/python3.10/site-packages/CIDER/cider.py:50: in import_as_data_dict
    for dict_name in os.listdir(data_dir):
E   FileNotFoundError: [Errno 2] No such file or directory: 'unittest_data'
=========================== short test summary info ============================
ERROR Tests/test_functions.py - FileNotFoundError: [Errno 2] No such file or ...
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
  • Kohulan

Revise short description of CIDER in readme, code documentation, and GitHub repo "About"

I would propose something like this:
"ChemIcal DatasEt comparatoR (CIDER) is a Python package and ready-to-use Jupyter Notebook script which primarily utilizes RDKit to compare two or more chemical structure datasets (SD files) in different aspects, e.g. size, overlap, molecular descriptor distributions, chemical space clustering, etc., most of which can be visually inspected in the notebook."
Opinions?

Duplicates in workflow datasets

@hannbus , CIDER detects two pairs of duplicates in the set_phenole.sdf dataset according to the workflow notebook. That got me wondering, did you put them in there or where they there originally? The latter would be weird, because these are datasets I downloaded from COCONUT and I can't find the duplicates there.

Import on Windows fails

I have set up the conda environment as it is described in the readme (but using the full conda, not mini conda) on my Windows system:

$ conda create --name ChemIcal_DatasEt_compaRator python=3.8
$ conda activate ChemIcal_DatasEt_compaRator
$ conda install pip
$ python -m pip install -U pip #Upgrade pip
$ pip install notebook rdkit-pypi matplotlib==3.5.1 seaborn==0.11.2 chemplot==1.2.0 matplotlib_venn==0.11.6 FPDF==1.7.2

When I execute the import cell in the tutorial notebook, I get an import error saying that the rdBase dll has not been found:
image

Any ideas or suggestions?

Test CIDER Workflow with COCONUT SDF

Similar to #29, we should test the CIDER workflow notebook for the COCONUT SDF. For one, because it is a nice case study for a relevant data set that is already interesting when it is not compared to another data set (-> next step), and also, it would be good to test CIDER on a bigger data set like COCONUT to identify (and tag in the comments) the methods that scale unfavorably with the dataset size, like e.g. get_duplicate_key() where every molecule is tested against every other molecule. The COCONUT SDF can be downloaded here: https://coconut.naturalproducts.net/download

Failing: cider.draw_most_frequent_scaffolds(testdict, graph_framework=True, number_of_structures=8, structures_per_row=4)

@hannbus, As we discussed today in the meeting apparently this function fails.

2023-03-10 11:37:30,722 [ERROR] CIDER: An Error occured while executing CIDER!
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3433, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_79681/1434904516.py", line 1, in <module>
    cider.draw_most_frequent_scaffolds(testdict, graph_framework=True, number_of_structures=8, structures_per_row=4)
  File "/home/kohulan/.local/lib/python3.10/site-packages/CIDER/cider.py", line 1489, in draw_most_frequent_scaffolds
    scaffolds = self._get_scaffold(
  File "/home/kohulan/.local/lib/python3.10/site-packages/CIDER/cider.py", line 1424, in _get_scaffold
    graph_framework_list.append(MurckoScaffold.MakeScaffoldGeneric(mol))
  File "/home/kohulan/.local/lib/python3.10/site-packages/rdkit/Chem/Scaffolds/MurckoScaffold.py", line 47, in MakeScaffoldGeneric
    return Chem.RemoveHs(res)
rdkit.Chem.rdchem.AtomValenceException: Explicit valence for atom # 40 C, 5, is greater than permitted
[11:37:30] Explicit valence for atom # 40 C, 5, is greater than permitted
---------------------------------------------------------------------------
AtomValenceException                      Traceback (most recent call last)
Cell In [32], line 1
----> 1 cider.draw_most_frequent_scaffolds(testdict, graph_framework=True, number_of_structures=8, structures_per_row=4)

File ~/.local/lib/python3.10/site-packages/CIDER/cider.py:1489, in ChemicalDatasetComparator.draw_most_frequent_scaffolds(self, all_dicts, number_of_structures, structures_per_row, image_size, framework, graph_framework, normalize, data_type, figsize, fontsize_title, fontsize_subtitle)
   1487     continue
   1488 title_list.append(single_dict)
-> 1489 scaffolds = self._get_scaffold(
   1490     all_dicts[single_dict][self.import_keyname],
   1491     number_of_structures,
   1492     structures_per_row,
   1493     image_size,
   1494     framework,
   1495     graph_framework,
   1496     normalize,
   1497 )
   1498 image_list.append(scaffolds[0])
   1499 all_dicts[single_dict][self.scaffold_list_keyname] = scaffolds[1]

File ~/.local/lib/python3.10/site-packages/CIDER/cider.py:1424, in ChemicalDatasetComparator._get_scaffold(self, moleculeset, number_of_structures, structures_per_row, image_size, framework, graph_framework, normalize)
   1422 graph_framework_list = []
   1423 for mol in framework_list:
-> 1424     graph_framework_list.append(MurckoScaffold.MakeScaffoldGeneric(mol))
   1425 for mol in graph_framework_list:
   1426     structure_list.append(Chem.MolToSmiles(mol))

File ~/.local/lib/python3.10/site-packages/rdkit/Chem/Scaffolds/MurckoScaffold.py:47, in MakeScaffoldGeneric(mol)
     45   bond.SetBondType(Chem.BondType.SINGLE)
     46   bond.SetIsAromatic(False)
---> 47 return Chem.RemoveHs(res)

AtomValenceException: Explicit valence for atom # 40 C, 5, is greater than permitted

PDF export in the Workflow notebook is not working

Branch: dev-Hannah
Commit: d339471
@hannbus

In the Workflow notebook, the last cell cider.export_all_picture_pdf() is not working.
Output:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [35], in <cell line: 1>()
----> 1 cider.export_all_picture_pdf()

AttributeError: 'ChemicalDatasetComparator' object has no attribute 'export_all_picture_pdf'

I guess the function call in the notebook has to be updated to export_all_figures_pdf() as it is now in the cider.py file.

draw_molecules with smi input does not work

When importing the dataset not from .sdf but from .smi or .txt files, draw_molecules is failing due to a Value

ValueError                                Traceback (most recent call last)
[c:\Users\hannah\Desktop\Data\Git\Not](file:///C:/Users/hannah/Desktop/Data/Git/Not) for upload\CIDER_Workflow_Copy.ipynb Cell 19 in <cell line: 1>()
----> [1](vscode-notebook-cell:/c%3A/Users/hannah/Desktop/Data/Git/Not%20for%20upload/CIDER_Workflow_Copy.ipynb#X23sZmlsZQ%3D%3D?line=0) cider.draw_molecules(testdict, number_of_mols = 20, mols_per_row = 5)

File [c:\Users\hannah\.conda\envs\CDC_neu\lib\site-packages\CIDER\cider.py:473](file:///C:/Users/hannah/.conda/envs/CDC_neu/lib/site-packages/CIDER/cider.py:473), in ChemicalDatasetComparator.draw_molecules(self, all_dicts, number_of_mols, mols_per_row, image_size, data_type, figsize, fontsize_title, fontsize_subtitle)
    470     to_draw.append(all_dicts[single_dict][self.import_keyname][i])
    471 for mol in to_draw:
    472     atom0_pos = [
--> 473         mol.GetConformer().GetAtomPosition(0).x,
    474         mol.GetConformer().GetAtomPosition(0).y,
    475         mol.GetConformer().GetAtomPosition(0).z,
    476     ]
    477     atom1_pos = [
    478         mol.GetConformer().GetAtomPosition(1).x,
    479         mol.GetConformer().GetAtomPosition(1).y,
    480         mol.GetConformer().GetAtomPosition(1).z,
    481     ]
    482     if atom0_pos == atom1_pos:

ValueError: Bad Conformer Id

This should not be too hard to fix, I think. Should I do that in the next days sometime?

In the Workflow, "Get descriptor value with database ID" function does not work

Branch: dev-Hannah
Commit: d339471
@hannbus

In the Workflow notebook, the method "get_value_from_id" defined in the section "Get descriptor value with database ID" does not work.
Output:

Molecule found in set_chlorbenzene-5.sdf
Molecular Weight value for ID CNP0291002: 173.55499999999998
Molecule found in set_chlorbenzene.sdf
Molecular Weight value for ID CNP0291002: 173.55499999999998
Molecule found in set_phenole.sdf
Molecular Weight value for ID CNP0291002: 173.55499999999998
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 get_value_from_id(testdict, 'CNP0291002', 'Molecular Weight')

Input In [23], in get_value_from_id(all_dicts, wanted_id, descriptor_list_keyname)
      4 """
      5 This function returns a descriptor value for a specific molecule referred to by its database ID and
      6 the dataset where the molecule has been found.
   (...)
     15 
     16 """
     17 for single_dict in all_dicts:
---> 18     if wanted_id in all_dicts[single_dict][cider.database_id_keyname]:
     19         print("Molecule found in " + str(single_dict))
     20         index = all_dicts[single_dict][cider.database_id_keyname].index(
     21             wanted_id
     22         )

KeyError: 'coconut_id'

I think the problem is that the sub-dictionary "figures" has no "coconut_id" key, which makes sense because it does not contain molecules. You have to either explicitly ignore the "figures" sub-dict here or (preferably) check every sub-dict first for the "coconut_id" key or rather "cider.database_id_keyname" key.

README.md adjustments

  • In readme, Python 3.8 needs to be updated to Python 3.10
  • White backgrounds for logos in readme?
  • Why is the logo in the workflow dir? Is there a better place for it? Place it in repo root dir

Catch warnings

  • Catch warnings specifically where they are produced using the warnings.catch_warnings() context manager

  • There are a couple of FutureWarnings:
    ´´´
    =============================== warnings summary ===============================
    Tests/test_functions.py::test_chemical_space_plot_tsne
    /home/runner/work/ChemIcal-DatasEt-comparatoR/ChemIcal-DatasEt-comparatoR/.tox/py310/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:795: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
    ´´´

  • Full example about this from StackOverflow:

    import warnings
    
    # Custom warning class for the example
    class MyWarning(UserWarning):
        pass
    
    def function_that_raises_warning(num):
        print(num)
        warnings.warn("This is a warning", MyWarning)
    
    
    with warnings.catch_warnings():
        function_that_raises_warning(1) # raises warning
        warnings.filterwarnings(
            action="ignore", category=MyWarning, message="This is a warning"
        )
        function_that_raises_warning(2) # Warning is filtered
    function_that_raises_warning(3) # raises warning

Rename repository

Hey @hannbus ,

Could you rename the repository to match the name on Read-Me? ChemIcal DatasEt comparatoR (CIDER)

Thanks

AttributeError in chemical_space_visualization from _t_sne.py

When plotting the chemical space visualization with tsne as dimension reduction method in a notebook, an AttributeError is raised from sklearn\manifold_t_sne.py. It says "AttributeError: 'list' object has no attribute 'shape'". For the other dimension reduction methods the chemical space visualization works well. Any ideas what the problem is?

Update notebook

Required modifications:

  • delete function definitions from notebook
  • adapt all examples in the notebook to use the ChemicalDatasetComparator class methods

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.