Giter Club home page Giter Club logo

Comments (12)

jchodera avatar jchodera commented on July 19, 2024

Oh dear.

from freesolv.

davidlmobley avatar davidlmobley commented on July 19, 2024

I asked @bannanc to post this. I've previously filtered for duplicates, so I'll have to dig in and figure out what's different about my prior duplicate filtering versus her duplicate filtering. Though, er, it does look somewhat like it could be an issue of chirality for three of these since the compound names clearly indicate chirality not reflected in the SMILES Caitlin has. It's not clear WHY this would be, though, as she does appear to be using isomeric SMILES strings. But, well, (2R)-1,1,1-trifluoropropan-2-ol vs (2S)-1,1,1-trifluoropropan-2-ol...

from freesolv.

jchodera avatar jchodera commented on July 19, 2024

But 2-acetoxyethyl acetate vs 2-acetoxyethyl acetate?

Sounds like an erratum may be in our future.

from freesolv.

bannanc avatar bannanc commented on July 19, 2024

Aren't there ways to indicate the chirality in the SMILES? I thought isomeric SMILES were supposed to include chirality, is there a different OE function I should use to get the SMILES?

from freesolv.

jchodera avatar jchodera commented on July 19, 2024

I thought it was just OEMolToSmiles.

How do we prevent these issues from happening in the future? What can we do to refine the process to avoid these kinds of mistakes?

from freesolv.

bannanc avatar bannanc commented on July 19, 2024

Ok, if I use OEMolToSmiles the only duplicate I see is the 2-acetoxyethyl acetate

from freesolv.

davidlmobley avatar davidlmobley commented on July 19, 2024

@jchodera :

How do we prevent these issues from happening in the future? What can we do to refine the process to avoid these kinds of mistakes?

These are all results of the original "data archeology" process we're not able to curate. Specifically, for every single data point here, at some point in the past, some human (not necessarily us -- many of these come from Rizzo's compilations or even earlier compilations) took a structure in a table in a paper and did SOMETHING to it to get a name and a molecular structure which ended up in a mol2 file. Because this was a time consuming, human-intensive process, it was error prone (names and structures not matching, duplicate molecules, names and structures being consistent but not matching the intended molecule, etc.). I've been able to detect and remove many of the errors over the years through the various curation steps I did, but it seems like each new time I/we come up with a slightly different way of processing the whole thing we come up with one or two new issues. I'm quite confident that there is NO way of making sure the whole thing is perfect. (Even if you got a magical robot which could redo all of the experiments in an automated way, generate all of the experimental data from scratch, and re-create all of the structures/IUPAC names/SMILES all in one go, you'd STILL have the problem that some of the compound vendors will have sent you the wrong compounds, etc.)

One could in principle go back to the original literature and pull all of the data again and cross-check against what we have here, but that would be equally time-consuming, human-intensive, and error-prone, not to mention the fact that some of what is here actually represents CORRECTIONS to the literature (finding mistakes in literature tables, etc.).

The idea of an erratum reminds me of one mistake I made in the latest FreeSolv update paper. In the PREVIOUS paper, I had planned that I would not do erratums unless they would significantly affect our conclusions, so I indicated clearly that all further updates to the database would be made on the FreeSolv repo itself. I forgot to do that in this paper, so we may need to do an erratum that (a) adds any corrections resulting from this issue, and (b) makes clear that all further updates will be made on the GitHub repo rather than via erratum.

(Errata are a terrible place for corrections to databases since one potentially might need to make many such corrections, such as if new experimental values become available or existing ones are better curated.)

from freesolv.

davidlmobley avatar davidlmobley commented on July 19, 2024

@bannanc :

I'd always used code more like yours:

oechem.OECreateIsoSmiString(mol)

So I'm curious to understand the differnece between these.

from freesolv.

bannanc avatar bannanc commented on July 19, 2024

@davidlmobley
That was the impression as well. I had an issue with smirky where I use a molecule's SMILES string as a dictionary key. When using OEMolToSmiles didn't work, it would regenerate a SMILES string for a molecule and wouldn't be able to find it in the dictionary, but if I used OECreateIsoSmiString it always creates the same SMILES string. However it looks like OECreateIsoSmiString doesn't include the characters to indicate chirality/isomers.

from freesolv.

davidlmobley avatar davidlmobley commented on July 19, 2024

However it looks like OECreateIsoSmiString doesn't include the characters to indicate chirality/isomers.

Hmm, that seems very odd, as I've used it for this many times in the past. I'm thinking there's something specific in how you're using it here (perhaps what processing you have or have not done on the molecule first) that is making it not provide this info. I'll have to dig in.

from freesolv.

davidlmobley avatar davidlmobley commented on July 19, 2024

OK, so to update on this:

  • It looks like OEMolToSmiles is what we want; I'm checking with Support on why OECreateIsoSmiString sometimes leads to different behavior (the docs leave it unclear and suggest BOTH for generating canonical isomeric SMILES)
  • There IS one duplicate pair here, mobley_4689084.mol2 vs mobley_352111.mol2
  • I think they sneaked in because at one point they had different names AND different SMILES (due to non-canonical SMILES) so my checks (checking the SMILES string for each molecule against the canonical SMILES for all other molecules) never caught the duplicate
  • The duplicate sneaked in because of an error by whoever was running SAMPL1 (OpenEye/Guthrie) -- they apparently included this in the prediction challenge (under a different name) even though it was already included in public databases (such as Rizzo's and mine) under a different name, and this mistake was never caught.

While having duplicates is bad, this is about as benign a duplicate as could possibly happen, in that the experimental value reported in both cases was identical, and the calculated values are within uncertainty of one another, so the overall effect is minor.

from freesolv.

davidlmobley avatar davidlmobley commented on July 19, 2024

For the record, this is info from James Haigh at OpenEye support:

It looks like we need to update the glossary part of the documentation to use OEMolToSmiles rather than OECreateIsoSmiString. OECreateIsoSmiString absolutely creates a canonical isomeric SMILES but only of the exact molecule that is present. OEMolToSmiles performs several perception calls on the molecule to ensure more consistency in the SMILES output.

Basically if you are reading molecules from different input sources they may be perceived in multiple different ways depending on the input file format or the method used to read. There are also multiple aromaticity models. OEMolToSmiles does perception to ensure consistency e.g. applying the OpenEye aromaticity model, perceiving stereochemistry etc.

A simple case is the Kekulé form of benzene. If I read that using OEParseSmiles and the generate a SMILES using OECreateIsoSmiString then I get C1=CC=CC=C1 out. But with either OEReadMolecule to read it, or OEMolToSmiles to generate a SMILES, I get c1ccccc1. See attached example.

Please let me know if you have any questions.

from freesolv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.