Giter Club home page Giter Club logo

guthriesolv's Introduction

The Guthrie Hydration Free Energy Database of Experimental Small Molecule Hydration Free Energies

This repository provides access to the late J. Peter Guthrie's small molecule hydration free energy database, which was donated posthumously to the community. If you are interested in using the data provided here, please read the relevant background information and disclaimers below and consider contributing to curation of the dataset.

DOI

Background information and disclaimers

Death of the primary author

For some years, J. Peter Guthrie (University of Western Ontario) worked passionately on a curating a massive database of experimental hydration free energies that he pulled from the literature. Some of these were used for the SAMPL series of challenges over the years, and others provided some assistance in curation of FreeSolv, which Peter co-authored with me (DLM). But the project was massive, and the literature immense. Peter was uniquely qualified for this database curation effort, with deep understanding of the experimental techniques, extrapolations commonly employed, etc. But the task was vast, and it outlasted him. He died September 19, 2017, at age 76, after a battle with Guillain-Barre Syndrome.

Succession plans

Apparently Peter must have expected the task might outlast him, as he left his son, James Guthrie, instructions to contact myself, Anthony Nicholls (OpenEye), and Paul Labute (CCG) in the event of his death. None of the three of us have many resources to invest in continuing the curation process at present; at the same time, we believe this data and the underlying work and references will have considerable value to the community long term. So after discussion, we decided the best path forward was simply to make available what Peter and James provided to allow the community to use and curate it. James gave permission to post this data publicly to allow this effort to continue.

Disclaimers

We provide two different types of data

This dataset consists of two parts which are expected to become significantly different:

  1. An original Excel spreadsheet, which is provided exactly as it arrived from the Guthrie family. This is provided in an "as is" format and you should use it as your own risk; we have no information about its contents beyond what is in the spreadsheet itself and in this GitHub repository. No changes to this spreadsheet will be made.
  2. A current database, which is initially an export of the contents of the Excel database, but is expected to become an independent entity based on community curation.

Use both versions at your own risk

We make no warranty as to the contents or usefulness of either dataset; both are provided as resources to the community but must be used with caution and with your own consultation of the literature.

Curation of the dataset

Our hope is that the community will get involved with curation of the dataset provided here -- in particular, the "current database" (the Excel spreadsheet should be left in its original form). Suggested improvements should come in via pull requests, where each pull request provides proposed modifications (including potentially supporting tools/scripts, data, references, or links to the same) and a clear explanation of these changes. Thus, over time the current, curated database is expected to move away from simply reflecting the contents of the Excel spreadsheet and become more valuable.

Some specific points of curation which will be needed include:

  • Separation of different types of data; for example, the main tab in the database Excel spreadsheet (and the data in guthrie_database.csv) contains not just hydration free energies but other properties with other units, e.g. the entries for phenol include values reported in mg/L, g/m^3, etc.
  • unit handling; values are present in kJ/mol and kcal/mol
  • checking of molecule names against SMILES and stereochemistry; I (DLM) previously gave Peter some tools to help with this but I do not know if he has used them

See also usage_notes.md for some information which relates to the contents.

Manifest

  • GuthrieDatabase_April14.zip: Guthrie database (Excel spreadsheet) as it was provided
  • guthrie_database.csv: Exported csv file of main tab of Excel spreadsheet
  • guthrie_references_and_status.csv: Additional tab of Excel spreadsheet which provides definitions of the references and reports on Peter's progress in extracting data from those references; may highlight other areas where more data is still available

There is also data/curation work in an additional tab of the spreadsheet, Sheet 2, which may be useful but is not present here as a separate file yet.

Using the dataset

The data set can be loaded easily in Python using pandas, for example as:

python
import pandas
db = pandas.read_csv('guthrie_database.csv', encoding='latin1')
data = db[db.Name=='phenol']

to load the database and extract all data with a molecule named phenol

Maintenance

This repository has data quality assurance tests implemented in Python that can be run with tox using the following commands:

$ git clone [email protected]:MobleyLab/GuthrieSolv.git
$ cd GuthrieSolv
$ pip install tox
$ tox

Authors

Primary author

  • J. Peter Guthrie (University of Western Ontario)

Other contributors

  • David L. Mobley, UC Irvine, who maintains this repository with help from the community
  • Chris Hoyt, who helped with CI and data integrity tests
  • Probably students and others who worked with Dr. Guthrie over the years, but I (DLM) do not have their information

Changelog

  • 2021-12-20: Added CI testing to ensure SMILES are non-null and parseable; add code quality checks; use GitHub actions to ensure tests run/continue working/etc.

Citing this work

Please cite this GitHub repository, as well as "The Guthrie Hydration Free Energy Database of Experimental Small Molecule Hydration Free Energies," J. Peter Guthrie and David L. Mobley, eScholarship, https://escholarship.org/uc/item/53n2h10t.

We maintain archival copies of this repository on eScholarship, administered by the University of California, in order to ensure long term access. New versions will also be posted there.

Acknowledgments

  • James Guthrie, who made this data available and gave permission to post it publicly; he does not want any credit for this, but he should certainly be acknowledged.

(To be updated as people contribute)

Versions

guthriesolv's People

Contributors

cthoyt avatar davidlmobley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

guthriesolv's Issues

Set up travis-CI testing

Should set up tests that attempt to process the main csv file and (possibly) ensure that all the SMILES can be processed and convert to the specified names

Mentorship opportunity

@davidlmobley I was thinking of how I would go about cleaning up/exploring the data in this repo now that there's a little infrastructure, but it occurred to me that this would be an excellent opportunity to mentor an excellent undergraduate or green graduate student on how practical data science goes (beyond using the iris dataset ๐Ÿ˜†) and a bit of machine learning. Let me know if someone immediately comes to mind, otherwise I will continue taking a crack at it myself

Need description of columns and values in columns

Right now it's pretty hard to figure out what's actually in the database - it would be great to have an explanation of what each column is in the README as well as the kinds of data that might be in each (especially the categorical fields, like for the measurement types)

Switch from CSV to TSV

There are a few major issues with CSV files:

  1. Commas may pop up inside data, which means cells need to be quoted
  2. There are many different flavors of quoting schemes. This is very difficult to convey
  3. Commas themselves create a huge amount of noise when reading a CSV document

TSV uses tabs instead of commas and has none of these problems as tabs should not be in the data itself.

I think it should be possible to directly import and export the table through pandas.

Use standard identifiers for references

Typically, people expect DOIs, PubMed identifiers (PMIDs), or arXiv identifiers for publications. The bibliography in this repository is a bit old school, which means that the provenance information associated with each row of the database isn't really machine-readable nor actionable. It would be a great contribution if someone could untangle all of the references in https://github.com/MobleyLab/GuthrieSolv/blob/master/guthrie_references_and_status.csv and replace them with DOIs or PMIDs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.