Giter Club home page Giter Club logo

pura's Introduction

Open app

pura

Pura helps you clean chemical and reaction data. Right now, you can use it to resolve common chemical names (e.g., aspirin) to standard cheminformatics identifiers like SMILES.

You can now access pura using our web app!

Installation

pip install pura

Resolve compound identifiers

Compounds are often recorded as common names instead of a machine readable identifier like SMILES.

There are several services that can do name resolution (PubChem, Chemical Identity Resolver, ChemSpider), and they sometimes disagree. Pura enables you to check several services asynchronously and ensure that a certain number agree on the resolved identifier. You can then discard or manually check the names that could not be resolved.

You can find a full list of services here.

# Import pura
from pura.resolvers import resolve_identifiers
from pura.compound import CompoundIdentifierType
from pura.services import PubChem, CIR, CAS

# Resolve names to SMILES
resolved = resolve_identifiers(
    ["Josiphos SL-J001-1", "Rh(NBD)2BF4", "DuPhos"],
    input_identifer_type=CompoundIdentifierType.NAME,
    output_identifier_type=CompoundIdentifierType.SMILES,
    backup_identifier_types=[
        CompoundIdentifierType.INCHI_KEY,
        CompoundIdentifierType.CAS_NUMBER,
    ],
    services=[PubChem(autocomplete=True), CIR(), CAS()],
    agreement=1,
    silent=True,
)
print("\nResults\n")
for input_compound, resolved_identifiers in resolved:
    print(input_compound, resolved_identifiers, "\n")
#Josiphos SL-J001-1 [CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, #value='C1CCCC1.CC(C1CCCC1P(c1ccccc1)c1ccccc1)P(C1CCCCC1)C1CCCCC1.[Fe]', details=None)]

# Rh(NBD)2BF4 [CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='C1=CC2C=CC1C2.C1=CC2C=CC1C2.F[B-](F)(F)F.[Rh]', details=None)]

# Dichloro(p-cymene)ruthenium(II) dimer [CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='Cc1ccc(C(C)C)cc1.Cc1ccc(C(C)C)cc1.Cl[Ru]Cl.Cl[Ru]Cl', details=None)]

# DuPhos [CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='CC(C)C1CCC(C(C)C)P1c1ccccc1P1C(C(C)C)CCC1C(C)C', details=None)]

Development

  1. Install poetry using the following or via the instructions here:

    curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
  2. Clone the repository:

    git clone https://github.com/sustainable-processes/pura.git
    cd pura
  3. Install the dependencies from poetry.lock:

    poetry install

    Once you make some changes, commit and push:

    git commit -am <YOUR COMMIT MESSAGE>
    git push

Subtleties of name resolution

Molecules will often be referred to with an English name, however, the same molecule can have many different names, and different molecules can have very similar (and sometimes even the same?) name! As an example, consider these two very similar names that refer to two different molecules:

  • Phenyl acetate is the ester of phenol and acetic acid (CC(=O)Oc1ccccc1)
  • Phenylacetate is an organic compound containing a phenyl functional group and a carboxylic acid functional group (O=C(O)Cc1ccccc1)

Furthermore, the name resolution can sometimes be further complicated by formal charges. Phenylacetate (a.k.a phenylacetic acid) is a carboxylic acid, so in water it will both be found as O=C(O)Cc1ccccc1 and O=C([O-])Cc1ccccc1, and indeed when querying services, both the charged and uncharged molecule was returned, which led to lack of agreement between services, despite the services having the same idea about what the molecule was.

Finally the presence/absense of stereochemical information can again cause disagreement between different services(Discussed in Issue #45). An example would be:

  • Given the molecule: (e)-2-butenenitrile
  • PubChem will resolve to: ['C/C=C/C#N']
  • CIR will resolve to: ['CC=CC#N']

Using agreement=2 will require (at least) 2 data providers to be in agreement with each other, which would flag cases with ambiguity (since Pura would return None, so you avoid getting the wrong result).

The way these disagreements should be resolved will depend on the context, so it's probably not possible to apply a standardised way of resolving conflict - rather, researchers should be aware of these subtleties, and make informed decisions that fit with the goals of their own projects.

pura's People

Contributors

marcosfelt avatar gz82 avatar ad1arsh avatar dswigh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.