Giter Club home page Giter Club logo

rxn-reaction-preprocessing's Introduction

RXN reaction preprocessing

Actions tests

This repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. It also includes code for stable train/test/validation splits and data augmentation.

Links:

System Requirements

This package is supported on all operating systems. It has been tested on the following systems:

  • macOS: Big Sur (11.1)
  • Linux: Ubuntu 18.04.4

A Python version of 3.7 or greater is recommended.

Installation guide

The package can be installed from Pypi:

pip install rxn-reaction-preprocessing[rdkit]

You can leave out [rdkit] if you prefer to install rdkit manually (via Conda or Pypi).

For local development, the package can be installed with:

pip install -e ".[dev]"

Usage

The following command line scripts are installed with the package.

rxn-data-pipeline

Wrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.

For an overview of all available configuration parameters and default values, run: rxn-data-pipeline --cfg job.

Configuration using YAML (see the file config.py for more options and their meaning):

defaults:
  - base_config

data:
  path: /tmp/inference/input.csv
  proc_dir: /tmp/rxn-preproc/exp
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - TOKENIZE
  fragment_bond: TILDE
preprocess:
  min_products: 0
split:
  split_ratio: 0.05
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.processed.train.csv
      out: ${data.proc_dir}/${data.name}.processed.train
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test
rxn-data-pipeline --config-dir . --config-name example_config

Configuration using command line arguments (example):

rxn-data-pipeline \
  data.path=/path/to/data/rxns-small.csv \
  data.proc_dir=/path/to/proc/dir \
  common.fragment_bond=TILDE \
  rxn_import.data_format=TXT \
  tokenize.input_output_pairs.0.out=train.txt \
  tokenize.input_output_pairs.1.out=validation.txt \
  tokenize.input_output_pairs.2.out=test.txt

Note about reading CSV files

Pandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns. In order for the scripts to work despite this, all the pd.read_csv function calls should include the argument lineterminator='\n'.

Examples

A pipeline supporting augmentation

A config supporting augmentation of the training split called train-augmentation-config.yaml:

defaults:
  - base_config

data:
  name: pipeline-with-augmentation
  path: /tmp/file-with-reactions.txt
  proc_dir: /tmp/rxn-preprocessing/experiment
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - AUGMENT
    - TOKENIZE
  fragment_bond: TILDE
rxn_import:
  data_format: TXT
preprocess:
  min_products: 1
split:
  input_file_path: ${preprocess.output_file_path}
  split_ratio: 0.05
augment:
  input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv
  output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv
  permutations: 10
  tokenize: false
  random_type: rotated
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.augmented.train.csv
      out: ${data.proc_dir}/${data.name}.augmented.train
      reaction_column_name: rxn_rotated
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test
rxn-data-pipeline --config-dir . --config-name train-augmentation-config

rxn-reaction-preprocessing's People

Contributors

avaucher avatar helderlopes97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rxn-reaction-preprocessing's Issues

Sphinx build failing

Caused by the GitHub action for building the docs, link.

It relies on a default docker image that points to the latest Python version, which changed to 3.11 a few weeks ago. The actual failure comes from the xxhash package, which does not have wheels for 3.11 and fails during compilation.

Possible solution: ammaraskar/sphinx-action#43 (comment)

Improve / change products_single_atoms filter

Imported from internal issue number 24 (Dec 2020).


Looking into the cases why 2000 reactions from [a project] were marked by this filter,

        if self.products_single_atoms(reaction):
            valid = False
            reasons.append("products_single_atoms")

it appears that actually most of them (maybe 90%) are actually rather "No product". This is mainly reactions that have the same starting material and product.

I would therefore:

  • Add a "no product" filter
  • Reformulate the products_single_atoms filter so that it does not return True in this case.

Thinking a bit further, I am not sure if we really need this "single atom" filter. The remaining 10% cases are reactions in which we had, in the beginning, sth like X>>X.Cl or X>>X.[NH4+]. We need to consider:

  • There are also cases in which the ion has more than one heavy atom (triflate, ...)
  • Mainly it's also about renaming the filter. I'd say that "single atoms" are the symptom, not the actual reason why we remove the reaction.

Similarly, products_subset_of_reactants currently returns True if there are no products - if we add the "no product" filter, this one is not needed anymore.

What may be to discuss is the following:
Currently, it looks to me like the molecules present on both sides are removed before entering the function with the filters - which leads to the problem(s) above. If I see it correctly, this is not necessary: products_subset_of_reactants would mark the reaction as invalid anyway and so it would be filtered afterwards anyway.
[@others] anything I am missing there?


Decision:

update implementations of product_single_atom and products_subset_of_reactants so that they are not tagged when the products are empty.

Add a column in csv, sth like rxn_before_smiles_processing

Ignore reactions where we know that names were extracted incorrectly

Imported from internal issue number 102 (Feb 2022).


For instance: "methylacetophenone" is assigned CCC(=O)c1ccccc1, which is not correct.

In this case, the best thing is probably to remove the full reaction record.

Note: it is unclear if this filtering step should be done in this repo or in another one...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.