RXN reaction preprocessing

This repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. It also includes code for stable train/test/validation splits and data augmentation.

Links:

System Requirements

This package is supported on all operating systems. It has been tested on the following systems:

macOS: Big Sur (11.1)
Linux: Ubuntu 18.04.4

A Python version of 3.7 or greater is recommended.

Installation guide

The package can be installed from Pypi:

pip install rxn-reaction-preprocessing[rdkit]

You can leave out [rdkit] if you prefer to install rdkit manually (via Conda or Pypi).

For local development, the package can be installed with:

pip install -e ".[dev]"

Usage

The following command line scripts are installed with the package.

rxn-data-pipeline

Wrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.

For an overview of all available configuration parameters and default values, run: rxn-data-pipeline --cfg job.

Configuration using YAML (see the file config.py for more options and their meaning):

defaults:
  - base_config

data:
  path: /tmp/inference/input.csv
  proc_dir: /tmp/rxn-preproc/exp
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - TOKENIZE
  fragment_bond: TILDE
preprocess:
  min_products: 0
split:
  split_ratio: 0.05
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.processed.train.csv
      out: ${data.proc_dir}/${data.name}.processed.train
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test

rxn-data-pipeline --config-dir . --config-name example_config

Configuration using command line arguments (example):

rxn-data-pipeline \
  data.path=/path/to/data/rxns-small.csv \
  data.proc_dir=/path/to/proc/dir \
  common.fragment_bond=TILDE \
  rxn_import.data_format=TXT \
  tokenize.input_output_pairs.0.out=train.txt \
  tokenize.input_output_pairs.1.out=validation.txt \
  tokenize.input_output_pairs.2.out=test.txt

Note about reading CSV files

Pandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns. In order for the scripts to work despite this, all the pd.read_csv function calls should include the argument lineterminator='\n'.

Examples

A pipeline supporting augmentation

A config supporting augmentation of the training split called train-augmentation-config.yaml:

defaults:
  - base_config

data:
  name: pipeline-with-augmentation
  path: /tmp/file-with-reactions.txt
  proc_dir: /tmp/rxn-preprocessing/experiment
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - AUGMENT
    - TOKENIZE
  fragment_bond: TILDE
rxn_import:
  data_format: TXT
preprocess:
  min_products: 1
split:
  input_file_path: ${preprocess.output_file_path}
  split_ratio: 0.05
augment:
  input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv
  output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv
  permutations: 10
  tokenize: false
  random_type: rotated
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.augmented.train.csv
      out: ${data.proc_dir}/${data.name}.augmented.train
      reaction_column_name: rxn_rotated
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test

rxn-data-pipeline --config-dir . --config-name train-augmentation-config

Improve / change products_single_atoms filter

Imported from internal issue number 24 (Dec 2020).

Looking into the cases why 2000 reactions from [a project] were marked by this filter,

        if self.products_single_atoms(reaction):
            valid = False
            reasons.append("products_single_atoms")

it appears that actually most of them (maybe 90%) are actually rather "No product". This is mainly reactions that have the same starting material and product.

I would therefore:

Add a "no product" filter
Reformulate the products_single_atoms filter so that it does not return True in this case.

Thinking a bit further, I am not sure if we really need this "single atom" filter. The remaining 10% cases are reactions in which we had, in the beginning, sth like X>>X.Cl or X>>X.[NH4+]. We need to consider:

There are also cases in which the ion has more than one heavy atom (triflate, ...)
Mainly it's also about renaming the filter. I'd say that "single atoms" are the symptom, not the actual reason why we remove the reaction.

Similarly, products_subset_of_reactants currently returns True if there are no products - if we add the "no product" filter, this one is not needed anymore.

What may be to discuss is the following:
Currently, it looks to me like the molecules present on both sides are removed before entering the function with the filters - which leads to the problem(s) above. If I see it correctly, this is not necessary: products_subset_of_reactants would mark the reaction as invalid anyway and so it would be filtered afterwards anyway.
[@others] anything I am missing there?

Decision:

update implementations of product_single_atom and products_subset_of_reactants so that they are not tagged when the products are empty.

Add a column in csv, sth like rxn_before_smiles_processing

rxn4chemistry / rxn-reaction-preprocessing Goto Github PK