Giter Club home page Giter Club logo

bms_molecular_translation's Introduction

BMS Molecular Translation

Top-5% solution to the BMS Molecular Translation Kaggle competition on chemical image-to-text translation.

sample

Summary

Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures could speed up research and development in the field.

The goal of this project is to develop a deep learning based algorithm for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formula strings. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.

My solution is an ensemble of seven CNN-LSTM Encoder-Decoder models implemented in PyTorch. The table below summarizes the main architecture and training parameters. The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The detailed summary is provided in this writeup.

models

Project structure

The project has the following structure:

  • codes/: .py main scripts with data, model, training and inference modules
  • notebooks/: .ipynb Colab-friendly notebooks for data augmentation and model training
  • input/: input data (not included due to size constraints, can be downloaded here)
  • output/: model configurations, weights and figures exported from the notebooks

Working with the repo

Environment

To work with the repo, I recommend to create a virtual Conda environment from the environment.yml file:

conda env create --name bms --file environment.yml
conda activate bms

Reproducing solution

The solution can then be reproduced in the following steps:

  1. Download competition data and place it in the input/ folder.
  2. Run 01_preprocessing_v1.ipynb to preprocess the data and define chemical tokenizer.
  3. Run 02_gen_extra_data.ipynb and 03_preprocessing_v2.ipynb to construct additional synthetic images.
  4. Run training notebooks 04_model_v6.ipynb - 10_model_v33.ipynb to obtain weights of base models.
  5. Perform normalization of each model predictions using 11_normalization.ipynb.
  6. Run the ensembling notebook 12_ensembling.ipynb to obtain the final predictions.

All training notebooks have the same structure and differ in model/data parameters. Different versions are included to ensure reproducibility. To understand the training process, it is sufficient to go through the codes/ folder and inspect one of the modeling notebooks. The ensembling code is also provided in this Kaggle notebook.

More details are provided in the documentation within the scripts & notebooks.

bms_molecular_translation's People

Contributors

kozodoi avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.