Giter Club home page Giter Club logo

misato-dataset's Introduction

MISATO - Machine learning dataset for structure-based drug discovery

python pytorch lightning

🌎 Where we are:

  • Quantum Mechanics: 19443 ligands, curated and refined
  • Molecular Dynamics: 16972 simulated protein-ligand structures, 10 ns each
  • AI: pytorch dataloaders, 2 base line models for MD and QM

:electron: Vision:

We are a drug discovery community project πŸ€—

  • highest possible accuracy for ligand molecules
  • represent the systems dynamics in reasonable timescales
  • innovative AI models for drug discovery predictions

Lets crack the 100+ ns MD, 30000+ protein-ligand structures and a whole new world of AI models for drug discovery together.

Check out the paper!

Alt text

πŸ’œ Community

Want to get hands-on for drug discovery using AI?

Join our discord server!

πŸ“ŒΒ Β Introduction

In this repository, we show how to download and apply the Misato database for AI models. You can access the calculated properties of different protein-ligand structures and use them for training in Pytorch based dataloaders. We provide a small sample of the dataset along with the repo.

You can freely download the FULL MISATO dataset from Zenodo using the links below:

  • MD (133 GiB)
  • QM (0.3 GiB)
  • electronic densities (6 GiB)
  • MD restart and topology files (55 GiB)
wget -O data/MD/h5_files/MD.hdf5 https://zenodo.org/record/7711953/files/MD.hdf5
wget -O data/QM/h5_files/QM.hdf5 https://zenodo.org/record/7711953/files/QM.hdf5

Start with the notebook src/getting_started.ipynb to :

  • Understand the structure of our dataset and how to access each molecule's properties.
  • Load the PyTorch Dataloaders of each dataset.
  • Load the PyTorch lightning Datamodules of each dataset.

πŸš€Β Β Quickstart

We recommend to pull our MISATO image from DockerHub or to create your own image (see docker/). The images use cuda version 11.8. We recommend to install on your own system a version of CUDA that is a least 11.8 to ensure that the drivers work correctly.

# clone project
git clone https://github.com/sab148/MiSaTo-dataset.git
cd MiSaTo-dataset

For singularity use:

# get the container image
singularity pull docker://sab148/misato-dataset
singularity shell misato.sif

For docker use:

sudo docker pull sab148/misato-dataset:latest
bash docker/run_bash_in_container.sh

Project Structure

β”œβ”€β”€ data                   <- Project data
β”‚   β”œβ”€β”€MD 
β”‚   β”‚   β”œβ”€β”€ h5_files           <- storage of dataset
β”‚   β”‚   └── splits             <- train, val, test splits
β”‚   └──QM
β”‚   β”‚   β”œβ”€β”€ h5_files           <- storage of dataset
β”‚   β”‚   └── splits             <- train, val, test splits
β”‚
β”œβ”€β”€ src                    <- Source code
β”‚   β”œβ”€β”€ data                    
β”‚   β”‚   β”œβ”€β”€ components           <- Datasets and transforms
β”‚   β”‚   β”œβ”€β”€ md_datamodule.py     <- MD Lightning data module
β”‚   β”‚   β”œβ”€β”€ qm_datamodule.py     <- QM Lightning data module
β”‚   β”‚   β”‚
β”‚   β”‚   └── processing           <- Skripts for preprocessing, inference and conversion
β”‚   β”‚      β”œβ”€β”€...    
β”‚   β”œβ”€β”€ getting_started.ipynb     <- notebook : how to load data and interact with it
β”‚   └── inference.ipynb           <- notebook how to run inference
β”‚
β”œβ”€β”€ docker                    <- Dockerfile and execution script 
└── README.md



Installation using your own conda environment

In case you want to use conda for your own installation please create a new misato environment.

In order to install pytorch geometric we recommend to use pip (within conda) for installation and to follow the official installation instructions:pytorch-geometric/install

Depending on your CUDA version the instructions vary. We show an example for the CUDA 11.8.

conda create --name misato python=3
conda activate misato
conda install -c anaconda pandas pip h5py
pip3 install torch --index-url https://download.pytorch.org/whl/cu118 --no-cache
pip install joblib matplotlib
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip install pytorch-lightning==1.8.3
pip install torch-geometric
pip install ipykernel==5.5.5 ipywidgets==7.6.3 nglview==2.7.7
conda install -c conda-forge nb_conda_kernels

To run inference for MD you have to install ambertools. We recommend to install it in a separate conda environment.

conda create --name ambertools python=3
conda activate ambertools
conda install -c conda-forge ambertools nb_conda_kernels
pip install h5py jupyter ipykernel==5.5.5 ipywidgets==7.6.3 nglview==2.7.7

misato-dataset's People

Contributors

till7m avatar sab148 avatar t7morgen avatar merdivane avatar

Stargazers

 avatar Yuang Cui avatar Yang avatar Rong Qian avatar NicolΓ‘s Gajardo-Parra avatar  avatar Jinlong Ru avatar Marco Matthies avatar Joe Greener avatar Wonho Zhung avatar Moon seok hyun avatar Eric Borucki avatar Radul R Dev avatar Nikita avatar Chaitanya Joshi avatar Rishabh Anand avatar Ryan Pederson avatar He Junhong avatar Carlo Fisicaro avatar MQ Liu avatar Jianmin Wang avatar  avatar Valery R Polyakov avatar  avatar Jean Charle Yaacoub avatar Yusuf Adeshina avatar Tanya Malygina avatar Charlie Harris avatar

Watchers

 avatar Yuang Cui avatar

misato-dataset's Issues

Reconstruct pdb file error from MD.hdf5 file

when I tried to reconstruct pdb file in MD hdf5 file via data/processing/h5_to_pdb.py , there was some case failed because of the follow reason, could you please help me out?

Generating pdb for MD dataset for 16PK frame 1
Traceback (most recent call last):
  File "data[/processing/h5_to_pdb.py](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224d36227d.vscode-resource.vscode-cdn.net/processing/h5_to_pdb.py)", line 123, in <module>
    lines = create_pdb_lines_MD(trajectory_coordinates, atoms_type, atoms_number, atoms_residue, molecules_begin_atom_index, typeMap,residue_Map, nameMap)
  File "data[/processing/h5_to_pdb.py](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224d36227d.vscode-resource.vscode-cdn.net/processing/h5_to_pdb.py)", line 76, in create_pdb_lines_MD
    atom_name = get_atom_name(i, atoms_number, residue_atom_index, residue_name, type_string, nameMap)
  File "data[/processing/h5_to_pdb.py](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224d36227d.vscode-resource.vscode-cdn.net/processing/h5_to_pdb.py)", line 33, in get_atom_name
    atom_name = atomic_numbers_Map[atoms_number[i]]+str(residue_atom_index)
KeyError: 9

or 

Generating pdb for MD dataset for 1A3E frame 1
KeyError ('ILE', 20, 'N3')
KeyError ('ILE', 21, 'H')
KeyError ('ILE', 22, 'H')
KeyError ('ILE', 23, 'H')
KeyError ('ILE', 24, 'CX')
KeyError ('ILE', 25, 'HP')
KeyError ('ILE', 26, '3C')
KeyError ('ILE', 27, 'HC')
KeyError ('ILE', 28, 'CT')
KeyError ('ILE', 29, 'HC')
KeyError ('ILE', 30, 'HC')
KeyError ('ILE', 31, 'HC')
KeyError ('ILE', 32, '2C')
KeyError ('ILE', 33, 'HC')
KeyError ('ILE', 34, 'HC')
KeyError ('ILE', 35, 'CT')
KeyError ('ILE', 36, 'HC')
KeyError ('ILE', 37, 'HC')
KeyError ('ILE', 38, 'HC')
KeyError ('ILE', 39, 'C')
KeyError ('ILE', 40, 'O')

Get rid of examples/ folder

I find the examples folder confusing, I would suggest we put the scripts to src. And the logs just in a new folder called logs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.