Giter Club home page Giter Club logo

protein-structure-dl's Introduction

Deep Learning Exploration For Protein Structural Classification and Regression Tasks

Updated: 3/5/19

Generating Data

This project explores three different data representations for proteomic data and corresponding neural network architectures: 3D volumetric images, 2D pairwise statistical images and protein graphs. In order to generate a new dataset, the following setup steps are required:

  1. Create a folder for the new dataset under /data/ in the project directory.
  2. Create a file named data.csv within the newly created folder. This file should list the dataset examples and respective classification or regression values. Formating should be as follows:
data.csv

pdb_id, chain_id, (class number or regression value)

Example:
3gft, A, 1

Note: If all chains within a pdb should be used chain_id, should be set to 0.

  1. Gather PDB files for dataset using the following command:

python3 src/datagen/fetch_pdbs.py path_to_data_folder

  1. Once PDB files for a data set have been downloaded, use one of the following commands to generate one of the three different types of data representations:
  • Volumetric images - python3 src/datagen/volumes3d/generate.py path_to_data_folder This generates 3D voxel images of protein structures with each residue type defined as a separate channel. The data is stored in binvox format to reduce file size.

Note: image size and resolution can be adjusted using the --size and --resolution flags respectively. Default: 64(^3) and 1.0 angstroms

  • Pairwise images - python3 src/datagen/pairwise2d/generate.py path_to_data_folder This generates 2d histogram images of the pairwise distances between residue types.

Note: histogram range and bins can be adjusted using the --range and --bins flags respectively. Default: 50 angstroms and 10 bins

  • Protein graphs - python3 src/datagen/graphs/generate.py path_to_data_folder This generates graph representations of protein structures with nodes encoding residues and edges encoding pairwise distances.

All data generation scripts can be run in parallel using MPI. EX: $mpirun -n $NODES python3 src/datagen/.........

Defining Tensorflow Training Experiment

Under /src/configs/ are .yaml files defining neural network training configurations. Depending on the type of model, model config parameters may vary. For further information, please read model documentation. Each file has the following general fields:

data_config:
    name: protien_graphs
    data_path: data/PsiBlast
    task_type: classification
    nb_classes: 2
    nb_nodes: 381
    split: [0.7,0.1,0.2]
    augment: 3
    fuzzy_radius: 0.25
    seed: 1234

experiment_config:
    name: classifier
    output_dir: out/psiblast_graph_new

model_config:
    model_type: gcnn
    input_shape: [381,29,3]
    kernel_limit: 126.0
    kernels_per_layer: [2,2]
    conv_layers: [64,64]
    conv_dropouts: [0.1,0.1]
    pooling_layers: [4,4]
    fc_layers: [128,]
    fc_dropouts: [0.5,]
    nb_classes: 2
    optimizer: 'Adam'
    learning_rate: 0.0001

train_config:
    batch_size: 20
    nb_epochs: 100
    early_stop_metric: valid_loss
    early_stop_epochs: 10
    save_best: True

Running Training

Once configuration file for training has been defined the following command is used to run training. File loader has been paralleized using multithreading library and number of cores to use can be set using the --cores flag.

python3 src/main.py --cores $CORES config/config.yaml

GCNN Model and Layers

A tensorflow definition of a graph convolutional neural network (GCNN) can be found within the /src/models/ folder. Layers used in the model are defined in /src/models/ops/graph_conv.py

protein-structure-dl's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.