Giter Club home page Giter Club logo

wsr-predictpofpathogenicity's Introduction

WeaklySupervisedRegressor

The Data folder contains all raw data(including features and labels)

The different feature sets of curated variants from ClinVar are stored in folder with naming pattern like:

features_last-interpreted-date(06-29-2017)_geq(greater and equal to)2/3(stars)_(wholeFeature)

  • Folders end with wholeFeature contain 71 features with names listed below:

Effect, Structure_and_dynamics, Secondary_structure, Stability_and_conformational_flexibility, Conformational_flexibility, Special_structural_signatures, Signal_peptide, Transmembrane_protein, Functional_residue, Macromolecular_binding, Protein_binding, Disordered_interface, Ordered_interface, Metal_binding, PTM_site, Intrinsic_disorder, B-factor, Relative_solvent_accessibility, Helix, Strand, Loop, N-terminal_signal, Signal_helix, C-terminal_signal, Signal_cleavage, Cytoplasmic_loop, Transmembrane_region, Non_cytoplasmic_loop, Coiled_coil, Catalytic_site, Calmodulin_binding, DNA_binding, RNA_binding, PPI_residue, PPI_hotspot, MoRF, Allosteric_site, Cadmium_binding, Calcium_binding, Cobalt_binding, Copper_binding, Iron_binding, Magnesium_binding, Manganese_binding, Nickel_binding, Potassium_binding, Sodium_binding, Zinc_binding, Acetylation, ADP-ribosylation, Amidation, C-linked_glycosylation, Carboxylation, Disulfide_linkage, Farnesylation, Geranylgeranylation, GPI_anchor_amidation, Hydroxylation, Methylation, Myristoylation, N-terminal_acetylation, N-linked_glycosylation, O-linked_glycosylation, Palmitoylation, Phosphorylation, Proteolytic_cleavage, Pyrrolidone_carboxylic_acid, Sulfation, SUMOylation, Ubiquitylation, Stability

  • Other folders contain 9 features (a subset of these 71 features. See section 2.3 in the paper) with names listed below:

Relative_solvent_accessibility, Allosteric_site, Catalytic_site, Secondary_structure, Stability_and_conformational_flexibility, Special_structural_signatures, Macromolecular_binding, Metal_binding, PTM_site

The different label sets of curated variants are stored in folder with naming pattern like:

labels_last-interpreted-date(06-29-2017)_geq(greater and equal to)2/3(stars)

  • Five classes are represented with number one to five:

Benign - 1
Likely Benign - 2
Uncertain - 3
Likely Pathogenic - 4
Pathogenic - 5

Features and labels of CAGI test set are in CAGI_test folder

  • Files with whole contain 71 features
  • Files without whole contain 9 features

Files

  • Each file contains data for one gene.
  • Files ending with clean only contains feauture matrix
  • Files without clean additionally contains variant names and header information

Revision data set

The new data set collected for revision process is in folder /data_revision

New genes

We extended the scale of gene candidates to general genes availble in ClinVar dataset. The statistics for new gene set is shown below. For more details, check the file SI_new_genes_stat.xlsx.

  • Benign - 460(9.52%)
  • Likely Benign - 694(14.36%)
  • Uncertain - 2806(58.05%)
  • Likely Pathogenic - 282(5.83%)
  • Pathogenic - 592(12.25%)
  • total - 4834

In order to test generalization of the model on genes sharing no connection with BRCA1/2, the 9 old property features for this new gene set are used to train the model and compared with the performance of model trained on same feature set of old genes. Experiment shows comparable performance indicating good generalization of our model.

New features

Besides our old property features from Mutpred2, we ensambled more from dbNSFP database, including 6 pathogenicity scores and 8 conservation scores.

  • 6 pathogenicity scores: SIFT_score, Polyphen2_HDIV_score, Polyphen2_HVAR_score, PROVEAN_score, REVEL_score, PrimateAI_score
  • 8 conservation scores: GERP++_RS, phyloP100way_vertebrate, phyloP30way_mammalian, phyloP17way_primate, phastCons100way_vertebrate, phastCons30way_mammalian, phastCons17way_primate, bStatistic

To test whether we can have gain of performance if more features are included, this new 14 features is appended to old 9 features on old gene set and fed into the model. The result shows improvement over model trained on only 9 old features.

wsr-predictpofpathogenicity's People

Contributors

shen-lab avatar stephen2526 avatar yuecao2017 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.