Giter Club home page Giter Club logo

similarityregression's Introduction

SimilarityRegression

This is the code repositiory for Similarity Regression (SR) a method to predict motif similarity using weighted alignments. Description of the directories:

  • ConstructSimilarityModels/: jupyter notebooks, and R scripts used to train and select SR models. This directory contains a README that describes the notebooks in greater detail.
  • Examples/ : Contains example data and a jupyter notebook with code to read TF gene/protein information from Cis-BP, and parse it into formats that can be used to train SR models, or score sequences using existing SR models.
  • Scripts/ contains python scripts and R code for aligning sequences to a Pfam HMM.
  • similarityregression/: python module containing code to align DBDs, and score alignments using SR models.
  • CisBP/: scripts to calculate E-score overlaps from data present in CisBP flat files.

python Dependancies: numpy, pandas, biopython, sklearn

R Dependancies: caret, glmnet, PRROC, aphid, seqinr

Citation

Samuel A. Lambert, Ally Yang, Alexander Sasse, Gwendolyn Cowley, Mark X. Caddick, Quaid D. Morris, Matthew T. Weirauch, and Timothy R. Hughes (2019). Similarity Regression predicts evolution of transcription factor sequence specificity. Nature Genetics. 51:981–989.

Abstract

Transcription factor (TF) binding specificities (motifs) are essential for the analysis of gene regulation. Accurate prediction of TF motifs is critical, because it is infeasible to assay all TFs in all sequenced eukaryotic genomes. There is ongoing controversy regarding the degree of motif diversification among related species that is, in part, because of uncertainty in motif prediction methods. Here we describe Similarity Regression, a significantly improved method for predicting motifs, which we use to update and expand the Cis-BP database. Similarity regression inherently quantifies TF motif evolution, and shows that previous claims of near-complete conservation of motifs between human and Drosophila are inflated, with nearly half of the motifs in each species absent from the other, largely due to extensive divergence in C2H2 zinc finger proteins. We conclude that diversification in DNA-binding motifs is pervasive, and present a new tool and updated resource to study TF diversity and gene regulation across eukaryotes.

similarityregression's People

Contributors

smlmbrt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

similarityregression's Issues

Issues with default notebooks - missing files "

Dear @smlmbrt,

Thanks for publishing the code associated to SimilarityRegression, including pre-processing examples and related.

I am having an issues using the notebooks, due to missing files. In this issue, I am presenting one case for the notebook

ConstructSimilarityModesl/Create DBD alignments and training dataframes.ipynb

motifs = pd.read_csv(loc_DBFiles + '**motifs.tab**', sep = '\t', skiprows=[1], index_col=0)
motif_features = pd.read_csv(loc_DBFiles + '**motif_features.tab**', sep = '\t', skiprows=[1], index_col=0)
domains = pd.read_csv(loc_DBFiles + '**domains.tab**', sep = '\t', skiprows=[1], index_col=0)
tf_families = pd.read_csv(loc_DBFiles + 'tf_families.tab', sep = '\t', skiprows=[1], index_col=0)

All .tab seem to be not part of Cis-BP, and require independent processing. I have been checking on http://cisbp.ccbr.utoronto.ca/bulk.php and I cannot map those names easily. May I please ask where to retrieve those and/or how to skip this step, to reliably reproduce this notebook?

Thanks for any input!

Training of the Logistic Regression Models

For the 2020 release of JASPAR, we are implementing your recently described similarity regression approach.

We are following the methods as described in the manuscript, with the only differences being the use of Tomtom e-values as values for Y (instead of E-score overlaps) and Python’s scikit-learn (instead of R).

According to the manuscript, TF-pairs with E-score overlaps >=75% and <20% are regarded as positive and negative, respectively, and are used for computing precision and recall metrics and train the logistic regression models. Hence, for training the logistic regression models, do you remove/mask all TF pairs whose E-score overlaps are between 75 and 20%?

Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.