similarityregression's Introduction

SimilarityRegression

This is the code repositiory for Similarity Regression (SR) a method to predict motif similarity using weighted alignments. Description of the directories:

ConstructSimilarityModels/: jupyter notebooks, and R scripts used to train and select SR models. This directory contains a README that describes the notebooks in greater detail.
Examples/ : Contains example data and a jupyter notebook with code to read TF gene/protein information from Cis-BP, and parse it into formats that can be used to train SR models, or score sequences using existing SR models.
Scripts/ contains python scripts and R code for aligning sequences to a Pfam HMM.
similarityregression/: python module containing code to align DBDs, and score alignments using SR models.
CisBP/: scripts to calculate E-score overlaps from data present in CisBP flat files.

python Dependancies: numpy, pandas, biopython, sklearn

R Dependancies: caret, glmnet, PRROC, aphid, seqinr

Citation

Samuel A. Lambert, Ally Yang, Alexander Sasse, Gwendolyn Cowley, Mark X. Caddick, Quaid D. Morris, Matthew T. Weirauch, and Timothy R. Hughes (2019). Similarity Regression predicts evolution of transcription factor sequence specificity. Nature Genetics. 51:981–989.

Abstract

Transcription factor (TF) binding specificities (motifs) are essential for the analysis of gene regulation. Accurate prediction of TF motifs is critical, because it is infeasible to assay all TFs in all sequenced eukaryotic genomes. There is ongoing controversy regarding the degree of motif diversification among related species that is, in part, because of uncertainty in motif prediction methods. Here we describe Similarity Regression, a significantly improved method for predicting motifs, which we use to update and expand the Cis-BP database. Similarity regression inherently quantifies TF motif evolution, and shows that previous claims of near-complete conservation of motifs between human and Drosophila are inflated, with nearly half of the motifs in each species absent from the other, largely due to extensive divergence in C2H2 zinc finger proteins. We conclude that diversification in DNA-binding motifs is pervasive, and present a new tool and updated resource to study TF diversity and gene regulation across eukaryotes.

similarityregression's People

Contributors

Stargazers

Watchers

similarityregression's Issues

Issues with default notebooks - missing files "

Dear @smlmbrt,

Thanks for publishing the code associated to SimilarityRegression, including pre-processing examples and related.

I am having an issues using the notebooks, due to missing files. In this issue, I am presenting one case for the notebook

ConstructSimilarityModesl/Create DBD alignments and training dataframes.ipynb

motifs = pd.read_csv(loc_DBFiles + '**motifs.tab**', sep = '\t', skiprows=[1], index_col=0)
motif_features = pd.read_csv(loc_DBFiles + '**motif_features.tab**', sep = '\t', skiprows=[1], index_col=0)
domains = pd.read_csv(loc_DBFiles + '**domains.tab**', sep = '\t', skiprows=[1], index_col=0)
tf_families = pd.read_csv(loc_DBFiles + 'tf_families.tab', sep = '\t', skiprows=[1], index_col=0)

All .tab seem to be not part of Cis-BP, and require independent processing. I have been checking on http://cisbp.ccbr.utoronto.ca/bulk.php and I cannot map those names easily. May I please ask where to retrieve those and/or how to skip this step, to reliably reproduce this notebook?

Thanks for any input!

Training of the Logistic Regression Models

For the 2020 release of JASPAR, we are implementing your recently described similarity regression approach.

We are following the methods as described in the manuscript, with the only differences being the use of Tomtom e-values as values for Y (instead of E-score overlaps) and Python’s scikit-learn (instead of R).

According to the manuscript, TF-pairs with E-score overlaps >=75% and <20% are regarded as positive and negative, respectively, and are used for computing precision and recall metrics and train the logistic regression models. Hence, for training the logistic regression models, do you remove/mask all TF pairs whose E-score overlaps are between 75 and 20%?

Thank you in advance.

Recommend Projects

smlmbrt / similarityregression Goto Github PK

similarityregression's Introduction

SimilarityRegression

Citation

similarityregression's People

Contributors

Stargazers

Watchers

Forkers

similarityregression's Issues

Issues with default notebooks - missing files "

Training of the Logistic Regression Models

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent