Giter Club home page Giter Club logo

probabilistic_topic_modeling's Introduction

Probabilistic Topic Modeling

Probabilistic Topics Modeling project for Cmpe59h - Bioinformatics Course @Bogazici University.
Participants: Gönül Aycı, Dilara Keküllüoğlu.
Instructor: Assoc. Dr. Arzucan Özgür

In this project, we use popular Asgari Word to vec(W2V) protein embeddings and the dataset they used. We implement our project using with Python.

In this project, we used this paper as reference: "La Rosa, Massimo, et al. Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC bioinformatics 16.6 (2015): S2."

Our repository consists of two parts. The first part is a data preparation. The second part includes algorithms and main parts of this project.

To run this project,
python main.py "train-data" "test-data"
python main.py classification_train_3.csv classification_test_3.csv

Implementations:

  • data-preparation.ipynb: In this notebook, you can find how to merge data, and apply 3, 5, and 8 -mers to this data.
  • analysis.ipynb: In this notebook, you can find how to select 100 families which have maximum number of proteins among Asgari dataset, and how do we split train and test data.
  • lda.py: Creates Latent Dirichlet Allocation (lda) model from the train data and then extracts topic - family dictionary from the probability distributions. Classification is done by assigning the family of the most probable topic to each sequence.
  • lda_svm.py: Uses the lda model created in lda.py and creates feature vectors with the probability distributions using this model. Classification is done by svm of sklearn library.
  • lda_svm_w2v.py: Adds Asgari's word2vec embeddings to the lda_svm.py's feature vectors. Classification is done with svm.
  • main.py: Called with the train and test data file paths. Executes lda, lda_svm and lda_svm_w2v consecutively and shows confusion matrices using matplotlib. Need to close the corresponding confusion matrix for code to continue executing.

Our csv files formats are Asgari ID, Family ID, SwissProt Accession ID, Sequences.

probabilistic_topic_modeling's People

Contributors

dilarakkl avatar aycignl avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.