Probabilistic Topics Modeling project for Cmpe59h - Bioinformatics Course @Bogazici University.
Participants: Gönül Aycı, Dilara Keküllüoğlu.
Instructor: Assoc. Dr. Arzucan Özgür
In this project, we use popular Asgari Word to vec(W2V) protein embeddings and the dataset they used. We implement our project using with Python.
In this project, we used this paper as reference: "La Rosa, Massimo, et al. Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC bioinformatics 16.6 (2015): S2."
Our repository consists of two parts. The first part is a data preparation. The second part includes algorithms and main parts of this project.
To run this project,
python main.py "train-data" "test-data"
python main.py classification_train_3.csv classification_test_3.csv
Implementations:
- data-preparation.ipynb: In this notebook, you can find how to merge data, and apply 3, 5, and 8 -mers to this data.
- analysis.ipynb: In this notebook, you can find how to select 100 families which have maximum number of proteins among Asgari dataset, and how do we split train and test data.
- lda.py: Creates Latent Dirichlet Allocation (lda) model from the train data and then extracts topic - family dictionary from the probability distributions. Classification is done by assigning the family of the most probable topic to each sequence.
- lda_svm.py: Uses the lda model created in lda.py and creates feature vectors with the probability distributions using this model. Classification is done by svm of sklearn library.
- lda_svm_w2v.py: Adds Asgari's word2vec embeddings to the lda_svm.py's feature vectors. Classification is done with svm.
- main.py: Called with the train and test data file paths. Executes lda, lda_svm and lda_svm_w2v consecutively and shows confusion matrices using matplotlib. Need to close the corresponding confusion matrix for code to continue executing.
Our csv files formats are Asgari ID, Family ID, SwissProt Accession ID, Sequences.