millanp95 / delucs Goto Github PK

This repository contains all the source files required to run DeLUCS, a deep learning clustering algorithm for DNA sequences.

Python 59.68% Jupyter Notebook 39.59% Shell 0.74%

sequence-clustering taxonomic-classification deep-learning

delucs's Introduction

DeLUCS

This repository contains all the source files required to reproduce the results in the original DeLUCS paper (https://doi.org/10.1101/2021.05.13.444008), as well as a detailed guide for running the code.

Computational Pipeline:

1. Build the dataset:

	python build_dp.py --data_path=<PATH_sequence_folder>

Input: Folders with the sequences in FASTA format
Output : file in the form (label,sequence,accession)

2. Compute the mimic sequences.

  python get_pairs.py --data_path=<PATH_pickle_dataset> --k=6 --modify='mutation' --output=<PATH_output_file> --n_mimics=<n mimics per sequence>

Input: file in the form (label,sequence,accession)
Output : file in the form of (pairs, x_test, y_test)

3. Train the model.

* For training DeLUCS and testing its performance
	```
	python EvaluateDeLUCS.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

	* Input: Pickle file with the mimics in the form of (pairs, x_test, y_test). 
	* Output : Confusion Matrix. 
			<!--* File with the misclassified sequences in the form (accession, true_label, predicted_label)-->

* For testing the performance  a single Neural Network trained in an unsupervised way (labels must be available):
	```
	python EvaluateSingleRun.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

Training on your own data

We recomend using the updated version of the code in (https://github.com/Kari-Genomics-Lab) for training on your own data.

Citation

If you find DeLUCS useful in your research please consider citing:

@article{10.1371/journal.pone.0261531,
    doi = {10.1371/journal.pone.0261531},
    author = {Millán Arias, Pablo AND Alipour, Fatemeh AND Hill, Kathleen A. AND Kari, Lila},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {DeLUCS: Deep learning for unsupervised clustering of DNA sequences},
    year = {2022},
    month = {01},
    volume = {17},
    url = {https://doi.org/10.1371/journal.pone.0261531},
    pages = {1-25},
    number = {1},
}

delucs's People

Contributors

Stargazers

Watchers

Forkers

linhduongtuan jorgeavilacartes sudha-vijayakumar sailfish009 zddzxxsmile kari-genomics-lab pmillana antecede lkampoli xxluv3

delucs's Issues

Alignement of non-DNA sequences

In the paper you specify explicitly the clusterning of DNA sequences. Can DeLUCS be used for clustering of non-DNA sequences, for example sequences of RNA viruses, or only specific genes?

build_DP bug

HI,

I'm trying to test your tool on some COVID seqs downloaded from gisaid. I put 500 seqs in fasta format in a folder called 'fas', I got the error 'File name too long' so I just named them 1-500, but this error persists.

Just to be clear; I have a folder called fas. In it are 500 fastas, called 1.fa, 2,fa etc.

They are formatted as such:

head fas/1.fa

>1.B_1
ACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCAC
TCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACA

I ran them as below (on ubuntu with Python 3.8.5) and got the below error:


build_dp.py --data_path = fas 
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 12: 
This script builds a dataset in pickle
format from a folder with FASTA files. The
desired label of the file must be in the file ID
after the accession number separated by a dot.

:param dataset: Name of the Dataset.
:param data_path: Path of the folder with the sequences.
:returns: None

Example: python build_dp.py --data_path = '../data/Influenza'
: File name too long
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 14: import: command not found
from: can't read /var/mail/Bio
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 16: import: command not found
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 17: import: command not found
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 20: syntax error near unexpected token `('
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 20: `def replace(seq):'

It seems as though there are multiple import errors.

Any help would be appreciated,
Liam

EvaluateDeLUCS.py

Hi,
Thanks for developing this tool!
While running EvaluateDeLUCS.py on test files (test 1) I got following error message:

Traceback (most recent call last):
File "EvaluateDeLUCS.py", line 3, in
import torch
File "/src/python3/envs/delucs/lib/python3.7/site-packages/torch/init.py", line 84, in
from torch._C import *
ImportError: /src/python3/envs/delucs/lib/python3.7/site-packages/torch/lib/libmkldnn.so.0: undefined symbol: cblas_sgemm_alloc

I tried to install mkl with intel (through conda install) and I am still getting the same error message.

Thanks,
Best regards,
Gautam

TrainDeLUCS.py

TrainDeLUCS.py line 116
SingleRun.py line 114
parser.add_argument('--n_custers', action='store', type=int, default=0)
typo: n_custers => n_clusters

How to know which specific sequence contributed to the classification?

Dear authors,

Your work is very valuable. I am a doctor worked in a hospital, not very familiar to the code.
Here is a question, how to know which specific sequence contributed to the classification?

Thanks very much.

Ns

Hi.

I'm excited to try this tool out. Just wondering how it handles N's?

Liam

bug in EvaluateDeLUCS/TrainDeLUCS

HI,

getting is error:

python TrainDeLUCS.py --data_dir=/home/binfie1/liam_dev/delucs/pairs --out_dir=/home/binfie1/liam_dev/delucs/train1
Traceback (most recent call last):
  File "TrainDeLUCS.py", line 193, in <module>
    main()
  File "TrainDeLUCS.py", line 127, in main
    x_train, x_test, y_test = pickle.load(open(filename, 'rb'))
NotADirectoryError: [Errno 20] Not a directory: '/home/binfie1/liam_dev/delucs/pairs/testing_data.p'

Seems to come from trying to open the output from get_pairs.py, which ran as.

python get_pairs.py --data_path=/home/binfie1/liam_dev/delucs/train --k=6 --modify='mutation' --output=/home/binfie1/liam_dev/delucs/pairs
............computing learning pairs................
......saving mutated pairs.....