Giter Club home page Giter Club logo

delucs's Introduction

DeLUCS

This repository contains all the source files required to reproduce the results in the original DeLUCS paper (https://doi.org/10.1101/2021.05.13.444008), as well as a detailed guide for running the code.

Computational Pipeline:

1. Build the dataset:

	python build_dp.py --data_path=<PATH_sequence_folder>	
  • Input: Folders with the sequences in FASTA format
  • Output : file in the form (label,sequence,accession)

2. Compute the mimic sequences.

  python get_pairs.py --data_path=<PATH_pickle_dataset> --k=6 --modify='mutation' --output=<PATH_output_file> --n_mimics=<n mimics per sequence>
  • Input: file in the form (label,sequence,accession)
  • Output : file in the form of (pairs, x_test, y_test)

3. Train the model.

* For training DeLUCS and testing its performance
	```
	python EvaluateDeLUCS.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

	* Input: Pickle file with the mimics in the form of (pairs, x_test, y_test). 
	* Output : Confusion Matrix. 
			<!--* File with the misclassified sequences in the form (accession, true_label, predicted_label)-->

* For testing the performance  a single Neural Network trained in an unsupervised way (labels must be available):
	```
	python EvaluateSingleRun.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

Training on your own data

We recomend using the updated version of the code in (https://github.com/Kari-Genomics-Lab) for training on your own data.

Citation

If you find DeLUCS useful in your research please consider citing:

@article{10.1371/journal.pone.0261531,
    doi = {10.1371/journal.pone.0261531},
    author = {Millán Arias, Pablo AND Alipour, Fatemeh AND Hill, Kathleen A. AND Kari, Lila},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {DeLUCS: Deep learning for unsupervised clustering of DNA sequences},
    year = {2022},
    month = {01},
    volume = {17},
    url = {https://doi.org/10.1371/journal.pone.0261531},
    pages = {1-25},
    number = {1},
}	

delucs's People

Contributors

millanp95 avatar pmillana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

delucs's Issues

Alignement of non-DNA sequences

In the paper you specify explicitly the clusterning of DNA sequences. Can DeLUCS be used for clustering of non-DNA sequences, for example sequences of RNA viruses, or only specific genes?

build_DP bug

HI,

I'm trying to test your tool on some COVID seqs downloaded from gisaid. I put 500 seqs in fasta format in a folder called 'fas', I got the error 'File name too long' so I just named them 1-500, but this error persists.

Just to be clear; I have a folder called fas. In it are 500 fastas, called 1.fa, 2,fa etc.

They are formatted as such:

head fas/1.fa

>1.B_1
ACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCAC
TCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACA

I ran them as below (on ubuntu with Python 3.8.5) and got the below error:


build_dp.py --data_path = fas 
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 12: 
This script builds a dataset in pickle
format from a folder with FASTA files. The
desired label of the file must be in the file ID
after the accession number separated by a dot.

:param dataset: Name of the Dataset.
:param data_path: Path of the folder with the sequences.
:returns: None

Example: python build_dp.py --data_path = '../data/Influenza'
: File name too long
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 14: import: command not found
from: can't read /var/mail/Bio
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 16: import: command not found
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 17: import: command not found
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 20: syntax error near unexpected token `('
/home/binfie1/binfiebin/DeLUCS/src/build_dp.py: line 20: `def replace(seq):'

It seems as though there are multiple import errors.

Any help would be appreciated,
Liam

EvaluateDeLUCS.py

Hi,
Thanks for developing this tool!
While running EvaluateDeLUCS.py on test files (test 1) I got following error message:

Traceback (most recent call last):
File "EvaluateDeLUCS.py", line 3, in
import torch
File "/src/python3/envs/delucs/lib/python3.7/site-packages/torch/init.py", line 84, in
from torch._C import *
ImportError: /src/python3/envs/delucs/lib/python3.7/site-packages/torch/lib/libmkldnn.so.0: undefined symbol: cblas_sgemm_alloc

I tried to install mkl with intel (through conda install) and I am still getting the same error message.

Thanks,
Best regards,
Gautam

TrainDeLUCS.py

TrainDeLUCS.py line 116
SingleRun.py line 114
parser.add_argument('--n_custers', action='store', type=int, default=0)
typo: n_custers => n_clusters

Ns

Hi.

I'm excited to try this tool out. Just wondering how it handles N's?

Liam

bug in EvaluateDeLUCS/TrainDeLUCS

HI,

getting is error:

python TrainDeLUCS.py --data_dir=/home/binfie1/liam_dev/delucs/pairs --out_dir=/home/binfie1/liam_dev/delucs/train1
Traceback (most recent call last):
  File "TrainDeLUCS.py", line 193, in <module>
    main()
  File "TrainDeLUCS.py", line 127, in main
    x_train, x_test, y_test = pickle.load(open(filename, 'rb'))
NotADirectoryError: [Errno 20] Not a directory: '/home/binfie1/liam_dev/delucs/pairs/testing_data.p'

Seems to come from trying to open the output from get_pairs.py, which ran as.

python get_pairs.py --data_path=/home/binfie1/liam_dev/delucs/train --k=6 --modify='mutation' --output=/home/binfie1/liam_dev/delucs/pairs
............computing learning pairs................
......saving mutated pairs.....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.