PLASMe

PLASMe is a tool to identify plasmid contigs from short-read assemblies using the Transformer. PLASMe capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe, while diverged plasmids can be predicted using order-specific Transformer models.

Required Dependencies

Python 3.x
Pytorch
diamond
blast
biopython
numpy
pandas

Quick install (Linux only)

Download PLASMe by "git clone"

git clone https://github.com/HubertTang/PLASMe.git
cd PLASMe

We recommend using conda to install all the dependencies.
```
# install the plasme
conda env create -f plasme.yaml
# activate the environment
conda activate plasme
```
Reminder:
1. Lower versions of Anaconda may not be able to install PLASMe (some users have reported that Anaconda version 4.8.4 cannot install PLASMe). If you encounter a PackagesNotFoundError, please upgrade Anaconda to a newer version.
2. If you encounter the conda package conflicts issue during installation, please set the channel_priority to flexible. The method to set it is as follows:
```
conda config --set channel_priority flexible
```
Download the reference database using PLASMe_db.py
```
python PLASMe_db.py
```
more optional arguments:

--keep_zip: Keep the compressed database. Default: False

--threads: The number of threads used to build the database. Default: 8

Alternative:

Download the reference dataset (12.4GB) manually from Zenodo (OneDrive) to the same directory with PLASMe.py. (No need to uncompress it, PLASMe will extract the files and build the database the first time you use it. It will take several minutes.)

Usage

PLASMe requires input assembled contigs in Fasta format and outputs the predicted plasmid sequences in Fasta format.

python PLASMe.py [INPUT_CONTIG] [OUTPUT_PLASMIDS] [OPTIONS]

more optional arguments:

-c, --coverage: the minimum coverage of BLASTN. Default: 0.9.

-i, --identity: the minimum identity of BLASTN. Default: 0.9.

-p, --probability: the minimum probability of Transformer. Default: 0.5.

-t, --thread: the number of threads. Default: 8.

-u, --unified: Using unified Transformer model to predict (default: False).

-m, --mode: Using pre-set parameters (default: None). We have preset three sets of parameters for user convenience, namely high-sensitivity, balance, and high-precision. In high-sensitivity mode, the sensitivity is higher, but it may introduce false positives (identity threshold: 0.7, probability threshold: 0.5). In high-precision mode, the precision is higher, but it may introduce false negatives (identity threshold: 0.9, probability threshold: 0.9). In balance mode, there is a better balance between precision and sensitivity (identity threshold: 0.9, probability threshold: 0.5).

--temp: the path of directory saving temporary files. Default: temp.

Outputs

Output files

Files	Description
<OUTPUT_PLASMIDS>	Fasta file of all predicted plasmid contigs
<OUTPUT_PLASMIDS>_report.csv	Report file of the description of the identified plasmid contigs

Output report format

Field	Description
contig	Sequence ID of the query contig
length	Length of the query contig
reference	The best-hit aligned reference plasmid
order	Assigned order
evidence	BLASTn or Transformer
score	The prediction score (applicable only to Transformer)
amb_region	The ambiguous regions*

* The ambiguous regions refer to regions that may be shared with the chromosomes. If a query contig contains a large proportion of ambiguous regions, caution must be exercised as it could potentially originate from a chromosome.

Example

# run PLASMe using coverage of 0.6, identity of 0.6, probability of 0.5, and 8 threads to identify the palsmids.
python PLASMe.py test.fasta test.plasme.fna -c 0.6 -i 0.6 -p 0.5 -t 8

Train the PC-based Transformer model using customized dataset

Considering that you may want to build protein cluster-based Transformer models from scratch, we provide train_pc_model.py to demonstrate how to train models using customized protein databases. It includes building the protein cluster database, converting query sequences into numerical vectors, training and evaluating models, and making predictions. To run this script, in addition to installing the required dependencies mentioned above, you will also need to install mcl using the following command:

conda install -c bioconda mcl

To achieve better results, we have the following recommendations:

The protein database should be as comprehensive as possible.
Setting stricter alignment thresholds when aligning query sequences to the PC database can further improve precision.
In classification tasks, PC clusters that lack discriminative power may introduce noise and reduce classification performance. Therefore, it is advisable to remove PC clusters that lack discriminative power.

Supplementary data

We have uploaded the supplmentary data into OneDrive, including the PLSDB test set and real data. The detailed information can be found in README.txt.

ohickl / plasme Goto Github PK

plasme's Introduction

PLASMe

Required Dependencies

Quick install (Linux only)

Usage

Outputs

Output files

Output report format

Example

Train the PC-based Transformer model using customized dataset

Supplementary data

plasme's People

Stargazers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent