Giter Club home page Giter Club logo

plasme's Introduction

PLASMe

DOI

PLASMe is a tool to identify plasmid contigs from short-read assemblies using the Transformer. PLASMe capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe, while diverged plasmids can be predicted using order-specific Transformer models.

Required Dependencies

  • Python 3.x
  • Pytorch
  • diamond
  • blast
  • biopython
  • numpy
  • pandas

Quick install (Linux only)

  1. Download PLASMe by "git clone"

    git clone https://github.com/HubertTang/PLASMe.git
    cd PLASMe
  2. We recommend using conda to install all the dependencies.

    # install the plasme
    conda env create -f plasme.yaml
    # activate the environment
    conda activate plasme

    Reminder:

    1. Lower versions of Anaconda may not be able to install PLASMe (some users have reported that Anaconda version 4.8.4 cannot install PLASMe). If you encounter a PackagesNotFoundError, please upgrade Anaconda to a newer version.

    2. If you encounter the conda package conflicts issue during installation, please set the channel_priority to flexible. The method to set it is as follows:

      conda config --set channel_priority flexible
  3. Download the reference database using PLASMe_db.py

    python PLASMe_db.py

    more optional arguments:

    --keep_zip: Keep the compressed database. Default: False

    --threads: The number of threads used to build the database. Default: 8

    Alternative:

    Download the reference dataset (12.4GB) manually from Zenodo (OneDrive) to the same directory with PLASMe.py. (No need to uncompress it, PLASMe will extract the files and build the database the first time you use it. It will take several minutes.)

Usage

PLASMe requires input assembled contigs in Fasta format and outputs the predicted plasmid sequences in Fasta format.

python PLASMe.py [INPUT_CONTIG] [OUTPUT_PLASMIDS] [OPTIONS]

more optional arguments:

-c, --coverage: the minimum coverage of BLASTN. Default: 0.9.

-i, --identity: the minimum identity of BLASTN. Default: 0.9.

-p, --probability: the minimum probability of Transformer. Default: 0.5.

-t, --thread: the number of threads. Default: 8.

-u, --unified: Using unified Transformer model to predict (default: False).

-m, --mode: Using pre-set parameters (default: None). We have preset three sets of parameters for user convenience, namely high-sensitivity, balance, and high-precision. In high-sensitivity mode, the sensitivity is higher, but it may introduce false positives (identity threshold: 0.7, probability threshold: 0.5). In high-precision mode, the precision is higher, but it may introduce false negatives (identity threshold: 0.9, probability threshold: 0.9). In balance mode, there is a better balance between precision and sensitivity (identity threshold: 0.9, probability threshold: 0.5).

--temp: the path of directory saving temporary files. Default: temp.

Outputs

Output files

Files Description
<OUTPUT_PLASMIDS> Fasta file of all predicted plasmid contigs
<OUTPUT_PLASMIDS>_report.csv Report file of the description of the identified plasmid contigs

Output report format

Field Description
contig Sequence ID of the query contig
length Length of the query contig
reference The best-hit aligned reference plasmid
order Assigned order
evidence BLASTn or Transformer
score The prediction score (applicable only to Transformer)
amb_region The ambiguous regions*

* The ambiguous regions refer to regions that may be shared with the chromosomes. If a query contig contains a large proportion of ambiguous regions, caution must be exercised as it could potentially originate from a chromosome.

Example

# run PLASMe using coverage of 0.6, identity of 0.6, probability of 0.5, and 8 threads to identify the palsmids.
python PLASMe.py test.fasta test.plasme.fna -c 0.6 -i 0.6 -p 0.5 -t 8

Train the PC-based Transformer model using customized dataset

Considering that you may want to build protein cluster-based Transformer models from scratch, we provide train_pc_model.py to demonstrate how to train models using customized protein databases. It includes building the protein cluster database, converting query sequences into numerical vectors, training and evaluating models, and making predictions. To run this script, in addition to installing the required dependencies mentioned above, you will also need to install mcl using the following command:

conda install -c bioconda mcl

To achieve better results, we have the following recommendations:

  1. The protein database should be as comprehensive as possible.
  2. Setting stricter alignment thresholds when aligning query sequences to the PC database can further improve precision.
  3. In classification tasks, PC clusters that lack discriminative power may introduce noise and reduce classification performance. Therefore, it is advisable to remove PC clusters that lack discriminative power.

Supplementary data

We have uploaded the supplmentary data into OneDrive, including the PLSDB test set and real data. The detailed information can be found in README.txt.

plasme's People

Stargazers

Tuobang Li avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.