Giter Club home page Giter Club logo

transannot's Introduction

TransAnnot - a fast and all-in-one transcriptome annotation pipeline

TransAnnot is a GPL-3.0 licensed, C++ implemented modular toolkit. TransAnnot predicts protein functions, orthologous relationships and biological pathways for the whole newly sequenced transcriptome. It uses high-performative MMseqs2 sequence-profile search to obtain closest homologs from profile database and infer protein function, structure and orthologous groups based on the identified homologs. Prior to functional annotation, it can perform transcriptome sequence assembly using PLASS (Protein-Level ASSembler) to de novo assemble raw sequence reads on protein level upon user request.

Compile from source

Compiling from source helps to optimize TransAnnot for the specific system, which improve its performance. For the compilation cmake, g++ and git are required. After the compilation TransAnnot will be located in build/bin directory.

git clone https://github.com/mariia-zelenskaia/transannot.git
cd transannot && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make -j 4
make install
export PATH=$(pwd)/transannot/bin/:$PATH

โ—๏ธ If you compile from source under macOS we recommend to install and use gcc instead of clang as a compiler. gcc can be installed with Homebrew. Force cmake to use gcc as a compiler by running:

CC="$(brew --prefix)/bin/gcc-10"
CCX="$(brew --prefix)/bin/g++-10"
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..

Other dependencies for the compilation from source are zlib and bzip.

Workflow dependencies

  • PLASS - should be installed separately, see corresponding repository. To perform de novo assembly, it is required to install PLASS to the current working directory.

Before starting

tmp folder

tmp folder keeps temporary files. By default, all the intermediate output files from different modules will be kept in this folder. To clear tmp pass --remove-tmp-files parameter.

Quick start

There is a possibility to run TransAnnot using easy module

transannot easytransannot <inputReads.fastq> Pfam-A.full eggNOG UniProtKB/Swiss-Prot <resDB> <tmp> [options]

If (one of the) target databases are already downloaded in MMseqs2 format, just provide pathway to them, otherwise simply use their names, and the databases will be downloaded in easy module.

Input

Possible inputs are:

  • assembled transcriptomes (obtained e.g. using Trinity) or raw transcriptome reads, which will be de novo assembled at protein level using plass
  • metatranscriptomes
  • single-organism transcriptomes

Running

Modules

  • assemblereads de novo assembles raw sequencing reads to large genomic fragments (contigs).
  • annotate clusters given input for the reduction of redundancy and runs sequnce-profile and sequence-sequence searches to obtain the closest homologs with annotated function. It also retrieves descriptions of orthologous groups and protein families throgh mapping.
  • createquerydb creates a database from the sequence space (obtained from downloaddb module) in a memory-efficient MMSeqs2 format.
  • downloaddb downloads databases that serve as a search space for homology detection
  • easytransannot easy module for a quick start, performs assembly, downloads DB and executes annotation

PLASS assembly

Before running this step PLASS must be installed, detailed information about installation can be found here. Please make sure PLASS is located in the current working directory.

In this step, reads will be assembled with Protein-Level ASSembler PLASS and afterwards MMseqs2 database will be created, you may skip this step if the transcriptome is already assembled. Usage:

transannot assemblereads <inputReads.fastq[.gz|bz]> ... <inputReads.fastq[.gz|bz]> <o: fastaFile with assembly> <o: seqDB> <tmp> [options]

Dowloading databases

In this step, sequence databases for homology searches will be downloaded.

To see detailed information about databases, please use the following command:

mmseqs databases -h

and execute the below command to download the databases (Ensure the same keyword as given in mmseqs database -h):

transannot downloaddb <selection> <outDB> <tmp> [options]

Hence transannot runs 3 searches in annotate module, this step should be repeated 3 times. For the annotation module Pfam-A.full, eggNOG (profile datbases) and UniProtKB/SwissProt (sequence database) are standard, so please download them using this module, for more information also check MMseqs2 user guide.

Annotate workflow

In the annotate module representative sequences will be extracted and used as search input to remove redundancy. 3 searches (one sequence-sequence and two seqeuce-profile) will be performed.

To run annotate module of transannot execute the following command:

transannot annotate <assembledQueryDB> <path to Pfam profileTargetDB> <path to eggNOG profileTargetDB> <path to SwissProt sequenceTargetDB> <o:resTsvFile> <tmp> [options]

Important options of the annotate module

--simple-output parameter allows user to obtain simplified output, which only includes query and target IDs, header of the target database and E-value. Whereas standard output also contains sequence identity and bit score for each target sequence. Usage:

transannot annotate $1 $2 $3 $4 $5 $6 --simple-output 

When no tag is used, standard output will be provided.

--min-seq-id is a parameter to adjust minimum sequence identity for the searches. Default value is set to 0.3.

--no-run-clust performs annotation without clustering. All the input sequences will undergo similarity searches.

Output

Outut is a tab-separated .tsv file containing following columns:

queryID targetID description E-value sequenceIdentity bitScore typeOfSearch nameOfDatabase 

transannot's People

Contributors

mariia-zelenskaia avatar milot-mirdita avatar vragh avatar yazhinia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.