Giter Club home page Giter Club logo

metapgn's Introduction

Introduction

This pipeline is used for: (i) construction of pangenome networks (PGNs) and, (ii) visualization of the resulting pangenome networks.

I. Steps for constructing pangenome networks

Requirement

  1. Perl (version 5.0 or above, https://www.perl.org/get.html),

  2. MetaGeneMark (version 2.8 or above, http://exon.gatech.edu/license_download.cgi)

    undefinedFrom decompressedfiles of MetaGeneMark, copy the key “gm_key[_64/_32]” to your home directory, and copy the program “gmhmmp” and “MetaGeneMark_v1.mod” to the directory “Scripts/basic_src” in MetaPGN.

    cp  gm_key[_64/_32]  ~/.gm_key
    cp  gmhmmp MetaGeneMark_v1.mod path/to/MetaPGN/Scripts/basic_src
    

Usage

1. Make sure you have these two inputs:

1.1 Query assemblies (e.g., "query.assemblies.fasta" in "Demos/Inputs/assemblies/") 1.2 A reference genome (e.g., "Escherichia_coli_K_12_substr__MG1655.genes.fasta" in "Demos/Reference/Gene_features")

2. Execute the script named “MetaPGN_flow.pl” to generate a shell script (e.g., “MetaPGN.sh”) containing necessary steps in the pipeline (the help text for “MetaPGN_flow.pl” is shown below).

2.1 Usage

perl MetaPGN_flow.pl -q <input query assemblies sequence> -qt <input query assemblies type> -ref <input reference sequence> -p <output prefix> <Other options> >MetaPGN.sh

2.2 Explanation of options

-type   <str>   sequence type of input query assemblies, nt (nucleotide sequence) or aa (amino acid sequence), (default nt)
-q      <str>   input file (in FASTA format) or file list (with the suffix .list) for query sequences
-qt     <str>   source of the input sequence, genome or metagenome (default genome)
-gene   <str>   type of the input sequence, gene or assembly (default assembly)
-ref    <str>   input file (in FASTA format) for the reference sequence
-meta   <str>   input file (in metadata format) for metadata of the reference sequence
-p      <str>   output prefix
-M      <int>   minimum gene length in gene prediction (default 100)
-qs     <str>   qsub or local, when run the Step2. Gene redundancy elimination (default local)
-qsp    <str>   project name for qsub -P (default none)
-qsq    <str>   queue name for qsub -q (default none)
-div    <int>   when you choose -qs qsub, it will divide your gene prediction results into segment (default 10)
-aS     <float> alignment coverage for the shorter sequence (default, nt: 0.9, aa: 0.7)
if set to 0.9, the alignment must cover 90% of the sequence
-i      <float> sequence identity threshold (default, nt: 0.95, aa: 0.7. See Cd-hit manual for more details)
-n      <int>   word_length (default nt: 8, aa: 5. See Cd-hit manual for more details)
-d      <int>   length of description in the *.clstr file (default 0)
if set to 0, it takes the fasta defline and stops at first space (See Cd-hit manual for more details)
-g1     <int>   1 or 0 (default 1)
Based on cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters (See Cd-hit manual for more details)
-T      <int>   number of threads in cd-hit-est, with 0, all CPUs will be used (default 6. See Cd-hit manual for more details)
-G      <int>   use global sequence identity (default 0. See Cd-hit manual for more details)
-c      <int>   cut off for the gene count on an assembly (default 3)
-r      <float> cut off for the ratio of the number of shared genes on an assembly to the total number of genes on that assembly (default 0.5)
-h      <str>   display this help text and exit

2.3 Examples

Example 1, the input sequence is nucleotide sequence of assemblies from metagenomes:

perl MetaPGN_flow.pl -q Inputs/query.assemblies.list -qt metagenome -ref References/Gene_features/Escherichia_coli_K_12_substr__MG1655.genes.fasta -meta References/Annotations/Escherichia_coli_K_12_substr__MG1655.idx.metadata -p metagenome_assemblies_nt > MetaPGN.sh

Example 2, the input sequence is nucleotide sequence of genes from isolated genomes:

perl MetaPGN_flow.pl -q Inputs/genes/query.genes.nt.fasta -qt genome -ref References/Gene_features/Escherichia_coli_K_12_substr__MG1655.genes.fasta -p genome_genes_nt -gene gene > MetaPGN.sh

Example 3, the input sequence is amino acid sequence of genes from metagenomes, and the script will use qsub to submit tasks in redundancy elimination of genes onto available computing clusters:

perl MetaPGN_flow.pl -q Inputs/query.genes.aa.list -qt metagenome -type aa -ref References/Gene_features/Escherichia_coli_K_12_substr\_\_MG1655.genes.prot.fasta -meta References/Annotations/Escherichia_coli_K_12_substr\_\_MG1655.prot.idx.metadata -gene gene -qs qsub -div 3 -p genome_genes_aa > MetaPGN.sh

3. Execute the shell script to construct pangenome networks.

Usage: sh MetaPGN.sh [>MetaPGN.log 2> MetaPGN.err &]

II. Steps for installing the plug-in and visualizing pangenome networks in Cytoscape

Requirement

  1. Java (version 8 or above, http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html),
  2. Cytoscape (version 3.0 or above, http://www.cytoscape.org/download.php)

Usage

1. In the interface in Cytoscape,

​ 1.1 Install the plugin pgnVisApp.jar through "Apps > App Manager > Install Apps > Install from file…". These steps are demonstrated as follows:

Open “App Manager”

Install “pgnVisApp.jar”

The “pgnVisApp” appears at the “Currently Installed” panel

​ 1.2 After installing, there should be an optionnamed "Arrange node" in the "Select" menu.

An option named “PGN Arrange nodes” in the “Select”menu after installing

2. Import the network files into Cytoscape

​ 2.1 Get files of apangenome network (used below are filesfor the pangenome network of the 5 E.coli strains. You can get it from "Network_files/Text_files/5_Ecoli_strains").

​ 2.2 Import the edge fileof the pangenome network through "File > Import > Network >file".

Importthe edge file

Click “OK”

​ 2.3 Import the node fileof the pangenome network through "File > Import > Table >file".

Import the node file (the interface shows the pangenome network before arrangement)

Click “OK”

3. Arrange nodes

​ 3.1 In the "NodeTable" in "Table Panel", add two columns named"search" and "coordinate" through "Create New Column> New Single Column > String", which will be used to store thearrangement and coordinate information.

Addtwo string-type columns named “search” and “coordinate”

Column “search” and “coordinate” appear at the “Node Table” panel

​ 3.2 Click the option "PGN Arrange nodes" in the "Select" menu and the pangenome network will be arranged according to rules introduced in the paper.

Click “PGN Arrange nodes”

The pangenome network after arrangement (Most of the nodes have been arranged, while others remain in the central area)

4. [Optional] Customize the shape of nodes and edges, searchtargeted nodes, and retrieve annotations of selected elements.

​ 4.1 In the “Style” panelof the “Control Panel” (red rectangle), customize the shape of nodes and edges basedon their attributes of interest. The example below selects the “Ripple” stylefor the network (purple rectangle), customizes the fill color for nodes accordingto their status (green for “reference-specific”, orange for “query-specific”and blue for “shared” nodes) (dark blue rectangle), and specifies the size ofnodes to reflect their weights in the network (green rectangle). Moreattributes can be customized and are not shown here (orange rectangle).

Select a node of interest from the “Table Panel” (dark blue dashed rectangle) or the original node file according to its annotation, type the ID (shared name) in the search box (red dashed rectangle) and press enter to locate the node (Shown below is Node 1237, which represents the sequence of gene fliD).

Type the ID of the targeted node and press enter to locate the node

Click “Zoom selected region” to show the node

Click “Zoom Out” or scroll down the mouse wheel to show its context

The context of the selected node is shown after properly zooming

​ 4.3 Select nodes (or edges) of interest to retrieve their annotation

Select a node within an interesting loop, and click “First Neighbors of Selected Nodes” (for 4 times in this example), or press and hold “Ctrl” until done selecting nodes using the mouse

Annotation of selected nodes in the “Table Panel” (dark blue dashed rectangle)

metapgn's People

Contributors

peng-ye avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

metapgn's Issues

Visualization plugin does not appear to work with Cytoscape 3.8.0

I am trying the Cytoscape visualization plugin to rearrange the network of the example dataset (5 E coli genomes). I followed the instructions to install the plugin and it appears in the App Manager window as installed, version 0.1. After loading the example edges and node tables and adding the search and coordinates columns, nothing happens when I click on Select -> PGN Arrange nodes. I tried selecting all nodes and it doesn't do anything then either. Am I missing a crucial step, or is the plugin no longer compatible with the current version of Cytoscape? Thanks!

How to run MetaPGN with MAGs

Hi,

I think this work is interesting and I got some bins affiliated with same genus from metagenomes recently. So I am trying to constructed pangenome network based on them.
Should I treat them as isolated genomes or just assemblies from metagenome?
And could I just use their nt sequences as input or I have to predict and annotate the genes then combine them into a single file?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.