METabolic And BiogeOchemistry anaLyses In miCrobes
Version 1.3
This software enables the prediction of metabolic and biogeochemical functional trait profiles to any given genome datasets. These genome datasets can either be metagenome-assembled genomes (MAGs), single-cell amplified genomes (SAGs) or pure culture sequenced genomes. It can also calculate the genome coverage. The information is parsed and diagrams for elemental/biogeochemical cycling pathways (currently Nitrogen, Carbon, Sulfur and "other") are produced.
Zhichao Zhou, [email protected]
Patricia Tran, [email protected]
Karthik Anantharaman, [email protected]
Anantharaman Microbiome Laboratory
Department of Bacteriology, University of Wisconsin, Madison
If you are using this program, please consider citing our preprint, available on BioRxiv
Zhou Z, Tran P, Liu Y, Kieft K, Anantharaman K. "METABOLIC: A scalable high-throughput metabolic and biogeochemical functional trait profiler based on microbial genomes" (2019). BioRxiv doi: https://doi.org/10.1101/761643
- Go to where you want the program to be and clone the github repository or click the green buttom "download ZIP" folder, and unzip. The perl and R scripts and dependent databases should be kept in the same directory.
git clone https://github.com/AnantharamanLab/METABOLIC.git
-
METABOLIC requires the following programs to be added to your system path:
2.1. Perl (>= v5.010)
Perl modules: (to install, use: "perl -MCPAN -e shell cpan" or "cpan -i Data::Dumper"; to check if present, use "perldoc -l Data::Dumper") use Data::Dumper; use POSIX qw(strftime); use Excel::Writer::XLSX; use Getopt::Long; use Statistics::Descriptive; use Array::Split qw(split_by split_into); use Bio::SeqIO; use Bio::Perl; use Bio::Tools::CodonTable; use Carp;
2.2. HMMER (>= v3.1b2)
2.3. Prodigal (>= v2.6.3)
Remarks: executable must be named prodigal and not prodigal.linux
2.4. SAMtools (>= v0.1.19)
2.5. BAMtools (>= v2.4.0)
2.6. CoverM
2.7. R (>= 3.6.0)
Installing required R packages using Rscript:
Copy and paste the following command into your terminal window (may require super user permissions to run):
Rscript -e 'install.packages("diagram", repos = "http://cran.us.r-project.org")'
You can follow the install instructions of each program, or you could also
install them via Conda and add them to your system path:
Link: https://anaconda.org
NOTE: For steps 3-7, we provide a "run_to_setup.sh" script for easily setting up dependent databases.
-
METABOLIC requires the KofamKOALA hmm and METABOLIC hmm databases
KofamKOALA website3.1. Download KofamKOALA hmm database files:
mkdir kofam_database
cd kofam_database
wget -c ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
wget -c ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
gzip -d ko_list.gz
tar xzf profiles.tar.gz; rm profiles.tar.gz
cd profiles
cp ../../Accessory_scripts/batch_hmmpress.pl ./
perl batch_hmmpress.pl
3.2. The METABOLIC hmm database in "METABOLIC_hmm_db.tgz" contains custom hmm files, self-parsed Pfam and TIRGfam files. It needs to be decompressed to the folder "METABOLIC_hmm_db" and stays in the same directory of KofamKOALA hmm database and scripts.
tar zxvf METABOLIC_hmm_db.tgz
- METABOLIC uses the "METABOLIC_temp_and_db" which contains the hmm result table and KEGG database information.
Decompress the METABOLIC_temp_and_db.tgz to the folder "METABOLIC_temp_and_db" and keep it in the same directory of KofamKOALA hmm database and scripts.
tar zxvf METABOLIC_temp_and_db.tgz
- This software also contains "Accessory_scripts.gz", which needs to be decompressed before use.
tar zxvf Accessory_scripts.tgz
- This software also contains "Motif.tgz", which needs to be decompressed before use.
tar zxvf Motif.tgz
- Finally, this software also contains "5_genomes_test.tgz", which needs to be decompressed before use. This is a set of 5 genomes that you can use to test run the program to see if it works correctly before running your real samples. (see end of page)
tar zxvf 5_genomes_test.tgz
-
The genome files should end with ".fasta"; The protein files should end with ".faa". These files should be in one single folder, which you will use as an argument for the option "-in-gn" in the Perl script "perl METABOLIC_v1.3.pl" (See example at end of page)
-
The "-o" or "-r" option requires inputting a text file to show the path of where the metagenomic reads are located. The metagenomics reads refer to the metagenomic read datasets that you used to generate the MAGs. A sample for this txt is as follows:
#Reads pair name with complete pathway:
/slowdata/Reads/METABOLIC_test_reads/SRR3577362_sub_1.fastq,/slowdata/Reads/METABOLIC_test_reads/SRR3577362_sub_2.fastq
/slowdata/Reads/METABOLIC_test_reads/SRR3577362_sub2_1.fastq,/slowdata/Reads/METABOLIC_test_reads/SRR3577362_sub2_2.fastq
Please notice that the paired reads are in the same line.
After running the whole program (perl script) you will obtain the following files:
- METABOLIC result table
This spreadsheet has four tabs:
- "HMMHitNum" = Number of HMM hits. if you scroll to the right with the coloured cells you'll find the presence/absence, the number of hits, and on which scaffold it was on.
- "FunctionHit" = hits to custom HMM curated database.
- "KEGGModuleHit" = KEGG module hits with modules and modules category names.
- "KEGGModuleStepHit" = similar to the previous one but broken down into smaller categories (steps).
In all cases if you scroll down you will see what "Gn00X" colnames refer to (they are based on your fasta file names for the genomes you gave.
-
Each hmm hit protein collection
-
Elemental/Biogeochemical cycling pathways for each genome and a summary scheme (Both files and figures)
In the designated R output folder named "R_ouput", you will have the following files for EACH MAG:
GenomeName.draw_sulfur_cycle_single.PDF
GenomeName.draw_nitrogen_cycle_single.PDF
GenomeName.draw_other_cycle_single.PDF
GenomeName.draw_carbon_cycle_single.PDF
A red arrow means present, and a black arrow means absent.
If you input a "metagenomic reads" txt file, the software will help to calculate the gene abundance, then you will have have the following output as the summary diagram of pathways at a community scale:
draw_sulfur_cycle_total.PDF
draw_other_cycle_total.PDF
draw_nitrogen_cycle_total.PDF
draw_carbon_cycle_total.PDF
You could also run the Rscript separately. Once you have the "R_input" folders, use it as input to visualize your results in the R script
RScript draw_biogeochemical_cycles.R R_input R_Output TRUE
The draw_biogeochemical_cycles.R command can be given as a relative paths or full path.
The first argument is the name of the folder with your inputs (in this case "R_input_files"). This is where your files that end with "...R_input.txt" and your "Total.R_input.txt" files are. Note!: There is no forward slash after the folder name. You can use full paths or relative paths to the input folder.
The second argument is the the name of the output folder where your images will be saved. The folder does not have to exist already (i.e. no need to mkdir first). Note!: Once again there is no forward slash after the folder name
The last argument takes the value "TRUE" or "FALSE". If it is "TRUE" it means that you have inputted mapped reads in the beginning, and you will have a "Total.R_input.txt" file that can be parsed to make the biogeochemical cycles summary figures. This is important because the summary figure has coverage information (the 3rd column of that file).
The test files are given in the folder "5_genomes_test.tgz", which includes the input five genome files and the running results.
One could use this to test whether you have successfully installed all the prerequisites in a proper way.
Follow similar instructions for your real files.
perl your/path/to/put/METABOLIC-folder/METABOLIC_v1.3.pl -in-gn [folder with all your genomes] -t [number of threads] -o [METABOLIC output folder] -m your/path/to/put/METABOLIC-folder
v1.1 -- Sep 4, 2019 --
fix the parallel problem, change from hmmscan to hmmsearch, and update the "METABOLIC_temp_and_db"
v1.2 -- Sep 5, 2019 --
fix the prodigal parallel run, change "working-dir" to "METABOLIC-dir"
v1.3 -- Sep 5, 2019 --
fix the output folder problem, the perl script could be called in another place instead of the original place