Giter Club home page Giter Club logo

pangia's Introduction

PanGIA Bioinformatics

The bioinformatics pipeline leverages BWA/Minimap2 to identify ‘where’ reads belong to provides taxonomy identification specific to strain-level. Other than community profiling, PanGIA uses two approaches to obtain a metric of confidence, one that relies on uniqueness of sequences and the other one that relies on comparing test samples with control samples (organism-basis).

The software associated a web-based user interface for job submission in docker and interactive result visualization for providing pathogenic information and real-time filtering results. The pipeline was tested and validated using many synthetic datasets ranging in community composition and complexity, and was successfully applied to spiked clinical samples.

The docker version is also available at Docker hub. The docker container runs PanGIA-UI that provides a web-based GUI to facilitate users to analyze their datasets through PanGIA and access to results.


REQUIREMENT

Third-party softwares:

  • Python >= 3.4
  • BWA >= v0.7
  • Minimap2 >= 2.1
  • samtools >= 1.8
  • GNU parallel

PanGIA requires following Python dependencies:

  • Pandas >= 0.22
  • SciPy >= 0.14
  • Bokeh >= 0.13 (optional)

DOWNLOAD DATBASE

PanGIA Database can be downloaded from LANL:

https://edge-dl.lanl.gov/PanGIA/database/
  1. Download taxonomy and pathogen metadata:

  2. Download BWA index(es) for reference genomes:

    • NCBI Refseq89 reference and representative genomes -- Bacteria/Archaea/Viruses (BAV) [tar]
    • NCBI Refseq89 complete genomes of CDC biothreat agents (adds) [tar]
    • (Optional) NCBI Refseq89 genomes of Plasmodium [tar]
  3. (Optional) Download BWA indexes for host genomes:

    • Human genome GRCh38.p12 [tar]
    • Human genome alternative assembly CHM1_1.1 [tar]
    • JCVI human genome assembly [tar]
    • Mosquitos genomes [tar]
  4. (Optional) Original sequences databases in FASTA format:

    • All raw sequences can be found here.

QUICK INSTALLATION

  1. Make sure you have requirements and dependencies installed properly. Conda is quick way.

  2. Retrieving PanGIA:

git clone https://github.com/poeli/pangia.git && cd pangia
  1. Download databases:
curl -O https://edge-dl.lanl.gov/PanGIA/database/PanGIA_20180915_taxonomy.tar.gz
curl -O https://edge-dl.lanl.gov/PanGIA/database/PanGIA_20180915_NCBI_genomes_refseq89_BAV.fa.tar
curl -O https://edge-dl.lanl.gov/PanGIA/database/PanGIA_20180915_NCBI_genomes_refseq89_adds.fa.tar
curl -O https://edge-dl.lanl.gov/PanGIA/database/PanGIA_20180915_NCBI_genomes_refseq89_Human_GRCh38.p12.fa.tar
  1. Decompress databases. All files will be decompressed to "pangia/database" directory.
tar -xzf PanGIA_20180915_taxonomy.tar.gz
tar -xzf PanGIA_20180915_NCBI_genomes_refseq89_BAV.fa.tar
tar -xzf PanGIA_20180915_NCBI_genomes_refseq89_adds.fa.tar
tar -xzf PanGIA_20180915_NCBI_genomes_refseq89_Human_GRCh38.p12.fa.tar
  1. Enjoy.

EXAMPLE USAGE

./pangia.py \
  -i test.1.fastq test.2.fastq\
  -db database/NCBI_genomes_refseq89_*.fa  \
  -t 24

Run dataset HMP Mock Community even sample (SRR172902) against PanGIA NCBI refseq89 BAV and adds database with 24 threads, save mapping information to JSON file for use as a background later.

./pangia.py \
  -i SRR172902.fastq \
  -db database/NCBI_genomes_refseq89_BAV.fa database/NCBI_genomes_refseq89_adds.fa \
  -sb \
  -t 24

Run dataset "test.fq" against all PanGIA databases with 24 threads, load QCB_background_REP1 as background and report a "combined" score.

./pangia.py \
  -i HPV_test.fq \
  -db database/NCBI_genomes_*.fa \
  -lb background/QCB_background_REP1_allQC.pHostDB_NoHost.pangia.json.gz \
  -st combined \
  -t 24

QUICK PanGIA-VIS

  1. PanGIA will cleanup the temp directory after the job is done. Run pangia.py with --keepTemp if you want PanGIA-VIS to display genome coverage plot.

  2. Install Bokeh >= v1.0.

conda install -c bokeh bokeh
  1. Run pangia-vis.pl with PanGIA result file (*.result.tsv). For example:
pangia-vis.pl pangia_vis/data/test.tsv
  1. Enjoy!

REPORT

COLUMN NAME DESCRIPTION
1 LEVEL Taxonomic rank
2 NAME Taxonomic name
3 TAXID Taxonomic ID
4 READ_COUNT Number of raw mapped reads
5 READ_COUNT_RNR Number of mapped reads normalized by shared reference
6 READ_COUNT_RSNB Number of rank-specific mapped reads normalized by identity and # of shared reference
7 LINEAR_COV Proportion of covered signatures to total signatures of mapped organism(s)
8 DEPTH_COV Depth of coverage
9 DEPTH_COV_NR Depth of coverage normalized by # of shared reference
10 RS_DEPTH_COV_NR Depth of coverage calculated by rank-specific reads normalized by # of shared reference at this rank
11 PATHOGEN Pathogen or not
12 SCORE Confidence score
13 REL_ABUNDANCE Relative abundance
14 ABUNDANCE Abundance
15 TOTAL_BP_MISMATCH Total number of mismatch base-pairs
16 NOTE Note
17 RPKM Reads Per Kilobase Million
18 PRI_READ_COUNT Number of reads mapped to this organism as a primary alignment
19 TOL_RS_READ_CNT Total rank specific read count
20 TOL_NS_READ_CNT Total rank non-specific read count
21 TOL_RS_RNR Total rank specific read count
22 TOL_NS_RNR Total rank non-specific read count
23 TOL_GENOME_SIZE Total size of genome(s) belong to this taxa
24 LINEAR_LENGTH Number of non-overlapping bases covering the signatures
25 TOTAL_BP_MAPPED Total bases of mapped reads
26 RS_DEPTH_COV Depth of coverage calculated by rank-specific reads
27 FLAG Superkingdom flag
38-36 STR - ROOT Number of READ_COUNT at each rank (strain to root)
37-45 STR_rnb - ROOT_rnb Number of READ_COUNT_RSNB at each rank (strain to root)
46-54 STR_rnr - ROOT_rnr Number of READ_COUNT_RNR at each rank (strain to root)
55-63 STR_ri - ROOT_ri read-mapping identity at each rank (strain to root)
64 SOURCE Pathogenic - sample sources
65 LOCATION Pathogenic - sample locations
66 HOST Pathogenic - sample hosts
67 DISEASE Pathogenic - diseases
68 SCORE_UNIQ Score based on uniqueness information among genomes (overall)
69 SCORE_BG Score based on comparing input dataset with input background
70 SCORE_UNIQ_CUR_LVL Score based on uniqueness information among genomes (rank)

USAGE

usage: pangia.py [-h] (-i [FASTQ] [[FASTQ] ...] | -s [SAMFILE])
                 [-d [[BWA_INDEX] [[BWA_INDEX] ...]]] [-dp [PATH]]
                 [-asl <INT>] [-ams <INT>] [-ao <STR>] [-se]
                 [-st {bg,standalone,combined}]
                 [-m {report,class,extract,lineage}]
                 [-rf {basic,r,rnb,rnr,ri,patho,score,ref,full,all} [{basic,r,rnb,rnr,ri,patho,score,ref,full,all} ...]]
                 [-da] [-par <INT>] [-xnm <INT>] [-x [TAXID]] [-r [FIELD]]
                 [-t <INT>] [-o [DIR]] [-td [DIR]] [-kt] [-p <STR>] [-ps]
                 [-sb] [-lb [<FILE> [<FILE> ...]]] [-ms <FLOAT>] [-mr <INT>]
                 [-mb <INT>] [-ml <INT>] [-mc <FLOAT>] [-md <FLOAT>]
                 [-mrd <FLOAT>] [-np] [-pd] [-if <STR>] [-nc] [-c] [--silent]
                 [--verbose] [--version]

PanGIA Bioinformatics 1.0.0

optional arguments:
  -h, --help            show this help message and exit
  -i [FASTQ] [[FASTQ] ...], --input [FASTQ] [[FASTQ] ...]
                        Input one or multiple FASTQ file(s). Use space to
                        separate multiple input files.
  -s [SAMFILE], --sam [SAMFILE]
                        Specify the input SAM file. Use '-' for standard
                        input.
  -d [[BWA_INDEX] [[BWA_INDEX] ...]], --database [[BWA_INDEX] [[BWA_INDEX] ...]]
                        Name/path of BWA-MEM index(es). [default: None]
  -dp [PATH], --dbPath [PATH]
                        Path of databases. If this option isn't specified but
                        a path is provided in "--database" option, this path
                        of database will also be used in dbPath. Otherwise,
                        the program will search "database/" in program
                        directory. [default: database/]
  -asl <INT>, --alignSeedLength <INT>
                        Minimum seed length uses in BWA-MEM [default: 40]
  -ams <INT>, --alignMinScore <INT>
                        Minimum alignment score (AS:i tag) for BWA-MEM
                        [default: 60]
  -ao <STR>, --addOptions <STR>
                        Additional options for BWA-MEM (no need to add -t)
                        [default: '-h150 -B2']
  -se, --singleEnd      Input single-end reads or treat paired-end reads as
                        single-end [default: False]
  -st {bg,standalone,combined}, --scoreMethod {bg,standalone,combined}
                        You can specify one of the following scoring method:
                        "bg"         : compare mapping results with the background;
                        "standalone" : score based on uniqueness;
                        "combined"       : bg * standalone;
                        [default: 'standalone']
  -m {report,class,extract,lineage}, --mode {report,class,extract,lineage}
                        You can specify one of the following output modes:
                        "report"  : report a summary of profiling result;
                        "class"   : output results of classified reads;
                        "extract" : extract mapped reads;
                        "lineage" : output abundance and lineage in a line;
                        Note that only results/reads belongs to descendants of TAXID will be reported/extracted if option [--taxonomy TAXID] is specified. [default: 'report']
  -rf {basic,r,rnb,rnr,ri,patho,score,ref,full,all} [{basic,r,rnb,rnr,ri,patho,score,ref,full,all} ...], --reportFields {basic,r,rnb,rnr,ri,patho,score,ref,full,all} [{basic,r,rnb,rnr,ri,patho,score,ref,full,all} ...]
                        You can specify following set of fields to display in the report:
                        "basic" : essential fields that will display in the reports;
                        "r"     : rank specific read count;
                        "rnb"   : rank specific read count normalized by 
                                  both identity and # of ref (1*identity/num_refs);
                        "rnr"   : rank specific read count normalized by 
                                  the number of references (1/num_refs);
                        "ri"    : rank specific read identity
                                  (mapped_length-nm)/read_length;
                        "patho" : metadata of pathogen;
                        "score" : detail score information;
                        "ref"   : mapped reference(s) and their locations
                        "full"  : display additional information
                        "all"   : display all of above;
                        [default: 'all']
  -da, --displayAll     Display all taxonomies including being filtered out
                        [default: None]
  -par <INT>, --procAltRefs <INT>
                        Process the number of different references in
                        alternative alignments [default: 30]
  -xnm <INT>, --extraNM <INT>
                        Process alternative alignments with extra number of
                        mismatches than primary alignment [default: 1]
  -x [TAXID], --taxonomy [TAXID]
                        Specify a NCBI taxonomy ID. The program will only
                        report/extract the taxonomy you specified.
  -r [FIELD], --relAbu [FIELD]
                        The field will be used to calculate relative
                        abundance. [default: DEPTH_COV]
  -t <INT>, --threads <INT>
                        Number of threads [default: 1]
  -o [DIR], --outdir [DIR]
                        Output directory [default: .]
  -td [DIR], --tempdir [DIR]
                        Default temporary directory [default:
                        <OUTDIR>/<PREFIX>_tmp]
  -kt, --keepTemp       Keep temporary directory after finishing the pipeline.
  -p <STR>, --prefix <STR>
                        Prefix of the output file [default:
                        <INPUT_FILE_PREFIX>]
  -ps, --pathoScoreOnly
                        Only calculate score for pathogen under '--scoreMethod
                        bg'
  -sb, --saveBg         Save current readmapping result in JSON to
                        <PREFIX>.json
  -lb [<FILE> [<FILE> ...]], --loadBg [<FILE> [<FILE> ...]]
                        Load one or more background JSON gzip file(s)
                        [default: None
  -ms <FLOAT>, --minScore <FLOAT>
                        Minimum score to be considered valid [default: 0]
  -mr <INT>, --minReads <INT>
                        Minimum number of reads to be considered valid
                        [default: 10]
  -mb <INT>, --minRsnb <INT>
                        Minimum number of reads to be considered valid
                        [default: 2.5]
  -ml <INT>, --minLen <INT>
                        Minimum linear length to be considered valid [default:
                        200]
  -mc <FLOAT>, --minCov <FLOAT>
                        Minimum linear coverage to be considered a valid
                        strain [default: 0.004]
  -md <FLOAT>, --minDc <FLOAT>
                        Minimum depth of coverage to be considered a valid
                        strain [default: 0.01]
  -mrd <FLOAT>, --minRsdcnr <FLOAT>
                        Minimum rank specific depth of coverage normalized by
                        the number of mapped references to be considered a
                        valid strain [default: 0.0009]
  -np, --nanopore       Input reads is nanopore data. This option is
                        equivalent to use [-oa='-h 150 -x ont2d' -ms 0 -mr 1
                        -mb 3 -ml 50 -asl 24 -ams 70]. [default: FALSE]
  -pd, --pathogenDiscovery
                        Adjust options for pathogen discovery. This option is
                        equivalent to use [-ms 0 -mr 3 -mb 1 -ml 50 -asl 24
                        -ams 50 -mc 0 -md 0 -mrd 0]. [default: FALSE]
  -if <STR>, --ignoreFlag <STR>
                        Ignore reads that mapped to the references that have
                        the flag(s) [default: None]
  -nc, --noCutoff       Remove all cutoffs. This option is equivalent to use
                        [-ms 0 -mr 0 -mb 0 -ml 0 -mc 0 -md 0 -mrd 0].
  -c, --stdout          Write on standard output.
  --silent              Disable all messages.
  --verbose             Provide verbose running messages and keep all
                        temporary files.
  --version             Print version number.

pangia's People

Contributors

poeli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pangia's Issues

Key error and custom database error

Hello,
I am having a couple of problems when trying to classify nanopore reads with PanGIA.

  1. When trying to classify a mix of 15 isolates PanGIA exits with the following error:
    (pangia.py -i /home/Analyzed_data/ATCC_gDNAs_plasmid_mix/mix.ghac.fastq -db /home/data0/pangia_db/PanGIA/NCBI_genomes_refseq89_*.fa -t 70 --keepTemp -sb -se -np )

Parsing SAM files with 70 subprocesses...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/mnt/data0/miniconda3/envs/pangia/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/pangia/pangia.py", line 714, in worker
lcr_lvl, lcr_name, lcr_info = lineageLCR(taxids)
File "/home/pangia/pangia.py", line 378, in lineageLCR
lng = t.taxid2lineageDICT(tid, 1, 1)
File "/home/pangia/taxonomy.py", line 265, in taxid2lineageDICT
return _taxid2lineage( tid, print_all_rank, print_strain, replace_space2underscore, output_type )
File "/home/pangia/taxonomy.py", line 305, in _taxid2lineage
rank = _getTaxRank(taxID)
File "/home/pangia/taxonomy.py", line 372, in _getTaxRank
return taxRanks[taxID]
KeyError: '134962'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/pangia/pangia.py", line 2316, in
(res, mapped_r_cnt) = processSAMfile( os.path.abspath(samfile), argvs.threads, lines_per_process)
File "/home/pangia/pangia.py", line 921, in processSAMfile
results.append( job.get() )
File "/mnt/data0/miniconda3/envs/pangia/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
KeyError: '134962'

  1. When trying to classify a single isolate with a custom built database, PanGIA exits with the following error:
    [00:05:49] Saving bitmasks in JSON format...
    [00:05:49] Done.
    Traceback (most recent call last):
    File "/home/DRDC/pangia/pangia.py", line 2357, in
    outputResultsAsReport( res_rollup, out_fp, argvs.relAbu, argvs.taxonomy, argvs.reportFields, argvs.scoreMethod, argvs.minScore, argvs.minCov, argvs.minRsnb, argvs.minReads, argvs.minLen, argvs.minDc, argvs.minRsdcnr, argvs.displayAll )
    File "/home/DRDC/pangia/pangia.py", line 1677, in outputResultsAsReport
    s = "%.6f"%res_rollup[tid]["S_SA_CL"] if res_rollup[tid]["S_SA_CL"] != "none" else "none"
    TypeError: must be real number, not str

Any help would be appreciated.
Thanks,
Scott

pangia-vis fails

Hello,
I am trying to run PanGIA. pangia.py finished running but pangia-vis.pl fails to display anything in my browser. The following messages are printed in the terminal:

Opening PanGIA-VIS application on http://localhost:5006/pangia-vis
2020-08-18 17:14:19,627 Starting Bokeh server version 2.1.1 (running on Tornado 6.0.4)
2020-08-18 17:14:19,628 User authentication hooks NOT provided (default user enabled)
2020-08-18 17:14:19,630 Bokeh app running at: http://localhost:5006/pangia-vis
2020-08-18 17:14:19,630 Starting Bokeh server with process id: 5420
BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead
BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead
BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead
BokehDeprecationWarning: 'WidgetBox' is deprecated and will be removed in Bokeh 3.0, use 'bokeh.models.Column' instead
2020-08-18 17:14:19,965 Error running application handler <bokeh.application.handlers.directory.DirectoryHandler object at 0x7fa60f4c6550>: unexpected attribute 'callback' to Button, similar attributes are js_event_callbacks
File "has_props.py", line 282, in setattr:
(name, self.class.name, text, nice_join(matches))) Traceback (most recent call last):
File "/mnt/data0/miniconda3/envs/pangia2/lib/python3.6/site-packages/bokeh/application/handlers/code_runner.py", line 197, in run
exec(self._code, module.dict)
File "/home/DRDC/pangia/pangia-vis/main.py", line 1081, in
dl_btn.callback = CustomJS(args=dict(source=dp_source), code=open(join(dirname(file), "download.js")).read())
File "/mnt/data0/miniconda3/envs/pangia2/lib/python3.6/site-packages/bokeh/core/has_props.py", line 282, in setattr
(name, self.class.name, text, nice_join(matches)))
AttributeError: unexpected attribute 'callback' to Button, similar attributes are js_event_callbacks

Any help would be appreciated.
Thanks,
Scott

PanGIA-Docker error

Hello,
I installed the docker version of PanGIA. I ran some nanopore data using all 3 processes (pre-processing, PanGIA and supplementary classification). All finished running except PanGIA which had the following error in pangia.log:

[00:00:11] Running read-mapping...
[00:00:11] Mapping to /home/edge/edge_dev/scripts/microbial_profiling/../../database/PanGIA/NCBI_genomes_refseq89_BAV.fa...
[00:00:11] [ERROR] error occurred while running read mapping (code: 1, message: + minimap2 -aL -t 12 -x map-ont /home/edge/edge_dev/scripts/microbial_profiling/../../database/PanGIA/NCBI_genomes_refseq89_BAV.fa /home/edge/edge_dev/edge_ui/EDGE_output//9da940633105a9e2be60b7f0bf03add4/ReadsBasedAnalysis/Taxonomy/allReads.fastq

This is from process-current.log:
Tool (pangia) - PID: 17739, starting...
[RUN_TOOL] [pangia] COMMAND: uge-pangia.sh -i '/home/edge/edge_dev/edge_ui/EDGE_output//9da940633105a9e2be60b7f0bf03add4/ReadsBasedAnalysis/Taxonomy/allReads.fastq' -p allReads -o /home/edge/edge_dev/edge_ui/EDGE_output//9da940633105a9e2be60b7f0bf03add4/ReadsBasedAnalysis/Taxonomy/1_allReads/pangia -t 12 -d 'NCBI_genomes_refseq89_BAV.fa NCBI_genomes_refseq89_adds.fa' -x /home/edge/edge_dev/scripts/microbial_profiling/../../database/PanGIA -s n -b '' -r DEPTH_COV -T standalone -A 60 -W 24 -S 0 -R 10 -B 3 -L 200 -D 0.01 -G 0.005 -C 0.001 -a ' -se --nanopore'
[RUN_TOOL] [pangia] Logfile: /home/edge/edge_dev/edge_ui/EDGE_output//9da940633105a9e2be60b7f0bf03add4/ReadsBasedAnalysis/Taxonomy/log/allReads-pangia.log
[RUN_TOOL] [pangia] Error occured.
[RUN_TOOL] [pangia] Running time: 00:00:13

This is from error.log:
convert: unable to open image /home/edge/edge_dev/edge_ui/EDGE_output//9da940633105a9e2be60b7f0bf03add4/QcReads/Log_LengthvsQualityScatterPlot_kde.pdf': No such file or directory @ error/blob.c/OpenBlob/2643. convert: no images defined /home/edge/edge_dev/edge_ui/EDGE_output//9da940633105a9e2be60b7f0bf03add4/HTML_Report/images/QC_length_quality.png' @ error/convert.c/ConvertImageCommand/3046.
Use of uninitialized value in division (/) at /home/edge/edge_dev/scripts/pangia_report/create_pangia_report_w_temp.pl line 24.

Do you have any suggestions on how to resolve this issue with PanGIA?
Thanks,
Scott

Documentation - minor error

in the README.md file, you have

tar -xzf PanGIA_20180915_taxonomy.tar.gz
tar -xzf PanGIA_20180915_NCBI_genomes_refseq89_BAV.fa.tar
tar -xzf PanGIA_20180915_NCBI_genomes_refseq89_adds.fa.tar
tar -xzf PanGIA_20180915_NCBI_genomes_refseq89_Human_GRCh38.p12.fa.tar

as the listed commands. However, as the tarballs are not gzipped, this does not work, and should be changed to

tar -xvf PanGIA_20180915_taxonomy.tar.gz
tar -xvf PanGIA_20180915_NCBI_genomes_refseq89_BAV.fa.tar
tar -xvf PanGIA_20180915_NCBI_genomes_refseq89_adds.fa.tar
tar -xvf PanGIA_20180915_NCBI_genomes_refseq89_Human_GRCh38.p12.fa.tar

PanGIA is broken: throws an internal `KeyError` when running test command

I installed PanGIA by cloning this repository and then downloading these two files:

$ curl -O https://edge-dl.lanl.gov/PanGIA/database/PanGIA_20190830_taxonomy.tar.gz
$ curl -O https://edge-dl.lanl.gov/PanGIA/database/PanGIA_20190830_NCBI_genomes_refseq89_BAV.fa.mmi.tar.gz

$ tar xzf PanGIA_20190830_taxonomy.tar.gz
$ tar xzf PanGIA_20190830_NCBI_genomes_refseq89_BAV.fa.mmi.tar.gz

Next, I ran the following command to test if PanGIA could classify a bunch of artificially generated reads:

(pangia) xapple@server ~ $ ~/programs/pangia/bin/pangia.py --threads 4 --database ~/databases/pangia/PanGIA/NCBI_genomes_refseq89_BAV.fa --mode report --outdir ~/runs/pangia_test/ --readmapper minimap2 --prefix sample --input ~/runs/pangia_test/reads_fwd.fastq.gz ~/runs/pangia_test/reads_rev.fastq.gz

But it throws a KeyError and seems to be non-functional.

[00:00:00] Starting PanGIA 1.0.0-RC6.1
[00:00:00] Temporary directory '~/runs/pangia_test//sample_tmp' found. Deleting directory...
[00:04:53] Arguments and dependencies checked:
[00:04:53]     Input reads       : ['~/runs/pangia_test/reads_fwd.fastq.gz', '~/runs/pangia_test/reads_rev.fastq.gz']
[00:04:53]     Input SAM file    : ~/runs/pangia_test//sample.pangia.sam
[00:04:53]     Input background  : None
[00:04:53]     Save background   : None
[00:04:53]     Scoring method    : standalone
[00:04:53]     Scoring parameter : 0.5:0.99
[00:04:53]     Database          : ['~/databases/pangia/PanGIA/NCBI_genomes_refseq89_BAV.fa.mmi']
[00:04:53]     Abundance         : DEPTH_COV
[00:04:53]     Output path       : ~/runs/pangia_test/
[00:04:53]     Prefix            : sample
[00:04:53]     Mode              : report
[00:04:53]     Specific taxid    : None
[00:04:53]     Threads           : 4
[00:04:53]     First #refs in XA : 30
[00:04:53]     Extra NM in XA    : 1
[00:04:53]     Minimal score     : 0
[00:04:53]     Minimal RSNB      : 2.5
[00:04:53]     Minimal reads     : 10
[00:04:53]     Minimal linear len: 200
[00:04:53]     Minimal genome cov: 0.004
[00:04:53]     Minimal depth (DC): 0.01
[00:04:53]     Minimal RSDCnr    : 0.0009
[00:04:53]     Aligner option    : -A1 -B2 -k 40 -m 60 -x sr -p 1 -N 30
[00:04:53]     Aligner seed len  : 40
[00:04:53]     Aligner min score : 60
[00:04:53]     Aligner path      : ~/mambaforge/envs/pangia/bin/minimap2
[00:04:53]     Samtools path     : ~/mambaforge/envs/pangia/bin/samtools
[00:04:53] Loading taxonomy information...
[00:05:00] Done.
[00:05:00] Loading pathogen information...
[00:05:00] Done. 2817 pathogens loaded.
[00:05:00] Loading taxonomic uniqueness information...
[00:05:00] Done. 31177 taxonomic uniqueness loaded.
[00:05:00] Loading sizes of genomes...
[00:05:55] Done. 1061 target and 0 host genome(s) loaded.
[00:05:55] Running read-mapping...
[00:05:55] Mapping to ~/databases/pangia/PanGIA/NCBI_genomes_refseq89_BAV.fa.mmi...
[00:06:53] Done mapping reads to the database(s).
[00:06:53] Merging SAM files...
[00:06:55] Logfile saved to ~/runs/pangia_test//sample.pangia.log.
[00:06:55] Done. Mapped SAM file saved to ~/runs/pangia_test//sample.pangia.sam.
[00:06:55] Total number of input reads: 400013
[00:06:55] Total number of mapped reads: 186478
[00:06:55] Total number of host reads: 0 (0.00%)
[00:06:55] Total number of ignored reads (cross superkingdom): 349 (0.19%)
[00:06:55] Processing SAM file...
[00:06:55] Parsing SAM files with 4 subprocesses...
[00:06:59] Merging results...
[00:06:59] Done.
[00:06:59] Calculating linear length...
[00:07:02] Done processing SAM file, 184670 alignment(s).
[00:07:02] Rolling up taxonomies...
[00:07:02] 17 strain(s) mapped.
Traceback (most recent call last):
  File "~/programs/pangia/bin/pangia.py", line 2320, in <module>
    res_rollup = taxonomyRollUp(res, patho_meta, mapped_r_cnt, argvs.minRsnb, argvs.minReads, argvs.minLen, argvs.minCov, argvs.minDc)
  File "~/programs/pangia/bin/pangia.py", line 1199, in taxonomyRollUp
    genome_size[taxid]
KeyError: '1582156.1'

PanGIA basic installation is broken

I followed the instructions in the README to install PanGIA. This did not work. It seems like this software is left in a non-functional state.

The following command (from the QUICK INSTALLATION section):

$ curl -O https://edge-dl.lanl.gov/PanGIA/database/PanGIA_20180915_taxonomy.tar.gz

Results in the following file being downloaded:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /PanGIA/database/PanGIA_20180915_taxonomy.tar.gz was not found on this server.</p>
</body></html>

Could you please fix PanGIA so that the bioinformatics community can actually install and use it? Thanks.

Mapping Issue

Hi everytime I run pangia I get this error message when its trying to map my reads could you tell me why?

[ERROR] error occurred while running read mapping (code: 1, message: + bwa mem -k40 -T60 -h100 -B2 -t24 database/NCBI_genomes_refseq89_BAV.fa SRR172902.fastq

  • gawk '-F\t' '!/^@/ { print }'
  • gawk '-F\t' '!and($2,256) && !and($2,2048) { print } END { print NR > "./SRR172902_tmp/raw_sam/NCBI_genomes_refseq89_BAV.fa.sam.count" }'
  • gawk '-F\t' '!and($2,4) { print }'
    ).

keyError and taxid not is genome_size

[00:00:00] Starting PanGIA 1.0.0-RC6.1
[00:00:00] Arguments and dependencies checked:
[00:00:00] Input reads : ['/data2/data/mNGS/test_data/SRR172902.fastq']
[00:00:00] Input SAM file : ../result//SRR172902.pangia.sam
[00:00:00] Input background : None
[00:00:00] Save background : None
[00:00:00] Scoring method : standalone
[00:00:00] Scoring parameter : 0.5:0.99
[00:00:00] Database : ['/home/kdws/workdir/database/PanGIA/PanGIA/NCBI_genomes_refseq89_BAV.fa.mmi']
[00:00:00] Abundance : DEPTH_COV
[00:00:00] Output path : ../result/
[00:00:00] Prefix : SRR172902
[00:00:00] Mode : report
[00:00:00] Specific taxid : None
[00:00:00] Threads : 12
[00:00:00] First #refs in XA : 30
[00:00:00] Extra NM in XA : 1
[00:00:00] Minimal score : 0
[00:00:00] Minimal RSNB : 2.5
[00:00:00] Minimal reads : 10
[00:00:00] Minimal linear len: 200
[00:00:00] Minimal genome cov: 0.004
[00:00:00] Minimal depth (DC): 0.01
[00:00:00] Minimal RSDCnr : 0.0009
[00:00:00] Aligner option : -A1 -B2 -k 40 -m 60 -x sr -p 1 -N 30
[00:00:00] Aligner seed len : 40
[00:00:00] Aligner min score : 60
[00:00:00] Aligner path : /home/kdws/miniconda3/envs/mNGS/bin/minimap2
[00:00:00] Samtools path : /home/kdws/miniconda3/envs/mNGS/bin/samtools
[00:00:00] Loading taxonomy information...
[00:00:04] Done.
[00:00:04] Loading pathogen information...
[00:00:04] Done. 2817 pathogens loaded.
[00:00:04] Loading taxonomic uniqueness information...
[00:00:04] Done. 31177 taxonomic uniqueness loaded.
[00:00:04] Loading sizes of genomes...
[00:01:05] Done. 1061 target and 0 host genome(s) loaded.
[00:01:05] Running read-mapping...
[00:01:05] Mapping to /home/kdws/workdir/database/PanGIA/PanGIA/NCBI_genomes_refseq89_BAV.fa.mmi...
[00:03:08] Done mapping reads to the database(s).
[00:03:08] Merging SAM files...
[00:03:52] Logfile saved to ../result//SRR172902.pangia.log.
[00:03:52] Done. Mapped SAM file saved to ../result//SRR172902.pangia.sam.
[00:03:52] Total number of input reads: 13124130
[00:03:52] Total number of mapped reads: 3361870
[00:03:52] Total number of host reads: 0 (0.00%)
[00:03:52] Total number of ignored reads (cross superkingdom): 1857 (0.06%)
[00:03:52] Processing SAM file...
[00:03:52] Parsing SAM files with 12 subprocesses...
[00:04:22] Merging results...
[00:04:22] Done.
[00:04:22] Calculating linear length...
[00:05:02] Done processing SAM file, 3360013 alignment(s).
[00:05:02] Rolling up taxonomies...
[00:05:02] 994 strain(s) mapped.
Traceback (most recent call last):
File "pangia.py", line 2320, in
res_rollup = taxonomyRollUp(res, patho_meta, mapped_r_cnt, argvs.minRsnb, argvs.minReads, argvs.minLen, argvs.minCov, argvs.minDc)
File "pangia.py", line 1199, in taxonomyRollUp
genome_size[taxid]
KeyError: '1133852'

PanGIA/pangia.py

Line 1199 in 3f7fe3d

genome_size[taxid]

Database downloads

Hello,
I am trying to download the PanGIA databases from https://edge-dl.lanl.gov/PanGIA/database/. The site is currently down for maintenance. Is there another download location? I did did find a ftp site but the databases appear to be older versions from 20180227 instead of 20180915.
Thanks,
Scott

No taxid in taxRanks

I've run into this issue using PanGIA command line. Logs are attached:

[00:00:00] Starting PanGIA 1.0.0-RC6.1
[00:00:00] Arguments and dependencies checked:
[00:00:00]     Input reads       : ['/srv/test_fastq/strawman_pathogen-miseq_95gg9031_05vv10245.fastq']
[00:00:00]     Input SAM file    : /srv/strawman_pathogen-miseq_95gg9031_05vv10245.pangia.sam
[00:00:00]     Input background  : None
[00:00:00]     Save background   : None
[00:00:00]     Scoring method    : standalone
[00:00:00]     Scoring parameter : 0.5:0.99
[00:00:00]     Database          : ['database/NCBI_genomes_refseq89_BAV.fa.mmi']
[00:00:00]     Abundance         : DEPTH_COV
[00:00:00]     Output path       : /srv
[00:00:00]     Prefix            : strawman_pathogen-miseq_95gg9031_05vv10245
[00:00:00]     Mode              : report
[00:00:00]     Specific taxid    : None
[00:00:00]     Threads           : 8
[00:00:00]     First #refs in XA : 30
[00:00:00]     Extra NM in XA    : 1
[00:00:00]     Minimal score     : 0
[00:00:00]     Minimal RSNB      : 1
[00:00:00]     Minimal reads     : 3
[00:00:00]     Minimal linear len: 50
[00:00:00]     Minimal genome cov: 0.004
[00:00:00]     Minimal depth (DC): 0.01
[00:00:00]     Minimal RSDCnr    : 0.0009
[00:00:00]     Aligner option    : -x map-ont
[00:00:00]     Aligner seed len  : 40
[00:00:00]     Aligner min score : 60
[00:00:00]     Aligner path      : /opt/conda/envs/pangia/bin/minimap2
[00:00:00]     Samtools path     : /opt/conda/envs/pangia/bin/samtools
[00:00:00] Loading taxonomy information...
[00:00:08] Done.
[00:00:08] Loading pathogen information...
[00:00:08] Done. 2817 pathogens loaded.
[00:00:08] Loading taxonomic uniqueness information...
[00:00:08] Done. 31177 taxonomic uniqueness loaded.
[00:00:08] Loading sizes of genomes...
[00:00:08] Done. 9634 target and 0 host genome(s) loaded.
[00:00:08] Running read-mapping...
[00:00:08] Mapping to database/NCBI_genomes_refseq89_BAV.fa.mmi...
[WARNING]�[1;31m For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.�[0m
[M::main::12.096*1.00] loaded/built the index for 2010 target sequence(s)
[M::mm_mapopt_update::15.182*1.00] mid_occ = 236
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2010
[M::mm_idx_stat::17.103*1.00] distinct minimizers: 154437548 (33.40% are singletons); average occurrences: 4.873; average spacing: 5.353
[M::worker_pipeline::31.420*2.60] mapped 799768 sequences
[M::main::42.605*2.18] loaded/built the index for 11332 target sequence(s)
[M::mm_mapopt_update::42.605*2.18] mid_occ = 236
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 11332
[M::mm_idx_stat::45.388*2.11] distinct minimizers: 139932295 (37.69% are singletons); average occurrences: 3.883; average spacing: 5.353
[M::worker_pipeline::58.889*2.95] mapped 799768 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -aL -t 8 -x map-ont database/NCBI_genomes_refseq89_BAV.fa.mmi /srv/test_fastq/strawman_pathogen-miseq_95gg9031_05vv10245.fastq
[M::main] Real time: 59.028 sec; CPU: 173.916 sec; Peak RSS: 11.823 GB
[00:01:08] Done mapping reads to the database(s).
[00:01:08] Merging SAM files...
[00:01:09] Logfile saved to /srv/strawman_pathogen-miseq_95gg9031_05vv10245.pangia.log.
[00:01:09] Done. Mapped SAM file saved to /srv/strawman_pathogen-miseq_95gg9031_05vv10245.pangia.sam.
[00:01:09] Total number of input reads: 1713173
[00:01:09] Total number of mapped reads: 41953
[00:01:09] Total number of host reads: 0 (0.00%)
[00:01:09] Total number of ignored reads (cross superkingdom): 29 (0.07%)
[00:01:09] Processing SAM file... 
[00:01:09] Parsing SAM files with 8 subprocesses...
[00:00:00] Starting PanGIA 1.0.0-RC6.1
[00:00:00] Temporary directory '/srv/strawman_pathogen-miseq_95gg9031_05vv10245_tmp' found. Deleting directory...
[00:00:00] Arguments and dependencies checked:
[00:00:00]     Input reads       : ['/srv/test_fastq/strawman_pathogen-miseq_95gg9031_05vv10245.fastq']
[00:00:00]     Input SAM file    : /srv/strawman_pathogen-miseq_95gg9031_05vv10245.pangia.sam
[00:00:00]     Input background  : None
[00:00:00]     Save background   : None
[00:00:00]     Scoring method    : standalone
[00:00:00]     Scoring parameter : 0.5:0.99
[00:00:00]     Database          : ['database/NCBI_genomes_refseq89_BAV.fa.mmi']
[00:00:00]     Abundance         : DEPTH_COV
[00:00:00]     Output path       : /srv
[00:00:00]     Prefix            : strawman_pathogen-miseq_95gg9031_05vv10245
[00:00:00]     Mode              : report
[00:00:00]     Specific taxid    : None
[00:00:00]     Threads           : 8
[00:00:00]     First #refs in XA : 30
[00:00:00]     Extra NM in XA    : 1
[00:00:00]     Minimal score     : 0
[00:00:00]     Minimal RSNB      : 1
[00:00:00]     Minimal reads     : 3
[00:00:00]     Minimal linear len: 50
[00:00:00]     Minimal genome cov: 0.004
[00:00:00]     Minimal depth (DC): 0.01
[00:00:00]     Minimal RSDCnr    : 0.0009
[00:00:00]     Aligner option    : -x map-ont
[00:00:00]     Aligner seed len  : 40
[00:00:00]     Aligner min score : 60
[00:00:00]     Aligner path      : /opt/conda/envs/pangia/bin/minimap2
[00:00:00]     Samtools path     : /opt/conda/envs/pangia/bin/samtools
[00:00:00] Loading taxonomy information...
[00:00:08] Done.
[00:00:08] Loading pathogen information...
[00:00:08] Done. 2817 pathogens loaded.
[00:00:08] Loading taxonomic uniqueness information...
[00:00:08] Done. 31177 taxonomic uniqueness loaded.
[00:00:08] Loading sizes of genomes...
[00:00:08] Done. 9634 target and 0 host genome(s) loaded.
[00:00:08] Running read-mapping...
[00:00:08] Mapping to database/NCBI_genomes_refseq89_BAV.fa.mmi...
[WARNING]�[1;31m For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.�[0m
[M::main::12.154*1.00] loaded/built the index for 2010 target sequence(s)
[M::mm_mapopt_update::15.333*1.00] mid_occ = 236
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2010
[M::mm_idx_stat::17.257*1.00] distinct minimizers: 154437548 (33.40% are singletons); average occurrences: 4.873; average spacing: 5.353
[M::worker_pipeline::29.478*2.96] mapped 799768 sequences
[M::main::40.952*2.41] loaded/built the index for 11332 target sequence(s)
[M::mm_mapopt_update::40.952*2.41] mid_occ = 236
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 11332
[M::mm_idx_stat::42.893*2.35] distinct minimizers: 139932295 (37.69% are singletons); average occurrences: 3.883; average spacing: 5.353
[M::worker_pipeline::57.041*3.15] mapped 799768 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -aL -t 8 -x map-ont database/NCBI_genomes_refseq89_BAV.fa.mmi /srv/test_fastq/strawman_pathogen-miseq_95gg9031_05vv10245.fastq
[M::main] Real time: 57.196 sec; CPU: 179.640 sec; Peak RSS: 11.823 GB
[00:01:06] Done mapping reads to the database(s).
[00:01:06] Merging SAM files...
[00:01:08] Logfile saved to /srv/strawman_pathogen-miseq_95gg9031_05vv10245.pangia.log.
[00:01:08] Done. Mapped SAM file saved to /srv/strawman_pathogen-miseq_95gg9031_05vv10245.pangia.sam.
[00:01:08] Total number of input reads: 1713173
[00:01:08] Total number of mapped reads: 41953
[00:01:08] Total number of host reads: 0 (0.00%)
[00:01:08] Total number of ignored reads (cross superkingdom): 29 (0.07%)
[00:01:08] Processing SAM file... 
[00:01:08] Parsing SAM files with 8 subprocesses...

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/conda/envs/pangia/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/pangia/pangia/pangia.py", line 714, in worker
    lcr_lvl, lcr_name, lcr_info = lineageLCR(taxids)
  File "/home/pangia/pangia/pangia.py", line 378, in lineageLCR
    lng = t.taxid2lineageDICT(tid, 1, 1)
  File "/home/pangia/pangia/taxonomy.py", line 265, in taxid2lineageDICT
    return _taxid2lineage( tid, print_all_rank, print_strain, replace_space2underscore, output_typ e )
  File "/home/pangia/pangia/taxonomy.py", line 305, in _taxid2lineage
    rank = _getTaxRank(taxID)
  File "/home/pangia/pangia/taxonomy.py", line 372, in _getTaxRank
    return taxRanks[taxID]
KeyError: '134962'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/pangia/pangia/pangia.py", line 2319, in <module>
    (res, mapped_r_cnt) = processSAMfile( os.path.abspath(samfile), argvs.threads, lines_per_proce
ss)
  File "/home/pangia/pangia/pangia.py", line 921, in processSAMfile
    results.append( job.get() )
  File "/opt/conda/envs/pangia/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
KeyError: '134962'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.