bactopia / bactopia Goto Github PK

# Build dataset
bactopia datasets datasets/ --species "Staphylococcus aureus" 

# Run Bactopia
bactopia --SE my-fastq.gz --datasets datasets/ --species "staphylococcus-aureus"

For consistency, I think the following should also work:

bactopia --SE my-fastq.gz --datasets datasets/ --species "Staphylococcus aureus"

Add option to compress output

Although it saves storage space, it is inconvenient to constantly gunzip/zcat to view files. A better alternative is let the user decide.

[bactopia-tool] Bactopia should include CRISPRfinder and Phigaro

Please add CRISPRfinder for CRISPR/Cas typing and Phigaro (version 0.2.1.7) for prophage identification in bacterial genomes.

add check for estimated genome sizes

sometimes the estimated genome size is too large (>15mb) or too small (<100kb).

This should be caught and further analysis stopped.

Add function to spit out program versions

Add genome size to bactopia search?

When dealing with genus queries, it might be useful to add genome size to the accessions output. Then bactopia can use this instead of estimating via Mash

Add "dry run" feature that details what would run given inputs

This won't actually go through the pipeline, but just give an overview of which analyses would be expected to be run given the inputs (PE vs SE, datasets (general, species, user) etc...)

This will require renaming the current --dry_run to something like --test_conda.

Deal with super low coverage samples

Similar to #5

Sometimes if super low coverage flash creates an empty extendedFrags.fastq,gz file which skesa does not like.

[shovill] Assembling reads with 'skesa'
[shovill] Running: skesa --gz  --fastq flash.extendedFrags.fastq.gz --fastq flash.notCombined_1.fastq.gz,flash.notCombined_2.fastq.gz --use_paired_ends --contigs_out skesa.fasta --min_contig 1 --memory 16 --cores 3 --vector_percent 1 2>&1 | sed 's/^/[skesa] /' | tee -a shovill.log
[skesa] skesa --gz --fastq flash.extendedFrags.fastq.gz --fastq flash.notCombined_1.fastq.gz,flash.notCombined_2.fastq.gz --use_paired_ends --contigs_out skesa.fasta --min_contig 1 --memory 16 --cores 3 --vector_percent 1
[skesa]
[skesa] WARNING: option --gz is deprecated - gzipped files are now recognized automatically
[skesa]
[skesa] Invalid fastq file format in flash.extendedFrags.fastq.gz

Add reference genome download and annotation to pangenome tool

Add option to inlcude accessions to be downloaded and annotated with prokka. These genomes can then be included in the roary analysis

update_versions.sh script doesn't update nextflow.config

Bactopia tools need to be added to the version update script

Currently only cgtree

Conda install missing dependencies

conda create -n bactopia -c rpetit3 -c conda-forge -c bioconda bactopia
Fetching package metadata .............
Solving package specifications: .


PackageNotFoundError: Package not found: '' Dependencies missing in current linux-64 channels: 
  - bactopia -> ariba 2.13.5 py36hf484d3e_0 -> libgcc-ng >=7.3.0
  - bactopia -> ariba 2.13.5 py36hf484d3e_0 -> libstdcxx-ng >=7.3.0
  - bactopia -> mash 2.1 hf69f6b5_1 -> openblas >=0.3.3,<0.3.4.0a0 -> libgfortran-ng >=7,<8.0a0

Close matches found; did you mean one of these?

    libgcc-ng: libgcc
    libgfortran-ng: libgfortran

You can search for packages on anaconda.org with

    anaconda search -t conda libstdcxx-ng

(and similarly for the other packages)

You may need to install the anaconda-client command line client with

    conda install anaconda-client

conda --version
conda 4.2.13

download reference - outdated assembly version

During the download_references step, I'm receiving

Command error:
  ERROR: No downloads matched your filter. Please check your options.

The issue is the accession being queried is GCF00000.1 but ncbi-genome-download finds GCF00000.2. This causes the step to fail.

Simple solution is to rebuild the species dataset so the updated accessions are included.

But it might become necessary to implement a method to not require a version number (e.g. 1,2, etc....)

setup-datasets: ImportError: libcrypto.so.1.0.0 ...

During setup-datasets, ariba will produce an error.

ariba getref --help
Traceback (most recent call last):
  File "/home/rpetit/miniconda3/envs/ariba/bin/ariba", line 3, in <module>
    import ariba
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/__init__.py", line 57, in <module>
    from ariba import *
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/assembly.py", line 6, in <module>
    from ariba import common, mapping, bam_parse, external_progs, ref_seq_chooser
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/mapping.py", line 4, in <module>
    import pysam
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

It is related to bioconda/bioconda-recipes#17448

And a PR fix has been submitted: bioconda/bioconda-recipes#17448

A temporary fix is to manually upgrade pysam

conda activate bactopia
conda install -c conda-forge -c bioconda pysam=0.15.3

bactopia run does not state explicit version

need to add -r to bactopia run

bactopia pull
Checking bactopia/bactopia ...
downloaded from https://github.com/bactopia/bactopia.git - revision: b702be9169 [v1.2.1]

bactopia --accession SRX477044 -profile slurm --cpus 8 -resume
N E X T F L O W  ~  version 19.07.0
Project `bactopia/bactopia` currently is sticked on revision: v1.2.1 -- you need to specify explicitly a revision with the option `-r` to use it

Coverage reported as inf if genome size is 0

{
    "qc_stats": {
        "total_bp":341120202,
        "coverage":inf,
        "read_total":1587547,
        "read_min":200,
        "read_mean":214.872,
        "read_std":0.769588,
        "read_median":215,
        "read_max":215,
        "read_25th":215,
        "read_75th":215,
        "qual_min":11,
        "qual_mean":31.2358,
        "qual_std":5.24431,
        "qual_max":38,
        "qual_median":33,
        "qual_25th":28,
        "qual_75th":35
    },

Causes bactopia summary to fail.

This was fixed in latest version of fastq-scan (v0.4.1 https://github.com/rpetit3/fastq-scan/releases/tag/v0.4.1)

NCBI's AMRFinderPlus should be included

Add it! https://github.com/ncbi/amr

Custom dataset from URL?

example give bactopia -datasets git repo a pull staphopia v1 just from url

Automate the building of containers

Add option to skip fastq checks

The step that verifies inputs have enough data can be time consuming.

There should be an option to skip it if the inputs are known to be sufficient. The checks are more useful when using public data when its not known before hand

Improve genome size estimates

See: marbl/Mash#114 (comment)

Version info not updated in main Bactopia container file

Dockerfile and Singularity files need to be updated

Create psuedo genome based off substitutions and coverage

Use Snippy substitution data and bedtools coverage data to create something similar to Snippy's ".consensus.subs.fa" file except with 0 coverage regions masked out.

Allow user to set publishDir mode

The default is to copy file from the work dir to the output dir

allow users to specify if they want symbilic links

See: https://www.nextflow.io/docs/latest/process.html#publishdir

Setup datasets - cgmlst error

Traceback (most recent call last):
  File "/home/mdh/mplumb/.conda/envs/bactopia/bin/setup-datasets.py", line 928, in <module>
    ARIBA, PUBMLST, CGMLST = get_available_datasets(args.clear_cache)
  File "/home/mdh/mplumb/.conda/envs/bactopia/bin/setup-datasets.py", line 131, in get_available_datasets
    return [data['ariba'], data['pubmlst'], data['cgmlst']]
KeyError: 'cgmlst'

documentation - output directories... what's going on?

explain this

tblastn segmentation fault

Command exit status:
  139

Command output:
  (empty)

Command error:
  .command.sh: line 40:    32 Segmentation fault      (core dumped) tblastn -db S.190206.00513 -query proteins.fasta -evalue 0.0001 -num_threads 4 -outfmt '6 qseqid qlen qstart qend sseqid slen sstart send length evalue bitscore pident nident mismatch gaps qcovs qcovhsp' -qcov_hsp_perc 50 >> blast/proteins/proteins.txt

maybe related to this: https://www.biostars.org/p/16729/

bactopia ena-query PRJNA123456789

Better handling of conda environments?

Occasionally building a conda environment can fail because connection issues. If it is the first time building, it will cause Nextflow to error out.

Sure you can just resume (-resume) the job, but it could be annoying.

Come up with something better. Something that builds the environment before jobs are run.

@SRR3030395.32 AUTD01000005.1/1
AGGGGGCGATCCCCCAACTACTATCGGCGTGCTGAAGCTTAACTTCTGTGTTCGGCATGGGAACAGGTGTATCCTTCAGGCTATCGCCACCACACTATAAGAGAACTTCTTCCCTCAAAACTAGATATTATTCAATTATTCTCGAAACAACTACGTTGTTGACTTGGTTAAGTCCTCGACCGATTAGTACTGGTCCGCTCCACGCCTCACGGCGCTGCTACTTCCAGCCTATCTACCTGATCATCTCTCAGGGGTCTTACTTCCATATAGGAATGGGAAATCTCATCTTGAGGCGAGTTTCACACTTAGATGCTTTCAGCGTTTATCTCATCCATACATAGCTACCCAGCGATGCGCCTGGCGGCACAACTGGTACACCAGCGGTATGTCCATCCCGGTCCTCTCGTACTAAGGACAGCTCCTCTCAAATTTCCTACGCCCGCGACGGATAGGGACCGAACTGTCTCACGACGTTCTGAACCCAGCTCGCGTACCGCTTTAATGGGCGAACAGCCCAACCCTTGGGACCGACTACAGCCCCAGGATGCGATGAGCCGACATCGAGGTGCCAAACCTCCCCGTCGATGTGGACTCTTGGGGGAGATAAGCCTGTTATCCCCAGGGTAGCTTTTATCCGTTGAGCGATGGCCCTTCCATACGGTACCACCGGATCACTAAGCCCGACTTTCGTCCCTGCTCGACCTGTCTGTCTCGCAGTCAAGCTCTCTTCTGCCTTTACACTCGACGAATGATTTCCAACCATTCTGAGAGAACCTTTGGGCGCCTCCGTTACTTTTTAGGAGGCGACCGCCCCAGTCAAACTGCCTACCTGACACTGTCTCCCACCACGATAAGTGGTGCGGGTTAGAGTGTTCACACAGCGAGGGTCGTATCCCACCAGCGCCTCACTCGAAACTAGCGTTCCGAGTTCTACGGCTCCGACCTATCCTGTACAAGCTGTGTCAACACCCAATATCAAGCTACAGTAAAGCTCCATGGGGTCTTTCCGTCCTGTCGCGGGTAACCTGCATCTTCACAGGTAATATAATTTCACCGAGTCTCTCGTTGAGACAGTGCCCAGATCGTTACGCCTTTCGTGCGGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTAGGACCGTTATAGTTACGGCCGCCGTTTACTGGGGCTTCATTTCTGGGCTTCGCCGAAGCTAACTCATCCACTTAACCTTCCAGCACCGGGCAGGCGTCAGCCCCTATACGTCATCTTTCGATTTTGCAGAAACCTGTGTTTTTGATAAACAGTCGCCTGGGCCTTTTCACTGCGGCTACACTTGCGTGCAGCACCCCTTCTCCCGAAGTTACGGGGTCATTTTGCCGAGTTCCTTAACGAGAGTTCACTCGCTCACCTTAGGATACTCTCCTCGACTACCTGTGTCGGTTTGCGGTACGGGTAATTAATCACTAACTAGAAGCTTTTCTCGGCAGTGTGACATCTGGCGCTTCCCTACTAAAATTCGGTCCTCGTCACGCCTTGTCCTTAGCGATAAGCATTTGACTCATCACCAGACTTGACGCTTGAACACACATTTCCAATCGTGTGCACACCATAGCCTCCTGCGTCCCTCCATCGTTCAAACATGATTAACTAGTACAGGAATATCAACCTGTTATCCATCGCCTACGCCTTGCGGCCTCGGCTTAGGTCCCGACTAACCCTGGGAGGACGAGCCTTCCCCAGGAAACCTTAGTCATTCGGTGGATCAGATTCTCACTGATCTTTCGCTACTCATACCGGCATTCTCACTTCTAAGCGCTCCACAAGTCCTTGCGATCTTGCTTCGTTGCCCTTAGAACGCTCTCCTATCACTCGACCTTACGGTCGAATCCACAATTTCGGTAACATGCTTAGCCCCGGTAAATTTTCGGCGCAGAATCACTCGGCTAGTGAGCTATTACGCACTCTTTAAATGGTGGCTGCTTCTGAGCCAACATCCTAGCTGTCTATGCAACTCCACATCCTTTTCCACTCAGCATGTATTTAGGGACCTTAATTGGTGGTCTGGGCTGTTCCCCTTTCGACGGTGGATCTTATCACTCATCGTCTGACTCCCGGATATAAATCTGTGGCATTCGGAGTTTATCTGAATTCAGTAACCCATGACGGGCCCCTAGTCCAAACAGTGGCTCTACCTCCACGATTCTTAACTCCGAGGCTAACCCTAAAGCTATTTCAGAACCAGCTATCTCCAAGTTCGTTTGGAATTTCACCGCTACCCACACCTCATCCCAGCATTTTTCAACATACACGGGTTCGGTCCTCCAGTGCGTTTTACCGCACCTTCAACCTGGACATGGGTAGGTCACCTGGTTTCGGGTCTACATCAATTTACTGAAACGCCCGTTTCAGACTCGCTTTCGCTACGGCTCCGGTCTTTCCACCTTAACCTTGCAAATTAACGTAACTCGCCGGTTCATTCTACAAAAGGCACGCTATCACCCATTAACGGGCTCTAACTAATTGTAGGCACATGGTTTCAGGAACTATTTCACTCCGCTTCCGCGGTGCTTTTCACCTTTCCCTCACGGTACTGGTTCACTATCGGTCACTAGGGAGTATTTAGCCTTGGGAGATGGTCCTCCCGGATTCCGACCACGTTTCACGTGTGTGGCCGTACTCAGGATCCTGAACTGAGGGTTGACGATTTCACCTACGGGGGTATCACCCTCTATGCCGAGCCTTCCCAGACTCTTCGGTTATCATCAACTTTGGTAACTCAAATGTTCAGTCCTACAACCCCAGAAAGCAAGCTTCCTGGTTTGGGCTGTTCCCCGTTCGCTCGCCGCTACTTAGGGAATCGATTTTTCTTTCTCTTCCTGTGGGTACTTAGATGTTTCAGTTCCCCACGTCTGCCTCAACTTGACTATGTATTCATCAAGTTGTAATCATCGGTAAAGATGATTGGGTTTCCCCATTCGGAAATCTCCGGATCAAAGCTTACGTACAGCTCCCCGAAGCATATCGGTGTTAGTCCCGTCCTTCATCGGCTCCTAGTACCAAGGCATCCACCATGCGCCCTTCATAACTTAACCTAACGGTCACTTCGTGATCGTCAAATTAATTGAGTATTAGCGATAAACTAATTAAAAAACTCAAAAATACGCAGTTGTTTCTCGGTTTAATTATCTTAATAATTAAAGGAAAATAATTGATAATATCTAGTTTTCAAAGAACAA
+
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Intermediate files for all 185 samples 662GB!

Need to clean this up by default

Reminder - Verify all datasets are set as files in Nextflow processes

Double check each process to make sure dataset files are not referring to the absolute paths

These need to be converted to symbolic links in the 'work' directory

bactopia / bactopia Goto Github PK

bactopia's Issues

Recommend Projects

Recommend Topics

Recommend Org