Giter Club home page Giter Club logo

bactopia's Issues

QC'd fastq duplicated?

One of the processes is dragging along the QC'd FASTQ

I think it may be coming from assembly, annotation, or reference download (they drag along the QC'd fastq)

Bactopia should warn if existing outputs are found

Currently bactopia is set to not overwrite by default, this is problematic considering Nextflow will run the full pipeline, but not overwrite output

In other words a simple check at the beginning so as to not waste user time

Dealing with species names

Currently the species name is different in bactopia datasets and bactopia

Example:

# Build dataset
bactopia datasets datasets/ --species "Staphylococcus aureus" 

# Run Bactopia
bactopia --SE my-fastq.gz --datasets datasets/ --species "staphylococcus-aureus"

For consistency, I think the following should also work:

bactopia --SE my-fastq.gz --datasets datasets/ --species "Staphylococcus aureus"

Add option to compress output

Although it saves storage space, it is inconvenient to constantly gunzip/zcat to view files. A better alternative is let the user decide.

Add genome size to bactopia search?

When dealing with genus queries, it might be useful to add genome size to the accessions output. Then bactopia can use this instead of estimating via Mash

Add "dry run" feature that details what would run given inputs

This won't actually go through the pipeline, but just give an overview of which analyses would be expected to be run given the inputs (PE vs SE, datasets (general, species, user) etc...)

This will require renaming the current --dry_run to something like --test_conda.

Deal with super low coverage samples

Similar to #5

Sometimes if super low coverage flash creates an empty extendedFrags.fastq,gz file which skesa does not like.

[shovill] Assembling reads with 'skesa'
[shovill] Running: skesa --gz  --fastq flash.extendedFrags.fastq.gz --fastq flash.notCombined_1.fastq.gz,flash.notCombined_2.fastq.gz --use_paired_ends --contigs_out skesa.fasta --min_contig 1 --memory 16 --cores 3 --vector_percent 1 2>&1 | sed 's/^/[skesa] /' | tee -a shovill.log
[skesa] skesa --gz --fastq flash.extendedFrags.fastq.gz --fastq flash.notCombined_1.fastq.gz,flash.notCombined_2.fastq.gz --use_paired_ends --contigs_out skesa.fasta --min_contig 1 --memory 16 --cores 3 --vector_percent 1
[skesa]
[skesa] WARNING: option --gz is deprecated - gzipped files are now recognized automatically
[skesa]
[skesa] Invalid fastq file format in flash.extendedFrags.fastq.gz

Conda install missing dependencies

conda create -n bactopia -c rpetit3 -c conda-forge -c bioconda bactopia
Fetching package metadata .............
Solving package specifications: .


PackageNotFoundError: Package not found: '' Dependencies missing in current linux-64 channels: 
  - bactopia -> ariba 2.13.5 py36hf484d3e_0 -> libgcc-ng >=7.3.0
  - bactopia -> ariba 2.13.5 py36hf484d3e_0 -> libstdcxx-ng >=7.3.0
  - bactopia -> mash 2.1 hf69f6b5_1 -> openblas >=0.3.3,<0.3.4.0a0 -> libgfortran-ng >=7,<8.0a0

Close matches found; did you mean one of these?

    libgcc-ng: libgcc
    libgfortran-ng: libgfortran

You can search for packages on anaconda.org with

    anaconda search -t conda libstdcxx-ng

(and similarly for the other packages)

You may need to install the anaconda-client command line client with

    conda install anaconda-client
conda --version
conda 4.2.13

download reference - outdated assembly version

During the download_references step, I'm receiving

Command error:
  ERROR: No downloads matched your filter. Please check your options.

The issue is the accession being queried is GCF00000.1 but ncbi-genome-download finds GCF00000.2. This causes the step to fail.

Simple solution is to rebuild the species dataset so the updated accessions are included.

But it might become necessary to implement a method to not require a version number (e.g. 1,2, etc....)

setup-datasets: ImportError: libcrypto.so.1.0.0 ...

During setup-datasets, ariba will produce an error.

ariba getref --help
Traceback (most recent call last):
  File "/home/rpetit/miniconda3/envs/ariba/bin/ariba", line 3, in <module>
    import ariba
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/__init__.py", line 57, in <module>
    from ariba import *
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/assembly.py", line 6, in <module>
    from ariba import common, mapping, bam_parse, external_progs, ref_seq_chooser
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/mapping.py", line 4, in <module>
    import pysam
  File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

It is related to bioconda/bioconda-recipes#17448

And a PR fix has been submitted: bioconda/bioconda-recipes#17448

A temporary fix is to manually upgrade pysam

conda activate bactopia
conda install -c conda-forge -c bioconda pysam=0.15.3

bactopia run does not state explicit version

need to add -r to bactopia run

bactopia pull
Checking bactopia/bactopia ...
downloaded from https://github.com/bactopia/bactopia.git - revision: b702be9169 [v1.2.1]

bactopia --accession SRX477044 -profile slurm --cpus 8 -resume
N E X T F L O W  ~  version 19.07.0
Project `bactopia/bactopia` currently is sticked on revision: v1.2.1 -- you need to specify explicitly a revision with the option `-r` to use it

Coverage reported as inf if genome size is 0

{
    "qc_stats": {
        "total_bp":341120202,
        "coverage":inf,
        "read_total":1587547,
        "read_min":200,
        "read_mean":214.872,
        "read_std":0.769588,
        "read_median":215,
        "read_max":215,
        "read_25th":215,
        "read_75th":215,
        "qual_min":11,
        "qual_mean":31.2358,
        "qual_std":5.24431,
        "qual_max":38,
        "qual_median":33,
        "qual_25th":28,
        "qual_75th":35
    },

Causes bactopia summary to fail.

This was fixed in latest version of fastq-scan (v0.4.1 https://github.com/rpetit3/fastq-scan/releases/tag/v0.4.1)

Add option to skip fastq checks

The step that verifies inputs have enough data can be time consuming.

There should be an option to skip it if the inputs are known to be sufficient. The checks are more useful when using public data when its not known before hand

Setup datasets - cgmlst error

Traceback (most recent call last):
  File "/home/mdh/mplumb/.conda/envs/bactopia/bin/setup-datasets.py", line 928, in <module>
    ARIBA, PUBMLST, CGMLST = get_available_datasets(args.clear_cache)
  File "/home/mdh/mplumb/.conda/envs/bactopia/bin/setup-datasets.py", line 131, in get_available_datasets
    return [data['ariba'], data['pubmlst'], data['cgmlst']]
KeyError: 'cgmlst'

tblastn segmentation fault

Command exit status:
  139

Command output:
  (empty)

Command error:
  .command.sh: line 40:    32 Segmentation fault      (core dumped) tblastn -db S.190206.00513 -query proteins.fasta -evalue 0.0001 -num_threads 4 -outfmt '6 qseqid qlen qstart qend sseqid slen sstart send length evalue bitscore pident nident mismatch gaps qcovs qcovhsp' -qcov_hsp_perc 50 >> blast/proteins/proteins.txt

maybe related to this: https://www.biostars.org/p/16729/

Low coverage samples break downstream analysis

There should be a minimum coverage cutoff for a sample to continue. Example if an input only had 1x coverage, flag it and keep downstream analyses (will probably fail anyways) from happening

Estimated genome size applied to wrong sample

Had a case in which the estimated genome size of one sample was applied to another sample.

This is only an issue when mash is used to estimate a genome size. If the user gives an explicit genome size they are all the same.

Run bactopia from the main.nf script

currently using nextflow run bactopia/bactopia while its nice, it becomes a problem when multiple versions of bactopia are installed, let's just call the main.nf and call it a day

Script to make accessions text

It would be useful to have a script that produced the --accessions test input

Example: I want to process all samples in BioProject PRJNA123456789

bactopia ena-query PRJNA123456789

Better handling of conda environments?

Occasionally building a conda environment can fail because connection issues. If it is the first time building, it will cause Nextflow to error out.

Sure you can just resume (-resume) the job, but it could be annoying.

Come up with something better. Something that builds the environment before jobs are run.

JSONify outputs during processing

I think it would be useful to JSONify outputs within the workflow.

This way it's done, and we don't need to duplicate the logic/re-parse in multiple workflows.

Is a FASTQ validator needed?

See https://www.ebi.ac.uk/ena/data/view/SRR3030395

It is an assembly, but converted somehow to a FASTQ? As you might imagine, it breaks Bactopia

@SRR3030395.32 AUTD01000005.1/1
AGGGGGCGATCCCCCAACTACTATCGGCGTGCTGAAGCTTAACTTCTGTGTTCGGCATGGGAACAGGTGTATCCTTCAGGCTATCGCCACCACACTATAAGAGAACTTCTTCCCTCAAAACTAGATATTATTCAATTATTCTCGAAACAACTACGTTGTTGACTTGGTTAAGTCCTCGACCGATTAGTACTGGTCCGCTCCACGCCTCACGGCGCTGCTACTTCCAGCCTATCTACCTGATCATCTCTCAGGGGTCTTACTTCCATATAGGAATGGGAAATCTCATCTTGAGGCGAGTTTCACACTTAGATGCTTTCAGCGTTTATCTCATCCATACATAGCTACCCAGCGATGCGCCTGGCGGCACAACTGGTACACCAGCGGTATGTCCATCCCGGTCCTCTCGTACTAAGGACAGCTCCTCTCAAATTTCCTACGCCCGCGACGGATAGGGACCGAACTGTCTCACGACGTTCTGAACCCAGCTCGCGTACCGCTTTAATGGGCGAACAGCCCAACCCTTGGGACCGACTACAGCCCCAGGATGCGATGAGCCGACATCGAGGTGCCAAACCTCCCCGTCGATGTGGACTCTTGGGGGAGATAAGCCTGTTATCCCCAGGGTAGCTTTTATCCGTTGAGCGATGGCCCTTCCATACGGTACCACCGGATCACTAAGCCCGACTTTCGTCCCTGCTCGACCTGTCTGTCTCGCAGTCAAGCTCTCTTCTGCCTTTACACTCGACGAATGATTTCCAACCATTCTGAGAGAACCTTTGGGCGCCTCCGTTACTTTTTAGGAGGCGACCGCCCCAGTCAAACTGCCTACCTGACACTGTCTCCCACCACGATAAGTGGTGCGGGTTAGAGTGTTCACACAGCGAGGGTCGTATCCCACCAGCGCCTCACTCGAAACTAGCGTTCCGAGTTCTACGGCTCCGACCTATCCTGTACAAGCTGTGTCAACACCCAATATCAAGCTACAGTAAAGCTCCATGGGGTCTTTCCGTCCTGTCGCGGGTAACCTGCATCTTCACAGGTAATATAATTTCACCGAGTCTCTCGTTGAGACAGTGCCCAGATCGTTACGCCTTTCGTGCGGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTAGGACCGTTATAGTTACGGCCGCCGTTTACTGGGGCTTCATTTCTGGGCTTCGCCGAAGCTAACTCATCCACTTAACCTTCCAGCACCGGGCAGGCGTCAGCCCCTATACGTCATCTTTCGATTTTGCAGAAACCTGTGTTTTTGATAAACAGTCGCCTGGGCCTTTTCACTGCGGCTACACTTGCGTGCAGCACCCCTTCTCCCGAAGTTACGGGGTCATTTTGCCGAGTTCCTTAACGAGAGTTCACTCGCTCACCTTAGGATACTCTCCTCGACTACCTGTGTCGGTTTGCGGTACGGGTAATTAATCACTAACTAGAAGCTTTTCTCGGCAGTGTGACATCTGGCGCTTCCCTACTAAAATTCGGTCCTCGTCACGCCTTGTCCTTAGCGATAAGCATTTGACTCATCACCAGACTTGACGCTTGAACACACATTTCCAATCGTGTGCACACCATAGCCTCCTGCGTCCCTCCATCGTTCAAACATGATTAACTAGTACAGGAATATCAACCTGTTATCCATCGCCTACGCCTTGCGGCCTCGGCTTAGGTCCCGACTAACCCTGGGAGGACGAGCCTTCCCCAGGAAACCTTAGTCATTCGGTGGATCAGATTCTCACTGATCTTTCGCTACTCATACCGGCATTCTCACTTCTAAGCGCTCCACAAGTCCTTGCGATCTTGCTTCGTTGCCCTTAGAACGCTCTCCTATCACTCGACCTTACGGTCGAATCCACAATTTCGGTAACATGCTTAGCCCCGGTAAATTTTCGGCGCAGAATCACTCGGCTAGTGAGCTATTACGCACTCTTTAAATGGTGGCTGCTTCTGAGCCAACATCCTAGCTGTCTATGCAACTCCACATCCTTTTCCACTCAGCATGTATTTAGGGACCTTAATTGGTGGTCTGGGCTGTTCCCCTTTCGACGGTGGATCTTATCACTCATCGTCTGACTCCCGGATATAAATCTGTGGCATTCGGAGTTTATCTGAATTCAGTAACCCATGACGGGCCCCTAGTCCAAACAGTGGCTCTACCTCCACGATTCTTAACTCCGAGGCTAACCCTAAAGCTATTTCAGAACCAGCTATCTCCAAGTTCGTTTGGAATTTCACCGCTACCCACACCTCATCCCAGCATTTTTCAACATACACGGGTTCGGTCCTCCAGTGCGTTTTACCGCACCTTCAACCTGGACATGGGTAGGTCACCTGGTTTCGGGTCTACATCAATTTACTGAAACGCCCGTTTCAGACTCGCTTTCGCTACGGCTCCGGTCTTTCCACCTTAACCTTGCAAATTAACGTAACTCGCCGGTTCATTCTACAAAAGGCACGCTATCACCCATTAACGGGCTCTAACTAATTGTAGGCACATGGTTTCAGGAACTATTTCACTCCGCTTCCGCGGTGCTTTTCACCTTTCCCTCACGGTACTGGTTCACTATCGGTCACTAGGGAGTATTTAGCCTTGGGAGATGGTCCTCCCGGATTCCGACCACGTTTCACGTGTGTGGCCGTACTCAGGATCCTGAACTGAGGGTTGACGATTTCACCTACGGGGGTATCACCCTCTATGCCGAGCCTTCCCAGACTCTTCGGTTATCATCAACTTTGGTAACTCAAATGTTCAGTCCTACAACCCCAGAAAGCAAGCTTCCTGGTTTGGGCTGTTCCCCGTTCGCTCGCCGCTACTTAGGGAATCGATTTTTCTTTCTCTTCCTGTGGGTACTTAGATGTTTCAGTTCCCCACGTCTGCCTCAACTTGACTATGTATTCATCAAGTTGTAATCATCGGTAAAGATGATTGGGTTTCCCCATTCGGAAATCTCCGGATCAAAGCTTACGTACAGCTCCCCGAAGCATATCGGTGTTAGTCCCGTCCTTCATCGGCTCCTAGTACCAAGGCATCCACCATGCGCCCTTCATAACTTAACCTAACGGTCACTTCGTGATCGTCAAATTAATTGAGTATTAGCGATAAACTAATTAAAAAACTCAAAAATACGCAGTTGTTTCTCGGTTTAATTATCTTAATAATTAAAGGAAAATAATTGATAATATCTAGTTTTCAAAGAACAA
+


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.