bactopia / bactopia Goto Github PK
View Code? Open in Web Editor NEWA flexible pipeline for complete analysis of bacterial genomes
Home Page: https://bactopia.github.io
License: MIT License
A flexible pipeline for complete analysis of bacterial genomes
Home Page: https://bactopia.github.io
License: MIT License
Expected to only run 54 times, currently at 1248
One of the processes is dragging along the QC'd FASTQ
I think it may be coming from assembly, annotation, or reference download (they drag along the QC'd fastq)
This could be useful to help users remember to delete it
Currently bactopia is set to not overwrite by default, this is problematic considering Nextflow will run the full pipeline, but not overwrite output
In other words a simple check at the beginning so as to not waste user time
Docs are cool, but what is actually executed?
Currently the species name is different in bactopia datasets
and bactopia
Example:
# Build dataset
bactopia datasets datasets/ --species "Staphylococcus aureus"
# Run Bactopia
bactopia --SE my-fastq.gz --datasets datasets/ --species "staphylococcus-aureus"
For consistency, I think the following should also work:
bactopia --SE my-fastq.gz --datasets datasets/ --species "Staphylococcus aureus"
Although it saves storage space, it is inconvenient to constantly gunzip/zcat to view files. A better alternative is let the user decide.
Please add CRISPRfinder for CRISPR/Cas typing and Phigaro (version 0.2.1.7) for prophage identification in bacterial genomes.
sometimes the estimated genome size is too large (>15mb) or too small (<100kb).
This should be caught and further analysis stopped.
When dealing with genus queries, it might be useful to add genome size to the accessions output. Then bactopia can use this instead of estimating via Mash
This won't actually go through the pipeline, but just give an overview of which analyses would be expected to be run given the inputs (PE vs SE, datasets (general, species, user) etc...)
This will require renaming the current --dry_run
to something like --test_conda
.
Similar to #5
Sometimes if super low coverage flash creates an empty extendedFrags.fastq,gz file which skesa does not like.
[shovill] Assembling reads with 'skesa'
[shovill] Running: skesa --gz --fastq flash.extendedFrags.fastq.gz --fastq flash.notCombined_1.fastq.gz,flash.notCombined_2.fastq.gz --use_paired_ends --contigs_out skesa.fasta --min_contig 1 --memory 16 --cores 3 --vector_percent 1 2>&1 | sed 's/^/[skesa] /' | tee -a shovill.log
[skesa] skesa --gz --fastq flash.extendedFrags.fastq.gz --fastq flash.notCombined_1.fastq.gz,flash.notCombined_2.fastq.gz --use_paired_ends --contigs_out skesa.fasta --min_contig 1 --memory 16 --cores 3 --vector_percent 1
[skesa]
[skesa] WARNING: option --gz is deprecated - gzipped files are now recognized automatically
[skesa]
[skesa] Invalid fastq file format in flash.extendedFrags.fastq.gz
Add option to inlcude accessions to be downloaded and annotated with prokka. These genomes can then be included in the roary analysis
Currently only cgtree
conda create -n bactopia -c rpetit3 -c conda-forge -c bioconda bactopia
Fetching package metadata .............
Solving package specifications: .
PackageNotFoundError: Package not found: '' Dependencies missing in current linux-64 channels:
- bactopia -> ariba 2.13.5 py36hf484d3e_0 -> libgcc-ng >=7.3.0
- bactopia -> ariba 2.13.5 py36hf484d3e_0 -> libstdcxx-ng >=7.3.0
- bactopia -> mash 2.1 hf69f6b5_1 -> openblas >=0.3.3,<0.3.4.0a0 -> libgfortran-ng >=7,<8.0a0
Close matches found; did you mean one of these?
libgcc-ng: libgcc
libgfortran-ng: libgfortran
You can search for packages on anaconda.org with
anaconda search -t conda libstdcxx-ng
(and similarly for the other packages)
You may need to install the anaconda-client command line client with
conda install anaconda-client
conda --version
conda 4.2.13
During the download_references
step, I'm receiving
Command error:
ERROR: No downloads matched your filter. Please check your options.
The issue is the accession being queried is GCF00000.1 but ncbi-genome-download finds GCF00000.2. This causes the step to fail.
Simple solution is to rebuild the species dataset so the updated accessions are included.
But it might become necessary to implement a method to not require a version number (e.g. 1,2, etc....)
During setup-datasets, ariba will produce an error.
ariba getref --help
Traceback (most recent call last):
File "/home/rpetit/miniconda3/envs/ariba/bin/ariba", line 3, in <module>
import ariba
File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/__init__.py", line 57, in <module>
from ariba import *
File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/assembly.py", line 6, in <module>
from ariba import common, mapping, bam_parse, external_progs, ref_seq_chooser
File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/ariba/mapping.py", line 4, in <module>
import pysam
File "/home/rpetit/miniconda3/envs/ariba/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
from pysam.libchtslib import *
ImportError: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
It is related to bioconda/bioconda-recipes#17448
And a PR fix has been submitted: bioconda/bioconda-recipes#17448
A temporary fix is to manually upgrade pysam
conda activate bactopia
conda install -c conda-forge -c bioconda pysam=0.15.3
need to add -r to bactopia run
bactopia pull
Checking bactopia/bactopia ...
downloaded from https://github.com/bactopia/bactopia.git - revision: b702be9169 [v1.2.1]
bactopia --accession SRX477044 -profile slurm --cpus 8 -resume
N E X T F L O W ~ version 19.07.0
Project `bactopia/bactopia` currently is sticked on revision: v1.2.1 -- you need to specify explicitly a revision with the option `-r` to use it
{
"qc_stats": {
"total_bp":341120202,
"coverage":inf,
"read_total":1587547,
"read_min":200,
"read_mean":214.872,
"read_std":0.769588,
"read_median":215,
"read_max":215,
"read_25th":215,
"read_75th":215,
"qual_min":11,
"qual_mean":31.2358,
"qual_std":5.24431,
"qual_max":38,
"qual_median":33,
"qual_25th":28,
"qual_75th":35
},
Causes bactopia summary to fail.
This was fixed in latest version of fastq-scan (v0.4.1 https://github.com/rpetit3/fastq-scan/releases/tag/v0.4.1)
Add it! https://github.com/ncbi/amr
example give bactopia -datasets git repo a pull staphopia v1 just from url
The step that verifies inputs have enough data can be time consuming.
There should be an option to skip it if the inputs are known to be sufficient. The checks are more useful when using public data when its not known before hand
Dockerfile and Singularity files need to be updated
Use Snippy substitution data and bedtools coverage data to create something similar to Snippy's ".consensus.subs.fa" file except with 0 coverage regions masked out.
The default is to copy file from the work dir to the output dir
allow users to specify if they want symbilic links
See: https://www.nextflow.io/docs/latest/process.html#publishdir
Traceback (most recent call last):
File "/home/mdh/mplumb/.conda/envs/bactopia/bin/setup-datasets.py", line 928, in <module>
ARIBA, PUBMLST, CGMLST = get_available_datasets(args.clear_cache)
File "/home/mdh/mplumb/.conda/envs/bactopia/bin/setup-datasets.py", line 131, in get_available_datasets
return [data['ariba'], data['pubmlst'], data['cgmlst']]
KeyError: 'cgmlst'
explain this
Command exit status:
139
Command output:
(empty)
Command error:
.command.sh: line 40: 32 Segmentation fault (core dumped) tblastn -db S.190206.00513 -query proteins.fasta -evalue 0.0001 -num_threads 4 -outfmt '6 qseqid qlen qstart qend sseqid slen sstart send length evalue bitscore pident nident mismatch gaps qcovs qcovhsp' -qcov_hsp_perc 50 >> blast/proteins/proteins.txt
maybe related to this: https://www.biostars.org/p/16729/
There should be a minimum coverage cutoff for a sample to continue. Example if an input only had 1x coverage, flag it and keep downstream analyses (will probably fail anyways) from happening
Had a case in which the estimated genome size of one sample was applied to another sample.
This is only an issue when mash is used to estimate a genome size. If the user gives an explicit genome size they are all the same.
Nothing is more annoying when you make a typo in a parameter name and the workflow chugs along like nothing ever happened!
currently using nextflow run bactopia/bactopia
while its nice, it becomes a problem when multiple versions of bactopia are installed, let's just call the main.nf and call it a day
Currently only download from ENA, make SRA available as well
It would be useful to have a script that produced the --accessions
test input
Example: I want to process all samples in BioProject PRJNA123456789
bactopia ena-query PRJNA123456789
Occasionally building a conda environment can fail because connection issues. If it is the first time building, it will cause Nextflow to error out.
Sure you can just resume (-resume
) the job, but it could be annoying.
Come up with something better. Something that builds the environment before jobs are run.
I think it would be useful to JSONify outputs within the workflow.
This way it's done, and we don't need to duplicate the logic/re-parse in multiple workflows.
See for fix: nextflow-io/nextflow#1108
Only happens with McCortex
See https://www.ebi.ac.uk/ena/data/view/SRR3030395
It is an assembly, but converted somehow to a FASTQ? As you might imagine, it breaks Bactopia
@SRR3030395.32 AUTD01000005.1/1
AGGGGGCGATCCCCCAACTACTATCGGCGTGCTGAAGCTTAACTTCTGTGTTCGGCATGGGAACAGGTGTATCCTTCAGGCTATCGCCACCACACTATAAGAGAACTTCTTCCCTCAAAACTAGATATTATTCAATTATTCTCGAAACAACTACGTTGTTGACTTGGTTAAGTCCTCGACCGATTAGTACTGGTCCGCTCCACGCCTCACGGCGCTGCTACTTCCAGCCTATCTACCTGATCATCTCTCAGGGGTCTTACTTCCATATAGGAATGGGAAATCTCATCTTGAGGCGAGTTTCACACTTAGATGCTTTCAGCGTTTATCTCATCCATACATAGCTACCCAGCGATGCGCCTGGCGGCACAACTGGTACACCAGCGGTATGTCCATCCCGGTCCTCTCGTACTAAGGACAGCTCCTCTCAAATTTCCTACGCCCGCGACGGATAGGGACCGAACTGTCTCACGACGTTCTGAACCCAGCTCGCGTACCGCTTTAATGGGCGAACAGCCCAACCCTTGGGACCGACTACAGCCCCAGGATGCGATGAGCCGACATCGAGGTGCCAAACCTCCCCGTCGATGTGGACTCTTGGGGGAGATAAGCCTGTTATCCCCAGGGTAGCTTTTATCCGTTGAGCGATGGCCCTTCCATACGGTACCACCGGATCACTAAGCCCGACTTTCGTCCCTGCTCGACCTGTCTGTCTCGCAGTCAAGCTCTCTTCTGCCTTTACACTCGACGAATGATTTCCAACCATTCTGAGAGAACCTTTGGGCGCCTCCGTTACTTTTTAGGAGGCGACCGCCCCAGTCAAACTGCCTACCTGACACTGTCTCCCACCACGATAAGTGGTGCGGGTTAGAGTGTTCACACAGCGAGGGTCGTATCCCACCAGCGCCTCACTCGAAACTAGCGTTCCGAGTTCTACGGCTCCGACCTATCCTGTACAAGCTGTGTCAACACCCAATATCAAGCTACAGTAAAGCTCCATGGGGTCTTTCCGTCCTGTCGCGGGTAACCTGCATCTTCACAGGTAATATAATTTCACCGAGTCTCTCGTTGAGACAGTGCCCAGATCGTTACGCCTTTCGTGCGGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTAGGACCGTTATAGTTACGGCCGCCGTTTACTGGGGCTTCATTTCTGGGCTTCGCCGAAGCTAACTCATCCACTTAACCTTCCAGCACCGGGCAGGCGTCAGCCCCTATACGTCATCTTTCGATTTTGCAGAAACCTGTGTTTTTGATAAACAGTCGCCTGGGCCTTTTCACTGCGGCTACACTTGCGTGCAGCACCCCTTCTCCCGAAGTTACGGGGTCATTTTGCCGAGTTCCTTAACGAGAGTTCACTCGCTCACCTTAGGATACTCTCCTCGACTACCTGTGTCGGTTTGCGGTACGGGTAATTAATCACTAACTAGAAGCTTTTCTCGGCAGTGTGACATCTGGCGCTTCCCTACTAAAATTCGGTCCTCGTCACGCCTTGTCCTTAGCGATAAGCATTTGACTCATCACCAGACTTGACGCTTGAACACACATTTCCAATCGTGTGCACACCATAGCCTCCTGCGTCCCTCCATCGTTCAAACATGATTAACTAGTACAGGAATATCAACCTGTTATCCATCGCCTACGCCTTGCGGCCTCGGCTTAGGTCCCGACTAACCCTGGGAGGACGAGCCTTCCCCAGGAAACCTTAGTCATTCGGTGGATCAGATTCTCACTGATCTTTCGCTACTCATACCGGCATTCTCACTTCTAAGCGCTCCACAAGTCCTTGCGATCTTGCTTCGTTGCCCTTAGAACGCTCTCCTATCACTCGACCTTACGGTCGAATCCACAATTTCGGTAACATGCTTAGCCCCGGTAAATTTTCGGCGCAGAATCACTCGGCTAGTGAGCTATTACGCACTCTTTAAATGGTGGCTGCTTCTGAGCCAACATCCTAGCTGTCTATGCAACTCCACATCCTTTTCCACTCAGCATGTATTTAGGGACCTTAATTGGTGGTCTGGGCTGTTCCCCTTTCGACGGTGGATCTTATCACTCATCGTCTGACTCCCGGATATAAATCTGTGGCATTCGGAGTTTATCTGAATTCAGTAACCCATGACGGGCCCCTAGTCCAAACAGTGGCTCTACCTCCACGATTCTTAACTCCGAGGCTAACCCTAAAGCTATTTCAGAACCAGCTATCTCCAAGTTCGTTTGGAATTTCACCGCTACCCACACCTCATCCCAGCATTTTTCAACATACACGGGTTCGGTCCTCCAGTGCGTTTTACCGCACCTTCAACCTGGACATGGGTAGGTCACCTGGTTTCGGGTCTACATCAATTTACTGAAACGCCCGTTTCAGACTCGCTTTCGCTACGGCTCCGGTCTTTCCACCTTAACCTTGCAAATTAACGTAACTCGCCGGTTCATTCTACAAAAGGCACGCTATCACCCATTAACGGGCTCTAACTAATTGTAGGCACATGGTTTCAGGAACTATTTCACTCCGCTTCCGCGGTGCTTTTCACCTTTCCCTCACGGTACTGGTTCACTATCGGTCACTAGGGAGTATTTAGCCTTGGGAGATGGTCCTCCCGGATTCCGACCACGTTTCACGTGTGTGGCCGTACTCAGGATCCTGAACTGAGGGTTGACGATTTCACCTACGGGGGTATCACCCTCTATGCCGAGCCTTCCCAGACTCTTCGGTTATCATCAACTTTGGTAACTCAAATGTTCAGTCCTACAACCCCAGAAAGCAAGCTTCCTGGTTTGGGCTGTTCCCCGTTCGCTCGCCGCTACTTAGGGAATCGATTTTTCTTTCTCTTCCTGTGGGTACTTAGATGTTTCAGTTCCCCACGTCTGCCTCAACTTGACTATGTATTCATCAAGTTGTAATCATCGGTAAAGATGATTGGGTTTCCCCATTCGGAAATCTCCGGATCAAAGCTTACGTACAGCTCCCCGAAGCATATCGGTGTTAGTCCCGTCCTTCATCGGCTCCTAGTACCAAGGCATCCACCATGCGCCCTTCATAACTTAACCTAACGGTCACTTCGTGATCGTCAAATTAATTGAGTATTAGCGATAAACTAATTAAAAAACTCAAAAATACGCAGTTGTTTCTCGGTTTAATTATCTTAATAATTAAAGGAAAATAATTGATAATATCTAGTTTTCAAAGAACAA
+
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
Happening on SLURM cluster, doesn't happen for all users. Space is not an issue, maybe permissions?
--disable_auto_variants
is a temp fix
https://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes
Looks like: identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment
Example: 185 samples
Each sample ~700mb of result data
Intermediate files for all 185 samples 662GB!
Need to clean this up by default
Double check each process to make sure dataset files are not referring to the absolute paths
These need to be converted to symbolic links in the 'work' directory
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.