tseemann / shovill Goto Github PK

View Code? Open in Web Editor NEW

208.0 17.0 43.0 12.14 MB

⚡♠️ Assemble bacterial isolate genomes from Illumina paired-end reads

License: GNU General Public License v3.0

Perl 86.39% Shell 13.61%

genome-assembler illumina-sequencing pipelines

shovill's Introduction

Shovill

Assemble bacterial isolate genomes from Illumina paired-end reads

Introduction

The SPAdes genome assembler has become the de facto standard de novo genome assembler for Illumina whole genome sequencing data of bacteria and other small microbes. SPAdes was a major improvement over previous assemblers like Velvet, but some of its components can be slow and it traditionally did not handle overlapping paired-end reads well.

Shovill is a pipeline which uses SPAdes at its core, but alters the steps before and after the primary assembly step to get similar results in less time. Shovill also supports other assemblers like SKESA, Velvet and Megahit, so you can take advantage of the pre- and post-processing the Shovill provides with those too.

⚠️ Shovill is for isolate data only, primarily small haploid organisms. It will NOT work on metagenomes or larger genomes. Please use Megahit directly instead.

Main steps

Estimate genome size and read length from reads (unless --gsize provided)
Reduce FASTQ files to a sensible depth (default --depth 100)
Trim adapters from reads (with --trim only)
Conservatively correct sequencing errors in reads
Pre-overlap ("stitch") paired-end reads
Assemble with SPAdes/SKESA/Megahit with modified kmer range and PE + long SE reads
Correct minor assembly errors by mapping reads back to contigs
Remove contigs that are too short, too low coverage, or pure homopolymers
Produce final FASTA with nicer names and parseable annotations

Quick Start

% shovill --outdir out --R1 test/R1.fq.gz --R2 test/R2.fq.gz

<snip>
Final assembly in: test/contigs.fa
It contains 17 (min=150) contigs totalling 169611 bp.
Done.

% ls out

contigs.fa   contigs.gfa   shovill.corrections  
shovill.log  spades.fasta

% head -n 4 out/contigs.fa

>contig00001 len=52653 cov=32.7 corr=1 origname=NODE_3 date=20180327 sw=shovill/1.0.1
ATAACGCCCTGCTGGCCCAGGTCATTTTATCCAATCTGGACCTCTCGGCTCGCTTTGAAGAAT
GAGCGAATTCGCCGTTCAGTCCGCTGGACTTCGGACTTAAAGCCGCCTAAAACTGCACGAACC
ATTGTTCTGAGGGCCTCACTGGATTTTAACATCCTGCTAACGTCAGTTTCCAACGTCCTGTCG

Installation

Homebrew

brew install brewsci/bio/shovill
shovill --check

Using Homebrew will install all the dependencies for you: Linux or MacOS

Conda

conda install -c conda-forge -c bioconda -c defaults shovill
shovill --check

Using Bioconda will install all the dependencies for you on MacOS and Linux.

Containers

The Docker recipe is generously maintained by Curtis Kapsak and the StaPH-B workgroup.

# Docker
docker pull staphb/shovill:latest
docker run staphb/shovill:latest shovill --help

# Singularity
singularity build shovill.sif docker://staphb/shovill:latest
singularity exec shovill.sif shovill --help

Source

git clone https://github.com/tseemann/shovill.git
./shovill/bin/shovill --help
./shovill/bin/shovill --check

You will need to install all the dependencies manually:

SPAdes >= 3.11 (prefer >= 3.14)
SKESA
MEGAHIT
Velvet >= 1.2
Lighter
FLASh
SAMtools >= 1.3 (prefer >= 1.10)
BWA MEM
KMC
seqtk
pigz. Pigz should be available with your OS distribution.
Pilon (Java).
Trimmomatic (Java)
samclip

Note that you will need to make pilon and trimmomatic executables. You can make a simple wrapper for each that just passes the shell arguments.

Output files

Filename	Description
`contigs.fa`	The final assembly you should use
`shovill.log`	Full log file for bug reporting
`shovill.corrections`	List of post-assembly corrections
`contigs.gfa`	Assembly graph (spades)
`contigs.fastg`	Assembly graph (megahit)
`contigs.LastGraph`	Assembly graph (velvet)
`skesa.fasta`	Raw assembly (skesa)
`spades.fasta`	Raw assembled contigs (spades)
`megahit.fasta`	Raw assembly (megahit)
`velvet.fasta`	Raw assembly (velvet)

`contigs.fa`

This is most important output file - the final, corrected assembly. It contains entries like this:

>contig00001 len=263154 cov=8.9 corr=1 origname=NODE_1 date=20180327 sw=shovill/0.9
>contig00041 len=339 cov=8.8 corr=0 origname=NODE_41 date=20180327 sw=shovill/0.9

The sequence IDs are named as per the --namefmt option, and the comment field is a series of space-separated name=value pairs with the following meanings:

Pair	Meaning
`len`	Length of contig in basepairs
`cov`	Average k-mer coverage as reported by assembler
`corr`	Number of post-assembly corrections (unless `--nocorr` used)
`origname`	The original name of the contig (before applying `--namefmt`)
`date`	YYYYMMDD date when this contig was assembled
`sw`	`shovill-engine/version` where engine is the `--assembler` chosen

Advanced options

SYNOPSIS
  De novo assembly pipeline for Illumina paired reads
USAGE
  shovill [options] --outdir DIR --R1 R1.fq.gz --R2 R2.fq.gz
GENERAL
  --help          This help
  --version       Print version and exit
  --check         Check dependencies are installed
INPUT
  --R1 XXX        Read 1 FASTQ (default: '')
  --R2 XXX        Read 2 FASTQ (default: '')
  --depth N       Sub-sample --R1/--R2 to this depth. Disable with --depth 0 (default: 150)
  --gsize XXX     Estimated genome size eg. 3.2M <blank=AUTODETECT> (default: '')
OUTPUT
  --outdir XXX    Output folder (default: '')
  --force         Force overwite of existing output folder (default: OFF)
  --minlen N      Minimum contig length <0=AUTO> (default: 0)
  --mincov n.nn   Minimum contig coverage <0=AUTO> (default: 2)
  --namefmt XXX   Format of contig FASTA IDs in 'printf' style (default: 'contig%05d')
  --keepfiles     Keep intermediate files (default: OFF)
RESOURCES
  --tmpdir XXX    Fast temporary directory (default: '/tmp/tseemann')
  --cpus N        Number of CPUs to use (0=ALL) (default: 8)
  --ram n.nn      Try to keep RAM usage below this many GB (default: 16)
ASSEMBLER
  --assembler XXX Assembler: skesa velvet megahit spades (default: 'spades')
  --opts XXX      Extra assembler options in quotes eg. spades: "--untrusted-contigs locus.fna" ... (default: '')
  --kmers XXX     K-mers to use <blank=AUTO> (default: '')
MODULES
  --trim          Enable adaptor trimming (default: OFF)
  --noreadcorr    Disable read error correction (default: OFF)
  --nostitch      Disable read stitching (default: OFF)
  --nocorr        Disable post-assembly correction (default: OFF)

--depth

Giving an assembler too much data is a bad thing. There comes a point where you are no longer adding new information (as the genome is a fixed size), and only adding more noise (sequencing errors). Most assemblers seem to be happy with ~150x depth, so Shovill will downsample your FASTQ files to this depth. It estimates depth by dividing read yield by genome size.

--gsize

The genome size is needed to estimate depth and for the read error correction stage. If you don't provide --gsize, it will be estimated via k-mer frequencies using mash. It doesn't need to be a perfect estimate, just in the right ballpark.

--keepfiles

This will keep all the intermediate files in --outdir so you can explore and debug.

--cpus

By default it will attempt to use all available CPU cores.

--ram

Shovill will do its best to keep memory usage below this value, but it is not guaranteed. If you are on a HPC cluster, you should make sure you tell your job submission engine a value higher than this.

--assembler

By default it will use SPAdes, but you can also choose Megahit or SKESA. These are much faster than SPAdes, but give lesser assemblies. If you use SKESA you can probably use --noreadcorr and --nocoor because it has some of that functionality inbuilt and is conservative.

--opts

If you want to provide some assembler-specific parameters you can use the --opts parameter. Make sure you quote the parameters so they get passed as a single string eg. For --assembler spades you might use --opts "--sc --untrusted-contigs similar_genome.fasta" or --opts '--sc'.

--kmers

A series of kmers are chosen based on the read length distribution. You can override this with this option.

Choosing which stages to use

Stage	Enable	Disable
Genome size estimation	default	`--gsize XX`
Read subsampling	`--depth N`	`--depth 0`
Read trimming	`--trim`	default
Read error correction	default	`--noreadcorr`
Read stitching/overlap	default	`--nostitch`
Contig correction	default	`--nocorr`

Environment variables recognised

These env-vars will be used as defaults instead of the built-in defaults. You can use the normal command line option to override them still.

Variable	Option	Default
`$SHOVILL_CPUS`	`--cpus`	8
`$SHOVILL_RAM`	`--ram`	16
`$SHOVILL_ASSEMBLER`	`--assembler`	`spades`
`$TMPDIR`	`--tmpdir`	`/tmp`

FAQ

Does shovill accept single-end reads?

No, but it might one day.
Do you support long reads from Pacbio or Nanopore?

No, this is strictly for Illumina paired-end reads only. Try use Flye. CANU, or Redbean.
Why does Shovill crash?

Shovill has a lot of dependencies. If any dependencies are not installed correctly it will die. Spades also doesn't handle --cpus > 16 very well - try giving more RAM.
Can I assemble metagenomes with Shovill?

No. Please use dedicated tools like Minia 3.x or Megahit. Shovill uses the estimated genome size for many dynamic settings related to read error correction, read subsampling etc.

Feedback

Please file questions, bugs or ideas to the Issue Tracker

License

GPLv3

Citation

Not published yet.

Author

Torsten Seemann
Web: https://tseemann.github.io/
Twitter: @torstenseemann
Blog: The Genome Factory

Contributors

Jason Kwong
Simon Gladman
Anders Goncalves da Silva

shovill's People

Contributors

Stargazers

Watchers

shovill's Issues

Estimate read length distribution for better kmers

Mincov value can only be an integer

Shovill fails if you give --mincov a float. Yet the help describes the parameter as accepting n.nn

KMC parallel contention for /tmp

Suggestion in #50 by @andersgs

Delete temp files AS YOU GO to save space

Currently deletes at the end,

bwa option -x intractg is likely incorrect

The bwa mem option -x intractg is used to map intra-species contigs to the the reference. Mapping the initial reads, -x intractg should not be used in the pilon step, IMHO.

Error with KMC?

I am interested in testing your tool, however I get an error from KMC -> 20-kmc.log:
Error: Cannot open temporary file /var/folders/c9/8b9fmkr15wd0pbzkjgjy7yv00000gp/T/kmc_00253.bin

The kmc_00253.bin does not exist (only 0-00252.bin)

I am running this on OS-X installed via Homebrew
Best, Erik

Add --check function and version checking

Add --check

Add version checking for each dependency.

Bad option causes -h with errorcode 0

Need to make || usage(1)

hovill -x ; echo $?
Unknown option: x
Synopsis:
  Faster de novo assembly pipeline based around Spades
Usage:
  shovill [options] --outdir DIR --R1 R1.fq.gz --R2 R2.fq.gz
Author:
  Torsten Seemann <[email protected]>
Options:
  --help          This help
  --version       Print version and exit
  --check         Check dependencies are installed
  --debug         Debug info (default: OFF)
  --cpus N        Number of CPUs to use (default: 16)
  --outdir XXX    Output folder (default: '')
  --namefmt XXX   Format of contig FASTA IDs in 'printf' style (default: 'contig%05d')
  --force         Force overwite of existing output folder (default: OFF)
  --R1 XXX        Read 1 FASTQ (default: '')
  --R2 XXX        Read 2 FASTQ (default: '')
  --depth N       Sub-sample --R1/--R2 to this depth. Disable with --depth 0 (default: 100)
  --gsize XXX     Estimated genome size <blank=AUTODETECT> (default: '')
  --kmers XXX     K-mers to use <blank=AUTO> (default: '')
  --opts XXX      Extra SPAdes options eg. --plasmid --sc ... (default: '')
  --nocorr        Disable post-assembly correction (default: OFF)
  --trim          Use Trimmomatic to remove common adaptors first (default: OFF)
  --trimopt XXX   Trimmomatic options (default: 'ILLUMINACLIP:/home/tseemann/git/shovill/bin/../db/trimmomatic.fa:1:30:11 LEADING:3 TRAILING:3 MINLEN:30 TOPHRED33')
  --minlen N      Minimum contig length <0=AUTO> (default: 1)
  --mincov n.nn   Minimum contig coverage <0=AUTO> (default: 2)
  --asm XXX       Spades result to correct: before_rr contigs scaffolds (default: 'contigs')
  --tmpdir XXX    Fast temporary directory (default: '/tmp/tseemann')
  --ram n.nn      Try to keep RAM usage below this many GB (default: 8)
  --keepfiles     Keep intermediate files (default: OFF)
Documentation:
  https://github.com/tseemann/shovill
0

Add contig name prefix option

Default is >contig00001
maybe allow a sprintf string?
so can do >strain_%02d_something

Feature Request: Resume from failed SPAdes assembly

Is it possible to add a "--resume" style option to begin the workflow from SPAdes onward? Alternatively a similar option where you can specify which point to begin from.

Use Kmerstream to estimate genome size and kmers

Check the user --kmers aren't > read length

if they are, spades give weird coverages.

Support single-end (SE) reads

Should Shovill support SE reads?

[Typo] "Removing short conting" + [bug?] error when running with max RAM

Something to correct in a future version, Shovill says when removing short contigs:
"Removing short conting"

More seriously, Shovill (the Java virtual machine) crashed at the Pilon stage when I defined "--ram 20.00", and exited with error 256. Removing the RAM command from the instructions and it ran without a hitch. Not sure whether that is a bug?

Ubuntu 16.04.3 LTS, installed via Linuxbrew, in a Virtualbox computer.

Option to trim adaptors

Consider flexbar ?

`Error 34304 running command` when the kmc part of the pipeline is running

I get the following output at the kmc part of the pipeline.

Estimating genome size with 'kmc'
Running: kmc -ci3 -k25 -t1 \/tmp\/tmp\.q0ZexVtdJO\.fq\.gz kmc /tmp >> 20-kmc.log 2>&1
Error 34304 running command

The contents of 20-kmc.log are:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

SPAdes memory limit error

SPAdes crashed with a malloc error due to not enough memory
It works when the memory is increased through the --memory parameter in SPAdes
Maybe it would a possible solution to add a --memory parameter to the Shovill pipeline

System information:
Linux version 3.10.0-327.13.1.el7.x86_64 (Red Hat 4.8.5-4)

Consider running "dust" across contigs

Help filter low complexity stuff?
Check these small contigs?

blast/2.6.0_2/bin/dustmasker

memory issues

I'm assembling a 100Mbp genome and having memory issues with both SPAdes and Pilon. The pipeline tried to use 32Gb for SPAdes which was not enough, I had to run it manually with more. Pilon also ran out of memory, and I had to run it by hand using the jar file rather than the binary installed by homebrew, as java -Xmx48G -jar /usr/local/Cellar/pilon/1.22/pilon-1.22.jar ...

It would be useful to have a --restart-from command to pick up after a crash

Combine all logs into one at successful finish

Aka shovill.log

"You ran: ...." is missing params

@argv is nuked by setOptions()
need to store @cmdline = ($0, @argv)

--help should not set $? == 1

Except if we print it because no valid command line.

Spades 3.11 uses assembly_graph_with_scaffolds.gfa

Seems output has changed in 3.11

Embed adaptors in DATA ?

This will solve need for FindBin and make Conda easier. For @Slugger70

Kmerstream sometimes fails when estimating the kmers to use

Here is the error message I recieve:

Running: seqtk sample R1.fq.gz 10000 | paste - - - - | cut -f2 > readsample.txt
Read length looks like 150 bp
Estimated K-mers: 21 37 53 69 85 101 117 [kn=8, ks=16, kmin=21, kmax=127]
Using kmers: 21,37,53,69,85,101,117
Estimating genome size
Running: KmerStream -k 21,37,53,69,85,101,117 -o kmerstream-raw.tsv -t 8 --tsv --verbose --online R1.fq.gz R2.fq.gz
Running: KmerStreamEstimate.py kmerstream-raw.tsv > kmerstream-est.tsv
Traceback (most recent call last):
  File "/usr/local/bin/KmerStreamEstimate.py", line 43, in <module>
    x,e = EMfit2(F0,f1,F1,int(k))
  File "/usr/local/bin/KmerStreamEstimate.py", line 26, in EMfit2
    e = brentq(func, 0, 1)
  File "/usr/lib/python2.7/dist-packages/scipy/optimize/zeros.py", line 415, in brentq
    r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
ValueError: f(a) and f(b) must have different signs
Error 256 running command

I am using the version from commit 84e56564a

Rename FASTA descriptions

We already rename the contigs to something sensible, but the desc is still Spades with the length in kmers (not bp) and coverage as it is.

>contig00001 NODE_1_length_54882_cov_28.4218_pilon

Maybe this? The corr=N could be how many pilon corrections? If it wasnt run it will be zero or n/a ?

>contig00001 len=54981 cov=28.4 corr=48

Subsample large yield read sets to < 100x

Once we have a genome size estimate we can not use all the reads if the coverage is too high.

All those noise k-mers aren't helping anyone.

No record of original reads in log

Add pilon to post-correct contigs

lighter stalls when installed from homebrew

FYI mourisl/Lighter#25

Make pilon 1.20 backward compatible

Use --fix bases instead of --fix snps,indels ?

brew install fails

On MacOSX, after
brew tap homebrew/science and brew tap tseemann/bioinformatics-linux complete successfully, brew install shovill fails with

Error: No available formula with the name "shovill"
==> Searching for a previously deleted formula...
Warning: homebrew/core is shallow clone. To get complete history run:
  git -C "$(brew --repo homebrew/core)" fetch --unshallow

Error: No previously deleted formula found.
==> Searching for similarly named formulae...
Error: No similarly named formulae found.
==> Searching taps...
Error: No formulae found in taps.

Use Time::Hires (core) instead of Time::Piece

http://search.cpan.org/~esaym/Time-Piece-1.3201/Piece.pm

Will need more code, but avoids a module install. or just native time (seconds) call!

Can multiple different libraries be combined?

For example if I have two different insert size libraries can I combine both these sets of reads together?

Does kmc -cs255 affect things.

  -cs<value> - maximal value of a counter (default: 255)

need to check low coverage and high coverages data
@AnnaSyme

Option to keep graph files (FASTG, GFA)

Updated spades assembly parameters per spades team suggestion.

According to spades team, the merged and unmerged reads should be treated as two separate libraries.

https://twitter.com/spadesassembler/status/907714056387252225

Have you tried shovill on yeast size or aspergillus size genomes?

What is the largest haploid genome size shovill can deal with?

FLASH overlap too low?

Ella suggests that 10 bp might be a bit risky,

--nocorr option fails at pilon step.

When running with --nocorr option, shovill 0.7.1 fails and returns the following error:

User supplied --nocorr, so not correcting contigs.
read_file 'pilon.changes' - sysopen: No such file or directory at /home/ubuntu/miniconda3/bin/shovill line 296.

Commandline used:
shovill --outdir out_nocorr --R1 mutant_R1.fastq --R2 mutant_R2.fastq --nocorr --ram 6

Look at Spades warning.log and do something

=== Error correction and assembling warnings:
 * 0:00:07.339    96M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 367)   Failed to determine erroneous kmer threshold. Threshold set to: 21
 * 0:00:09.585    96M / 8G    WARN    General                 (simplification.cpp        : 569)   The determined erroneous connection coverage threshold may be determined improperly
 * 0:00:06.018    88M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 219)   Too many erroneous kmers, the estimates might be unreliable
 * 0:00:06.025    88M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 328)   Valley value was estimated improperly, reset to 1
 * 0:00:06.025    88M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 367)   Failed to determine erroneous kmer threshold. Threshold set to: 1
 * 0:00:10.306    88M / 8G    WARN    General                 (pair_info_count.cpp       : 319)   Unable to estimate insert size for paired library #0
 * 0:00:10.306    88M / 8G    WARN    General                 (pair_info_count.cpp       : 325)   None of paired reads aligned properly. Please, check orientation of your read pairs.
 * 0:00:10.306    88M / 8G    WARN    General                 (repeat_resolving.cpp      :  62)   Insert size was not estimated for any of the paired libraries, repeat resolution module will not run.
======= Warnings saved to

Rescue orphans from trimmomatic and use as --s-1 ?

Given that the overlapping PE belong in --s-2 then perhaps it might be worth considering not ignoring the orphan R1 and R2 reads from the trimmomatic output (currently /dev/null) and using them correctly as --s-1.

Trimming is off by default, and so it might not be worth it.

v0.5.1 deletes all "low coverage" contigs

e.g.

Removing low coverage contig (< 2 x): NODE_359_length_31400_cov_0.00179034_pilon
Removing low coverage contig (< 2 x): NODE_288_length_29837_cov_0.00817741_pilon

$ fa shovill.fa 
(stdin)                   no=4 bp=747 ok=747 Ns=0 gaps=0 min=142 avg=186 max=273 N50=183
$ fa spades-fast.fa 
(stdin)                   no=432 bp=2056909 ok=2056909 Ns=0 gaps=0 min=123 avg=4761 max=31400 N50=9082

$ cat yield.clean.tab 
Files	ERR036060/R1.fq.gz ERR036060/R2.fq.gz
Reads	5346082
Yield	405984280
GeeCee	49.0
MinLen	30
AvgLen	75
MaxLen	76
ModeLen	76
Phred	33
AvgQual	31.4
Depth	181x

tseemann / shovill Goto Github PK

shovill's Introduction

Shovill

Introduction

Main steps

Quick Start

Installation

Homebrew

Conda

Containers

Source

Output files

contigs.fa

Advanced options

--depth

--gsize

--keepfiles

--cpus

--ram

--assembler

--opts

--kmers

Choosing which stages to use

Environment variables recognised

FAQ

Feedback

License

Citation

Author

Contributors

shovill's People

Contributors

Stargazers

Watchers

Forkers

shovill's Issues

Recommend Projects

Recommend Topics

Recommend Org

`contigs.fa`