Giter Club home page Giter Club logo

shovill's Introduction

Build Status License: GPL v3 Don't judge me

Shovill

Assemble bacterial isolate genomes from Illumina paired-end reads

Introduction

The SPAdes genome assembler has become the de facto standard de novo genome assembler for Illumina whole genome sequencing data of bacteria and other small microbes. SPAdes was a major improvement over previous assemblers like Velvet, but some of its components can be slow and it traditionally did not handle overlapping paired-end reads well.

Shovill is a pipeline which uses SPAdes at its core, but alters the steps before and after the primary assembly step to get similar results in less time. Shovill also supports other assemblers like SKESA, Velvet and Megahit, so you can take advantage of the pre- and post-processing the Shovill provides with those too.

⚠️ Shovill is for isolate data only, primarily small haploid organisms. It will NOT work on metagenomes or larger genomes. Please use Megahit directly instead.

Main steps

  1. Estimate genome size and read length from reads (unless --gsize provided)
  2. Reduce FASTQ files to a sensible depth (default --depth 100)
  3. Trim adapters from reads (with --trim only)
  4. Conservatively correct sequencing errors in reads
  5. Pre-overlap ("stitch") paired-end reads
  6. Assemble with SPAdes/SKESA/Megahit with modified kmer range and PE + long SE reads
  7. Correct minor assembly errors by mapping reads back to contigs
  8. Remove contigs that are too short, too low coverage, or pure homopolymers
  9. Produce final FASTA with nicer names and parseable annotations

Quick Start

% shovill --outdir out --R1 test/R1.fq.gz --R2 test/R2.fq.gz

<snip>
Final assembly in: test/contigs.fa
It contains 17 (min=150) contigs totalling 169611 bp.
Done.

% ls out

contigs.fa   contigs.gfa   shovill.corrections  
shovill.log  spades.fasta

% head -n 4 out/contigs.fa

>contig00001 len=52653 cov=32.7 corr=1 origname=NODE_3 date=20180327 sw=shovill/1.0.1
ATAACGCCCTGCTGGCCCAGGTCATTTTATCCAATCTGGACCTCTCGGCTCGCTTTGAAGAAT
GAGCGAATTCGCCGTTCAGTCCGCTGGACTTCGGACTTAAAGCCGCCTAAAACTGCACGAACC
ATTGTTCTGAGGGCCTCACTGGATTTTAACATCCTGCTAACGTCAGTTTCCAACGTCCTGTCG

Installation

Homebrew

brew install brewsci/bio/shovill
shovill --check

Using Homebrew will install all the dependencies for you: Linux or MacOS

Conda

conda install -c conda-forge -c bioconda -c defaults shovill
shovill --check

Using Bioconda will install all the dependencies for you on MacOS and Linux.

Containers

The Docker recipe is generously maintained by Curtis Kapsak and the StaPH-B workgroup.

# Docker
docker pull staphb/shovill:latest
docker run staphb/shovill:latest shovill --help

# Singularity
singularity build shovill.sif docker://staphb/shovill:latest
singularity exec shovill.sif shovill --help

Source

git clone https://github.com/tseemann/shovill.git
./shovill/bin/shovill --help
./shovill/bin/shovill --check

You will need to install all the dependencies manually:

Note that you will need to make pilon and trimmomatic executables. You can make a simple wrapper for each that just passes the shell arguments.

Output files

Filename Description
contigs.fa The final assembly you should use
shovill.log Full log file for bug reporting
shovill.corrections List of post-assembly corrections
contigs.gfa Assembly graph (spades)
contigs.fastg Assembly graph (megahit)
contigs.LastGraph Assembly graph (velvet)
skesa.fasta Raw assembly (skesa)
spades.fasta Raw assembled contigs (spades)
megahit.fasta Raw assembly (megahit)
velvet.fasta Raw assembly (velvet)

contigs.fa

This is most important output file - the final, corrected assembly. It contains entries like this:

>contig00001 len=263154 cov=8.9 corr=1 origname=NODE_1 date=20180327 sw=shovill/0.9
>contig00041 len=339 cov=8.8 corr=0 origname=NODE_41 date=20180327 sw=shovill/0.9

The sequence IDs are named as per the --namefmt option, and the comment field is a series of space-separated name=value pairs with the following meanings:

Pair Meaning
len Length of contig in basepairs
cov Average k-mer coverage as reported by assembler
corr Number of post-assembly corrections (unless --nocorr used)
origname The original name of the contig (before applying --namefmt)
date YYYYMMDD date when this contig was assembled
sw shovill-engine/version where engine is the --assembler chosen

Advanced options

SYNOPSIS
  De novo assembly pipeline for Illumina paired reads
USAGE
  shovill [options] --outdir DIR --R1 R1.fq.gz --R2 R2.fq.gz
GENERAL
  --help          This help
  --version       Print version and exit
  --check         Check dependencies are installed
INPUT
  --R1 XXX        Read 1 FASTQ (default: '')
  --R2 XXX        Read 2 FASTQ (default: '')
  --depth N       Sub-sample --R1/--R2 to this depth. Disable with --depth 0 (default: 150)
  --gsize XXX     Estimated genome size eg. 3.2M <blank=AUTODETECT> (default: '')
OUTPUT
  --outdir XXX    Output folder (default: '')
  --force         Force overwite of existing output folder (default: OFF)
  --minlen N      Minimum contig length <0=AUTO> (default: 0)
  --mincov n.nn   Minimum contig coverage <0=AUTO> (default: 2)
  --namefmt XXX   Format of contig FASTA IDs in 'printf' style (default: 'contig%05d')
  --keepfiles     Keep intermediate files (default: OFF)
RESOURCES
  --tmpdir XXX    Fast temporary directory (default: '/tmp/tseemann')
  --cpus N        Number of CPUs to use (0=ALL) (default: 8)
  --ram n.nn      Try to keep RAM usage below this many GB (default: 16)
ASSEMBLER
  --assembler XXX Assembler: skesa velvet megahit spades (default: 'spades')
  --opts XXX      Extra assembler options in quotes eg. spades: "--untrusted-contigs locus.fna" ... (default: '')
  --kmers XXX     K-mers to use <blank=AUTO> (default: '')
MODULES
  --trim          Enable adaptor trimming (default: OFF)
  --noreadcorr    Disable read error correction (default: OFF)
  --nostitch      Disable read stitching (default: OFF)
  --nocorr        Disable post-assembly correction (default: OFF)

--depth

Giving an assembler too much data is a bad thing. There comes a point where you are no longer adding new information (as the genome is a fixed size), and only adding more noise (sequencing errors). Most assemblers seem to be happy with ~150x depth, so Shovill will downsample your FASTQ files to this depth. It estimates depth by dividing read yield by genome size.

--gsize

The genome size is needed to estimate depth and for the read error correction stage. If you don't provide --gsize, it will be estimated via k-mer frequencies using mash. It doesn't need to be a perfect estimate, just in the right ballpark.

--keepfiles

This will keep all the intermediate files in --outdir so you can explore and debug.

--cpus

By default it will attempt to use all available CPU cores.

--ram

Shovill will do its best to keep memory usage below this value, but it is not guaranteed. If you are on a HPC cluster, you should make sure you tell your job submission engine a value higher than this.

--assembler

By default it will use SPAdes, but you can also choose Megahit or SKESA. These are much faster than SPAdes, but give lesser assemblies. If you use SKESA you can probably use --noreadcorr and --nocoor because it has some of that functionality inbuilt and is conservative.

--opts

If you want to provide some assembler-specific parameters you can use the --opts parameter. Make sure you quote the parameters so they get passed as a single string eg. For --assembler spades you might use --opts "--sc --untrusted-contigs similar_genome.fasta" or --opts '--sc'.

--kmers

A series of kmers are chosen based on the read length distribution. You can override this with this option.

Choosing which stages to use

Stage Enable Disable
Genome size estimation default --gsize XX
Read subsampling --depth N --depth 0
Read trimming --trim default
Read error correction default --noreadcorr
Read stitching/overlap default --nostitch
Contig correction default --nocorr

Environment variables recognised

These env-vars will be used as defaults instead of the built-in defaults. You can use the normal command line option to override them still.

Variable Option Default
$SHOVILL_CPUS --cpus 8
$SHOVILL_RAM --ram 16
$SHOVILL_ASSEMBLER --assembler spades
$TMPDIR --tmpdir /tmp

FAQ

  • Does shovill accept single-end reads?

    No, but it might one day.

  • Do you support long reads from Pacbio or Nanopore?

    No, this is strictly for Illumina paired-end reads only. Try use Flye. CANU, or Redbean.

  • Why does Shovill crash?

    Shovill has a lot of dependencies. If any dependencies are not installed correctly it will die. Spades also doesn't handle --cpus > 16 very well - try giving more RAM.

  • Can I assemble metagenomes with Shovill?

    No. Please use dedicated tools like Minia 3.x or Megahit. Shovill uses the estimated genome size for many dynamic settings related to read error correction, read subsampling etc.

Feedback

Please file questions, bugs or ideas to the Issue Tracker

License

GPLv3

Citation

Not published yet.

Author

Contributors

  • Jason Kwong
  • Simon Gladman
  • Anders Goncalves da Silva

shovill's People

Contributors

andersgs avatar kriskiil avatar linsalrob avatar tseemann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

shovill's Issues

v0.5.1 deletes all "low coverage" contigs

e.g.

Removing low coverage contig (< 2 x): NODE_359_length_31400_cov_0.00179034_pilon
Removing low coverage contig (< 2 x): NODE_288_length_29837_cov_0.00817741_pilon
$ fa shovill.fa 
(stdin)                   no=4 bp=747 ok=747 Ns=0 gaps=0 min=142 avg=186 max=273 N50=183
$ fa spades-fast.fa 
(stdin)                   no=432 bp=2056909 ok=2056909 Ns=0 gaps=0 min=123 avg=4761 max=31400 N50=9082
$ cat yield.clean.tab 
Files	ERR036060/R1.fq.gz ERR036060/R2.fq.gz
Reads	5346082
Yield	405984280
GeeCee	49.0
MinLen	30
AvgLen	75
MaxLen	76
ModeLen	76
Phred	33
AvgQual	31.4
Depth	181x

Bad option causes -h with errorcode 0

Need to make || usage(1)

hovill -x ; echo $?
Unknown option: x
Synopsis:
  Faster de novo assembly pipeline based around Spades
Usage:
  shovill [options] --outdir DIR --R1 R1.fq.gz --R2 R2.fq.gz
Author:
  Torsten Seemann <[email protected]>
Options:
  --help          This help
  --version       Print version and exit
  --check         Check dependencies are installed
  --debug         Debug info (default: OFF)
  --cpus N        Number of CPUs to use (default: 16)
  --outdir XXX    Output folder (default: '')
  --namefmt XXX   Format of contig FASTA IDs in 'printf' style (default: 'contig%05d')
  --force         Force overwite of existing output folder (default: OFF)
  --R1 XXX        Read 1 FASTQ (default: '')
  --R2 XXX        Read 2 FASTQ (default: '')
  --depth N       Sub-sample --R1/--R2 to this depth. Disable with --depth 0 (default: 100)
  --gsize XXX     Estimated genome size <blank=AUTODETECT> (default: '')
  --kmers XXX     K-mers to use <blank=AUTO> (default: '')
  --opts XXX      Extra SPAdes options eg. --plasmid --sc ... (default: '')
  --nocorr        Disable post-assembly correction (default: OFF)
  --trim          Use Trimmomatic to remove common adaptors first (default: OFF)
  --trimopt XXX   Trimmomatic options (default: 'ILLUMINACLIP:/home/tseemann/git/shovill/bin/../db/trimmomatic.fa:1:30:11 LEADING:3 TRAILING:3 MINLEN:30 TOPHRED33')
  --minlen N      Minimum contig length <0=AUTO> (default: 1)
  --mincov n.nn   Minimum contig coverage <0=AUTO> (default: 2)
  --asm XXX       Spades result to correct: before_rr contigs scaffolds (default: 'contigs')
  --tmpdir XXX    Fast temporary directory (default: '/tmp/tseemann')
  --ram n.nn      Try to keep RAM usage below this many GB (default: 8)
  --keepfiles     Keep intermediate files (default: OFF)
Documentation:
  https://github.com/tseemann/shovill
0

SPAdes memory limit error

SPAdes crashed with a malloc error due to not enough memory
It works when the memory is increased through the --memory parameter in SPAdes
Maybe it would a possible solution to add a --memory parameter to the Shovill pipeline

System information:
Linux version 3.10.0-327.13.1.el7.x86_64 (Red Hat 4.8.5-4)

bwa option -x intractg is likely incorrect

The bwa mem option -x intractg is used to map intra-species contigs to the the reference. Mapping the initial reads, -x intractg should not be used in the pilon step, IMHO.

[Typo] "Removing short conting" + [bug?] error when running with max RAM

Something to correct in a future version, Shovill says when removing short contigs:
"Removing short conting"

More seriously, Shovill (the Java virtual machine) crashed at the Pilon stage when I defined "--ram 20.00", and exited with error 256. Removing the RAM command from the instructions and it ran without a hitch. Not sure whether that is a bug?

Ubuntu 16.04.3 LTS, installed via Linuxbrew, in a Virtualbox computer.

Error with KMC?

I am interested in testing your tool, however I get an error from KMC -> 20-kmc.log:
Error: Cannot open temporary file /var/folders/c9/8b9fmkr15wd0pbzkjgjy7yv00000gp/T/kmc_00253.bin

The kmc_00253.bin does not exist (only 0-00252.bin)

I am running this on OS-X installed via Homebrew
Best, Erik

--nocorr option fails at pilon step.

When running with --nocorr option, shovill 0.7.1 fails and returns the following error:

User supplied --nocorr, so not correcting contigs.
read_file 'pilon.changes' - sysopen: No such file or directory at /home/ubuntu/miniconda3/bin/shovill line 296.

Commandline used:
shovill --outdir out_nocorr --R1 mutant_R1.fastq --R2 mutant_R2.fastq --nocorr --ram 6

memory issues

I'm assembling a 100Mbp genome and having memory issues with both SPAdes and Pilon. The pipeline tried to use 32Gb for SPAdes which was not enough, I had to run it manually with more. Pilon also ran out of memory, and I had to run it by hand using the jar file rather than the binary installed by homebrew, as java -Xmx48G -jar /usr/local/Cellar/pilon/1.22/pilon-1.22.jar ...

It would be useful to have a --restart-from command to pick up after a crash

`Error 34304 running command` when the kmc part of the pipeline is running

I get the following output at the kmc part of the pipeline.

Estimating genome size with 'kmc'
Running: kmc -ci3 -k25 -t1 \/tmp\/tmp\.q0ZexVtdJO\.fq\.gz kmc /tmp >> 20-kmc.log 2>&1
Error 34304 running command

The contents of 20-kmc.log are:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

Look at Spades warning.log and do something

=== Error correction and assembling warnings:
 * 0:00:07.339    96M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 367)   Failed to determine erroneous kmer threshold. Threshold set to: 21
 * 0:00:09.585    96M / 8G    WARN    General                 (simplification.cpp        : 569)   The determined erroneous connection coverage threshold may be determined improperly
 * 0:00:06.018    88M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 219)   Too many erroneous kmers, the estimates might be unreliable
 * 0:00:06.025    88M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 328)   Valley value was estimated improperly, reset to 1
 * 0:00:06.025    88M / 8G    WARN    General                 (kmer_coverage_model.cpp   : 367)   Failed to determine erroneous kmer threshold. Threshold set to: 1
 * 0:00:10.306    88M / 8G    WARN    General                 (pair_info_count.cpp       : 319)   Unable to estimate insert size for paired library #0
 * 0:00:10.306    88M / 8G    WARN    General                 (pair_info_count.cpp       : 325)   None of paired reads aligned properly. Please, check orientation of your read pairs.
 * 0:00:10.306    88M / 8G    WARN    General                 (repeat_resolving.cpp      :  62)   Insert size was not estimated for any of the paired libraries, repeat resolution module will not run.
======= Warnings saved to 

Kmerstream sometimes fails when estimating the kmers to use

Here is the error message I recieve:

Running: seqtk sample R1.fq.gz 10000 | paste - - - - | cut -f2 > readsample.txt
Read length looks like 150 bp
Estimated K-mers: 21 37 53 69 85 101 117 [kn=8, ks=16, kmin=21, kmax=127]
Using kmers: 21,37,53,69,85,101,117
Estimating genome size
Running: KmerStream -k 21,37,53,69,85,101,117 -o kmerstream-raw.tsv -t 8 --tsv --verbose --online R1.fq.gz R2.fq.gz
Running: KmerStreamEstimate.py kmerstream-raw.tsv > kmerstream-est.tsv
Traceback (most recent call last):
  File "/usr/local/bin/KmerStreamEstimate.py", line 43, in <module>
    x,e = EMfit2(F0,f1,F1,int(k))
  File "/usr/local/bin/KmerStreamEstimate.py", line 26, in EMfit2
    e = brentq(func, 0, 1)
  File "/usr/lib/python2.7/dist-packages/scipy/optimize/zeros.py", line 415, in brentq
    r = _zeros._brentq(f,a,b,xtol,rtol,maxiter,args,full_output,disp)
ValueError: f(a) and f(b) must have different signs
Error 256 running command

I am using the version from commit 84e56564a

Rename FASTA descriptions

We already rename the contigs to something sensible, but the desc is still Spades with the length in kmers (not bp) and coverage as it is.

>contig00001 NODE_1_length_54882_cov_28.4218_pilon

Maybe this? The corr=N could be how many pilon corrections? If it wasnt run it will be zero or n/a ?

>contig00001 len=54981 cov=28.4 corr=48

brew install fails

On MacOSX, after
brew tap homebrew/science and brew tap tseemann/bioinformatics-linux complete successfully, brew install shovill fails with

Error: No available formula with the name "shovill"
==> Searching for a previously deleted formula...
Warning: homebrew/core is shallow clone. To get complete history run:
  git -C "$(brew --repo homebrew/core)" fetch --unshallow

Error: No previously deleted formula found.
==> Searching for similarly named formulae...
Error: No similarly named formulae found.
==> Searching taps...
Error: No formulae found in taps.

Rescue orphans from trimmomatic and use as --s-1 ?

Given that the overlapping PE belong in --s-2 then perhaps it might be worth considering not ignoring the orphan R1 and R2 reads from the trimmomatic output (currently /dev/null) and using them correctly as --s-1.

Trimming is off by default, and so it might not be worth it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.