sanger-pathogens / assembly_improvement Goto Github PK

Improve the quality of a denovo assembly by scaffolding and gap filling

License: Other

Perl 99.24% Java 0.67% Python 0.09%

genomics sequencing next-generation-sequencing research bioinformatics bioinformatics-pipeline global-health infectious-diseases pathogen

assembly_improvement's Introduction

Assembly Improvement

Take in an assembly in FASTA format, reads in FASTQ format, and make the assembly better by scaffolding and gap filling.

Introduction
Installation
Usage
License
Feedback/Issues
Citation

Introduction

This software takes an assembly in FASTA format along with reads in FATSQ format and improves the assembly by scaffolding and gap filling. The software contains several separate scripts, including improve_assembly, descaffold_assembly, diginorm_with_khmer, fastq_tools, order_contigs_with_abacas, read_correction_with_sga, remove_primers_with_quasr, rename_contigs and scaffold_with_sspace. For more information on what each scrips does, please see the usage below.

Installation

Required external dependancies

SSPACE and GapFiller - This is used by the improve_assembly, fill_gaps_with_gapfiller, and scaffold_with_sspace scripts. Download SSPACE-standard and GapFiller from Baseclear. The software is free for academic use.

Optional external dependancies

khmer - This is used for the diginorm_with_khmer script and can be downloaded and installed from the khmer github repository. It requires python 2.7.
MUMmer - This is used by ABACAS for reference based scaffolding as part of the improve_assembly and order_contigs_with_abacas scripts. It is widely available from package management systems such as HomeBrew/LinuxBrew and apt (Debian/Ubuntu) and the homepage is on sourceforge.
SGA - This is used by the read_correction_with_sga script and can be downloaded from the SGA github repository. It is widely available from package management systems such as HomeBrew/LinuxBrew and apt (Debian/Ubuntu).
QUASR - This is used by the remove_primers_with_quasr script. It requires JAVA and can be downloaded from the QUASR github repository.

The software and its Perl dependancies can be downloaded from CPAN. It requires Perl 5.14 or greater.

cpanm -f Bio::AssemblyImprovement

If you encounter an issue when installing assembly_improvement please contact your local system administrator. If you encounter a bug please log it here or email us at [email protected]

Running the tests

The test can be run with dzil from the top level directory:

dzil test

Usage

improve_assembly

The improve_assembly script is the main script for the repository. The usage is:

Usage: improve_assembly [options] -a <FASTA> -f <FASTQ> -r <FASTQ> -s <EXEC> -g <EXEC>
Given an assembly in FASTA format and paired ended reads in FASTQ format, output a scaffolded and gapfilled assembly.

Required: 	
    -a STR  assembly FASTA(.gz) file
    -f STR  forward reads FASTQ(.gz) file
    -r STR  reverse reads FASTQ(.gz) file
    -s STR  path to SSPACE executable
    -g STR  path to GapFiller executable
Options:    
    -c STR  reference FASTA(.gz) file
    -i INT  insert size [250]
    -b STR  path to abacas.pl executable
    -o STR  output directory [.]
    -l INT  Minimum final contig length [300]
    -p INT  Only filter if this percentage of bases left [95]
    -d      debug output []
    -h      print this message and exit

To scaffold and gap fill an assembly you run:

improve_assembly -a contigs.fa -f 123_1.fastq -r 123_2.fastq

This will output files for each stage in the script, with the various processes performed in the filename. The final filename in this case is scaffolds.scaffolded.gapfilled.length_filtered.sorted.fa.

A reference genome can be used to additionally orientate and order the contigs using ABACAS. This is useful for when you have a close by high quality reference sequence.

improve_assembly -a contigs.fa -f 123_1.fastq -r 123_2.fastq -c my_reference.fa

The input files can be optionally gzipped.

improve_assembly -a contigs.fa.gz -f 123_1.fastq.gz -r 123_2.fastq.gz

The insert size (fragment size) should ideally be set to the actual insert size. Be aware that the insert size you ask for can often differ from the actual insert size achieved in the lab. It is assumed that most of the reads will fall within 30% of the insert size given. To change it use -i:

improve_assembly -a contigs.fa -f 123_1.fastq -r 123_2.fastq -i 500

Small contigs are filtered out at the end (default 300 bases).

improve_assembly -a contigs.fa -f 123_1.fastq -r 123_2.fastq -l 500

The filtering step is only run if the size of the assembly is at least 95% of the input size. This can be changed using -p:

improve_assembly -a contigs.fa -f 123_1.fastq -r 123_2.fastq -p 95

The output directory for the results can be set using -o.

improve_assembly -a contigs.fa -f 123_1.fastq -r 123_2.fastq -o my_directory

descaffold_assembly

Given a FASTA file, break up each sequence where there is a gap.

Usage: descaffold_assembly -a <FASTA>

diginorm_with_khmer

A wrapper script around the khmer normalize-by-median.py script. This is useful where you have uneven coverage. For example in certain virus sequencing experiments, there can be 10,000X coverage however different parts of the genome may have been amplified so the actual coverage can have massive peaks and troughs.

Usage: diginorm_with_khmer [options] -i <FASTQ>

Required:
    -i  STR   input FASTQ(.gz) file of shuffled reads
Options:
    -c  INT   target median coverage [2]
    -k  INT   length of kmer to be used, higher numbers require more RAM [31]
    -n  INT   number of hashes [4]
    -m  FLOAT minimum hash size. [2.5e8]
    -o  STR   output directory [.]
    -z  STR   output filename [digitally_normalised.fastq.gz]
    -e  STR   path to normalize-by-median.py
    -py STR   python executable [python-2.7]
    -d        debug output []
    -h        print this message and exit

fastq_tools

This script has some useful utilities for analysing a shuffled paired ended FASTQ file. We know they are available in other applications but to minimise dependancies we reimplemented them and have exposed them here for completeness sake.

Usage: fastq_tools [options] -i <FASTQ>
	
Required:	
    -i STR  input FASTQ(.gz) file of shuffled reads
    -t STR  task to be run (kmer/coverage/split/histogram)
Options:
    -g INT  genome size of the coverage task []
    -h      print this message and exit

Calculate 66% and 90% of median read length as minimum and maximum kmer sizes for use when runnning VelvetOptimiser:

fastq_tools -i input.fastq -t kmer

To calculate the coverage of a genome:

fastq_tools -i input.fastq -t coverage -g 5000000

To split a shuffled paired ended FASTQ file into 2 FASTQ files, one containing the forward reads and another containing the reverse reads. At the end of each read name /1 (forward) or /2 (reverse) is appended:

fastq_tools -i input.fastq -t split

To produce a histogram (histogram.png) of the read lengths found in the FASTQ file. This is useful for after you have trimmed the reads to get an indication of whats left:

fastq_tools -i input.fastq -t histogram

fill_gaps_with_gapfiller

Given an assembly in FASTA format and paired ended reads in FASTQ format, iteratively fill in gaps with GapFiller. Initially a high level of read coverage is required to close a gap, which decreases with subsequent iterations. The means that bases with the highest level of evidence are filled in first, reducing the possiblity of errors.

Usage: fill_gaps_with_gapfiller [options] -a <FASTA> -f <FASTQ> -r <FASTQ> -g <EXEC>
Take in an assembly in FASTA format and reads in FASTQ format, then iteratively fill in the gaps with GapFiller.
Required: 	
    -a STR  assembly FASTA(.gz) file
    -f STR  forward reads FASTQ(.gz) file
    -r STR  reverse reads FASTQ(.gz) file
    -g STR  path to GapFiller executable
Options:    
    -i INT  insert size [250]
    -t INT  number of threads [1]
    -o STR  output directory [.]
    -d      debug output []
    -h      print this message and exit

order_contigs_with_abacas

A wrapper script around ABACAS for ordering contigs against a reference.

Usage: order_contigs_with_abacas -a <FASTA> -c <FASTA>
Take in an assembly and a reference in FASTA format and order the contigs.

Required: 
    -a STR  assembly FASTA(.gz) file
    -c STR  reference genome FASTA(.gz) file
Options: 
    -b STR  ABACAS executable [abacas.pl]
    -h      print this message and exit

read_correction_with_sga

A wrapper script around SGA read correction which sets some common defaults.

Usage: read_correction_with_sga [options] -f <FASTQ> -r <FASTQ> -s <EXEC>
Takes in FASTQ files and produces read corrected fastq files using SGA.

Required:
    -f STR  forward reads FASTQ(.gz) file
    -r STR  reverse reads FASTQ(.gz) file
    -s STR  path to SGA executable
Options:    
    -m INT  minimum read length [66]
    -q      trim reads using BWT trimming algorithm
    -a STR  indexing algorithm to use (ropebwt/sais) [ropebwt]
    -t INT  number of threads [1]
    -o STR  output directory [.]
    -h INT  kmer threshold [5]
    -k INT  length of kmer to be used [41]
    -z STR  output filename [_sga_error_corrected.fastq.gz]
    -d      debug output []

remove_primers_with_quasr

A wrapper script around QUASR.

rename_contigs

Rename all sequences in a FASTA file with a common base name and a sequential number.

Usage: rename_contigs [options]
Given an assembly, rename the contigs iteratively
Required:
    -a STR  assembly FASTA(.gz) file
    -b STR  basename for sequences
Options:    
    -h      print this message and exit

scaffold_with_sspace

A script to iteratively run scaffold an assembly using paired ended reads using SSPACE. Multiple iterations of scaffolding are run, begining where there is the highest level of read pair evidence linking together 2 contigs. It then outputs the scaffolded assembly in FASTA format.

Usage: scaffold_with_sspace [options] -a <FASTA> -f <FASTQ> -r <FASTQ> -s <EXEC>
Take in an assembly in FASTA format and reads in FASTQ format, then iteratively scaffold with SSPACE.
Required: 	
    -a STR  assembly FASTA(.gz) file
    -f STR  forward reads FASTQ(.gz) file
    -r STR  reverse reads FASTQ(.gz) file
    -s STR  path to SSPACE executable
Options:    
    -i INT  insert size [250]
    -t INT  number of threads [1]
    -o STR  output directory [.]
    -d      debug output []
    -h      print this message and exit

License

assembly_improvement is free software, licensed under GPLv3.

Feedback/Issues

Please report any issues to the issues page or email [email protected]

Citation

If you use this software please cite:

"Robust high throughput prokaryote de novo assembly and improvement pipeline for Illumina data", Page AJ, De Silva, N., Hunt M, Quail MA, Parkhill J, Harris SR, Otto TD, Keane JA, Microbial Genomics, 2016. doi: 10.1099/mgen.0.000083

assembly_improvement's People

Contributors

Stargazers

Watchers

assembly_improvement's Issues

Option -t tblastx can not open reference when it is in subfolder

Running abacas works fine when reference is in subfolder. But tblastx gives following error:

This may take several minutes ...
[formatdb] ERROR: Could not open reference.fasta

ERROR: Could not find 'formatdb' for blast

When reference is moved to the working directory tblastx works fine.

Intermediate file not found?

Dear @andrewjpage

I've been trying, for a couple of days already, to properly install and run your assembly improvment pipeline.

After installing all dependencies highlighted by cpanm -f Bio::AssemblyImprovement and test with dzil test runs smoothly within the github cloned folder.

However, when I run as follows (and only if I decide to include a reference with -c):

$ improve_assembly -a denovo_assembly/complete/KB10_trim_contigs.fasta -f clinic1/raw_reads/KB10_R1.fastq.gz -r clinic1/raw_reads/KB10_R2.fastq.gz -s bin/SSPACE-STANDARD-3.0_linux-x86_64/SSPACE_Standard_v3.0.pl -g bin/GapFiller_v1-10_linux-x86_64/GapFiller.pl  -o improve_assembly_KB10/   -c references/polyoma_dunlop_reference.fasta

Reports back the following error:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not read file '/home/jmarti/polyoma/7wWtvgPPJ7/KB10_trim_contigs.scaffolded.fasta_polyoma_dunlop_reference.fasta.fasta': Bestand of map bestaat niet
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/jmarti/perl5/lib/perl5/Bio/Root/Root.pm:449
STACK: Bio::Root::IO::_initialize_io /home/jmarti/perl5/lib/perl5/Bio/Root/IO.pm:272
STACK: Bio::SeqIO::_initialize /home/jmarti/perl5/lib/perl5/Bio/SeqIO.pm:508
STACK: Bio::SeqIO::fasta::_initialize /home/jmarti/perl5/lib/perl5/Bio/SeqIO/fasta.pm:88
STACK: Bio::SeqIO::new /home/jmarti/perl5/lib/perl5/Bio/SeqIO.pm:384
STACK: Bio::SeqIO::new /home/jmarti/perl5/lib/perl5/Bio/SeqIO.pm:430
STACK: Bio::AssemblyImprovement::Abacas::DelimiterRole::_split_sequence_on_delimiter /home/jmarti/perl5/lib/perl5/Bio/AssemblyImprovement/Abacas/DelimiterRole.pm:70
STACK: Bio::AssemblyImprovement::Abacas::Main::run /home/jmarti/perl5/lib/perl5/Bio/AssemblyImprovement/Abacas/Main.pm:73
STACK: Bio::AssemblyImprovement::Abacas::Iterative::_run_abacas /home/jmarti/perl5/lib/perl5/Bio/AssemblyImprovement/Abacas/Iterative.pm:74
STACK: Bio::AssemblyImprovement::Abacas::Iterative::run /home/jmarti/perl5/lib/perl5/Bio/AssemblyImprovement/Abacas/Iterative.pm:46
STACK: /home/jmarti/perl5/bin/improve_assembly:142
-----------------------------------------------------------

I have the impression that /home/jmarti/polyoma/7wWtvgPPJ7/KB10_trim_contigs.scaffolded.fasta_polyoma_dunlop_reference.fasta.fasta merges names of both my assembly file and reference. Is it a bug? Is it supposed to have an intermediate file like this? Is there an error with the requirements?

Excited to make this pipeline to work. Thank you very much!

~ Joan ~

Question about output

I have been using AssemblyImprover following VelvetOptimizer for 36 bacterial genomes. For the majority of the genomes, I generated the expected output files:
contigs.fa.scaffolded.filtered
contigs.fa.scaffolded.gapfilled.filtered
scaffolds.scaffolded.gapfilled.length_filtered.fa
scaffolds.scaffolded.gapfilled.length_filtered.sorted.fa

But for a handful, I get the following files:
contigs.scaffolded.fa
contigs.scaffolded.gapfilled.fa
scaffolds.scaffolded.gapfilled.length_filtered.fa
scaffolds.scaffolded.gapfilled.length_filtered.sorted.fa

What is the difference? Are these unexpected files usable? I can not find any documentation about this discrepancy anywhere, so anything you can tell me would be helpful.

Call to SSPACE dies due to invalid config file (expects aligner as second argument)

It looks like maybe the expected input to SSPACE has changed since assembly_improvement was originally released.

Running against the latest SSPACE (SSPACE-STANDARD-3.0_linux-x86_64) produces errors like:

ERROR: Invalid aligner in library LIB: /data/thesisWGS/AW01/_5AzmruAqQ/AW-1_S1_L001_R1_001.fastq. Should be either 'bowtie', 'bwa' or 'bwasw' -- fatal

(There's a similar error message in the code for GapFiller, but I confirmed that at runtime it's coming from SSPACE by adding debug statements to SSPACE to print the path of the library file it's attempting to process)

The current version of AssemblyImprovement installed from cpan (and also here in git at AssemblyImprovement/Scaffold/SSpace/Config.pm) creates a _scaffolder_config_file that looks like:

$ cat /data/thesisWGS/AW01/8Y4EvoYYIc/KToMu7SX1I/_scaffolder_config_file
LIB /data/thesisWGS/AW01/_5AzmruAqQ/AW-1_S1_L001_R1_001.fastq /data/thesisWGS/AW01/_5AzmruAqQ/AW-1_S1_L001_R2_001.fastq 350 0.3 FR

The first two args are: "LIB" (hardcoded placeholder library name), followed by path to the forward read.

Compare to the current SSPACE example, where the aligner is the second arg:

$ cat ~/tools/SSPACE-STANDARD-3.0_linux-x86_64/example/libraries.txt
lib1 bowtie SRR001665_1.fastq SRR001665_2.fastq 200 0.25 FR

I don't have previous versions of SSPACE to confirm that it changed at some point, but that seems to be the most plausible explanation.

It seems like the most reasonable fix is to add a mapper arg as is used in AssemblyImprovement/FillGaps/GapFiller/Config.pm, also defaulting to "bwa".

I can add a simple pull request to implement that, but I'm hoping someone can shed light on why/when the SSPACE expected params changed so that we don't break backward compatibility.

Could not read file (reference file) error

Hi @andrewjpage
I ran assembly improvement against a spades assembly and a virus reference sequence provided on the command line and I get this error. I decided to just run the order_contigs_with_abacus script and again it raises this error. Is this a known issue or something wrong with my setup? The reference file was provided in the arguments.

This is with improve_assembly
For some reason it is unable to locate the given files and arguments.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Could not read file '/assembly/spades/sample1/improved_no_ref_assembly/ucgdTSn1uJ/scaffolds.scaffolded.gapfilled.length_filtered.sorted.fa_KP317916_1.fasta.fasta': No such file or directory
STACK: Error::throw
STACK: Bio::Root::Root::throw /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/Root/Root.pm:447
STACK: Bio::Root::IO::_initialize_io /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/Root/IO.pm:268
STACK: Bio::SeqIO::_initialize /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/SeqIO.pm:513
STACK: Bio::SeqIO::fasta::_initialize /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/SeqIO/fasta.pm:87
STACK: Bio::SeqIO::new /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/SeqIO.pm:389
STACK: Bio::SeqIO::new /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/SeqIO.pm:435
STACK: Bio::AssemblyImprovement::Abacas::DelimiterRole::_split_sequence_on_delimiter /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/AssemblyImprovement/Abacas/DelimiterRole.pm:70
STACK: Bio::AssemblyImprovement::Abacas::Main::run /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/AssemblyImprovement/Abacas/Main.pm:73
STACK: Bio::AssemblyImprovement::Abacas::Iterative::_run_abacas /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/AssemblyImprovement/Abacas/Iterative.pm:74
STACK: Bio::AssemblyImprovement::Abacas::Iterative::run /apps/perl/5.24.0/lib/site_perl/5.24.0/Bio/AssemblyImprovement/Abacas/Iterative.pm:46
STACK: /apps/perl/5.24.0/bin/order_contigs_with_abacas:57
-----------------------------------------------------------

I had provided all the arguments with the correct paths to where these files are located.

access issue

Dear developers,
I cannot access the page.

SSPACE and GapFiller - This is used by the improve_assembly, fill_gaps_with_gapfiller, and scaffold_with_sspace scripts. Download SSPACE-standard and GapFiller from Baseclear. The software is free for academic use.

Is there an alternative way to access it?
Thanks

Mummer error

ERROR: mummer and/or mgaps returned non-zero
ERROR: Could not parse delta file, nucmer.delta
Hi, I am getting this error while running ABACAS with the following params.
$ perl abacas.1.3.1.pl -b -N -i 50 -v 50 -p nucmer -m -o -r -q

could anyone suggest where I am going wrong?
error no: 400
ERROR: Could not parse delta file, nucmer.filtered.delta
error no: 402
Use of uninitialized value in addition (+) at abacas.1.3.1.pl line 1001.
Total contigs = 0
Use of uninitialized value $id in print at abacas.1.3.1.pl line 1013.
Use of uninitialized value $id in print at abacas.1.3.1.pl line 1014.
Use of uninitialized value $id in print at abacas.1.3.1.pl line 1015.
Use of uninitialized value $id in print at abacas.1.3.1.pl line 1178.
FINISHED CONTIG ORDERING

Trying to use SSpace-Long Read

I have pre-assembled contigs from Canu.

When I run the script: It gives me an error: Please insert a file with contig sequences. You've inserted '/SSPACE/SSPACE-LongRead_v1-1/runs/data/contigs.fasta' which either does not exist or is not filled in

However this is a file that consists all the. contigs.

Eg of header:

tig00000001 len=1248915 reads=5062 covStat=1331.73 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
TACTAAGCTCCTTACTAGAGGCATCATAATATTATCCCTTTACCTATTAGCATAAGTCTTATTCTTCATTCTTCTACAAAAACCTAAACCATCCAAAAAT

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.