fastq-info's Introduction

fastq-info

Compute estimated sequencing depth/coverage of genomes

This script generates estimated coverage information for paired-end fastq files (Illumina WGS data). No dependencies needed - runs smoothly on Linux or Mac as this is a pure Bash script. Should generate outcomes within seconds. You will need raw paired-end fastq files (R1 and R2) and genome assemblies to estimate the coverage.

The calculation of coverage is based the Lander/Waterman equation:

C = LN / G

Where coverage (C) based on read length (L), number of reads (N), and genome size (G) from https://emea.illumina.com/science/technology/next-generation-sequencing/plan-experiments/read-length.html

Usage

Options

$ fastqinfo-2.0.sh -h

This bash script calculates actual sequencing coverage(X)
Fasta assembly and raw fastq files (paired-end) are needed

Usage: /fastqinfo-2.0.sh [options] fastq_R1 fastq_R2 fasta_assembly
Option:
 -r insert size (default=125)
 -h print usage and exit
 -a print author and exit
 -v print version and exit

Version 2.0
Author: Raymond Kiu [email protected]

Run the software

$ ./fastqinfo-2.0.sh -r INSERT_SIZE R1.fastq R2.fastq ASSEMBLY.fasta

Inputs

You will need R1 and R2 raw fastq files (paired-end) and a genome assembly (draft genome will do) to compute the coverage, or sequencing depth (X). This is based on the fact that Illumina raw fastq files have the same insert size for every read.

Additionally you will need the insert size (or read length) for the fastq files, to quickly generate this:

$ head -n 2 R1.fastq |tail -n 2|wc -c
$ 250

This is actually printing the second line of the fastq file and calculate the AGTC counts as follows:

@NB5
GTTTGTATTGATTGAGGTGTTGTAACATTAGCATTACCTATCTCAAAGCCATTCTCTAACATATCTTTTGCATCTATGAGACAACAATTGGTTAATGGTTGAAATGGATGGTAATCTAAGTCGTGAAAATGAATATCTCCCGATTGATGTG
+
AAAAAE6EAEEEEEEEEE/EEEEEE6/EAEEE/EEEEEEEEEEEEEEEEE/AEEEEE/EEEE/EEEEEAEAEAEEEAEEAEEEEAEEA<AEE</AEEEEEAEAE/EEAE<<<////EAAEE<AA/A/A<<6<<E<A/<<<6/A<<EEEA/E

WARNING: Using the wrong insert size will bias the outcomes and accuracy. Also this script is designed to calculate microbial genomes not tested on eukaryotic genomes.

Outputs

It will generate tab-delimited standard outputs e.g.:

Sample   	Insert	Reads	Genome	Coverage(X)
CA-68.fna	250	1014649	2499579	202

Insert: insert size in bp
Reads: total read counts in both paired-end fastq files
Genome: Size (bp) of genome assembly supplied
Coverage(X): estimated sequencing depth of the genome

Issues

This script has been tested on Linux OS, it should run smoothly with no dependencies needed. Please report any issues to the issues page.

Citation

If you use fastq-info for results in your publication, please cite:

Kiu R, fastq-info: compute estimated sequencing depth (coverage) of prokaryotic genomes, GitHub https://github.com/raymondkiu/fastq-info

License

GPLv3

Author

Raymond Kiu [email protected]

fastq-info's People

Stargazers

Watchers

fastq-info's Issues

Multiple Samples At Once

Hi @raymondkiu I'm really keen to try your coverage script and was wondering can multiple samples be run at the same time using a loop?

I must be upfront and say I'm new to the world of scripts/command so I'm not that savvy! Normally for writing a loop I do something like this:

1 #!/bin/sh
2
3 sampleLoc=/Path/to/fq.gz/
4 sampleName= ("19IE01" "19IE02" "19IE03" "19IE04" "19IE05" )
5 outputLoc=/Path/to/Output/
6
7 for sample in ${sampleName[*]}
8
9 do
10
11 spades.py -1 $sampleLoc/${sample}R1.fq.gz -2 $sampleLoc/${sample}R2.fq.gz -o
$outputLoc/${sample}
12
13 done

Can I add something similar to the ./fastq_info_3.sh file to loop all the .fq.gz files?

Cheers

Oxford Nanopore Version?

Hi Raymond,

Thanks so much for providing this tool. It worked great.

I was wondering if the same could be done for long reads, more specifically Oxford Nanopore?

Thanks :)

Recommend Projects

raymondkiu / fastq-info Goto Github PK