Giter Club home page Giter Club logo

fastq-info's Introduction

fastq-info

Compute estimated sequencing depth/coverage of genomes

This script generates estimated coverage information for paired-end fastq files (Illumina WGS data). No dependencies needed - runs smoothly on Linux or Mac as this is a pure Bash script. Should generate outcomes within seconds. You will need raw paired-end fastq files (R1 and R2) and genome assemblies to estimate the coverage.

The calculation of coverage is based the Lander/Waterman equation:

C = LN / G

Where coverage (C) based on read length (L), number of reads (N), and genome size (G) from https://emea.illumina.com/science/technology/next-generation-sequencing/plan-experiments/read-length.html

Usage

Options

$ fastqinfo-2.0.sh -h

This bash script calculates actual sequencing coverage(X)
Fasta assembly and raw fastq files (paired-end) are needed

Usage: /fastqinfo-2.0.sh [options] fastq_R1 fastq_R2 fasta_assembly
Option:
 -r insert size (default=125)
 -h print usage and exit
 -a print author and exit
 -v print version and exit

Version 2.0
Author: Raymond Kiu [email protected]

Run the software

$ ./fastqinfo-2.0.sh -r INSERT_SIZE R1.fastq R2.fastq ASSEMBLY.fasta

Inputs

You will need R1 and R2 raw fastq files (paired-end) and a genome assembly (draft genome will do) to compute the coverage, or sequencing depth (X). This is based on the fact that Illumina raw fastq files have the same insert size for every read.

  • Additionally you will need the insert size (or read length) for the fastq files, to quickly generate this:
$ head -n 2 R1.fastq |tail -n 2|wc -c
$ 250

This is actually printing the second line of the fastq file and calculate the AGTC counts as follows:

@NB5
GTTTGTATTGATTGAGGTGTTGTAACATTAGCATTACCTATCTCAAAGCCATTCTCTAACATATCTTTTGCATCTATGAGACAACAATTGGTTAATGGTTGAAATGGATGGTAATCTAAGTCGTGAAAATGAATATCTCCCGATTGATGTG
+
AAAAAE6EAEEEEEEEEE/EEEEEE6/EAEEE/EEEEEEEEEEEEEEEEE/AEEEEE/EEEE/EEEEEAEAEAEEEAEEAEEEEAEEA<AEE</AEEEEEAEAE/EEAE<<<////EAAEE<AA/A/A<<6<<E<A/<<<6/A<<EEEA/E

WARNING: Using the wrong insert size will bias the outcomes and accuracy. Also this script is designed to calculate microbial genomes not tested on eukaryotic genomes.

Outputs

It will generate tab-delimited standard outputs e.g.:

Sample   	Insert	Reads	Genome	Coverage(X)
CA-68.fna	250	1014649	2499579	202
  • Insert: insert size in bp
  • Reads: total read counts in both paired-end fastq files
  • Genome: Size (bp) of genome assembly supplied
  • Coverage(X): estimated sequencing depth of the genome

Issues

This script has been tested on Linux OS, it should run smoothly with no dependencies needed. Please report any issues to the issues page.

Citation

If you use fastq-info for results in your publication, please cite:

  • Kiu R, fastq-info: compute estimated sequencing depth (coverage) of prokaryotic genomes, GitHub https://github.com/raymondkiu/fastq-info

License

GPLv3

Author

Raymond Kiu [email protected]

fastq-info's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fastq-info's Issues

Multiple Samples At Once

Hi @raymondkiu I'm really keen to try your coverage script and was wondering can multiple samples be run at the same time using a loop?

I must be upfront and say I'm new to the world of scripts/command so I'm not that savvy! Normally for writing a loop I do something like this:

1 #!/bin/sh
2
3 sampleLoc=/Path/to/fq.gz/
4 sampleName= ("19IE01" "19IE02" "19IE03" "19IE04" "19IE05" )
5 outputLoc=/Path/to/Output/
6
7 for sample in ${sampleName[*]}
8
9 do
10
11 spades.py -1 $sampleLoc/${sample}R1.fq.gz -2 $sampleLoc/${sample}R2.fq.gz -o
$outputLoc/${sample}
12
13 done

Can I add something similar to the ./fastq_info_3.sh file to loop all the .fq.gz files?

Cheers

Oxford Nanopore Version?

Hi Raymond,

Thanks so much for providing this tool. It worked great.

I was wondering if the same could be done for long reads, more specifically Oxford Nanopore?

Thanks :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.