Giter Club home page Giter Club logo

fastqdedup's Introduction

fastqdedup

Fastqdedup deduplicates FASTQ files with UMI data while taking sequencing errors into account. Fastqdedup is based on HUMID and was created to explore an alternative implementation. HUMID is currently more actively developed.

Fastqdedup can operate on one or multiple FASTQ files. So single reads with prepended UMIs, paired reads with prepended UMIs or paired reads with separate UMI file are all possible.

It executes the following steps:

  • Read all sequences
  • Filter low quality sequences (can be turned off).
  • Determine clusters based on Hamming distance (or Levenshtein when set).
  • Determine which reads are distinct in each cluster. Put these in a set. This uses the 'directional' method as described in the UMI-tools paper. Several methods are available.
  • Read all sequence records again from the FASTQ files. Check against the set. If a sequence is present the sequence record is output to the output FASTQ files and the sequence is removed from the set so that it is not included again.

Installation

To install the latest development version:

pip install git+https://github.com/rhpvorderman/fastqdedup.git

Usage

Since all the sequences are stored in memory the memory usage is between approximately 45-110% of the size of the original uncompressed FASTQ files depending on read length and similarities between reads. The memory usage can be substantially reduced by setting --check-lengths.

usage: fastqdedup [-h] [-l CHECK_LENGTHS] [-o OUTPUT] [-p PREFIX]
                  [-d MAX_DISTANCE] [-e MAX_AVERAGE_ERROR_RATE] [-E] [--edit]
                  [-c {highest_count,adjacency,directional}] [-v] [-q]
                  FASTQ [FASTQ ...]

positional arguments:
  FASTQ                 Forward FASTQ and optional reverse and UMI FASTQ
                        files.

optional arguments:
  -h, --help            show this help message and exit
  -l CHECK_LENGTHS, --check-lengths CHECK_LENGTHS
                        Comma-separated string with the maximum string check
                        length of each file. For example 'fastqdedup --check-
                        lengths 16,8 R1.fastq R2.fastq' only checks the first
                        16 bases of R1 and the first 8 bases of R2 for
                        duplication. Supports slice notation such as '4:8' or
                        '::8'.
  -o OUTPUT, --output OUTPUT
                        Output file (optional), must be specified multiple
                        times for multiple input files. For example
                        ``fastqdedup -o dedupR1.fastq -o dedupR2.fastq
                        R1.fastq R2.fastq``.
  -p PREFIX, --prefix PREFIX
                        Prefix for the output files. Default: 'fastqdedup_R'
  -d MAX_DISTANCE, --max-distance MAX_DISTANCE
                        The Hamming distance at which inputs are considered
                        different. Default: 1.
  -e MAX_AVERAGE_ERROR_RATE, --max-average-error-rate MAX_AVERAGE_ERROR_RATE
                        The maximum average per base error rate for each FASTQ
                        record. Average is evaluated over bases taken into
                        account by --check-lengths.Default: 0.001
  -E, --no-average-error-rate-filter
                        Do not filter on average per base error rate.
  --edit                Use edit (Levenshtein) distance instead of Hamming
                        distance.
  -c {highest_count,adjacency,directional}, --cluster-dissection-method {highest_count,adjacency,directional}
                        How to approach clusters with multiple reads.
                        'highest_count' selects only one read, the one with
                        the highest count. 'adjacency' starts from the read
                        with the highest count and selects all reads that are
                        within the specified distance. The process is repeated
                        for the remaining reads. 'directional' is similar to
                        adjacency but uses counts to determine if an error is
                        a PCR/sequencing artifact or derived from a difference
                        in the molecule (default).
  -v, --verbose         Increase log verbosity.
  -q, --quiet           Reduce log verbosity.

Methodology

Fastqdedup was heavily inspired by @jfjlaros's trie implementation. A trie is a way to store sequences in a tree format. A disadvantage of this format is that memory usage is quite high since every character is stored as a node in the tree, and a node uses a lot more space than a single character.

Fastqdedup utilizes the following optimizations.

  1. Radix tree suffixes. While a lot of similarities between FASTQ files are basically guaranteed across the first 8 characters (65536 possible sequences) the amount of possible sequences explodes after that. Therefore a trie of FASTQ sequences contains a lot of branches that consists of nodes that only link one parent to one child for quite a long length. Fastqdedup stores terminating branches as strings rather than nodes, saving a lot of memory.
  2. Variable size nodes. The trie is initialized with an ACGTN alphabet, where A has index 0, C has index 1 etc. Nodes are sized such that they can contain the character with the highest index that is in the node.
  3. Fast fail Hamming (or Levenshtein) distances. Since the most likely result is that the sequence is not seen before, the algorithm gives up as quickly as it finds the Hamming distance will be exceeded.

Background

Illumina reads have an error rate of approximately 1 in 1000. This means that the chance of a sequencing error in a 150bp illumina read is 1 - ((1 - (1/1000)) ^ 150) ~= 14%. This means that most (86%) illumina reads are correct and that it is hard to distinguish between reads with one SNP and reads with one sequencing error. It is also hard to distinguish between sequencing replicates and actual biological replicates.

Unique Molecular Identifiers (UMIs) solve this problem by prepending each read with a short sequence of bases. With a UMI length of 8, there are 65536 (4^8) possible UMIs. The chance of two biological replicates having the same UMI is 1 in 65536. Therefore two sequences with the same UMI and only one base pair different are probably derived from the same biological replicate since there is a 14% chance of a sequencing error.

fastqdedup's People

Contributors

rhpvorderman avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

ungtb10d

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.