Giter Club home page Giter Club logo

mashmap's Introduction

MashMap

BioConda Install GitHub Downloads

MashMap implements a fast and approximate algorithm for computing local alignment boundaries between long DNA sequences. It can be useful for mapping genome assembly or long reads (PacBio/ONT) to reference genome(s). Given a minimum alignment length and an identity threshold for the desired local alignments, Mashmap computes alignment boundaries and identity estimates using k-mers. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Minimizers and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined using the given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

As an example, Mashmap can map a human genome assembly to the human reference genome in about one minute total execution time and < 4 GB memory using just 8 CPU threads, achieving more than an order of magnitude improvement in both runtime and memory over alternative methods. We describe the algorithms associated with Mashmap, and report on speed, scalability, and accuracy of the software in the publications listed below. Unlike traditional mappers, MashMap does not compute exact sequence alignments. In future, we plan to add an optional alignment support to generate base-to-base alignments.

Installation

Follow INSTALL.txt to compile and install MashMap. We also provide dependency-free linux and OSX binaries for user convenience through the latest release.

Usage

  • Map set of query sequences against a reference genome:

    mashmap -r reference.fna -q query.fa

    The output is space-delimited with each line consisting of query name, length, 0-based start, end, strand, target name, length, start, end and mapping nucleotide identity.

  • Map set of query seqences against a list of reference genomes:

    mashmap --rl referenceList.txt -q query.fa

    File 'referenceList.txt' containing the list of reference genomes should contain path to the reference genomes, one per line.

Parameters

For most of the use cases, default values should be appropriate. However, different parameters and their purpose can be checked using the help page mashmap -h. Important ones are mentioned below:

  • Identity threshold (--perc_identity, --pi) : By default, it is set to 85, implying mappings with 85% or more identity should be reported. For example, it can be set to 80% to account for more noisy long-read datasets or 95% for mapping human genome assembly to human reference.

  • Minimum segment length (-s, --segLength) : Default is 5,000 bp. Sequences below this length are ignored. Mashmap provides guarantees on reporting local alignments of length twice this value.

  • Filtering options (-f, --filter_mode) : Mashmap implements a plane-sweep based algorithm to perform the alignment filtering. Similar to delta-filter in nucmer, different filtering options are provided that are suitable for long read or assembly mapping. Option -f map is suitable for reporting the best mappings for long reads, whereas -f one-to-one is suitable for reporting orthologous mappings among all computed assembly to genome mappings.

Visualize

We provide a perl script for generating dot-plots to visualize mappings. It takes Mashmap's mapping output as its input. This script requires availability of gnuplot. Below is an example demonstrating mapping of canu NA12878 human genome assembly (y-axis) to hg38 reference (x-axis).

Release

Use the latest release for a stable version.

Publications

mashmap's People

Contributors

aphillippy avatar cj101192 avatar cjain7 avatar esrice avatar rando2 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.