Giter Club home page Giter Club logo

genmap's Introduction

GenMap: Ultra-fast Computation of Genome Mappability

BioConda Install Github All Releases Travis CI BSD3 License

GenMap computes the uniqueness of k-mers for each position in the genome while allowing for up to e mismatches. More formally, the uniqueness or (k,e)-mappability can be described for every position as the reciprocal value of how often each k-mer occurs approximately in the genome, i.e., with up to e mismatches. Hence, a mappability value of 1 at position i indicates that the k-mer in the sequence at position i occurs only once in the sequence with up to e errors. A low mappability value indicates that this k-mer belongs to a repetitive region. GenMap can be applied to single or multiple genomes and helps finding regions that are unique or shared by many or all genomes.

Below you can see the (4,1)-mappability and frequency M and F of the nucleotide sequence T = ATCTAGGCTAATCTA. The mappability value M[1] = 0.33 means that the 4-mer starting at position 1 T[1..3] = TCTA occurs three times in the sequence with up to one mismatch: at positions 1 (TCTA), 6 (GCTA) and 11 (TCTA).

example of mappability

The mappability can be exported in various formats that allow post-processing or display in genome browsers. A small example on how to run GenMap is listed below, further details are on the GitHub Wiki pages. For questions or feature requests feel free to open an issue on GitHub or send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de.

Christopher Pockrandt, Mai Alzamel, Costas S. Iliopoulos, Knut Reinert. GenMap: Ultra-fast Computation of Genome Mappability. Bioinformatics, 2020.

$ conda install -c bioconda genmap

Your CPU must support the POPCNT instruction. If you have a modern CPU, you can go with the optimized 64 bit version that additionally uses SSE4. This improves the running time by 10 %. To verify whether your CPU supports these instructions sets you can check the output of cat /proc/cpuinfo | grep -E "mmx|sse|popcnt" (Linux) or sysctl -a | grep -i -E "mmx|sse|popcnt" (Mac).

Platform Download Version Additional requirements
Download Linux binaries Linux 64 bit 1.3.0 (2020-06-17) -
Linux 64 bit optimized 1.3.0 (2020-06-17) requires SSE4
Download Mac binaries Mac 64 bit 1.3.0 (2020-06-17) -
Mac 64 bit optimized 1.3.0 (2020-06-17) requires SSE4

If you want to build it from source, we recommend cloning the git repository as shown below. The tarballs on GitHub do not contain git submodules (i.e., SeqAn). Please note that building from source can easily take 10 minutes and longer depending on your machine and compiler.

$ git clone --recursive https://github.com/cpockrandt/genmap.git
$ mkdir genmap-build && cd genmap-build
$ cmake ../genmap -DCMAKE_BUILD_TYPE=Release
$ make genmap

You can install genmap as follows

$ sudo make install
$ genmap

or run the binary directly:

$ ./genmap

If you are using a very old version of Git (< 1.6.5) the flag --recursive does not exist. In this case you need to clone the submodule separately before you can run cmake:

$ git clone https://github.com/cpockrandt/genmap.git
$ cd genmap
$ git submodule update --init --recursive

Requirements

Operating System
GNU/Linux, Mac
Architecture
Intel/AMD platforms that support POPCNT
Compiler
GCC ≥ 4.9, LLVM/Clang ≥ 3.8
Build system
CMake ≥ 3.0
Language support
C++14

At first you have to build an index of the fasta file(s) whose mappability you want to compute. This step only has to be performed once. You might want to check out pre-built indices available for download.

$ ./genmap index -F /path/to/fasta.fasta -I /path/to/index/folder

A new folder /path/to/index/folder will be created to store the index and all associated files.

There are two algorithms that can be chosen for index construction. One uses RAM (divsufsort), one uses secondary memory/disk space (skew). Depending on the quota and main memory limitations you can choose the appropriate algorithm with -A divsufsort or -A skew. It is recommended to use divsufsort (default setting). It needs about 6n space in main memory (or 10n for fasta files >2GB). n is the number of bases in your fasta file(s). It might be more or less depending on the number and length of the individual sequences. If you are running out of memory, you can try to reduce the memory consumption a bit by inreasing -S, e.g., use -S 20 (up to 64) Although this will slow down the algorithm to compute the mappability.

Skew needs more space on disk, at least 25n. You can change the location of the temp directory via the environment variable (e.g., to choose a directory with more quota):

$ export TMPDIR=/somewhere/else/with/more/space

To compute the (30,2)-mappability of the previously indexed genome, simply run:

$ ./genmap map -K 30 -E 2 -I /path/to/index/folder -O /path/to/output/folder -t -w -bg

This will create a text, wig and bedGraph file in /path/to/output/folder storing the computed mappability in different formats. You can omit formats that are not required by removing the corresponding flags -t -w or -bg.

Instead of the mappability, the frequency can be outputted, you only have to add the flag -fl to the previous command.

A detailed list of arguments and explanations can be retrieved with --help:

$ ./genmap --help
$ ./genmap index --help
$ ./genmap map --help

More detailed examples can be found in the Wiki.

Building an index on a large genome takes some time and requires a lot of space. Hence, we provide indexed genomes for download. If you need other genomes indexed and do not have the computational resources, please send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de. The genomes where built with a higher sampling value (-S 20) to reduce the index size. To increase speed when computing the mappability and outputting csv files, you can build your own index with a lower sampling value. The genomes do not contain alt scaffolds (i.e., only chromosomes and unplaced/unlocalized fragments).

Genome Index size (compressed) Download
Human GRCh38 [1] 5.4 GB GRCh38 index
Human hs37-1kg [2] 5.4 GB hs37-1kg index
Mouse GRCm38 4.9 GB GRCm38 index
D. melanogaster dm6 0.2 GB dm6 index
C. elegans ce11 0.1 GB ce11 index
Wheat T. aestivum ta45 [3] 21.9 GB ta45 index
[1]ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
[2]ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
[3]ftp://ftp.ensemblgenomes.org/pub/plants/release-45/fasta/triticum_aestivum/dna/Triticum_aestivum.IWGSC.dna.toplevel.fa.gz

genmap's People

Contributors

cpockrandt avatar joshuak94 avatar remyschwab avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.