Giter Club home page Giter Club logo

rapmap's Introduction

Join the chat at https://gitter.im/COMBINE-lab/RapMap

What is RapMap?

RapMap is a testing ground for ideas in quasi-mapping and selective alignment. That means that, at this point, it is somewhat experimental. The develop branch will have the latest improvements and additions, but is not guaranteed to be stable between commits. Breaking changes to the master branch will be accompanied by a tag to the version before the breaking change. Currently, RapMap is a stand-alone quasi-mapper that can be used with other tools. It is also being used as part of Salmon and Sailfish. Eventually, the hope is to create and stabilize an API so that it can be used as a library from other tools.

Building RapMap

To build RapMap, you need a C++14 compliant compiler (g++ >= 4.9 and clang >= 3.4) and CMake (>= 3.9). RapMap is built with the following steps (assuming that path_to_rapmap is the toplevel directory where you have cloned this repository):

[path_to_rapmap] > mkdir build && cd build
[path_to_rapmap/build] > cmake ..
[path_to_rapmap/build] > make
[path_to_rapmap/build] > make install
[path_to_rapmap/build] > cd ../bin
[path_to_rapmap/bin] > ./rapmap -h

This should output the standard help message for rapmap.

Using RapMap

To use RapMap to map reads, you first have to index your reference transcriptome. Once the index is created, it can be used to map many different sets of reads. Assuming that your reference transcriptome is in the file ref.fa, you can produce the index as follows:

> rapmap quasiindex -t ref.fa -i ref_index

if you want to make use of a minimum perfect hash when indexing (which will lower the memory requirement during mapping), you can instead use the following command:

> rapmap quasiindex -t ref.fa -i ref_index -p -x 4

the -p option enables the minimum perfect hash and -x 4 tells RapMap to use up to 4 threads when building the perfect hash (you can specify as many or as few threads as you wish).

The index itself will record whether it was built with the aid of minimum perfect hashing or not, so no extra information concerning this need be provided when mapping. For the purposes of this example, we'll assume that we wish to map paired-end reads with the first mates in the file r1.fq.gz and the second mates in the file r2.fq.gz. We can perform the mapping like so:

> rapmap quasimap -i ref_index -1 r1.fq.gz -2 r2.fq.gz -s -t 8 -o mapped_reads.sam

This will tell RapMap to map the paired-end reads using 8 threads, and to write the resulting SAM records to the file mapped_reads.sam. The -s flag tells RapMap to use selective alignment to generate better mappings and to validate the alignment score of hits. The SAM format is rather verbose, and so such output files can be rather large (and slow to write) if you're mapping many reads. For that reason, we recommend that you use samtools to convert the SAM file to a BAM file on-the-fly. Assuming samtools is installed an in your path, that can be accomplished with the following command:

> rapmap quasimap -i ref_index -1 r1.fq.gz -2 r2.fq.gz -s -t 8 | samtools view -Sb -@ 4 - > mapped_reads.bam

This will stream the output from RapMap to standard out, and then convert it into a BAM file (using up to an additional 4 threads for BAM compression) and write the resulting output to the file mapped_reads.bam. To reduce the amount that needs to be typed in the common case, and to prevent the user from having to remember invocations like the above, we inclde a simple wrapper script that simplifies this process. After installing RapMap, there should be a script called RunRapMap.sh in the bin directory of whereever you have chosen to install RapMap. You can issue a command equivalent to the above using this scrpt as follows:

> RunRapMap.sh quasimap -i ref_index -1 r1.fq.gz -2 r2.fq.gz -s -t 8 --bamOut mapped_reads.sam --bamThreads 4

This will run RapMap with a command equivalent to the one mentioned above. If you leave out the --bamThreads argument, then a single thread will be used for compression. The RunRapMap.sh script can be used even if you don't wish to write the output to BAM format; in that case it is simply equivalent to running whichever command you pass with the rapmap executable itself.

Can I use RapMap for genomic alignment?

The index and mapping strategy employed by RapMap are highly geared toward mapping to transcriptomes. This means that RapMap will likely use a lot of memory when indexing and mapping to mammalian-sized genomes, though it's possible. We have succesfully applied RapMap to map reads to collections of baterial and viral genomes, however.

Caveats

RapMap is experimental, and the code, at this point, is subject to me testing out new ideas (see the description above about the master vs. develop branch). This also means that limited effort has been put into size or speed optimizaiton. There are numerous ways that the code can be sped up and the memory footprint reduced, but that hasn't been the focus yet --- it will be eventualy. All of this being said --- RapMap is open to the community because I'd like feedback / help / thoughts. A contribution policy is forthcoming. So, if you're not scared off by any of the above, please dig in!

License

Since RapMap uses Vinga's rank implementation, it must be released under the GPL. However, this is currently the only GPL dependency. If it can be replaced, I'd like to re-license RapMap under a BSD license. I'd be happy to accept pull-requests that replace this rank implementation with a library released under a more liberal license (BSD-compatible), but note that I will not accept such pull requests if they reduce the speed or increase the memory consumption over the current version.

rapmap's People

Contributors

flyingdeveloper avatar geetduggal avatar keyavi avatar lynxoid avatar rob-p avatar roryk avatar vals avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rapmap's Issues

Check forward & RC logic

Since we recently changed to hashing only the canonical (lexicographically smaller) version of each k-mer, we should make sure the logic for assigning the k-mer the appropriate strand of the originating transcript (and determining the strand to which a read maps) is correct.

Please state clear license and copyright

I'm currently packaging RapMap for Debian and I was wondering if you could add a license statement (e.g. mentioning the GPL version used for now, etc. -- I am currently assuming it will be GPLv3 as in Jellyfish) as well as some indication of copyright attribution for the main RapMap tool. This would make it possible to include the tool in the Debian archive -- thanks. Ideally there would also be a new tagged release ;)

Output NH tag in SAM output

I just tried the 'quasialignment' (SA based) and it is positively amazing! (~15 seconds per sample to map)

Would it be possible to output an NH tag in the SAM output which tells how many locations a read maps to?

Support different output formats

Right now the output is in a format that was just simple for me (Rob) to test. We should look into outputting 2 different formats:

  1. An efficient binary format for potential use with tools that don't want to read in a full mapping file. This format should be as concise as possible to minimize communication overhead --- potentially just a vector of some standard form of hit objects.
  2. A variant of the pseudobam format that is supported by Kallisto. This is essentially SAM format, but discarding the information you don't get via lightweight / psedo-alignment. This format would be useful for other downstream analysis to e.g. look at the mappings with other tools.

map multiple pair-end reads file

Hi,

I have pair-end reads from different organs and want to map them together to transcriptomes. Does RapMap support to map them together in one run?

Bests,

TF

Figure out how to support a BSD or MIT license

Currently this is not possible as we rely on Jellyfish for the memory-efficient hash, parallel read parser, and basic k-mer operations. The read parser and (with some trouble) k-mer operations could be replaced with other alternatives or re-implemented. However, there's not a clear alternative for the hash right now, of which I know, that doesn't adversely affect either the memory consumption or speed.

mateIsFwd never gets a value set for unpaired hits

In RapMap v.0.5.0, the value of mateIsFwd is never set for unpaired hits in the function mergeLeftRightHits, in the file include/RapMapUtils.hpp. The function in question is below


        inline void mergeLeftRightHits(
                std::vector<QuasiAlignment>& leftHits,
                std::vector<QuasiAlignment>& rightHits,
                std::vector<QuasiAlignment>& jointHits,
                uint32_t readLen,
                uint32_t maxNumHits,
                bool& tooManyHits,
                HitCounters& hctr) {
            if (leftHits.size() > 0) {
                constexpr const int32_t signedZero{0};
                auto leftIt = leftHits.begin();
                auto leftEnd = leftHits.end();
                auto leftLen = std::distance(leftIt, leftEnd);
                if (rightHits.size() > 0) {
                    auto rightIt = rightHits.begin();
                    auto rightEnd = rightHits.end();
                    auto rightLen = std::distance(rightIt, rightEnd);
                    size_t numHits{0};
                    jointHits.reserve(std::min(leftLen, rightLen));
                    uint32_t leftTxp, rightTxp;
                    while (leftIt != leftEnd && rightIt != rightEnd) {
                        leftTxp = leftIt->tid;
                        rightTxp = rightIt->tid;
                        if (leftTxp < rightTxp) {
                            ++leftIt;
                        } else {
                            if (!(rightTxp < leftTxp)) {
                                int32_t startRead1 = std::max(leftIt->pos, signedZero);
                                int32_t startRead2 = std::max(rightIt->pos, signedZero);
                                bool read1First{(startRead1 < startRead2)};
                                int32_t fragStartPos = read1First ? startRead1 : startRead2;
                                int32_t fragEndPos = read1First ?
                                    (startRead2 + rightIt->readLen) : (startRead1 + leftIt->readLen);
                                uint32_t fragLen = fragEndPos - fragStartPos;
                                jointHits.emplace_back(leftTxp,
                                        startRead1,
                                        leftIt->fwd,
                                        leftIt->readLen,
                                        fragLen, true);
                                // Fill in the mate info
                                auto& qaln = jointHits.back();
                                qaln.mateLen = rightIt->readLen;
                                qaln.matePos = startRead2;
                                qaln.mateIsFwd = rightIt->fwd;
                                jointHits.back().mateStatus = MateStatus::PAIRED_END_PAIRED;

                                ++numHits;
                                if (numHits > maxNumHits) { tooManyHits = true; break; }
                                ++leftIt;
                            }
                            ++rightIt;
                        }
                    }
                }
                if (tooManyHits) { jointHits.clear(); ++hctr.tooManyHits; }
            }

            // If we had proper paired hits
            if (jointHits.size() > 0) {
                hctr.peHits += jointHits.size();
                //orphanStatus = 0;
            } else if (leftHits.size() + rightHits.size() > 0 and !tooManyHits) {
                // If there weren't proper paired hits, then either
                // there were too many hits, and we forcibly discarded the read
                // or we take the single end hits.
                auto numHits = leftHits.size() + rightHits.size();
                hctr.seHits += numHits;
                //orphanStatus = 0;
                //orphanStatus |= (leftHits.size() > 0) ? 0x1 : 0;
                //orphanStatus |= (rightHits.size() > 0) ? 0x2 : 0;
                jointHits.insert(jointHits.end(),
                        std::make_move_iterator(leftHits.begin()),
                        std::make_move_iterator(leftHits.end()));
                jointHits.insert(jointHits.end(),
                        std::make_move_iterator(rightHits.begin()),
                        std::make_move_iterator(rightHits.end()));
            }
        }

The problem occurs because in the final else if (leftHits.size() + rightHits.size() > 0 and !tooManyHits) statement, hits are appended to the end of joint hits without updating the unset mateIsFwd field. Instead mateIsFwd is 'randomly' filled with a garbage value which appears to have some dependency on the previous read which was aligned. The value which is assigned to mateIsFwd then has an impact on the SAM flag because mateIsFwd is used to determine the SAM flag in getSamFlags. Consequently, read SRR5237781.4 in problem4_1.fastq and problem4_2.fastq usually produces the following alignment on my machine:

SRR5237781.4	89	ENSMUST00000020488	36263	1	100M	=	36263	0	ATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAACATTGGACTGAAC	*	NH:i:1
SRR5237781.4	165	ENSMUST00000020488	36263	0	*	=	36263	0	GTAAGGCCAGCAATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAAC	*	NH:i:1

Whereas the same read in problem4_3read_1.fastq and problem4_3read_2.fastq usually produces this alignment (virtually identical except different SAM flags):

SRR5237781.4	121	ENSMUST00000020488	36263	1	100M	=	36263	0	ATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAACATTGGACTGAAC	*	NH:i:1
SRR5237781.4	181	ENSMUST00000020488	36263	0	*	=	36263	0	GTAAGGCCAGCAATCCTTCTGAGGTTAGGGATCTAGGGGAAGTGGGGGATGTGCTCTGAATTAACTTTGTTAGGTAATTTAGAATATAGGAATCTCTAAC	*	NH:i:1

This problem does not occur for proper paired hits, because mateIsFwd is correctly set in the if (!(rightTxp < leftTxp)) code block. On the develop-salmon branch, read SRR5237781.4 produces proper paired hits, so this test example does not work (or rather, it does work, because mateIsFwd is correctly set in if (!(rightTxp < leftTxp)) ). Unfortunately I have not yet been able to find as nice a test example for the develop-salmon branch yet, but from reading the code I believe that this might also be a problem on the develop-salmon branch. The fix would be to set the value of mateIsFwd in the final else if statement of mergeLeftRightHits().

problem4_1.fastq.gz
problem4_2.fastq.gz
problem4_3read_1.fastq.gz
problem4_3read_2.fastq.gz

Any experiences on using RapMap for downstream variant calling

Hi,
Do you have any experience in using the alignments produced by RapMap, perhaps with selective-alignment, for variant calling from RNA-seq? have you compared the results of variant calling with those produced with GSNAP/HISTA2 or another aligner used for RNA-seq data?

Looking forwards to your comments.

Kind Regards,

Mapping to repetitive regions

Hi Rob. I'm using the --writeMappings option in Salmon 0.7.2, but reporting this issue here since I think the code base is (will be) the same. I have a few genes that we've noticed (based on biological intuition) are called expression when they shouldn't be. The mapping data has really helped to diagnose these transcripts! Here's one example:

IGV mapped data

This transcript has some repetitive sequence near the 3' end, which is masked by RepeatMasker:

>ENSMUST00000034602.7
GAATACCCATGAGTCTGTTGGCGGAGGATAGTGCTAACCCTGGCTGATCATCCTGTGGCTTGCCTCTATTTGTTGCATGATTCCCTCTGGAAGATGGAACACAGCGGGATTCTGGCTAGTCTGATACTGATTGCTGTTCTCCCCCAAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTAGATGGAACGGTGGAAGGATGGTTTGCAAAGAATAAAACACTCAACTTGGGCAAAGGCGTTCTGGACCCACGAGGGATATATCTGTGTAATGGGACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCGGGCACCATGGCTGGTGTCATCTTCATTGACCTCATCGCAACTCTGCTCCTGGCTTTGGGCGTCTACTGCTTTGCAGGACATGAGACCGGAAGGCCTTCTGGGGCTGCTGAGGTTCAAGCACTGCTGAAGAATGAGCAGCTGTATCAGCCTCTTCGAGATCGTGAAGATACCCAGTACAGCCGTCTTGGAGGGAACTGGCCCCGGAACAAGAAATCTTAAGATGGAGGTTTCTCAAAGTGCCATAGCAACGGCGCCTTCCCCTGTGATCAACCAATAAAGACGTTTCCTTCCGTCGGCAGGCGCCTGTGTTTTCCAAAGCATGGATTGAGAGTTGTCTTAGCTTGGCTGAGTTCTGCCCACCCGTGGCCCCTCACTTCTGGCTTCCTCTCCTCAGACCCAGACAGGCAGTACGAGGGACAGCTTCCCATCCTGGCCCCTTGGCATTTCTAACCCTGTGGGTTTCCTGTGGGCAGGGTCAGCCCAGCCACTCTGCTTCATTGCCTGTTGGTTCCAGTCCCTATCATAACAACTTCCGGGATTTTGCTGCCTTACCACAGCCCAAAGCAGGCTGTGGTGTTTAGAAGCTGAGCCACAAAGAAGTTTCCATGACATCATGAATGGGGGTGGCAGAGAAGAATATTGGGGCTCAGAGGgtgtgtgtgtgtgtgtgtgtgtgtgtgtgagtgagagagagagagagagagagagagagagagagagagagagacagacagagagacagagagacagaaagacagaaagacagaaagacagagacagagctgtcAAGaattttcttctcacttcgtgggttctggggattgtactaaggtcatgaggcttgtgtggcaagcacccttacccactgagccatcttgcctgcccTCATCCTCaaattaattaaaaataaagaacatgatgaattcc

The IGV screenshot has two tracks: the top track is using the transcript sequence as is, and the bottom is converting all of the RepeatMasker soft-masked bases to hard-masking ("N"). You can see that the mapping is greatly improved. I'm not sure other transcripts will benefit so much (or might even be harmed) by masking repetitive elements, but have you considered any way to deal with this issue? The reads mapping to this transcript are:

1       163 ENSMUST00000034602.7    146 1   80M =   319 253 AAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTA    *   NH:i:1
2       83  ENSMUST00000034602.7    319 1   80M =   146 -253    ACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCCGG    *   NH:i:1
3       121 ENSMUST00000034602.7    976 1   125M    =   976 0   CCAGTTGGTTGTCTAGCACCAAATAGCTCAGAAAATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA   *   NH:i:1
4       377 ENSMUST00000034602.7    988 1   125M    =   988 0   CCCTGTATTATAATGACACTGGGATGTGAAGCCAGCTCTTCTCACTCCAGCCTGCTCTTTGTCTGCTTCGGCACACCTCTTATCACCCCTCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG   *   NH:i:15
5       121 ENSMUST00000034602.7    1007    1   80M =   1007    0   GAACCGGGATTGGTCTATTTGTCCTTTTAGAGGCCAGACAACTGCATATGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAG    *   NH:i:1
6       163 ENSMUST00000034602.7    1010    1   80M =   1018    88  ATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGA    *   NH:i:1
7       121 ENSMUST00000034602.7    1017    1   80M =   1017    0   CAAATACTTTTAAAGCAGTCCAAGTTTATTTTCATTCGTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAG    *   NH:i:1
8       83  ENSMUST00000034602.7    1018    1   80M =   1010    -88 AAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGC    *   NH:i:1
9       121 ENSMUST00000034602.7    1021    1   80M =   1021    0   TAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA    *   NH:i:1
10      163 ENSMUST00000034602.7    1022    1   125M    =   1056    159 AATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAA   *   NH:i:1
11      105 ENSMUST00000034602.7    1024    1   80M =   1024    0   CCCGGCATAGTCTGATTTATTTATTTTTCATTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:1
12      153 ENSMUST00000034602.7    1030    1   80M =   1030    0   GTTTCTTAATCTCTCTTTCTCTTTCTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGTCTTTTCTATACCAGTTTATA    *   NH:i:1
13      361 ENSMUST00000034602.7    1033    1   80M =   1033    0   GTCAGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:2
14      393 ENSMUST00000034602.7    1035    1   80M =   1035    0   TAATAATTCTCATAATTTTTTTCTCCTTACCATGTGCATGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:2
15      377 ENSMUST00000034602.7    1035    1   80M =   1035    0   GCAGAAAGTTAAATAAAAGGTAAGTACGTGTGTGTGGCCAAATCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:15
16      361 ENSMUST00000034602.7    1036    1   80M =   1036    0   AGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:2
17      361 ENSMUST00000034602.7    1044    1   80M =   1044    0   CGGAAATGTGAAAGTTCTGAGACAGACATATAGACAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:7
18      153 ENSMUST00000034602.7    1050    1   80M =   1050    0   GTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACAT    *   NH:i:1
19      99  ENSMUST00000034602.7    1054    1   125M    =   1065    136 GTGTGAGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCT   *   NH:i:1
20      83  ENSMUST00000034602.7    1056    1   125M    =   1022    -159    AAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAAAAAGAGGACATGAATTTAAAAGAGCAAGGAGGGG   *   NH:i:1
21      105 ENSMUST00000034602.7    1056    1   80M =   1056    0   GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGCGCGTGAGAGAGAGCGGGCGAGCACACTCTGTGCCTCTCAGGAAG    *   NH:i:1
22      121 ENSMUST00000034602.7    1056    1   80M =   1056    0   GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCAAGAGCGAGCGAGAGCGAAAGCATGTTCAGGT    *   NH:i:1
23      153 ENSMUST00000034602.7    1058    1   80M =   1058    0   GAGAAAATGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCGAGCGCAGGCTGGTGCCTTTTTCAGTAACTTCCC    *   NH:i:1
24      147 ENSMUST00000034602.7    1065    1   125M    =   1054    -136    TGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCTCAGTCAGAAAC   *   NH:i:1
25      361 ENSMUST00000034602.7    1086    1   80M =   1086    0   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGTGTGCAAATGATGCAAAGCCTACTGCCTAAGTTGTATTT    *   NH:i:22
26      409 ENSMUST00000034602.7    1094    1   80M =   1094    0   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAAACGTGACTTCCTCTTTTTCTATTTGTAGAACATTTATTTTCTTCC    *   NH:i:27
27      83  ENSMUST00000034602.7    1134    1   125M    =   1137    128 GACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG   *   NH:i:1
28      163 ENSMUST00000034602.7    1137    1   125M    =   1134    -128    AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAA   *   NH:i:1
29      99  ENSMUST00000034602.7    1137    1   80M =   1139    82  AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGAC    *   NH:i:1
30      147 ENSMUST00000034602.7    1139    1   80M =   1137    -82 AAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG    *   NH:i:1
31      105 ENSMUST00000034602.7    1141    1   125M    =   1141    0   AGACAGAAAGACAGAAAGACAGAAAGACAGAAACACATAGAGGTGTACTGTACAACTGGTAAGAAAAGGCATATTTTTTAAATATAGAAGTGCCTACATATAGAGTGTACTTGGGAAAGAAAAGA   *   NH:i:1
32      169 ENSMUST00000034602.7    1225    1   80M =   1225    0   TAACATCATCAGGCTTGGTGGCAAGCACCCTTACCCACTGAGCCATCTAATAGCTCTTCTCTTGAATTTTTTTAAACACC    *   NH:i:1

Are there windows binaries?

Hi, we are currently learning to use RapMap and SailFish and would like to know if someone has succeeded compiling and linking to the libraries under lib/ under Windows. Thanks for the great software,
Bruno

Illegal instruction with prebuilt binary

Hi Rob,

The prebuilt binary doesn't work on one system I am on:

Linux loge 2.6.32-504.12.2.el6.x86_64 #1 SMP Wed Mar 11 22:03:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

stracing it:

execve("/home/rdk4/miniconda/bin/rapmap", ["rapmap"], [/* 65 vars */]) = 0
brk(0)                                  = 0x283c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f01681000
readlink("/proc/self/exe", "/home/rdk4/miniconda/bin/rapmap", 4096) = 31
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/home/rdk4/miniconda/bin/../lib/tls/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/home/rdk4/miniconda/bin/../lib/tls/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/home/rdk4/miniconda/bin/../lib/tls/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/home/rdk4/miniconda/bin/../lib/tls", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/home/rdk4/miniconda/bin/../lib/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/home/rdk4/miniconda/bin/../lib/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/home/rdk4/miniconda/bin/../lib/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/home/rdk4/miniconda/bin/../lib", {st_mode=S_IFDIR|S_ISGID|0775, st_size=9075, ...}) = 0
open("/opt/gcc-5.2.0/lib64/tls/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/gcc-5.2.0/lib64/tls/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/tls/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/gcc-5.2.0/lib64/tls", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/gcc-5.2.0/lib64/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/gcc-5.2.0/lib64", {st_mode=S_IFDIR|S_ISGID|0775, st_size=3463, ...}) = 0
open("/opt/bcbio/centos/lib/tls/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/bcbio/centos/lib/tls/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/bcbio/centos/lib/tls/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/bcbio/centos/lib/tls", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/bcbio/centos/lib/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/bcbio/centos/lib/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/bcbio/centos/lib/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/bcbio/centos/lib", {st_mode=S_IFDIR|0775, st_size=402, ...}) = 0
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/tls/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/tls/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/tls/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/tls", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/x86_64/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/x86_64", 0x7fff767d89a0) = -1 ENOENT (No such file or directory)
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libpthread.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
stat("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib", {st_mode=S_IFDIR|0755, st_size=3072, ...}) = 0
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=53831, ...}) = 0
mmap(NULL, 53831, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0f01673000
close(3)                                = 0
open("/lib64/libpthread.so.0", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000^\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=142688, ...}) = 0
mmap(NULL, 2212848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0f01246000
mprotect(0x7f0f0125d000, 2097152, PROT_NONE) = 0
mmap(0x7f0f0145d000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f0f0145d000
mmap(0x7f0f0145f000, 13296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0f0145f000
close(3)                                = 0
open("/home/rdk4/miniconda/bin/../lib/libgomp.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/libgomp.so.1", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340a\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=711939, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f01672000
mmap(NULL, 2230416, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0f01025000
mprotect(0x7f0f01045000, 2097152, PROT_NONE) = 0
mmap(0x7f0f01245000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000) = 0x7f0f01245000
close(3)                                = 0
open("/home/rdk4/miniconda/bin/../lib/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/bcbio/centos/lib/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/librt.so.1", O_RDONLY)     = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240!\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=43944, ...}) = 0
mmap(NULL, 2128816, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0f00e1d000
mprotect(0x7f0f00e24000, 2093056, PROT_NONE) = 0
mmap(0x7f0f01023000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f0f01023000
close(3)                                = 0
open("/home/rdk4/miniconda/bin/../lib/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/bcbio/centos/lib/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libm.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/libm.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p>\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=596272, ...}) = 0
mmap(NULL, 2633912, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0f00b99000
mprotect(0x7f0f00c1c000, 2093056, PROT_NONE) = 0
mmap(0x7f0f00e1b000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x82000) = 0x7f0f00e1b000
close(3)                                = 0
open("/home/rdk4/miniconda/bin/../lib/libgcc_s.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/libgcc_s.so.1", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260)\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=532820, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f01671000
mmap(NULL, 2185248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0f00983000
mprotect(0x7f0f00999000, 2093056, PROT_NONE) = 0
mmap(0x7f0f00b98000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0x7f0f00b98000
close(3)                                = 0
open("/home/rdk4/miniconda/bin/../lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/bcbio/centos/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/libc.so.6", O_RDONLY)      = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p\356\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1920936, ...}) = 0
mmap(NULL, 3750152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0f005ef000
mprotect(0x7f0f00779000, 2097152, PROT_NONE) = 0
mmap(0x7f0f00979000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18a000) = 0x7f0f00979000
mmap(0x7f0f0097e000, 18696, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0f0097e000
close(3)                                = 0
open("/home/rdk4/miniconda/bin/../lib/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/gcc-5.2.0/lib64/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/bcbio/centos/lib/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/libdl.so.2", O_RDONLY)     = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\r\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=19536, ...}) = 0
mmap(NULL, 2109696, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0f003eb000
mprotect(0x7f0f003ed000, 2097152, PROT_NONE) = 0
mmap(0x7f0f005ed000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f0f005ed000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f01670000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f0166f000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f0166d000
arch_prctl(ARCH_SET_FS, 0x7f0f0166d7e0) = 0
mprotect(0x7f0f005ed000, 4096, PROT_READ) = 0
mprotect(0x7f0f00979000, 16384, PROT_READ) = 0
mprotect(0x7f0f00e1b000, 4096, PROT_READ) = 0
mprotect(0x7f0f01023000, 4096, PROT_READ) = 0
mprotect(0x7f0f0145d000, 4096, PROT_READ) = 0
mprotect(0x7f0f01682000, 4096, PROT_READ) = 0
munmap(0x7f0f01673000, 53831)           = 0
set_tid_address(0x7f0f0166dab0)         = 16752
set_robust_list(0x7f0f0166dac0, 0x18)   = 0
futex(0x7fff767d929c, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fff767d929c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 7f0f0166d7e0) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigaction(SIGRTMIN, {0x7f0f0124bcb0, [], SA_RESTORER|SA_SIGINFO, 0x7f0f012557e0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x7f0f0124bd40, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x7f0f012557e0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0
open("/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
readlink("/etc/malloc.conf", 0x7fff767d8060, 4096) = -1 ENOENT (No such file or directory)
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0efffeb000
munmap(0x7f0efffeb000, 4194304)         = 0
mmap(NULL, 8384512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0effbec000
munmap(0x7f0effbec000, 81920)           = 0
munmap(0x7f0f00000000, 4108288)         = 0
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 4
read(4, "0-23\n", 8192)                 = 5
close(4)                                = 0
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0eff800000
getdents(3, /* 35 entries */, 32768)    = 1056
getdents(3, /* 0 entries */, 32768)     = 0
madvise(0x7f0eff806000, 36864, MADV_DONTNEED) = 0
close(3)                                = 0
sched_getaffinity(16752, 8,  { ffffff }) = 8
futex(0x86200c, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x862018, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/dev/urandom", O_RDONLY)          = 3
fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff767d8ff0) = -1 EINVAL (Invalid argument)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f01680000
read(3, "\200\241\342VA\374F%\244\304:T\251\317\24\273O(\205\301\364\2053\221P\20m\227 i\221K"..., 4096) = 4096
--- SIGILL (Illegal instruction) @ 0 (0) ---
+++ killed by SIGILL +++
Illegal instruction

If I build it myself on the same system it works fine. The binary built as part of bioconda also has the same issue. Do you know what might be causing the difference?

Quasi-mapping and gene fusion detection

Hello! Thanks a lot for the great project, quasi-mapping is wicked fast!
I am looking for the information regarding usage of quasi-mapping in the context of gene fusion detection. Does RapMap allow for detection of split and spanning reads?

I'm failing to compile release 0.5

I've got a several-year old installation of Arch, and I'm trying to
compile RapMap to benchmark against some other aligners with in
silico tests. I'm assessing accuracy and recall as
"alignment/mapping overlaps origin of read" and "bedtools subtract".

Anyways, so to try and compile the 0.5 release of RapMap you linked
here, I've got some errors.

gcc -v is

Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc-multilib/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release
Thread model: posix
gcc version 6.3.1 20170306 (GCC)

clang -v is

clang version 3.9.1 (tags/RELEASE_391/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/6.3.1
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/6.3.1
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/6.3.1
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/6.3.1
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
clang-3.9: warning: argument unused during compilation: '-fPIE'
clang-3.9: warning: argument unused during compilation: '-pie'
clang-3.9: warning: argument unused during compilation: '-fstack-check'
clang-3.9: warning: argument unused during compilation: '-fstack-protector-strong'

cmake -V is 3.7.2


mkdir build && cd build of course works fine

cmake .. works fine, reports that it couldn't find Cereal or
Jellyfish and will fetch and build both.

make reports errors and warnings. It seems to start at step

[ 65%] Building CXX object src/CMakeFiles/rapmap.dir/RapMapIndexer.cpp.o
cd /home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/build/src && /usr/lib/hardening-wrapper/bin/c++    -I/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/include -I/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/external -I/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/external/cereal/include -I/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/external/install/include -I/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/external/install/include/jellyfish-2.2.6  -march=native -pthread -funroll-loops -fPIC -fomit-frame-pointer -O4 -DHAVE_ANSI_TERM -Wall -std=c++11 -static-libstdc++ -Wno-unknown-pragmas -Wreturn-type -Werror=return-type -Wno-unused-local-typedefs   -o CMakeFiles/rapmap.dir/RapMapIndexer.cpp.o -c /home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/src/RapMapIndexer.cpp
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/include-fixed/limits.h:168:0,
                 from /usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/include-fixed/syslimits.h: ,
                 from /usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/include-fixed/limits.h:34,
                 from /usr/include/c++/6.3.1/climits:42,
                 from /home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/include/concurrentqueue.h:69,
                 from /home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/include/FastxParser.hpp:17,
                 from /home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/src/RapMapIndexer.cpp:41:
/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/include/spdlog/fmt/bundled/format.h: In member function ‘void fmt::internal::ArgFormatterBase<Impl, Char>::visit_char(int)’:
/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/include/spdlog/fmt/bundled/format.h:2198:24: error: expected unqualified-id before numeric constant
         const unsigned CHAR_WIDTH = 1;
                        ^

There's quite a few warnings, and then it ends with:

make[2]: *** [src/CMakeFiles/rapmap.dir/build.make:90: src/CMakeFiles/rapmap.dir/RapMapIndexer.cpp.o] Error 1
make[2]: Leaving directory '/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/build'
make[1]: *** [CMakeFiles/Makefile2:232: src/CMakeFiles/rapmap.dir/all] Error 2
make[1]: Leaving directory '/home/zed/labDarach/hstAlingers/rapmapWork/compileStuff/RapMap-0.5.0/build'
make: *** [Makefile:164: all] Error 2

Any ideas on how to debug this, and fix it on my end or in the release?

Question: Unassigned Reads

I'm new to using rapmap, and I can't seem to find unassigned reads, Is there an option to include them in the sam output from rapmap?

FindJellyfish.cmake finds cereal, not jellyfish

The file cmake/Modules/FindJellyfish.cmake is satisfied when it finds the cereal includes. I don't have Jellyfish installed and was surprised that CMake reported that it found Jellyfish. It simply printed the path to the cereal headers.

Explore Kallisto-esq skipping & Equivalence Class pre-computation

Right now, we're using Geet's "dirty-mapping" strategy to try and determine the origin of reads (dirty mapping treats the left and right-most hashable k-mer of each read as "mini" paired-ends, and uses the information about the k-mers to which they map to figure out where the read goes). It would be useful to explore 2 particular ideas used in the psudeo-alignment procedure of Kallisto.

  • Break the transcriptome into simple paths of the de Bruijn graph. This will give us a data-driven rule to compute exactly how far one might need to "skip" along a read to find another informative k-mer. Note that while this idea is conceptually simple, there are definitely some tricky issues (e.g. we should properly handle the case where a transcript begins / ends in the middle of a simple path, as would happen with e.g. alternative start sites etc.). This means that a simple path in the dBG may contain multiple transcript cover-classes depending on where transcripts begin / end.
  • Pre-compute unique labels for every combination of transcripts we see in the dBG cover. Note, these are not exactly the same as the equivalence classes that e.g. Salmon uses during quantification since, there, intersecting the hits for the ends of a paired-end read could lead to unique combinations of transcripts. However, here, the idea would be to essentially label all of the transcript cover-classes with unique identifiers. This would make the intersection of the transcript sets for the "simple" case (both reads come from the same set of transcripts) trivial --- just comparing the equality of the labels.

Please support spdlog 1.2

Hi,
in the Debian packaging I formerly used the Debian packaged spdlog to avoid code copies as per Debian policy. Since spdlog in Debian switched to version 1.2 this does not work any more since I get build errors due to a change in the API of spdlog. I wonder whether you might like to update rapmap to work with spdlog 1.2.
Kind regards, Andreas.
PS: Is there any reason for tagging the RapMap releases with "salmon-" prefix?

Transcriptome clustering

I want to repeat the analysis from section "5 Application of quasi-mapping for clustering de novo assemblies", from the paper "RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics (2016) 32 (12): i192-i200".

The paper states RapMap was used, but I can't find documentation on how to do the clustering with RapMap. I also found RapCluster github repo. Using SailFish + RapClust gives the same results as the paper?

files generated by RapMap produce inaccurate CIGAR strings

I am trying to test RapMap on higher values of k (>= 13) on wgsim-simulated Illumina paired reads, but when I try to convert the resulting SAM file to a BAM file with SAMtools I get a truncated file with this error:

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 16
[main_samview] truncated file.

The line number varies. However, I looked at the file and it appears the file seems to truncate at lines like this: (2nd line in particular)

Pt_90296_90860_0:0:0_1:2:0_1 121 4 1159670 1 76M = 1159670 0 ACAACAGGCTATGGGGTGGTGATCCCGCTTATGGGGTCAAATCAATACGTTCTAAGAAGAAAGATTTGACAATAAA 22222222222222222222222222222222222222222222222222222222222222222222222222 NH:i:6

Pt_90296_90860_0:0:0_1:2:0_1 2485 4 1159670 0 76M = 1159670 0 CTAAGTAGCAGTCAAAAATTTGTATCCATTTTTCATGATATTATGCATGGATTAGATATATCATGGCGAA 2222222222222222222222222222222222222222222222222222222222222222222222 NH:i:6

Pt_90296_90860_0:0:0_1:2:0_1 2409 2 1026497 1 76M = 1026497 0 TTATTGTCAAATCTTTCTTCTTAGAACGTATTGATTTGACCCCATAAGCGGGATCACCACCCCATAGCCTGTTGTC 2222222222222222222222222222222222222222222222222222222222222222222222222222 NH:i:6

For whatever reason the CIGAR string length isn't matched to a read of the same length. This is not an issue on smaller k values, but there are so many primary + secondary reads in those files that it's hard to parse through them (e.g. 1M reads could have as many as 13M hits).

Reduce memory usage

Currently, the memory usage for the index is higher than it feels it should be. Building the index (again on the 213k transcript human transcriptome) seems to take ~12-13G of RAM. This is a large transcriptome, but the RAM usage could probably be reduced by a factor of 2-2.5x.

SAM format issue

Minor minor issue. When using samtools view on the RapMap output, I came across an error that SEQ was not of equal length to QUAL. Reading more, it seems that the SAM specification states for field 11:

This field can be a ‘*’ when quality is not stored. If not a ‘*’, SEQ 
must not be a ‘*’ and the length of the quality string ought to 
equal the length of SEQ.

So I perl'd a '*' into the 11-th position of the SAM file, and samtools (v1.2 using htslib 1.2.1 on arch linux) now converts it nicely to bam.

Am I right to think that your tool ought to stick a '*' in that position?

By the way, your program is obscenely fast.

Samtools fails parsing SAM due to CIGAR and query sequence are of different length

Here is a segment of an output sam from RapMap:

$ sed -n '104528,104540'p rapmap.pseudo.sam
@SQ SN:ERCC-00160   LN:719
@SQ SN:ERCC-00162   LN:499
@SQ SN:ERCC-00163   LN:519
@SQ SN:ERCC-00164   LN:999
@SQ SN:ERCC-00165   LN:848
@SQ SN:ERCC-00168   LN:999
@SQ SN:ERCC-00170   LN:999
@SQ SN:ERCC-00171   LN:481
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 66  ENSMUST00000068798  1369    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000161918  1251    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000160640  1480    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.4:CELL_AGCCACT:UMI_GGGGT 66  ENSMUST00000170335  931 255 35M941S *   0   0   TTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ceeggfgehihiiihihgecccccccccccacccccccc_aWW[ac  NH:i:2
SRR1043412.4:CELL_AGCCACT:UMI_GGGGT 2370    ENSMUST00000185804  2093    255 46M *   0   0   TTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ceeggfgehihiiihihgecccccccccccacccccccc_aWW[ac  NH:i:2

If I try to use htslib (e.g. samtools) to parse this, I get errors due to 35M941S not corresponding to 46 nts in the read.

$ samtools view rapmap.pseudo.sam | head
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 104539
[main_samview] truncated file.
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 66  ENSMUST00000068798  1369    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000161918  1251    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3
SRR1043412.2:CELL_AGCCACT:UMI_GTAGA 2370    ENSMUST00000160640  1480    255 46M *   0   0   GGGAGAAAGGGGCATTCTGGCACTTGGCATAACCTATGCCACAATG  eeeggegghiiihhihihhhhhihigfhfdghhihhiiaefghihi  NH:i:3

If I change 35M941S to 35M11S the read parses fine. Is there a sensible reason for the S to be so long in the RapMap output?

-u always active, unaligned reads cause parse error in samtools

At the moment, RapMap always prints unaligned reads, even if the -u flag is not given.
The writeUnmapped variable seems to be set according to the flag, but is later ignored.
Whenever such an unaligned read entry is read by samtools for sam->bam conversion, samtools terminates with a "parse error on line xxxxx".

Happens on current master branch and with current samtools version.

How to determine the length of k-mer to index ?

rapmap quasiindex --help:
-k (positive integer less than 32), --klen (positive integer less than32), The length of k-mer to index.

But how to select the k for different genome, such as human, mouse and c.elegans ?

Thank you.

remove scaffold ?

Hi,
I downloaded reference genomes from Ensembl (fasta format).
But there are lots of sequences with name "dna:scaffold": https://github.com/CTLife/TEMP/tree/master/RefGenomes

Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ? https://github.com/CTLife/TEMP/blob/master/RefGenomes/Mouse_GRCm38.p4.txt

Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?

textLen_ is uninitialised in the function lce()

textLen_ is uninitialised in the function lce()

In RapMap v.0.5.0, textLen_ is uninitialised in the function lce() in the file include/SASearcher.hpp. The function in question is in the code snippet below.

        OffsetT lce(OffsetT p1, OffsetT p2,
                    OffsetT startAt=0,
                    OffsetT stopAt=std::numeric_limits<OffsetT>::max(),
                    bool verbose=false) {
            std::string& seq = *seq_;
            std::vector<OffsetT>& SA = *sa_;
            OffsetT len = static_cast<OffsetT>(startAt);
            auto o1 = SA[p1] + startAt;
            auto o2 = SA[p2] + startAt;
            auto maxIndex = std::max(o1, o2);
            while (maxIndex + len < textLen_ and seq[o1+len] == seq[o2+len]) {
                if (seq[o1+len] == '$') { break; }
                if (len >= stopAt) { break; }
                ++len;
            }
            return len;
        }

The value of textLen_ is never set. In practice, if I modify the function to print the value of textLen_ as below, I find that textLen_ is always zero, meaning the central while loop is never executed. This is also the case on the develop-salmon branch (although under default command line options, lce is never actually executed on this branch). I am guessing that the aim of the first condition of the while loop (maxIndex + len < textLen_) is to stop the while loop from executing if we have reached the end of the transcriptome. If I am correct, this bug could be fixed by setting the value of textLen_ to the length of the transcriptome.


        OffsetT lce(OffsetT p1, OffsetT p2,
                    OffsetT startAt=0,
                    OffsetT stopAt=std::numeric_limits<OffsetT>::max(),
                    bool verbose=false) {

	    std::cerr << "In lce" << '\n';
	    std::cerr << "textLen_ = " << textLen_ << '\n';
            std::string& seq = *seq_;
            std::vector<OffsetT>& SA = *sa_;
            OffsetT len = static_cast<OffsetT>(startAt);
            auto o1 = SA[p1] + startAt;
            auto o2 = SA[p2] + startAt;
            auto maxIndex = std::max(o1, o2);
            while (maxIndex + len < textLen_ and seq[o1+len] == seq[o2+len]) {
                if (seq[o1+len] == '$') { break; }
                if (len >= stopAt) { break; }
                ++len;
            }
            return len;
        }

quasimapping, second read always gets not primary flag

hi,

I ran the following, using RapMap 0.2.0:

rapmap quasiindex -i quasiindex -t transcripts.fa
rapmap quasimap -o maps.sam -1 <(zcat ERR188021_1.fastq.gz | head -16000000) -2 <(zcat ERR188021_2.fastq.gz | head -16000000) -i quasiindex

And get the following output, where it seems that the second read in the pair always has the "not primary alignment" flag:

ERR188021.24    99      ENSG00000254244 790     1       75M     =       845     130     TTGATGGGCTCTACTTCTGATCTTGGTCACTGTGAAAAAATCAAGAAGGCCTGTGGAAATTTTGGCATTCCATGT     @CCFDFFFHHHHDIJJIJJJIGJIJJHIIJIJHGGIJJJJCGIIJGHGGIHIJDGIJIEIIIIJFHIJJHHHHHE     NH:i:1
ERR188021.24    2451    ENSG00000254244 845     1       75M     =       790     130     AAATTTTGGCATTCCATGTGAACTTCGAGTAACATCTGCGCATAAAGGACCAGATGAAACTCTGAGGATTAAAGA     HHGIGGJIGGJIJIGGIJIIHHFCGJGIGFJIIJIIHHGIIJIJJJJIHJHGGJIJJIJIJHHHHFFFFFFCCC~     NH:i:1
ERR188021.25    99      ENSG00000198211 2205    1       75M     =       2354    224     CGCCCCGCGGCCTGAAGATGTCGGCCACCTTCATCGGCAACAGCACGGCCATCCAGGAGCTGTTCAAGCGCATCT     @@CFFDDDHFDHHEGAEEGGFCDEIGIJCFHHD@DH8BHFBFC@@@?>@B:>AACDCB@8AC::CDDD:8;555>     NH:i:1
ERR188021.25    2451    ENSG00000198211 2354    1       75M     =       2205    224     AGTTCACCGAGGCCGAGAGCAACATGAACGACCTGGTGTCCGAGTACCAGCAGTACCAGGACGCCACGGCCGACA     C@CD@?DBADDBDCCC@CB@DDCCDEB:;ECHGGFF@ABBFBBBB?FAHEHEFD:DDEEDFF<DC8D?DDD?@@~     NH:i:1
ERR188021.33    83      ENSG00000198727 952     1       75M     =       916     111     GCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAGTCGGAGGACAACCAA     BA9@@>@ACFDBEHEC=;EECHAE@7>GEIIIHFIGIIIFBFCJIIGHGHGHHEHGGIHGGGHFFHDDDDD<@@~     NH:i:1
ERR188021.33    2467    ENSG00000198727 916     1       75M     =       952     111     ATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCA     B@@FFDFFGDHAFHIGBGIGGIGGHEGIIJDHJJJJHEIIJIFIGIGJIJGHGGGIGGGGGGHGHGGGIJIHHFF     NH:i:1
...

Maybe I'm missing something obvious!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.