marbl / mashmap Goto Github PK

View Code? Open in Web Editor NEW

265.0 265.0 39.0 26.79 MB

A fast approximate aligner for long DNA sequences

License: Other

C++ 93.26% C 2.95% Perl 3.14% Python 0.22% CMake 0.31% Shell 0.12%

mashmap's People

Contributors

Stargazers

Watchers

mashmap's Issues

Decontamination of bacterial sequences in an assembly

Hello,

Thanks very much for MashMap!

I have an assembly that is contaminated with lots of bacterial sequences, and I'd like to try MashMap to tease them out. I downloaded all complete RefSeq bacterial genomes, and put them in a single FASTA file. My assembly is highly fragmented, including 300k+ contigs. When I ran MashMap as below,

mashmap -r ${refseq} -q ${assembly} -s 500 -t ${NT} -o ${out}

the program consumes large amount of RAM (>= 200GB). My question is:

whether MashMap can be used for this kind of analysis?
if yes, should I split my bacterial database into chunks and do it with multiple batches?

Any advice would be appreciated!

cheers,
Simo

format of output file

Hi,

I am using mashmap with a small testing example as a start. My reference containing 100 sequences and query containing 5 sequences, and the query is exactly the first 5 sequences of the reference. I expect a perfect matching for each of the 5 query sequences. However, the first line of the output shows:
read1 2033 0 2032 + read1 2033 8 2040 98.9381 54 74 99.3103 0.943433

My questions are:
1, the length of read1 is 2033, so why the coordinates of target show 8 to 2040?
2, what does the last 5 numbers (98.9381 54 74 99.3103 0.943433) mean?

Thanks a lot!
Mingfu

/home/software/MashMap/mashmap: error while loading shared libraries: libgsl.so.23: cannot open shared object file: No such file or directory

Please excuse my lack of informatics skills, but I'm having trouble with mashmap installation.
I'm using CentOS 7.3.1611 with gcc 4.8.5 20150623.
I installed gsl-2.4, then configured mashmap with:
$./bootstrap.sh
$./configure --with-gsl=/home/software/gsl-2.4/
$make
$/home/software/mashmap -s S15-132.contigs.fasta -q S15_132.1.subreads.fastq -o test1

Returns:
/home/software/MashMap/mashmap: error while loading shared libraries: libgsl.so.23: cannot open shared object file: No such file or directory

Is MKL dependency missing on conda?

There was a user who hit a verkko error with mashmap installed through conda (see: marbl/verkko#162). Couldn't confirm mashmap version since it won't run to report version but I would guess this is not 2.0. I didn't see any reference to this library in the code so not sure why it's there in conda.

Filtering for finding duplications

Hi,

Suppose I would like to map a genome to itself in order to find duplications. What is the correct filtering option (-f, --filter_mode) for this use case?

Current bioconda recipe doesn't seem to work

I'm having trouble getting the current bioconda version of mashmap to work. Any advice?

See related bioconda-recipes issue bioconda/bioconda-recipes#32329

Much appreciated!
Mike

bottom K sketch time complexity

Dear MashMap team,

The bottom-k sketch used in both MashMap2 and MashMap3, it's time complexity is still O(dk), where d is the number of elements in set while k is the number of MinHashes or hash functions. I read that in the newest paper, each kmer is only need to be hashed once, since there will be many hash functions, this will still be O(dk) right. I am comparing to the original Minhash.

Thanks,

Jianshu

About mashmap out

Is there a manual to understand the output of the program?

Weird relationship between aligned segment length and read length

Hi,

I used mashmap to align a set of PacBio reads to a reference genome and noticed a weird pattern in the relationship between read length and the aligned segment length.
I then repeated the experiment using PacBio and ONT and comparing with minimap2 and I observe the same pattern for both sequencing data when using MashMap but not Minimap2.
- Here is read length plotted against the aligned segment length.

And here is read length plotted against the aligned segment length as a fraction of read size.

Maybe this is some known issue or maybe I'm doing something wrong but I thought I'd post this here in case you guys have any info.

Methods

I originally noticed the pattern in our PacBio RS II and Sequel I unpublished data, and then I used the E.coli samples described in the Canu documentation for reproducing it and reporting it here. These include Pacbio and ONT reads.
https://canu.readthedocs.io/en/latest/quick-start.html#quickstart
I used a conda install of mashmap 2.0 and ran it as follows:
mashmap -t 10 -s 5000 -r ecoli_ref.fasta -q oxford.fasta
I also used a conda install of minimap2 2.17 and ran it with either PB or ONT presets.
minimap2 -t 10 -cx map-pb ecoli_ref.fasta pacbio.fastq
I plotted alignments >= 5000 bp using ggplot's geom_point function.
I did not retain a single alignment per read for my analysis. However, the vast majority of reads seem to produce a single alignment.

Additional output

Hi,

Not a bug report, or an issue, I was just wondering if there is any way to get out information about the actual alignment other than the percent identity? For my purposes I need to be able to examine the differences between the two regions.

Is there a way to do this/do you have any plans for supporting it in the future?

Thanks!

Increased threshold for identity --pi generate more match dots

Hi there,
I ran mashmap to compare the the contigs of one species to the genome of another species. But I found the dotplot showed more match dots when I set --pi as 90% than that when --pi was set as 85% (any other parameters are identical). What I understand is that higher --pi sets more strict constraints on matching, and the plot is expected to show less matching dots. Could you please give any interpretation about it?

--pi 85

--pi 90

MashMap 2.0 Compiling error: undefined reference to '__wrap_memcpy'

MashMap version: 2.0
Using:
gcc/g++ 5.4.0
GSL 2.4
Boost/1.61.0

[a1695820@l01 MashMap-2.0]$ ./bootstrap.sh
[a1695820@l01 MashMap-2.0]$ ./configure --prefix=/home/a1695820/fastdir/SR531867/MashMap
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking zlib.h usability... yes
checking zlib.h presence... yes
checking for zlib.h... yes
checking gsl/gsl_cdf.h usability... yes
checking gsl/gsl_cdf.h presence... yes
checking for gsl/gsl_cdf.h... yes
configure: creating ./config.status
config.status: creating Makefile
[a1695820@l01 MashMap-2.0]$ make
g++ -O3 -DNDEBUG -std=c++11 -Wno-deprecated-declarations -Isrc -I /usr/local//include -include src/common/memcpyLink.h -Wl,--wrap=memcpy src/map/mash_map.cpp -o mashmap -L/usr/local//lib -lgsl -lgslcblas -lstdc++ -lz -lm -lpthread
/tmp/ccU9syl1.o:mash_map.cpp:function void std::__cxx11::basic_string<char, std::char_traits, std::allocator >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .isra.299]: error: undefined reference to '__wrap_memcpy'
/tmp/ccU9syl1.o:mash_map.cpp:function kseq_read(kseq_t*): error: undefined reference to '__wrap_memcpy'
/tmp/ccU9syl1.o:mash_map.cpp:function kseq_read(kseq_t*): error: undefined reference to '__wrap_memcpy'
/tmp/ccU9syl1.o:mash_map.cpp:function __gnu_cxx::__normal_iterator<skch::MappingResult*, std::vector<skch::MappingResult, std::allocatorskch::MappingResult > > std::vector<skch::MappingResult, std::allocatorskch::MappingResult >::insert<__gnu_cxx::__normal_iterator<skch::MappingResult*, std::vector<skch::MappingResult, std::allocatorskch::MappingResult > >, void>(__gnu_cxx::__normal_iterator<skch::MappingResult const*, std::vector<skch::MappingResult, std::allocatorskch::MappingResult > >, __gnu_cxx::__normal_iterator<skch::MappingResult*, std::vector<skch::MappingResult, std::allocatorskch::MappingResult > >, __gnu_cxx::__normal_iterator<skch::MappingResult*, std::vector<skch::MappingResult, std::allocatorskch::MappingResult > >): error: undefined reference to '__wrap_memcpy'
collect2: error: ld returned 1 exit status
make: *** [mashmap] Error 1

Request: usage statement

The github page here is helpful but mashmap -h lacks a usage statement.

Scaffolding

Hi,

Can I use MashMap to scaffold contigs from a draft genome according to the reference?

Thanks,
Francesco

Segmentation fault (core dumped)

For context: I am attempting to create an augmented FASTA file to add decoy sequence to a Salmon index as noted in the release notes in the most recent version of Salmon (0.14.0): https://github.com/COMBINE-lab/salmon/releases/tag/v0.14.0

The authors provide a script that makes use of MashMap to do so here: https://github.com/COMBINE-lab/SalmonTools/blob/master/scripts/generateDecoyTranscriptome.sh

I get Segmentation fault (core dumped) when the script reaches the MashMap step at this line https://github.com/COMBINE-lab/SalmonTools/blob/23eac847decf601c345abd8527eed5dc1b382573/scripts/generateDecoyTranscriptome.sh#L105

This can be reproduced from the command line:

mashmap -r reference.masked.genome.fa -q Homo_sapiens.GRCh38.cdna.all.fa -t 8 --pi 80 -s 500
>>>>>>>>>>>>>>>>>>
Reference = [reference.masked.genome.fa]
Query = [Homo_sapiens.GRCh38.cdna.all.fa]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 8
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 985533927
Segmentation fault (core dumped)

Where the relevant input to generateDecoyTranscriptome.sh to generate reference.masked.genome.fa and the transcript fasta are:

Input	File Download
GTF	ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz
Genome FASTA	ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
Transcript FASTA	ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

I'm using a Docker image with the v2.0 release of MashMap. (It can be pulled from jtaroni/2019-chi-training and MashMap is installed like so: https://github.com/AlexsLemonade/RNA-Seq-Exercises/blob/d6e5f8627c75e55e572e9061f0498388ebb7d212/Dockerfile#L91).

This also occurs running on my Ubuntu 18.04 machine w/ 64GB RAM outside the container.

Any ideas about what may be happening would be appreciated. Thank you!

Save sketches for reuse

When the reference is large, a lot of time is spent during the sketch stage. Is it possible (currently or in the future) to save and reuse sketches across multiple runs?

Outputs lower than specific percent identity threshold

Mashmapped the 45S gene to an assembly, set --pi 95, but see outputs where the estimated ANI is below 95 included in mashmap.out.
Here is my command:

    mashmap \
        -t 32 \
        --noSplit \
        -q 45S.fa \
        -r $asm \
        -s 13332 \
        --pi 95 \
        -f none \
        -o 45S_to_$name.mashmap.out

Here is mashmap.out:

...
45S	13332	0	13332	+	chrY_pat	46954669	12702898	12716230	2	13332	11	id:f:0.914184	kc:f:0.718472
45S	13332	0	13332	-	chrY_pat	46954669	14412868	14426200	2	13332	11	id:f:0.914184	kc:f:0.718472
45S	13332	0	13332	+	chrY_pat	46954669	17742985	17756317	2	13332	11	id:f:0.914184	kc:f:0.718472
...

Is there a difference in the way ANI is estimated for the initial filtering of alignments vs for the final output that could explain this?

Also worth noting that mashmap -h says --noSplit is unnecessary because it is set by default, but the logs seem to show that without setting this flag, read splitting is allowed.

mashmap does not give results if input sequences are too short

Dear Developers,

I am using mashmap to mine homologous regions for my reference genes from genomes, and I have encountered a bug in the program. If one or both of the input sequences is shorter than a specific length then the program appears to run and exits with exit code 0, but does not produce any plots.

When using the following command:

mashmap -s 500 -r ref.fas -q target.fas -o output.mash

The ref.fas had to be at least 16100 bp long and the target.fas had to be at least 510 bp otherwise there was nothing in the output.

For me it would be already great if mashmap returned a different exit code than 0 in this case, because than I know that it failed because of input requirements and doesn't mean there are no homologous sequences.

generateDotPlot gives incomplete plot in postscript output

Hi,
Thank you very much for creating MashMap.

Running perl generateDotPlot 'postscript' 'medium' <mashmap ouptut file> gave me a .ps file where the plot is cropped.
The axis showing the reference is complete, but most of the axis showing the query is missing (only the largest few contigs are present in the plot).
This was also the case when using the 'large' option.

I imported the .ps file into GIMP 2.10.18. Changing the GIMP import rendering options (resolution, dimensions) did not make a difference. (*EDIT: GIMP cannot be the problem, I tried an online .ps to .jpg converter, the plot was incomplete)

The .png file created from the same MashMap output file is complete, no problem there.

What could be the issue? Is there anything I can do?

Best regards

Installation

Hello, noob on a steep learning curve here!
Not sure where to go next to compile the code for this?
I've installed:
autoconf and gcc8 (which has also installed gcc7 by default by the looks of it) using sudo apt
I've installed gsl-2.5 (in /usr/local/include/gsl/..) and zlib-1.2.11 (in /usr/local/include/zlib.h) using ./configure && make && sudo make install,
moved the source code into /usr/tmp and run ./bootstrap.sh, but this happens
:/tmp$ MashMap-2.0/bootstrap.sh
autoconf: error: no input file
Could I try the same with Boost? Any suggestion much appreciated, thanks.

Can't find precompiled binaries

Hi,
I can't find the precompiled binaries as mentioned in the Readme and I'm having problems compiling it myself. The release page only contains the sourcecode.

Cheers

Compilation issues under Archlinux using cmake

i am trying to package mashmap for Archlinux distribution. using the compiling instruction with recommended cmake command as below,

$ cmake -B build \
        -DCMAKE_BUILD_TYPE='Release' \
        -DCMAKE_INSTALL_PREFIX='/usr' \
        -Wno-dev
$ cmake --build build

while running aforementioned command the compilation is failing with following error

-- The C compiler identification is GNU 13.1.1
-- The CXX compiler identification is GNU 13.1.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- CMAKE_BUILD_TYPE: Release
/build/mashmap/src/MashMap-3.0.4/scripts/generate_git_version.sh: line 3: git: command not found
-- Configuring done (0.5s)
-- Generating done (0.0s)
-- Build files have been written to: /build/mashmap/src/MashMap-3.0.4/build
[ 16%] Building CXX object CMakeFiles/mashmap.dir/src/common/utils.cpp.o
/build/mashmap/src/MashMap-3.0.4/src/common/utils.cpp: In function ‘int64_t mashmap::handy_parameter(const std::string&)’:
/build/mashmap/src/MashMap-3.0.4/src/common/utils.cpp:14:9: error: ‘uint64_t’ was not declared in this scope
   14 |         uint64_t str_len = value.length();
      |         ^~~~~~~~
compilation terminated due to -Wfatal-errors.
make[2]: *** [CMakeFiles/mashmap.dir/build.make:76: CMakeFiles/mashmap.dir/src/common/utils.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/mashmap.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

could someone take a look and tell what could be the issue?

repo request: adding small example files to quickly check installation

Hi @cjain7

Thanks for the excellent tool!

A small request: could you include small example files such that users could quickly check that the installation was successful? This should also make the use case (e.g. input formats, etc.) more clear.

Naturally, trying mashmap --help gives you some idea, but it would be better to run on an example.

Also, please provide links to the publications you provide---that would make life easier.

Where is the “dependency-free linux and OSX binaries”

I tried all the files under "www.41.com" and couldn't find it

--pi changes reported percent identities

I ran the program using the commands:
mashmap -s 1000 -f none -r ref.fna -q qry.fna -o mash_map.tsv -t 1
and
mashmap -s 1000 -f none -r ref.fna -q qry.fna -o mash_map.tsv -t 1 --pi 94

Using the 1st command, the maximum reported % identity is ~91. Using the second command, there are numerous alignments to the same reference sequence with percent identities ranging from 91 to 95.

Here is an example of one alignment from the second command:

191	2173	1000	1999	-	2775506989\|Ga0263387_11	8363872	1652253	1653252	95.6678

A few questions: why are the alignments different using the --pi flag? Which alignments are correct? Why are alignments with % id < 94 reported when using the --pi 94 flag?

k-mer size

Dear concerned,

What is the largest value of k-mer works for MashMap? According to the help doc, we can use k-mer size up to 16. But I was running mashmap for some other experiments and found that k-mer size 20 also works.
Thanks in advance,
Tazin

mashmap-align very slow

Dear Mashmap3 team,

The mashmap command is very fast. However, mashmap.out to mashmap.sam file is extremely slow and use only one thread:

mashmap-align -s min17_fyle/DAS_bin_HQ_rename.fasta -q min17.noHost.sup1K.fastq --mappingFile mashmap.out --threads 24 --perc_identity 85

where mashmap.out is from the mashmap command. I am using the most recent version 3.0.5 compiled just now. Did I miss something? It is extremely slow to a level that it cannot be used in practice to have sam file for downstream analysis.

Thanks,

Jianshu

cmake doesn't look for htslib, so the build fails

In file included from /usr/ports/biology/mashmap/work/MashMap-3.1.0/src/map/mash_map.cpp:16:
In file included from /usr/ports/biology/mashmap/work/MashMap-3.1.0/src/map/include/winSketch.hpp:39:
/usr/ports/biology/mashmap/work/MashMap-3.1.0/src/common/seqiter.hpp:8:10: fatal error: 'htslib/faidx.h' file not found
#include <htslib/faidx.h>
         ^~~~~~~~~~~~~~~~
1 warning and 1 error generated.

output conversion to SAM/BAM file

Hi there,
I have a polyploid genome that I want to map to a haploid reference and want to calculate the depth/number of contigs aligning at each region. Is there a way to convert your output tsv file into a SAM file in order so simplify this?

Thanks!

perl script generateDotPlot fails

Hi
Using perl 5 version 24 results in the following error:

Can't use 'defined(%hash)' (Maybe you should just omit the defined()?) at bin/generateDotPlot line 576.

Which is not the only position. Removing defined() get rid of the error and works well.

Window and K-mer size

Hi MashMap,

Hope this email finds you well.

Just wondering is it possible to indicate a specific window size? Or, is it automatically calculated based on segment length?

From the "mashmap -h", I can see the -k (kmer), -s (seg length) and --pi (percent identity) can be modified but not the "Window Size".

I am trying to calculate the percentage of each 2Mb region (I assume this part would be "Window Size") covered by a query sequence of at least 1,000bp (I assume this part would be "seg length) and 95% identity (I assume this part would be "percent identity").

Looking forward to your reply!

Cheers,

Taek

Installation issue with cmake --build build

I am unable to execute the code 'cmake --build build' - I receive the following error:
Please could you suggest how I can install this program?

16%] Linking CXX executable bin/mashmap
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:183: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:308: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:183: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:308: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:176: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o:/usr/local/include/c++/8.2.0/bits/fs$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:1054: undefined reference to std::fil$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:176: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:544: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:824: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:183: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:1054: undefined reference to std::fil$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:176: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:544: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:824: undefined reference to std::file$
/usr/local/include/c++/8.2.0/bits/fs_path.h:824: undefined reference to std::file$ /usr/local/include/c++/8.2.0/bits/fs_path.h:824: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::__cxx$ /usr/local/include/c++/8.2.0/bits/fs_path.h:183: undefined reference to std::file$
CMakeFiles/mashmap.dir/src/map/mash_map.cpp.o: In function std::filesystem::exist$ /usr/local/include/c++/8.2.0/bits/fs_ops.h:121: undefined reference to std::files$
collect2: error: ld returned 1 exit status
CMakeFiles/mashmap.dir/build.make:112: recipe for target 'bin/mashmap' failed
make[2]: *** [bin/mashmap] Error 1
CMakeFiles/Makefile2:84: recipe for target 'CMakeFiles/mashmap.dir/all' failed
make[1]: *** [CMakeFiles/mashmap.dir/all] Error 2
Makefile:135: recipe for target 'all' failed
make: *** [all] Error 2

All-against-all mappings

Hi,

Suppose that I have N long genomes (e.g. primates) and want to compute mappings between all pairs of the genomes. What is the best way to do so using mashmap? For example, is it a good idea to cat everything into a single FASTA file and then align it against itself, or it is the best to run the comparisons separately? And how would one adjust the parameters in such a case?

Compilation error

Hi,

I'm now trying to compile MashMap, and faced a compilation issue after following steps described in INSTALL.txt.
Could you please advise?

Used gcc v4.7.4/gsl v1.16, and static linking compilation was tried.
The error from compiler was shown below.
Thank you in advance.

In file included from src/map/include/computeMap.hpp:23:0,
from src/map/mash_map.cpp:17:
src/map/include/slidingMap.hpp: In instantiation of ‘void skch::SlideMapper<Q_Info>::init() [with Q_Info = skch::QueryMetaData<std::vectorskch::MinimizerInfo >]’:
src/map/include/slidingMap.hpp:106:11: required from ‘skch::SlideMapper<Q_Info>::SlideMapper(Q_Info&) [with Q_Info = skch::QueryMetaData<std::vectorskch::MinimizerInfo >]’
src/map/include/computeMap.hpp:642:41: required from ‘double skch::Map::computeJaccardSinglePos(Q_Info&, skch::seqno_t, skch::offset_t) [with Q_Info = skch::QueryMetaData<std::vectorskch::MinimizerInfo >; skch::seqno_t = int; skch::offset_t = int]’
src/map/include/computeMap.hpp:965:85: required from here
src/map/include/slidingMap.hpp:123:13: error: ‘skch::SlideMapper<skch::QueryMetaData<std::vectorskch::MinimizerInfo > >::MapType’ has no member named ‘emplace_hint’
make: *** [mashmap] Error 1

Mash distance?

Hello Team,

It turns out that the mash equation in map_stats.hpp is a different one comapre with the fastANI/Mashmap2 code. The correct one should be:

float mash_dist = (-1.0 / k) * log(2.0 * j/(1+j) );

while it is this in mashmap3:

float mash_dist = 1 - std::pow(2*j / (1+j), 1.0/k);

They are not the same to me. Is this on purpose or something because the original mash equation should be the first one.

Thanks,
Jianshu

Problemas with illumina files

Hello,

I want to use your code with illumina sequences encoded in Sanger / Illumina 1.9. In the case I have two files R1 and R2 with read length of 77 and I use as reference the human dna v37 (human_g1k_v37_decoy.fasta). I try to run the code with the comand
./mashmap -s ../ref/human_g1k_v37_decoy.fasta --ql input.txt -m 75 -o illu.txt
The input.txt has my two illumina files (sub_270654-003_4_S48_L004_R1_001.fastq.gz and sub_270654-003_4_S48_L004_R2_001.fastq.gz).

But the program consumes lots of memory (45 GB) and it terminates with a Killed 9.
The outputs of the console is this:
./mashmap -s ../ref/human_g1k_v37_decoy.fasta --ql input.txt -m 75 -o illu.txt

Reference = [../ref/human_g1k_v37_decoy.fasta]
Query = [../data/illumina/sub_270654-003_4_S48_L004_R1_001.fastq.gz, ../data/illumina/sub_270654-003_4_S48_L004_R2_001.fastq.gz]
Kmer size = 16
Window size = 2
Read length >= 75
Alphabet = DNA
P-value = 0.001
Percentage identity threshold = 85
Mapping output file = illu.txt

INFO, skch::Sketch::build, minimizers picked from reference = 1933153479
Killed: 9

Am I making something wrong?
Thanks.

segmentation fault with mashmap compare when empty file is passed to `-r` parameter

Hello! We are using mashmap in a snakemake workflow tool. Part of the workflow this tool orchestrates downloads a genome from genbank and maps it against a user-specified genome. Our tool accidentally created some empty genbank genome files but carried on running, leading us to a situation where mashmap compare was run using an empty genome for the -r parameter. When we did this, we got an ~instantaneous segmentation fault.

$ mashmap -q ref_genomes/GCA_900112995.1_genomic.fna.gz -r genbank_genomes/GCA_008680295.1_genomic.fna.gz -o output.genbank_genomes/stage2/GCA_900112995.1_genomic.fna.gz.x.GCA_008680295.1.mashmap.align --pi 95 > output.genbank_genomes/stage2/GCA_900112995.1_genomic.fna.gz.x.GCA_008680295.1.mashmap.out
Segmentation fault

$ ls -lh /home/tereiter/github/2020-ibd/genbank_genomes/GCA_900112995.1_genomic.fna.gz
-rw-r--r-- 1 tereiter tereiter 806K May 15 14:20 /home/tereiter/github/2020-ibd/genbank_genomes/GCA_900112995.1_genomic.fna.gz

$ ls -lh genbank_genomes/GCA_008680295.1_genomic.fna.gz
-rw-r--r-- 1 tereiter tereiter 0 May 27 09:32 genbank_genomes/GCA_008680295.1_genomic.fna.gz

We documented this behavior in an issue here. We just wanted to make the mashmap community aware of this as well!

Problem with linux binary distribution for mashmap 3

Was trying to use mashmap binary provided with mashmap3 release, and it seems that it is not dependence-free.
Still require libgsl.so.25, libopenblas and, after fixing that, specific versions of stdlib

./mashmap -r chm13v2.0.fa -q chm13v2.0.fa
./mashmap: /usr/local/GCC/9.2.0/lib64/libstdc++.so.6: version GLIBCXX_3.4.30' not found (required by ./mashmap) ./mashmap: /usr/local/GCC/9.2.0/lib64/libstdc++.so.6: version GLIBCXX_3.4.29' not found (required by ./mashmap)
./mashmap: /usr/local/GCC/9.2.0/lib64/libstdc++.so.6: version `CXXABI_1.3.13' not found (required by ./mashmap)

Typo in help for command "index"

There is a typo for metamaps index -h

It says:

--maxmemory <value>, --mm <value>
    maximum memory, in GB [default e-03]

The [default e-03] is from the p-value parameter.

Ordering output with reference sequences

Hey, thank you for this wonderful package! It has saved many headaches.

I am wondering, how does MashMap decide on the ordering of the reference sequences (and, in turn the order of the query)? I am mapping a plant genome assembly (~1956 contigs) to a reference assembly of a closely related species with 12 chromosomes. The output I get is not in the same order that the 12 chromosomes appear in the reference .fasta file - they are all jumbled up (i.e. X-axis reads: Chromosome 1 -> 3 -> 2 -> 4 -> 6 -> 5 -> 7 -> 11...etc.).

I was wondering if you know of a way to force MashMap to output a line that is arranged such that the reference chromosomes appear in order? I have attached the output I currently have to this post.
oryza_hypo_contigs_vs_nipp_200717_8 copy.pdf

I don’t think there should be any order to the contigs (I.e. order to the Y axis) between chromosomes, so I don’t know why MashMap has decided on the ordering. Does that make sense?

Thank you in advance,
Aaron :)

Impact of filtering options on genome alignment

Hi,

I'm interested in aligning microbial assemblies/closed genomes using MashMap. I am unsure of the difference between the various filtering options after trying to compare a few genomes. Do you recommend using one-to-one mapping or none if I am interested in finding all blocks in both query/reference genomes that probably align?

libgsl.so.25: cannot open shared object file

I have installed mashmap through conda, with version mashmap-2.0-pl5321h8e5b204_8, but it shows error " libgsl.so.25: cannot open shared object file". I checked on anaconda that the requirement of this version mashmap, it said 2.8>gsl>=2.7, and the file in /env/mashmap/bin is libgsl.so.27.

Questions about the alignment of MashMap

Hello,

Following #6, I checked the outputs of MashMap using blastn and got some quesions with the alignment. I think I shoud open a new issue to understand the ouput of MashMap.

I wanted to remove posssible contaminants within the PacBio data, which was from several insects (whole organisms were used to extract DNA, so there maybe some bacteria DNA). I downloaed all the archaea, bacteria, fungi, protozoa, and viral sequences and merged them together as a contamination library. I also included mitochondrion sequences of the insect into the library.

Then I used MashMap with different parameters and got several outputs:

# run1, with default parameters
$path2mashmap -r $contaminants -q third_all.fasta -o mashmap.out

# run2, with -s 2500 --pi 80
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 2500 --pi 80 -o mashmap2.out

# run3, with -s 500 --pi 85
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 500 --pi 85 -o mashmap3.out

And the outputs from three runs varied:

# there are 6633142 sequences of input
$ grep -c '>' third_all.fasta 
6633142

# run1
cut -f 1 -d ' ' mashmap.out |sort|uniq|wc -l
463569

# run2
cut -f 1 -d ' ' mashmap2.out |sort|uniq|wc -l
2821004

# run3
cut -f 1 -d ' ' mashmap3.out |sort|uniq|wc -l
6189307

As can be seen, nearly all the sequences were aligned to contaminant library. That really shocked me.

Then I checked the top 10 sequences with highest identity and top 10 ones with loweset identity from the first run using blastn.

$ sort -r -t ' ' -k10 mashmap.out|head
m161111_191517_42256_c101055312550000001823247601061715_s1_p0/135507/26123_34791 8668 0 8667 + AV_I75 15629 300 8661 94.3979
m161111_191517_42256_c101055312550000001823247601061715_s1_p0/118488/0_10466 10466 0 4999 + AV_I338 16706 10618 15617 93.7729
m161111_062618_42256_c101055312550000001823247601061713_s1_p0/84256/10406_16318 5912 0 5911 + AV_I336 15276 533 6284 93.5688
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/108916/11216_16430 5214 0 5213 - kraken:taxid|163164|NC_002978.6 1267782 837214 842416 93.4375
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/29550/13832_24025 10193 0 10192 - AV_I75 15629 5633 15425 93.3745
m161123_002016_42256_c101049952550000001823247601061782_s1_p0/107348/0_13397 13397 0 4999 + kraken:taxid|1633785|NZ_CP011148.1 1267840 408697 413696 93.326
m161118_115530_42256_c101055932550000001823247601061713_s1_p0/96135/27165_36074 8909 0 8908 - kraken:taxid|615|NZ_CP021984.1 5241555 4731262 4740479 93.2709
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/37699/0_7490 7490 0 7489 + AV_I337 13969 4110 11232 93.2274
m161118_115530_42256_c101055932550000001823247601061713_s1_p0/148198/23006_28097 5091 0 5090 - AV_I336 15276 2279 7323 93.1338
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/86966/0_10197 10197 5197 10196 - kraken:taxid|1633785|NZ_CP011148.1 1267840 642883 647882 93.0843

$ sort -t ' ' -k10 mashmap.out|head > mashmap.lowest
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100026/37064_49118 12054 7054 12053 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100308/48537_58156 9619 4619 9618 - kraken:taxid|880070|NC_015914.1 6221273 4110216 4115215 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/103161/48125_53357 5232 232 5231 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104812/38481_50150 11669 0 4999 - kraken:taxid|877455|NC_015216.1 2583753 1814585 1819584 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104851/0_12783 12783 7783 12782 - kraken:taxid|76857|NZ_CP022123.1 2521394 1534616 1539615 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104909/21773_32004 10231 0 4999 - kraken:taxid|134821|NZ_CP021988.1 722452 74550 79549 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105012/0_20923 20923 0 4999 + kraken:taxid|552509|NC_033778.1 111453 67843 72842 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105790/29860_45799 15939 0 4999 - kraken:taxid|1360|NZ_CP025500.1 2346663 699615 704614 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105963/29235_38484 9249 4249 9248 - kraken:taxid|29430|NZ_CP018260.1 3267348 1015045 1020044 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/106112/0_20765 20765 0 4999 + kraken:taxid|2017483|NZ_CP022315.1 4071214 1765868 1770867 80.8247

The highest ones were fine. There were some differences between hits reported by blastn and MashMap, but maybe it's because they used different databases. But the loweset ones were problematic. Most of them were 'No significant similarity found' when default parameters of blastn were used. And when I unselected 'Low complexity regions', the alignments were unreliable. There maybe something with 'low complexity regions' or 'repeat' somthething.

So my question are:

If blastn cannot find any significant alignment, why MashMap could?
In my case, what percentage identity threshold do you think is proper for removing contaminants?

Thank you! Sorry if I missed something.

Bests,
Yiwei Niu

Malformed line in output?

Hi there
I was computing an alignment between two chromosomes (human chr17 and gorilla chr5)

I ran this command

mashmap -r hg38_chr17.fa.gz -q gorgor6_chr5_2.fa.gz

The outputted file has a line where not all fields were filled in though

screenshot

can attach full file if interested

Interpretation of gnuplot

Dear MashMap,

Thank you for the great program. It was really easy to use.
I have a quick questions about the outcome of gun plot (plot dot graph).

If you look at the png file, it shows a different colours (light blue and purple) and and direction (vertically and/or inverted).

Could you please provide me a bit more explanation about this?

Many thanks in advance!

Cheers,

Taek

about mashmap.out result

the results looks like strange，Why can sequences with length 1914830 be fully aligned to sequences with length 2645 ？

generateDotPlot for publication: png sizes and axis labels

Thank you for making this excellent software and visualizer. The only issue I've run into is that it's difficult to prepare publication-quality figures because the png image sizes are hard-coded, and the font face/size parameters aren't included in out.gp for the png format.

I submitted a pull request with a few modifications to widen the margins and pass the font face and size variables to gnuplot, but I also wanted to point out that the image dimensions are hard-coded:
line 747: "$PNG tiny size $SIZE,$SIZE" : "$PNG small";

Additionally, in case it's helpful to other users, I wanted to note that you can put the labels at an angle by modifying:
line 822: print GFILE "set xtics rotate ( \\n";
to: print GFILE "set xtics rotate by -25 ( \\n"
This will give a -25 degree rotation, for example. For the y-axis, the corresponding line is 845.

Cheers and thank you again for developing this elegant and highly useful software!

marbl / mashmap Goto Github PK

mashmap's People

Contributors

Stargazers

Watchers

Forkers

mashmap's Issues

Methods

Recommend Projects

Recommend Topics

Recommend Org