parbliss / fastani Goto Github PK

View Code? Open in Web Editor NEW

348.0 12.0 63.0 6.49 MB

Fast Whole-Genome Similarity (ANI) Estimation

License: Apache License 2.0

Makefile 0.17% Shell 0.39% M4 0.40% C++ 92.87% C 4.36% R 0.25% CMake 1.04% Python 0.53%

microbial-genomics

fastani's People

Contributors

Stargazers

Watchers

fastani's Issues

no output file

I tried to run fastANI with the following command:

fastANI -q fasta1.fas -r fasta2.fas -o fastani.out

I get the following output in STDOUT:

>>>>>>>>>>>>>>>>>>
Reference = [fasta1.fas]
Query = [fasta2.fas]
Kmer size = 16
Fragment length = 3000
ANI output file = fastani.out
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 3493
INFO, skch::Sketch::index, unique minimizers = 3363
INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 3263) ... (3, 30)
INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, consider all minimizers during lookup.
INFO, skch::main, Time spent sketching the reference : 0.0149605 sec
INFO, skch::main, Time spent mapping fragments in query #1 : 0.0218417 sec
INFO, skch::main, Time spent post mapping : 0.000101473 sec

However, the fastani.out doesn't get anything on it and it has 0 size when I try du -h.

I have tried release versions 1.0, 1.1 and even git clone. I can execute fastANI but the output file never gets written.
The two sequences that I am testing have more or less 70kb and they should have a mash dist of <0.1, so they should be highly similar. Any ideas on what is going on?

Thanks

Generate distance matrix rather than pairwise

Would it be possible to add an option to output a distance matrix in TSV or CSV instead of the pairwise list?

    A    B    C  
A   100  83   71 
B       100   92
C            100

It could be upper triangle, lower triangle, or both.

ERROR, skch::validateInputFiles

I'm getting the following error when trying to run FastANI:

>>>>>>>>>>>>>>>>>>
Reference = [/Users/viniWS/Bio/gtdb/fasta/NC_006576.1.fasta, /Users/viniWS/Bio/gtdb/fasta/NC_007335.2.fasta, /Users/viniWS/Bio/gtdb/fasta/NC_007513.fasta]
Query = [Bio/gtdb/fasta/NC_005042.fna]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = Bio/gtdb/data/ani_test_out.txt
>>>>>>>>>>>>>>>>>>
ERROR, skch::validateInputFiles, Could not open /Users/viniWS/Bio/gtdb/fasta/NC_007513.fasta

This is the content for head /Users/viniWS/Bio/gtdb/fasta/NC_007513.fasta:

>NC_005042.1 Prochlorococcus marinus subsp. marinus str. CCMP1375 complete genome
AAAGCTAGATGGCAGAAAGGTTTTTGAATAATTTCCACAGATTCCACAAGACCTACTACT
ACTGTATTAATTTCATATAATTAAATTAGAATTACTAGAAGAAGAGAAAACTTTTATTAA
AGCTATGAAAACTTTTTGTTCCTTTTTTGGTATTTCGTAGTTATGTTGAACCGATGAAAC
TTGTTTGTTCTCAAATTGAGCTCAATACAGCTCTTCAACTAGTTAGTAGAGCTGTAGCCA
CTAGGCCTTCGCATCCAGTATTGGCAAATGTATTGCTTACTGCTGATGCGGGAACTGGAA
AACTAAGCTTAACTGGATTTGATCTTAATTTAGGAATTCAAACATCGCTTAGTGCTTCTA
TCGAAAGTAGTGGAGCAATTACAGTTCCCTCAAAACTTTTCGGAGAAATAATATCAAAAT
TATCTAGTGAATCTTCTATAACTTTATCAACAGATGATTCTAGTGAACAAGTTAATTTAA
AGAGTAAAAGTGGAAATTATCAGGTAAGAGCTATGAGTGCTGATGATTTTCCTGATTTGC

This is the content of Bio/gtdb/data/ani_test.txt:

/Users/viniWS/Bio/gtdb/fasta/NC_006576.1.fasta
/Users/viniWS/Bio/gtdb/fasta/NC_007335.2.fasta
/Users/viniWS/Bio/gtdb/fasta/NC_007513.fasta

Any ideas why this might be happening? I did some test runs on small-ish fragments (1500 bp) and it worked fine, but when using 'real world' genomes, I always get this error.

FastANI now packaged in BrewSci

FYI
See: https://github.com/brewsci/homebrew-bio/blob/master/Formula/fastani.rb

brew tap brewsci/bio
brew install fastani

Thank you to @sjackman for helping me through it.

Output is still empty after two days running

command:
fastANI --ql xx --rl xx -t 23 --matrix -o fastANI.out

But Only a log as follow, and never update

Reference = [total 1125 fasta seq]
Query = [total 5000 fasta seq]
Kmer size = 16
Fragment length = 3000
Threads = 23
ANI output file = fastANI.out

INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 44870461
INFO [thread 0], skch::Sketch::index, unique minimizers = 32929725
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 25158025) ... (434, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 43 times during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 147.643 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.802434 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000257816 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #2 : 2.47368 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000986272 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #3 : 1.4448 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000543327 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #4 : 1.10718 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000373789 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #5 : 1.60302 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000503294 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #6 : 1.01921 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000374961 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #7 : 0.698442 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000139954 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #8 : 1.16008 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.00037629 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #9 : 1.949 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.000728267 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #10 : 0.912171 sec
......

Add --version flag

% fastANI --version
fastani 1.1

It should return 0.

make problem in Linux: "MapType’ has no member named ‘emplace_hint’ "

Output of $ ./configure:

configure: loading site script /usr/share/site/x86_64-unknown-linux-gnu
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for g++ option to support OpenMP... -fopenmp
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking zlib.h usability... yes
checking zlib.h presence... yes
checking for zlib.h... yes
checking gsl/gsl_cdf.h usability... yes
checking gsl/gsl_cdf.h presence... yes
checking for gsl/gsl_cdf.h... yes
configure: creating ./config.status
config.status: creating Makefile

Output of $ make or $ sudo make:

g++ -O3 -DNDEBUG -std=c++11 -Isrc -I /usr/local//include -fopenmp -include src/common/memcpyLink.h -Wl,--wrap=memcpy  src/cgi/core_genome_identity.cpp -o fastANI -L/usr/local//lib -lgsl -lgslcblas -lstdc++ -lz -lm
In file included from src/map/include/computeMap.hpp:24:0,
                 from src/cgi/core_genome_identity.cpp:18:
src/map/include/slidingMap.hpp: In instantiation of ‘void skch::SlideMapper<Q_Info>::init() [with Q_Info = skch::QueryMetaData<kseq_t*, std::vector<skch::MinimizerInfo> >]’:
src/map/include/slidingMap.hpp:104:11:   required from ‘skch::SlideMapper<Q_Info>::SlideMapper(Q_Info&) [with Q_Info = skch::QueryMetaData<kseq_t*, std::vector<skch::MinimizerInfo> >]’
src/map/include/computeMap.hpp:437:41:   required from ‘void skch::Map::computeL2MappedRegions(Q_Info&, skch::Map::L1_candidateLocus_t&, skch::Map::L2_mapLocus_t&) [with Q_Info = skch::QueryMetaData<kseq_t*, std::vector<skch::MinimizerInfo> >]’
src/map/include/computeMap.hpp:369:13:   required from ‘bool skch::Map::doL2Mapping(Q_Info&, VecIn&, VecOut&) [with Q_Info = skch::QueryMetaData<kseq_t*, std::vector<skch::MinimizerInfo> >; VecIn = std::vector<skch::Map::L1_candidateLocus_t>; VecOut = std::vector<skch::MappingResult>]’
src/map/include/computeMap.hpp:221:11:   required from ‘void skch::Map::mapSingleQuerySeq(Q_Info&, skch::MappingResultsVector_t&, std::ofstream&) [with Q_Info = skch::QueryMetaData<kseq_t*, std::vector<skch::MinimizerInfo> >; skch::MappingResultsVector_t = std::vector<skch::MappingResult>; std::ofstream = std::basic_ofstream<char>]’
src/map/include/computeMap.hpp:178:57:   required from here
src/map/include/slidingMap.hpp:121:13: error: ‘skch::SlideMapper<skch::QueryMetaData<kseq_t*, std::vector<skch::MinimizerInfo> > >::MapType’ has no member named ‘emplace_hint’
make: *** [fastANI] Error 1

Any idea of where the problem might be? I think all my dependencies are OK.

Thank you for any assistance you can provide.

AAI support

Hi,

can I use FastANI to calculate AAI (Average Aminoacid Identity)? Is this supported?

Thank you,

conda install always hanging

Hi, developer,
I tried to use conda to install FastANI, code conda install -c bioconda fastani
But it is always hanging at Collecting package metadata: done Solving environment:
May I know any other conda install recipe?
Thanks!

FastANI gives different results depending on genomes in the reference list

Hello,

Thank you for FastANI. We are using it regularly in our work. I have run into some unexpected behaviour where FastANI does not appear to give consistent results. I have a query genome Q and the reported ANI to a given reference genome R changes depending on what genomes I have in the reference list.

That is,
fastANI -q Q.fna -r R.fna -o single.tsv

Gives a different result to:
fastANI -q Q.fna --rl references.lst -o multiple.tsv

single.tsv gives: Q.fna R.fna 97.0547 1150 1325

The relevant line in multiple.tsv gives: Q.fna R.fna 97.0342 1152 1325

Why is the report ANI and number of alignment fragments different? The results change slightly as I modify the genomes in references.lst. Is this the expected behaviour? If so, it would be helpful to note this heuristic quality of FastANI in the README since these small difference do change assignments in a small number of cases when processing large genome databases which leads to confusion.

How to make query list and reference list?

how to make query list and reference list of genome,
could you elaborate the same?
i could get output ANI for single query and single reference genome, but when i try for multiple ref and query genomes, my command is $ ./fastANI -q 119.fasta --rl Xac29-1.fasta Xanthomonas_albilinneans.fasta Xanthomonas_ arbicola.fasta -o punicae.text

it shows error as follows,
1, Misplaced option '-r' detected. All option have to be BEFORE the first argument
2, Misplaced option '-o' detected. All option have to be BEFORE the first argument

the '-h' thing, maybe better if it went to std::out?

Not urgent, since the help is not too long.

When checking the help with the '-h' option, if I try

fastANI -h | more

I don't see the page-to-page help, I have to use

fastANI -h |& more

(that's with the tcsh, with bash nothing works to get page-by-page output)

which means that the output of the help goes to std::cerr, rather than to std::cout. Something that might be useful to fix, for the sake of UNIXers.

Best, and thanks for making this program. Runs very well!

getting no output for genome files

Hi, many thanks for this new ANI implementation! I'm looking forward to use it but for this two genomes I'm getting no output at all. The output file is empty. (fasta.zip)

fastANI -q ./MYb12.fasta -r ./BT247.fasta -o test
>>>>>>>>>>>>>>>>>>
Reference = [./BT247.fasta]
Query = [./MYb12.fasta]
Kmer size = 16
Fragment length = 3000
ANI output file = test
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 448286
INFO, skch::Sketch::index, unique minimizers = 440286
INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 434525) ... (74, 1)
INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 74 times during lookup.
INFO, skch::main, Time spent sketching the reference : 0.5524 sec
INFO, skch::main, Time spent mapping fragments in query #1 : 0.507377 sec
INFO, skch::main, Time spent post mapping : 2.7297e-05 sec

From the output I could not guess what going wrong.
The input files are draft genomes but from a rough quality check they should be fine, e.g. N50>10000.

--help should return 0 not 1

% fastANI -h
% echo $?
1

Should be 0 because I asked for help and it gave it to me, it was a success.

Filenames do not support blank

Filenames do not support blank.

fastANI -q "/abcdef/aa.fa" -r "/a b/bb.fa" -o "123"

>>>>>>>>>>>>>>>>>>
Reference = [/a]
Query = [b/bb.fa/abcdef/aa.fa]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = 111
>>>>>>>>>>>>>>>>>>
ERROR, skch::validateInputFiles, Could not open b/bb.fa/abcdef/aa.fa

String Extraction from Excel Sheet

I need to extract a particular string from an xls or csv file along with the string containing entire row and may be stored in another excel file, so that I can mine my information with time effective manner. Hence, please let me know any script is available in R programme regarding the same.

Program runs, but output file is empty.

Hi,

I'm running FastANI, the console logs indicate it appears to be working fine, but the output fine is empty.
Command:
FastANI -q GCF_000063525.1_ASM6352v1_genomic.fna -r GCF_002356215.1_ASM235621v1_genomic.fna -o ani_test.txt

Console:
`

Reference = [GCF_002356215.1_ASM235621v1_genomic.fna]
Query = [GCF_000063525.1_ASM6352v1_genomic.fna]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = ani_test.txt

INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 249756
INFO [thread 0], skch::Sketch::index, unique minimizers = 245256
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 241440) ... (12, 2)
INFO [thread 0], skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 12 times during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.361363 sec

INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.211354 sec
INFO [thread 0], skch::main, Time spent post mapping : 2.0606e-05 sec
`

After this,
ani_test.txt is empty.

I also had some problems trying to run with -ql and -rl params, but I will prep an issue later.

Thanks,

compiling under MacOSX Mojave

When trying to compile under mojave, if I simply run
autoconf
./configure --prefix=/usr/local
make

I get:
src/cgi/core_genome_identity.cpp:11:10: fatal error: 'omp.h' file not found

No compilation happens.
So, I tried:
./configure --prefix=/usr/local --disable-openmp
make

I still got:
src/cgi/core_genome_identity.cpp:11:10: fatal error: 'omp.h' file not found

So, because I have gcc-8 from homebrew I tried:
export CXX=/usr/local/bin/g++-8
./configure --prefix=/usr/local
make

I got:
g++-8: error: unrecognized command line option '-stdlib=libc++'

So, I edited the Makefile and eliminated the "-stdlib=libc++"

That worked all right, and the program seems to do what's supposed to do (it's fast too!)

However, I thought you'd like to fix those issues to help Mac users get a clean first-try installation.

The first one is insisting on using omp.h, which seems to be used only by gcc (I might be wrong about this, am I? Is there a way to point to gcc's omg.h but still compile with clang?), the second wants to use a clang library, even though I was trying to use gcc's g++.

Installing FastANI on local computer (Mac OSX)

Hi,

I am trying to install the latest FastANI release. I downloaded the tar.gz file just fine using this script: wget https://github.com/ParBLiSS/FastANI/releases/tag/v1.1/fastani-OSX64-v1.1.tar.gz. But when I try to run tar -zxvf fastani-OSX64-v1.1.tar.gz I get this error:
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

I am wondering if there is a missing script that I need to add?

I am still new to bioinformatics and would appreciate any help :)

Best,
C

Applicable for Eukaryotes?

Hello,

I was wondering if this program can be used for small eukaryotic genomes (<35 Mb) with less than 10% repeats?

Thanks,
Morgan

FastANI still not fast enough?

Dear Sir

I tried your fastANI to generate ANI to about 2000 genomes; the speed is quite slow. I run the program on a super node with 64 cores and 500 Gb memory. The software can only run in one single thread.

I know that you have already supplied a script to split genomes into smaller parts. But in one node, the speed is limited by the IO transfer if I run it parallel in one hard disk.

How did you generate the ANI among 80000 genomes? Can you give me some hint?

I tried to run it on our HPCF; however, for every single run, the memory requirements exceed 96 Gbs which is the configuration in most of our node.

I can only submit limited jobs (10) at one time, so I can just split the total jobs into less than 100 jobs rather than 1000 of jobs.

Thank you very much!

Xiaotao

Recommended N50

Hello,

The README suggest that both the query and reference genomes should have an N50 >10Kb. I'm a little unclear why lower N50, say 5 Kb, is problematic when the query fragment size is only 3 Kb. I can imagine if both the query and reference genome have a low N50 correctly matching homologous regions can be compromised. However, if just one of the reference or query has a low N50 it would seem the method would still work well enough. Just wondering if you have some insights that would help me understand how N50 impacts results and if N50 >10Kb should be consider a hard requirement.

Thanks.

Blank output

In certain (fairly rare) cases, I'm finding blank output from certain fasta data.

Example query file: https://github.com/jayrbolton/random-test-data/blob/master/fasta/shewanella.fasta
Example reference file: https://github.com/jayrbolton/random-test-data/blob/master/fasta/rhodobacter.fasta

Commands to reproduce

$ fastANI -q shewanella.fasta -r rhodobacter.fasta -o result.out
$ cat result.out

The file is empty. The stdout is:

>>>>>>>>>>>>>>>>>>
Reference = [rhodobacter.fasta]
Query = [shewanella.fasta]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = result.out
>>>>>>>>>>>>>>>>>>
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 368426
INFO [thread 0], skch::Sketch::index, unique minimizers = 352326
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 339338) ... (26, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 20 times during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.306699 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.269743 sec
INFO [thread 0], skch::main, Time spent post mapping : 2.426e-05 sec

At first I thought this had to do with lowercase/uppercase nucleotides in the fasta, but that does not seem to make a difference.

Redirect ANI value to stdout

Would it be possible to output to stdout if no output file is given?

fastANI and completeness

Hi,
Thanks for the FAST tool!
I've calculated ANI between my reconstructed metagenomes and several databases (such as NCBI). I found the genomes with (maximal) ANI between 76-83 had higher completeness relative to genomes with higher ANI (83-100) or no ANI. I couldn't find any reasonable explanation for this observation, do you have any idea?

Additionally, can you please explain why there is almost no pairs that have ANI lower than 76?

Thanks in advance!

Error: skch::validateInputFiles, Count of query and ref genomes should be non-zero

I am doing "Many to Many" ANI calculation using fastANI and getting following error,

$ fastANI --ql /home/ncim/ncbi_genomes --rl /home/ncim/ncbi_genomes -o home/ncim/ncbi_genomes/all.out

Reference = []
Query = []
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = home/ncim/ncbi_genomes/all.out

ERROR, skch::validateInputFiles, Count of query and ref genomes should be non-zero

What could go wrong? Please help.
When I did "One to One" ANI calculation everything worked fine, but when I am doing many to many ANI calculation I am getting the error.

-i option?

Hello,

I think it would be very helpful to make an option similar to -i option in mash:

-i Sketch individual sequences, rather than whole files, e.g. for
multi-fastas of single-chromosome genomes or pair-wise gene
comparisons.

I know that pretty much the same thing could be achieved with lists, but generating 10's of thousands of files is cumbersome (I'm doing plasmids and contigs right now).

Thank you for the great tool! Looking forward to seeing how different my results would be, compared to mash.

All output data feedback

Hi,

Do you have a list of all of your output data from FastANI?

I'm interested in the all of the ANI values for all the genomes you computed.

Thanks

Is only one of boost or gsl needed?

I have a feeling boost gets used in priority over gsl?
If so, which would make fastANI fastest?
Or are both used?

Installation issue: /usr/bin/perl symbol lookup error

I get the following error on installation:

$ ./bootstrap.sh
/usr/bin/perl: symbol lookup error: /usr/common/usg/languages/perl/5.24.0/extra/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_handshake

Any ideas?

Thanks,
Stephen

results different between FastANI and ANI result from Kostas lab

Hi,

i tried using FastANI to do multiple genomes comparison and one of the genome show no any similarity to another genome which is quite weird as they are closely related organism, so I tried using another ANI calculator from kostas lab to do and it show 95% similarity, may I know the reason behind as it changes a lot the output??

Thanks!

Install instructions for brew

brew install brewsci/bio/fastani

Now at v1.1

FastANI stuck on last comparison

I have tried running FastANI with a combination of MAGs and genomes from GenBank (1537 genomes in total). Everything appears to run fine but FastANI is stuck in the last comparison for a couple of hours now and no output is written:

INFO [thread 0], skch::main, Time spent mapping fragments in query #1535 : 3.45975 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.00237654 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1536 : 2.6996 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.00182084 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1537 : 3.23705 sec
INFO [thread 0], skch::main, Time spent post mapping : 0.00220625 sec

I am running the compiled version for Linux (v.1.1) on a HPC server and the command used is shown below:

fastANI --ql phylogenomics_ANI_paths.txt \
        --rl phylogenomics_ANI_paths.txt \
        -o fastani.out \
        -t 4 \
        --matrix

Many thanks in advance for your thoughts on this.

Cheers,
Igor

Segmentation fault (core dumped)

@ Cjain7
I have been successfully running fastANI tool for the past 5 months, it works fine and very useful for my comparative genome analysis. But recently I have been facing an error like "Segmentation fault (core dumped)" The exact error is written below as shown in linux terminal . Please help me to fix this issue.
Thank you in advance

arvind@arvind:~/Documents/dinesh/fastANI/fastani-Linux64-v1.0$ ./fastANI -q query.txt -r 1.txt -o test.txt

Reference = [1.txt]
Query = [query.txt]
Kmer size = 16
Fragment length = 3000
ANI output file = test.txt

INFO, skch::Sketch::build, minimizers picked from reference = 0
INFO, skch::Sketch::index, unique minimizers = 0
Segmentation fault (core dumped)

Missing results in output file

Hello,

It seems that when a reference genome is divergent enough from the query genome, the resulting comparison may not be reported in the output file. I appreciate that a minimum number of fragments are required for a reliable ANI estimation. However, it is often very confusing when a comparison is simply missing from the output file. Perhaps it would be better to report these as "N/A" instead of just leaving out the comparison completely.

There also appears to be an actual bug since the following may result in no comparison being reported:
fastANI --minFrag -1 -q input.fna -r reference.fna -o test.tsv

Both the input and reference genomes here are "normal" genomes with plenty of >10 kb contigs. This is particularly problematic when using a list of reference genomes and then having to manually establish which comparisons weren't performed.

Thanks for any assistance you can provide.

Is it possible to sketch genomes abd save them for later?

System requirements and usage with accession numbers

Does this package have any system requirements?
I have a MacBook Pro 8gb RAM and 2,6 GHz i5. Is it enough to small whole genomes (5-10MB fasta files)?

Can it be used with NCBI accession numbers only? If not, do you plan on working on this feature? I could potentially contribute.

thanks
V

configure not finding gsl_cdf.h when gsl path is given?

I'm trying to install the tool from the install but it can't find my gsl. I created a new conda environment for fastani with gsl.

conda create --name fastani_env --yes
source activate fastani_env
conda install -c conda-forge gsl --yes

Now I'm trying to install the tool:

(fastani_env) -bash-4.1$ ./bootstap.sh
(fastani_env) -bash-4.1$ ls -lhtr /usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/fastani_env/include/gsl/ | grep "gsl_cdf"
-rw-rw-r-- 2 jespinoz tigr 7.3K Apr 23 11:01 gsl_cdf.h
(fastani_env) -bash-4.1$ ./configure --prefix /usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/fastani_env --with-gsl /usr/local/devel/ANNOTATION/jespinoz/anaconda/envs/fastani_env/include/gsl/
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for g++ option to support OpenMP... -fopenmp
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking zlib.h usability... yes
checking zlib.h presence... yes
checking for zlib.h... yes
checking gsl/gsl_cdf.h usability... no
checking gsl/gsl_cdf.h presence... no
checking for gsl/gsl_cdf.h... no
configure: error: GNU Scientific Library headers not found.

Do you know why it's not finding my gsl_cdf.h?

Output is empty

Hi Chirag
I just tried to run fastani both v1.1 and 1.2 with two genomes in data folder (e coli and shigella) and output file is empty and this is the log I got.

Reference = [Shigella_flexneri_2a_01.fna]
Query = [Escherichia_coli_str_K12_MG1655.fna]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = test.txt

INFO [thread 0], skch::main, Count of threads executing parallel_for : 1
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 4918
INFO [thread 0], skch::Sketch::index, unique minimizers = 2589
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 1792) ... (36, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.00427922 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.0392048 sec
INFO [thread 0], skch::main, Time spent post mapping : 1.1669e-05 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished

kseq.h is out of date

Please update to the latest version of kseq.h.

How to interpret n mapped fragments?

Hello

I have a question about the fastANI output
E.g.
Genome1 genome2 0.9 60 100

0.9 is the estimated ANI over the whole genome or only over the aligned fragments?

How can we interpret the ratio of mapped /all fragments?
Does 60/ 100 mean the genomes overlap to 60 %?

IbI have e.g. 5 mapped from 100 can I trust the AI calculation ?

I work with MAGs may be I need to be more cautious. Thanks for the clarifictions

Issue opening resulting .pdf of Visualize.R

Hello,
Sorry in advance for such a basic question.

I've generated the ". visual.pdf" file as demonstrated with the accompanying "visualize.R" script without any errors. However, I haven't found a program that is able to open the file on Mac (e.g. Preview, Adobe Reader). Trying to open the file returns a "file is damaged and could not be repaired/opened" error. Thinking it had something to do with the extension I changed it to just .pdf with no success.

Any suggestions on how to open the plot file?

Thanks!

Values chosen in the --matrix output ?

Hi there,

While doing an all-to-all ANI comparison on a set of genomes, I noticed that the regular output displays different values when the genomes are switched:

1509405_PRJNA252589.fasta.gz  246200_PRJNA281.fasta.gz      76.3461  98    1679
246200_PRJNA281.fasta.gz      1509405_PRJNA252589.fasta.gz  76.9103  84    1369

When using the --matrix option there is only a single value for this pair, which is 76.628181 (looks like the mean).

I thus have two questions:

Why do the ANI values change when the genome are switched?
Is there a particular reason to use the mean of the two values in the --matrix output?

Cheers,
Nils

fastANI process is aborted or killed.

When I am comparing ~10,000 genomes, on my server having 24 cores with 128 GB RAM, the fastANI process is getting aborted.
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)
What will be memory requirements when I want to compare ~10,000 genomes?

I am using fastANI v1.1
My command:
fastANI --refList reference.list --ql query.list -t 10 -o out --matrix

Results missing from output

I'm performing a large all vs all ANI calculation, and it appears that some results that should be there are missing from the output. For example, when comparing bacteria A vs. bacteria B, the resulting ANI is 87%, but when the converse comparison is made (B vs. A), the result is entirely missing from the output file. Here's a piece of a heatmap illustrating this:

Notice the blue dots (values equal to 0) dispersed asymmetrically in areas of otherwise high ANI

Error: 'std::bad_alloc'

fastaANI ran properly with 100 genomes. However, increased to 1000 genomes resulted in the following error

Error details:

$ fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt

Reference = [1.fasta, 2.fasta, ......... 1000.fasta]
Query = [1.fasta, 2.fasta, ......... 1000.fasta]
Kmer size = 16
Fragment length = 3000
ANI output file = /media/network/project_Lm_all/results/43_fastANI/output.txt

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped) fastANI --ql 1000_genomes.list --rl 1000_genomes.list -o output.txt

Hardware details

Processor | 8x Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Memory | 65858MB
Operating System | Ubuntu 16.04.3 LTS

Compiling fastANI on macOS but using gcc (solved)

==> make install
g++-8 -O3 -DNDEBUG -std=c++11 -Isrc -I /usr/local/opt/boost/include -fopenmp -mmacosx-version-min=10.7 -stdlib=libc++ -DUSE_BOOST src/cgi/core_genome_identity.cpp -o fastANI /usr/local/opt/boost/lib/libboost_math_c99.a -lstdc++ -lz -lm  
g++-8: error: unrecognized command line option '-stdlib=libc++'
make: *** [fastANI] Error 1

It still seems to be passing clang++ specific parameters to g++
ie. mmacosx-version-min=10.7 -stdlib=libc++

Ok I now see in Makefile.in where it occurs:

ifeq ($(UNAME_S),Darwin)
	CXXFLAGS += -mmacosx-version-min=10.7 -stdlib=libc++
else
	CXXFLAGS += -include src/common/memcpyLink.h -Wl,--wrap=memcpy
	CFLAGS += -include src/common/memcpyLink.h
endif

Can you change the if Darwin to if Darwin and not GCC ?

Minimizers or index saved for each genome?

Hi sir,

While I was querying one genome against ~2000 genome, it was very slow. I checked back the paper on 90k prokaryotic genomes and found indexing would take the majority of the runtime, so I wonder if minimizer of each genome can be saved (like sketch or signature of MinHash) and doesn't have to be recreated every time?

OS X: Symbols not found

OS 10.12.6 throws a trap for me, using fastani-OSX64-v1.1.tar.gz

fastani-OSX64-v1.1 bede$ ./fastANI -q kp/HS11286.fasta -r ec/E24377A.fasta -o kp
>>>>>>>>>>>>>>>>>>
Reference = [ec/E24377A.fasta]
Query = [kp/HS11286.fasta]
Kmer size = 16
Fragment length = 3000
Threads = 1
ANI output file = kp
>>>>>>>>>>>>>>>>>>
dyld: lazy symbol binding failed: Symbol not found: ___emutls_get_address
  Referenced from: /Users/bede/Downloads/fastani-OSX64-v1.1/.//./libgomp.1.dylib
  Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: ___emutls_get_address
  Referenced from: /Users/bede/Downloads/fastani-OSX64-v1.1/.//./libgomp.1.dylib
  Expected in: /usr/lib/libSystem.B.dylib

Abort trap: 6
fastani-OSX64-v1.1 bede$

kmer size selection

I have been using fastANI for sometime now but wanted to see how my results differ based on the kmer size. I realized (kmer <=16, default=16),
From the help description:
'''
-k , --kmer
kmer size <= 16 [default : 16]
'''

I tried different kmers (13, 16, 18, 21) on a set of four bacteria genomes as seen below;
k-mer size = 13
4
x.fna | | |
y.fna | NA | |
z.fna | NA | NA |
m.fna | NA | 75.762955 | NA

k-mer size = 16
4
x.fna | | |
y.fna | 75.345757 | |
z.fna | NA | NA |
m.fna | 74.794037 | 75.225105 | NA

K-mer size = 18
4
x.fna | | |
y.fna | 74.216934 | |
z.fna | 74.254578 | 74.217453 |
m.fna | 74.190994 | 74.626328 | 74.193939

k-mer size = 21
4
x.fna | | |
y.fna | 75.161926 | |
z.fna | 75.197411 | 75.177795 |
m.fna | 75.154465 | 75.465622 | 75.155106

I want to know why it was possible for me to get all pairwise ANI results using kmer sizes greater than the maximum allowed kmer size (16) in fastANI and poor results (most were NA) for kmer size < 16?
Using fastANI version v1.1
My command example:
fastANI -k 13 --ql query.list --rl reference.list -o out --matrix

parbliss / fastani Goto Github PK

fastani's People

Contributors

Stargazers

Watchers

Forkers

fastani's Issues

Error details:

Hardware details

Recommend Projects

Recommend Topics

Recommend Org