bioinformaticstoolsmith / identity Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 2.0 15.44 MB

License: Other

Python 1.86% C++ 97.44% CMake 0.70%

identity's People

Contributors

Stargazers

Watchers

Forkers

dfajar2 hersheytriestocode

identity's Issues

Turn off warning?

Dear,

I was wondering whether there was a way to turn off the warnings like:
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
and
Skip row
?

Thank you!

I was used meshclust3 to cluster repeat sequences (including mite and tir), I got many warnings like "Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.". The sequences length in input file range from 100bp to ~1Mb (some times range from 20bp to 3Mb).

Will these warnings affect results, and how to avoid these warnings?

Best,
Kun

Identity on very large data

Hi,

Thanks for the tool!

Can I use it on a very large data, like 300K-400K genomes (reads & assemblies)

Best,

Any info for default threshold and mininum read sequence length for meshclust?

I can't see it in the README and -h of meshclust.

And I got this error:

Database file: all_region_prefixed.fasta
Output file: output1.txt
Cores: 16
Provided threshold: 0.97
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No


Average: 224
K: 3
Histogram size: 64
A histogram entry is 32 bits.
Generating data.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
KmerHistogram: At least one valid segment is required.
Sequence: G
terminate called after throwing an instance of 'std::exception'
  what():  std::exception

I think this was due to short lengths of some sequences. Was it right?

My questions are,

What's the default threshold value?
If it was, what's the minimum length for clustering?

Thank you.

Runtime for Large Dataset (1.3M seqs)

Hi there,

Thanks for making this tool available and for the clean repo!

I am trying to run MeshClust on a set of 1.3 million sequences with lengths ranging from 75bp - 6000bp. From the paper I saw that you were able to run meshclust on a microbiome dataset which comprised ~1 million sequences in ~2hrs with the hardware specified in the paper.

I've run meshclust on my dataset with a calculated identity threshold and its been running for 12 hrs and has only processed 160k sequences and is on the first data pass. I see that there are still many ~50k seqs in the reservoir. I'm guessing the reason it is taking so long to run is that the resevoir is continually being filled and then the initialization step for mean shift is being rerun causing the long runtime.

I wanted to check to see if you had any ideas why its taking this long or ways I could maybe split the data for better runtime?

Kind regards,
Evan

test mesh killed

not sure where to troubleshoot.

the Identity test worked fine.
the mesh test crashed.

_
Database file: 97_shuffled.fa
Output file: output2.txt
Cores: 2
Provided threshold: 0.97
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No

Average: 289
K: 4
Histogram size: 256
A histogram entry is 16 bits.
Generating data.
Preparing data ...
Positive examples: 10000
Training size: 5000
Validation size: 5000
Better performance of: 0.00178729
jeffrey_divergence
Better performance of: 0.00169123
jeffrey_divergence
simMM
Better performance of: 0.00165223
jeffrey_divergence
sim_ratio
simMM
Better performance of: 0.00157806
chi_squared
jeffrey_divergence
sim_ratio
simMM
squared_chord^2 x hellinger^2
Better performance of: 0.00154754
chi_squared
jeffrey_divergence
sim_ratio
simMM
bray_curtis x hellinger^2
squared_chord^2 x hellinger^2
Better performance of: 0.00151275
chi_squared
bray_curtis
jeffrey_divergence
sim_ratio
simMM
bray_curtis x hellinger^2
covariance_r x simMM^2
squared_chord^2 x hellinger^2
Better performance of: 0.00147395
chi_squared
bray_curtis
jeffrey_divergence
sim_ratio
simMM
bray_curtis x jeffrey_divergence
bray_curtis x hellinger^2
covariance_r x simMM^2
chi_squared^2 x squared_chord^2
squared_chord^2 x hellinger^2
Selected statistics:
chi_squared
bray_curtis
jeffrey_divergence
sim_ratio
simMM
bray_curtis x jeffrey_divergence
bray_curtis x hellinger^2
covariance_r x simMM^2
chi_squared^2 x squared_chord^2
squared_chord^2 x hellinger^2
Finished training.
MAE: 0.0217444
MSE: 0.00147395
Optimizing ...
Validating ...
MAE: 0.0292413
MSE: 0.00208381

Clustering ...

Data run 1 ...
Killed_

why would the test fail? could i be lacking some dependencies?

Clustering long-read 18S amplicons

Hello there
Thanks a lot already for the work on this package!

I am trying to cluster 34,937,058 sequences of about 1000bp (18S amplicons) contained in a single fasta file, I'm using the following code on HPC:

meshclust \
  -d /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/18S_NIOZ320_NIOZ326.fa \
  -o /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/consensus_95/18S_NIOZ320_NIOZ326_cl_0.95.txt \
  -t 0.95 \
  -b 45000 \
  -v 180000

The code has been running for 125 days and was about to finish its 4th run, which I thought would be the last, but a 5th clustering run of the data has started (see screenshot). This last run indicate from the beginning that there are "0 unprocessed sequences" and the number of found centers has been stagnating around 47,900 for quite sometime.

I understand that this is a lot of data and that the error rate of Oxford Nanopore reads probably adds complexity to the clustering algorithm. The amplicons have nevertheless been quality filtered and represent consensuses of several amplicons (pre-clustered based Unique Molecular Identifiers). A previous Meshclust run with a similar approach but 16S data took ~80 days to cluster 33,306,880 amplicons and found 55,715 centers.

My questions are:

Am I doing something wrong here? Can Meshclust support such a computation? ("swarm -d 3" ran faster but clustered only 500K reads).
Is there a way to stop the run at this stage and get the current output (centers and their composition)? Is there a way to predict how many runs will it take Meshclust to give an output?

Any help would be highly appreciated!
Best
Pierre

Compilation error

error: ‘sleep_for’ is not a member of ‘std::this_thread’
  368 |                                 std::this_thread::sleep_for(std::chrono::seconds(1));
      |

This can be fixed by including and/or .

Is it possible to cluster the reads?

Hi @benjamin-james ,

I was wondering if it is possible to cluster reads/contigs based on the similarity in sequence?
For example, let's say I have 1000 contigs and want to find out how they cluster together based on the sequence similarity?

Thanks,
Vahid

How to get MeShClust v3.0.0

Sorry there must be some resource I'm missing but could not find the latest v3.0.0

Clustering MAGs to nrMAGs

Hi,
I was wondering if MeshClust could be used to cluster MAGs in order to dereplicate them? There is no mention of metagenome assembly in your paper and I'm wondering if the distance estimation would be sensitive to varying bin completeness and contamination levels.

Cheers

Identity shows the highest value for very different contigs length

Dears,
Along with greeting you, I found that when I have two contigs of different length, for example: 30.000bp and 3.000bp, they are almost 90% identical. It is ok from the 3000bp perspective but not from 30000bp. I couldn't found a parameter to correct (or select other) value. So, Is there any parameter or flag to adjust the identity values? and would be okay if I correct the values by length ratio? I am looking for an average between them more than the highest value.

thanks you in advance!

Floating point exception (core dumped)

I am running the following FASTA file
fasta.zip

Using ../meshclust -d file.fasta -o file.txt

MeShClust 2.0 is developed by Hani Z. Girgis, PhD.

This program clusters DNA sequences using identity scores obtained without alignment.

Copyright (C) 2021-2022 Hani Z. Girgis, PhD

Academic use: Affero General Public License version 1.

Any restrictions to use for profit or non-academics: Alternative commercial license is required.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Please contact Dr. Hani Z. Girgis ([email protected]) if you need more information.

Please cite the following papers:
        MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
        and alignment-free identity scores (2022). Hani Z. Girgis, BMC Genomics, 23(1):423.

        Identity: Rapid alignment-free prediction of sequence alignment identity scores using
        self-supervised general linear models (2021). Hani Z. Girgis, Benjamin T. James, and
        Brian B. Luczak. NAR Genom Bioinform, 13(1), lqab001.

        A survey and evaluations of histogram-based statistics in alignment-free sequence
        comparison (2019). Brian B. Luczak, Benjamin T. James, and Hani Z. Girgis. Briefings
        in Bioinformatics, 20(4):1222–1237.

        MeShClust: An intelligent tool for clustering DNA sequences (2018). Benjamin T. James,
        Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83.

Database file: low_diversity_day.fasta
Output file: low_diversity_day.txt
Cores: 16
Estimating the threshold ...
Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Number of standard deviations: 2
Preparing data ...
        Positive examples: 9990
        Training size: 4995
        Validation size: 4995
Better performance of: 0.000105995
        jeffrey_divergence
Better performance of: 4.05578e-05
        jeffrey_divergence
        correlation x correlation^2
Better performance of: 3.28931e-05
        jeffrey_divergence
        simMM
        euclidean x cosine
        euclidean x correlation
        correlation x correlation^2
Better performance of: 2.8555e-05
        jeffrey_divergence
        simMM
        chi_squared^2
        minkowski^2
        euclidean x cosine
        euclidean x correlation
        chi_squared x cosine^2
        correlation x correlation^2
Selected statistics:
        jeffrey_divergence
        simMM
        chi_squared^2
        minkowski^2
        euclidean x cosine
        euclidean x correlation
        chi_squared x cosine^2
        correlation x correlation^2
Finished training.
        MAE: 0.00398844
        MSE: 2.8555e-05
Optimizing ...
Validating ...
        MAE: 0.00411611
        MSE: 3.08339e-05
Mean = 0.992131
STD = 3.33067e-16
Min = 0.992131
============================================
0.992131
Final threshold: 0.992131
Calculated threshold: 0.992131
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No


Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Floating point exception (core dumped)

Floating point exception (core dumped)

Hi Dr. Girgis,

Thanks for this interesting package and I'm really excited about it. I installed Meshclust and could run some of the example data to completion easily (keratin_query.fasta). Unfortunately when I try to run it on my fasta file with sequences of interest (majority of which have >99% similarity), I get the error Floating point exception (core dumped). My DNA sequences range from 1381bps to 1840bps with a majority at 1584bps. Below is the output from my terminal when I run Meshclust, any help would be greatly appreciated it. Thanks!

MeShClust 2.0 is developed by Hani Z. Girgis, PhD.

This program clusters DNA sequences using identity scores obtained without alignment.

Copyright (C) 2021-2022 Hani Z. Girgis, PhD

Academic use: Affero General Public License version 1.

Any restrictions to use for profit or non-academics: Alternative commercial license is required.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Please contact Dr. Hani Z. Girgis ([email protected]) if you need more information.

Please cite the following papers: 
	MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
	and alignment-free identity scores (2022). Hani Z. Girgis, BMC Genomics, 23(1):423.

	Identity: Rapid alignment-free prediction of sequence alignment identity scores using
	self-supervised general linear models (2021). Hani Z. Girgis, Benjamin T. James, and
	Brian B. Luczak. NAR Genom Bioinform, 13(1), lqab001.

	A survey and evaluations of histogram-based statistics in alignment-free sequence
	comparison (2019). Brian B. Luczak, Benjamin T. James, and Hani Z. Girgis. Briefings
	in Bioinformatics, 20(4):1222–1237.

	MeShClust: An intelligent tool for clustering DNA sequences (2018). Benjamin T. James,
	Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83.

Database file: ./consensus_reads.fa
Output file: ./cluster_test.txt
Cores: 96
Estimating the threshold ...
Average: 1585
K: 5
Histogram size: 1024
A histogram entry is 16 bits.
Generating data.
Number of standard deviations: 1
Preparing data ...
	Positive examples: 10000
	Training size: 5000
	Validation size: 5000
Better performance of: 0.00110822
	jeffrey_divergence
Better performance of: 0.000747565
	jeffrey_divergence
	cosine x jeffrey_divergence
Better performance of: 0.000719267
	jeffrey_divergence
	simMM
	cosine x jeffrey_divergence
Better performance of: 0.000681764
	jeffrey_divergence
	simMM
	manhattan x correlation
	cosine x jeffrey_divergence
	correlation x simMM^2
Better performance of: 0.000652853
	jeffrey_divergence
	sim_ratio
	simMM
	manhattan x correlation
	cosine x jeffrey_divergence
	correlation x simMM^2
Selected statistics:
	jeffrey_divergence
	sim_ratio
	simMM
	manhattan x correlation
	cosine x jeffrey_divergence
	correlation x simMM^2
Finished training.
	MAE: 0.0175349
	MSE: 0.000652853
Optimizing ...
Validating ...
	MAE: 0.0219525
	MSE: 0.000956407
Initialization: 0.373589 1
Stopping because there is no change for three iterations: 3
0.374592 0.995013 0.0001 0.9999
Initialization: 0.374592 0.995013 0.005 0.005
Stopping because there is no change for three iterations: 3
0.374592 0.995013 0.00129545 0.00302476 0.0001 0.9999
Initialization: 0.915731 1
Stopping because there is no change for three iterations: 10
0.956302 0.995658 0.0011 0.9989
Initialization: 0.956302 0.995658 0.0170152 0.005
Converged. 
0.995614 0.996173 0.00287971 9.07327e-07 1 0
Initialization: 0.874258 1
Stopping because there is no change for three iterations: 9
0.954592 0.995994 0.00215 0.99785
Initialization: 0.954592 0.995994 0.0202069 0.005
Converged. 
0.995905 0.996071 0.00327823 1.77699e-06 1 0
============================================
0.992735
0.992628
0.991988
Final threshold: 0.992628
Calculated threshold: 0.992628
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No


Average: 1585
K: 5
Histogram size: 1024
A histogram entry is 16 bits.
Generating data.
Floating point exception (core dumped)

bioconda recipe for meshclust3

Hi,

Thanks for the great tool. I would like to include meshclust3 in my workflow. I wonder if it's possible to build a bioconda recipe for it.

I tried to make one but it couldn't identify CXX compiler. Do you have any suggestions?
bash.sh
meta.yaml

-- The C compiler identification is GNU 12.3.0
-- The CXX compiler identification is unknown
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: $BUILD_PREFIX/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
CMake Error at CMakeLists.txt:5 (project):
  The CMAKE_CXX_COMPILER:

    g++-7

  is not a full path and was not found in the PATH.

  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.

Interpretation of output

MeshClust looks like just what I want for clustering sets of haplotye alleles. Thanks for writing it.
Please can you add to the documentation a description of the columns in the output file? I assume that column 1 is the cluster number and column 2 is obviously the sequence id. Column 3 looks like some measure of similarity but to what? What is column 4?
Thanks Harry

Segmentation fault

Hi there, I'm trying to run and all vs. all of around 10k sequences which are on average 100kb in length.

I've tried running with various combinations of -c 8 or not specifying any cores (server has 24 cores) but keep getting the error "Segmentation fault (core dumped)" each time.

~/programs/Identity/bin/identity -d pooled_genomes.fasta -o output.txt -t 0.9 -c 8

I'm not sure what I'm doing wrong. Any help would be greatly appreciated! :)

Nanopore 16S Reads & Pre-Compiled Binaries

Hi Hani,

I've just come across your algorithm and paper - it looks great!

I am planning on using it to cluster ONT minION sequenced full length 16S microbiome samples (after which I plan on creating some "consensus" representative read samples to use for taxonomic classification - which hopefully will be an improvement on Kraken2 and other clustering methods e.g. UMAP based methods from https://github.com/genomicsITER/NanoCLUST ).

Obviously these reads differ from the microbiome reads described in your paper - for my test samples, the mean Q Score of the reads is around 11-12 (so low 90s % accurate). And they are around 1500bp long.

Do you have any advice on tweaking the parameters to optimise cluster quality given the low accuracy of the reads? As an aside, I've found -t 0.8 is on average is giving me the highest cluster score for my test set but I haven't done rigorous tests yet.

And furthermore, would you be able to provide some pre-compiled binaries (please excuse my ignorance if it is not possible, I am a bioinformatician/computational biologist not a computer scientist and I don't code in C++)? I am unable to compile the program on my M1 chip Mac (it runs fine on my Intel chip MacBook once I installed gcc using brew) because g++ doesn't seem to be compatible (yet).

Kind regards,

George Bouras

kmer-db

--Hi,

did you compare the results with the kmer-db tool (https://doi.org/10.1093/bioinformatics/bty610) ?

thank you --

core dump

Hi there,

Thanks for the tool!

When I tried the meshclust 3.0, I got the core dump error, do you have any suggestions for this? thank you!

The compute-node of the cluster has 56 cores (112 threads), 1.5T RAM, and we did not limit how much RAM the meshclust would like to use.

Best
Guanliang

-rw-rw-r-- 1 gmeng 1.5G Jun 13 17:15 combined.fa
-rw-rw-r-- 1 gmeng  112 Jun 14 10:00 meshclust3.sh
-rw-r--r-- 1 gmeng 5.9K Jun 15 20:16 meshclust3.sh.o539214
-rw------- 1 gmeng  18G Jun 15 22:18 core.229599
-rw-r--r-- 1 gmeng  416 Jun 15 22:18 meshclust3.sh.e539214

$ grep -c '>' combined.fa
5652580

meshclust3.sh:

/home/gmeng/soft/MeShClust_v3/Identity/bin/meshclust -d combined.fa -t 0.6  -o out.clstr -c 80 -e y -a n -p 10

meshclust3.sh.o539214:

MeShClust v3.0 is developed by Hani Z. Girgis, PhD.

This program clusters DNA sequences using identity scores obtained without alignment.

Copyright (C) 2021-2022 Hani Z. Girgis, PhD

Academic use: Affero General Public License version 1.

Any restrictions to use for profit or non-academics: Alternative commercial license is required.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Please contact Dr. Hani Z. Girgis ([email protected]) if you need more information.

Please cite the following papers:
	1. Identity: Rapid alignment-free prediction of sequence alignment identity scores using
	self-supervised general linear models. Hani Z. Girgis, Benjamin T. James, and Brian B.
	Luczak. NAR GAB, 3(1):lqab001, 2021.
	2. MeShClust: an intelligent tool for clustering DNA sequences. Benjamin T. James,
	Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83, 2018.
	3. MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
	and alignment-free identity scores. Hani Z. Girgis. A great journal. 2022.

Database file: combined.fa
Output file: out.clstr
Cores: 80
Provided threshold: 0.6
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No


Average: 756
K: 4
Histogram size: 256
A histogram entry is 16 bits.
Generating data.
Preparing data ...
	Positive examples: 10000
	Training size: 5000
	Validation size: 5000
Better performance of: 0.00324074
	chi_squared x jeffrey_divergence
Better performance of: 0.00278104
	chi_squared x jeffrey_divergence
	chi_squared^2 x d2_s_r^2
Better performance of: 0.00275123
	chi_squared x jeffrey_divergence
	chi_squared^2 x d2_s_r^2
	squared_chord^2 x hellinger^2
Better performance of: 0.00271437
	chi_squared x jeffrey_divergence
	chi_squared^2 x d2_s_r^2
	bray_curtis^2 x d2_s_r^2
	squared_chord^2 x hellinger^2
Better performance of: 0.00266334
	chi_squared x squared_chord
	chi_squared x jeffrey_divergence
	chi_squared^2 x d2_s_r^2
	bray_curtis^2 x d2_s_r^2
	squared_chord^2 x hellinger^2
	kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00263148
	squared_chord
	chi_squared x squared_chord
	chi_squared x jeffrey_divergence
	chi_squared^2 x d2_s_r^2
	bray_curtis^2 x d2_s_r^2
	squared_chord^2 x hellinger^2
	kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00257594
	squared_chord
	chi_squared x squared_chord
	chi_squared x jeffrey_divergence
	hellinger x hellinger^2
	chi_squared^2 x d2_s_r^2
	bray_curtis^2 x d2_s_r^2
	squared_chord^2 x hellinger^2
	kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00249854
	squared_chord
	manhattan x simMM
	chi_squared x squared_chord
	chi_squared x jeffrey_divergence
	hellinger x hellinger^2
	chi_squared^2 x d2_s_r^2
	bray_curtis^2 x d2_s_r^2
	squared_chord^2 x hellinger^2
	kulczynski_2^2 x d2_s_r^2
Selected statistics:
	squared_chord
	manhattan x simMM
	chi_squared x squared_chord
	chi_squared x jeffrey_divergence
	hellinger x hellinger^2
	chi_squared^2 x d2_s_r^2
	bray_curtis^2 x d2_s_r^2
	squared_chord^2 x hellinger^2
	kulczynski_2^2 x d2_s_r^2
Finished training.
	MAE: 0.036734
	MSE: 0.00249854
Optimizing ...
Validating ...
	MAE: 0.0426102
	MSE: 0.00325363

Clustering ...

Data run 1 ...
	Processed sequences: 25000
	Unprocessed sequences: 0
	Found centers: 772
	Processed sequences: 50000
	Unprocessed sequences: 24657
	Found centers: 770
	Processed sequences: 100478
	Unprocessed sequences: 41448
	Found centers: 1278
	Processed sequences: 166024
	Unprocessed sequences: 32518
	Found centers: 2628
	Processed sequences: 206655
	Unprocessed sequences: 27580
	Found centers: 3034
	Processed sequences: 338846
	Unprocessed sequences: 65658
	Found centers: 3620
	Processed sequences: 348903
	Unprocessed sequences: 50307
	Found centers: 4308
	Processed sequences: 414183
	Unprocessed sequences: 67888
	Found centers: 4653
	Processed sequences: 428889
	Unprocessed sequences: 56801
	Found centers: 5147
	Processed sequences: 473924
	Unprocessed sequences: 66571
	Found centers: 5560
	Processed sequences: 591912
	Unprocessed sequences: 101368
	Found centers: 6457
	Processed sequences: 599863
	Unprocessed sequences: 83946
	Found centers: 6943
	Processed sequences: 682732
	Unprocessed sequences: 112078
	Found centers: 7277
	Processed sequences: 694499
	Unprocessed sequences: 97930
	Found centers: 7757
	Processed sequences: 752209
	Unprocessed sequences: 114752
	Found centers: 8067
	Processed sequences: 767163
	Unprocessed sequences: 94407
	Found centers: 8447
	Processed sequences: 867163
	Unprocessed sequences: 141679
	Found centers: 8792
	Processed sequences: 875812
	Unprocessed sequences: 125026
	Found centers: 9248
	Processed sequences: 950986
	Unprocessed sequences: 155363
	Found centers: 9586
	Processed sequences: 962281
	Unprocessed sequences: 137454
	Found centers: 10001
	Processed sequences: 1050620
	Unprocessed sequences: 173768
	Found centers: 10430
	Processed sequences: 1060816
	Unprocessed sequences: 156809
	Found centers: 10884
	Processed sequences: 1138833
	Unprocessed sequences: 189905
	Found centers: 11240
	Processed sequences: 1219898
	Unprocessed sequences: 191996
	Found centers: 12162
	Processed sequences: 1234377
	Unprocessed sequences: 173682
	Found centers: 12615
	Processed sequences: 1328038
	Unprocessed sequences: 210768
	Found centers: 13095
	Processed sequences: 1338108
	Unprocessed sequences: 194114
	Found centers: 13563
	Processed sequences: 1413309
	Unprocessed sequences: 217638
	Found centers: 13916
	Processed sequences: 1426200
	Unprocessed sequences: 203726
	Found centers: 14366
	Processed sequences: 1482720
	Unprocessed sequences: 217439
	Found centers: 14648
	Processed sequences: 1549592
	Unprocessed sequences: 216905
	Found centers: 15453
	Processed sequences: 1566431
	Unprocessed sequences: 205939
	Found centers: 15909
	Processed sequences: 1610994
	Unprocessed sequences: 211989
	Found centers: 16228

meshclust3.sh.e539214:

Mean 1 (mean1) and Mean 2 (mean2) cannot be zeros. Mean 1 is: 0, mean 2 is: 0.226562

terminate called after throwing an instance of 'std::exception'
  what():  std::exception
/opt/gridengine/default/spool/compute-0-0/job_scripts/539214: Zeile 1: 229599 Abgebrochen             (Speicherabzug geschrieben) /home/gmeng/soft/MeShClust_v3/Identity/bin/meshclust -d combined.fa -t 0.6 -o out.clstr -c 80 -e y -a n -p 10

missing meshclust

I downloaded the latest release. I compiled it successfully however i see the identity binary only. There is no MeShClust binary. I dont see any meshclust folder in the src

thanks