bioinformaticstoolsmith / identity Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
Dear,
I was wondering whether there was a way to turn off the warnings like:
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
and
Skip row
?
Thank you!
Hi,
I was used meshclust3 to cluster repeat sequences (including mite and tir), I got many warnings like "Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.". The sequences length in input file range from 100bp to ~1Mb (some times range from 20bp to 3Mb).
Will these warnings affect results, and how to avoid these warnings?
Best,
Kun
Hi,
Thanks for the tool!
Can I use it on a very large data, like 300K-400K genomes (reads & assemblies)
Best,
I can't see it in the README and -h
of meshclust.
And I got this error:
Database file: all_region_prefixed.fasta
Output file: output1.txt
Cores: 16
Provided threshold: 0.97
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No
Average: 224
K: 3
Histogram size: 64
A histogram entry is 32 bits.
Generating data.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
Statistician warning at harmonicMeanRSimilarity. A sequence is too short. Similarity is assigned zero.
KmerHistogram: At least one valid segment is required.
Sequence: G
terminate called after throwing an instance of 'std::exception'
what(): std::exception
I think this was due to short lengths of some sequences. Was it right?
My questions are,
Thank you.
Hi there,
Thanks for making this tool available and for the clean repo!
I am trying to run MeshClust on a set of 1.3 million sequences with lengths ranging from 75bp - 6000bp. From the paper I saw that you were able to run meshclust on a microbiome dataset which comprised ~1 million sequences in ~2hrs with the hardware specified in the paper.
I've run meshclust on my dataset with a calculated identity threshold and its been running for 12 hrs and has only processed 160k sequences and is on the first data pass. I see that there are still many ~50k seqs in the reservoir. I'm guessing the reason it is taking so long to run is that the resevoir is continually being filled and then the initialization step for mean shift is being rerun causing the long runtime.
I wanted to check to see if you had any ideas why its taking this long or ways I could maybe split the data for better runtime?
Kind regards,
Evan
not sure where to troubleshoot.
the Identity test worked fine.
the mesh test crashed.
_
Database file: 97_shuffled.fa
Output file: output2.txt
Cores: 2
Provided threshold: 0.97
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No
Average: 289
K: 4
Histogram size: 256
A histogram entry is 16 bits.
Generating data.
Preparing data ...
Positive examples: 10000
Training size: 5000
Validation size: 5000
Better performance of: 0.00178729
jeffrey_divergence
Better performance of: 0.00169123
jeffrey_divergence
simMM
Better performance of: 0.00165223
jeffrey_divergence
sim_ratio
simMM
Better performance of: 0.00157806
chi_squared
jeffrey_divergence
sim_ratio
simMM
squared_chord^2 x hellinger^2
Better performance of: 0.00154754
chi_squared
jeffrey_divergence
sim_ratio
simMM
bray_curtis x hellinger^2
squared_chord^2 x hellinger^2
Better performance of: 0.00151275
chi_squared
bray_curtis
jeffrey_divergence
sim_ratio
simMM
bray_curtis x hellinger^2
covariance_r x simMM^2
squared_chord^2 x hellinger^2
Better performance of: 0.00147395
chi_squared
bray_curtis
jeffrey_divergence
sim_ratio
simMM
bray_curtis x jeffrey_divergence
bray_curtis x hellinger^2
covariance_r x simMM^2
chi_squared^2 x squared_chord^2
squared_chord^2 x hellinger^2
Selected statistics:
chi_squared
bray_curtis
jeffrey_divergence
sim_ratio
simMM
bray_curtis x jeffrey_divergence
bray_curtis x hellinger^2
covariance_r x simMM^2
chi_squared^2 x squared_chord^2
squared_chord^2 x hellinger^2
Finished training.
MAE: 0.0217444
MSE: 0.00147395
Optimizing ...
Validating ...
MAE: 0.0292413
MSE: 0.00208381
Clustering ...
Data run 1 ...
Killed_
why would the test fail? could i be lacking some dependencies?
Hello there
Thanks a lot already for the work on this package!
I am trying to cluster 34,937,058 sequences of about 1000bp (18S amplicons) contained in a single fasta file, I'm using the following code on HPC:
meshclust \
-d /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/18S_NIOZ320_NIOZ326.fa \
-o /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/consensus_95/18S_NIOZ320_NIOZ326_cl_0.95.txt \
-t 0.95 \
-b 45000 \
-v 180000
The code has been running for 125 days and was about to finish its 4th run, which I thought would be the last, but a 5th clustering run of the data has started (see screenshot). This last run indicate from the beginning that there are "0 unprocessed sequences" and the number of found centers has been stagnating around 47,900 for quite sometime.
I understand that this is a lot of data and that the error rate of Oxford Nanopore reads probably adds complexity to the clustering algorithm. The amplicons have nevertheless been quality filtered and represent consensuses of several amplicons (pre-clustered based Unique Molecular Identifiers). A previous Meshclust run with a similar approach but 16S data took ~80 days to cluster 33,306,880 amplicons and found 55,715 centers.
My questions are:
Am I doing something wrong here? Can Meshclust support such a computation? ("swarm -d 3" ran faster but clustered only 500K reads).
Is there a way to stop the run at this stage and get the current output (centers and their composition)? Is there a way to predict how many runs will it take Meshclust to give an output?
Any help would be highly appreciated!
Best
Pierre
error: ‘sleep_for’ is not a member of ‘std::this_thread’
368 | std::this_thread::sleep_for(std::chrono::seconds(1));
|
This can be fixed by including and/or .
Hi @benjamin-james ,
I was wondering if it is possible to cluster reads/contigs based on the similarity in sequence?
For example, let's say I have 1000 contigs and want to find out how they cluster together based on the sequence similarity?
Thanks,
Vahid
Sorry there must be some resource I'm missing but could not find the latest v3.0.0
Hi,
I was wondering if MeshClust could be used to cluster MAGs in order to dereplicate them? There is no mention of metagenome assembly in your paper and I'm wondering if the distance estimation would be sensitive to varying bin completeness and contamination levels.
Cheers
Dears,
Along with greeting you, I found that when I have two contigs of different length, for example: 30.000bp and 3.000bp, they are almost 90% identical. It is ok from the 3000bp perspective but not from 30000bp. I couldn't found a parameter to correct (or select other) value. So, Is there any parameter or flag to adjust the identity values? and would be okay if I correct the values by length ratio? I am looking for an average between them more than the highest value.
thanks you in advance!
I am running the following FASTA file
fasta.zip
Using ../meshclust -d file.fasta -o file.txt
MeShClust 2.0 is developed by Hani Z. Girgis, PhD.
This program clusters DNA sequences using identity scores obtained without alignment.
Copyright (C) 2021-2022 Hani Z. Girgis, PhD
Academic use: Affero General Public License version 1.
Any restrictions to use for profit or non-academics: Alternative commercial license is required.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Please contact Dr. Hani Z. Girgis ([email protected]) if you need more information.
Please cite the following papers:
MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
and alignment-free identity scores (2022). Hani Z. Girgis, BMC Genomics, 23(1):423.
Identity: Rapid alignment-free prediction of sequence alignment identity scores using
self-supervised general linear models (2021). Hani Z. Girgis, Benjamin T. James, and
Brian B. Luczak. NAR Genom Bioinform, 13(1), lqab001.
A survey and evaluations of histogram-based statistics in alignment-free sequence
comparison (2019). Brian B. Luczak, Benjamin T. James, and Hani Z. Girgis. Briefings
in Bioinformatics, 20(4):1222–1237.
MeShClust: An intelligent tool for clustering DNA sequences (2018). Benjamin T. James,
Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83.
Database file: low_diversity_day.fasta
Output file: low_diversity_day.txt
Cores: 16
Estimating the threshold ...
Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Number of standard deviations: 2
Preparing data ...
Positive examples: 9990
Training size: 4995
Validation size: 4995
Better performance of: 0.000105995
jeffrey_divergence
Better performance of: 4.05578e-05
jeffrey_divergence
correlation x correlation^2
Better performance of: 3.28931e-05
jeffrey_divergence
simMM
euclidean x cosine
euclidean x correlation
correlation x correlation^2
Better performance of: 2.8555e-05
jeffrey_divergence
simMM
chi_squared^2
minkowski^2
euclidean x cosine
euclidean x correlation
chi_squared x cosine^2
correlation x correlation^2
Selected statistics:
jeffrey_divergence
simMM
chi_squared^2
minkowski^2
euclidean x cosine
euclidean x correlation
chi_squared x cosine^2
correlation x correlation^2
Finished training.
MAE: 0.00398844
MSE: 2.8555e-05
Optimizing ...
Validating ...
MAE: 0.00411611
MSE: 3.08339e-05
Mean = 0.992131
STD = 3.33067e-16
Min = 0.992131
============================================
0.992131
Final threshold: 0.992131
Calculated threshold: 0.992131
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No
Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Floating point exception (core dumped)
Hi Dr. Girgis,
Thanks for this interesting package and I'm really excited about it. I installed Meshclust and could run some of the example data to completion easily (keratin_query.fasta). Unfortunately when I try to run it on my fasta file with sequences of interest (majority of which have >99% similarity), I get the error Floating point exception (core dumped)
. My DNA sequences range from 1381bps to 1840bps with a majority at 1584bps. Below is the output from my terminal when I run Meshclust, any help would be greatly appreciated it. Thanks!
MeShClust 2.0 is developed by Hani Z. Girgis, PhD.
This program clusters DNA sequences using identity scores obtained without alignment.
Copyright (C) 2021-2022 Hani Z. Girgis, PhD
Academic use: Affero General Public License version 1.
Any restrictions to use for profit or non-academics: Alternative commercial license is required.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Please contact Dr. Hani Z. Girgis ([email protected]) if you need more information.
Please cite the following papers:
MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
and alignment-free identity scores (2022). Hani Z. Girgis, BMC Genomics, 23(1):423.
Identity: Rapid alignment-free prediction of sequence alignment identity scores using
self-supervised general linear models (2021). Hani Z. Girgis, Benjamin T. James, and
Brian B. Luczak. NAR Genom Bioinform, 13(1), lqab001.
A survey and evaluations of histogram-based statistics in alignment-free sequence
comparison (2019). Brian B. Luczak, Benjamin T. James, and Hani Z. Girgis. Briefings
in Bioinformatics, 20(4):1222–1237.
MeShClust: An intelligent tool for clustering DNA sequences (2018). Benjamin T. James,
Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83.
Database file: ./consensus_reads.fa
Output file: ./cluster_test.txt
Cores: 96
Estimating the threshold ...
Average: 1585
K: 5
Histogram size: 1024
A histogram entry is 16 bits.
Generating data.
Number of standard deviations: 1
Preparing data ...
Positive examples: 10000
Training size: 5000
Validation size: 5000
Better performance of: 0.00110822
jeffrey_divergence
Better performance of: 0.000747565
jeffrey_divergence
cosine x jeffrey_divergence
Better performance of: 0.000719267
jeffrey_divergence
simMM
cosine x jeffrey_divergence
Better performance of: 0.000681764
jeffrey_divergence
simMM
manhattan x correlation
cosine x jeffrey_divergence
correlation x simMM^2
Better performance of: 0.000652853
jeffrey_divergence
sim_ratio
simMM
manhattan x correlation
cosine x jeffrey_divergence
correlation x simMM^2
Selected statistics:
jeffrey_divergence
sim_ratio
simMM
manhattan x correlation
cosine x jeffrey_divergence
correlation x simMM^2
Finished training.
MAE: 0.0175349
MSE: 0.000652853
Optimizing ...
Validating ...
MAE: 0.0219525
MSE: 0.000956407
Initialization: 0.373589 1
Stopping because there is no change for three iterations: 3
0.374592 0.995013 0.0001 0.9999
Initialization: 0.374592 0.995013 0.005 0.005
Stopping because there is no change for three iterations: 3
0.374592 0.995013 0.00129545 0.00302476 0.0001 0.9999
Initialization: 0.915731 1
Stopping because there is no change for three iterations: 10
0.956302 0.995658 0.0011 0.9989
Initialization: 0.956302 0.995658 0.0170152 0.005
Converged.
0.995614 0.996173 0.00287971 9.07327e-07 1 0
Initialization: 0.874258 1
Stopping because there is no change for three iterations: 9
0.954592 0.995994 0.00215 0.99785
Initialization: 0.954592 0.995994 0.0202069 0.005
Converged.
0.995905 0.996071 0.00327823 1.77699e-06 1 0
============================================
0.992735
0.992628
0.991988
Final threshold: 0.992628
Calculated threshold: 0.992628
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No
Average: 1585
K: 5
Histogram size: 1024
A histogram entry is 16 bits.
Generating data.
Floating point exception (core dumped)
Hi,
Thanks for the great tool. I would like to include meshclust3 in my workflow. I wonder if it's possible to build a bioconda recipe for it.
I tried to make one but it couldn't identify CXX compiler. Do you have any suggestions?
bash.sh
meta.yaml
-- The C compiler identification is GNU 12.3.0
-- The CXX compiler identification is unknown
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: $BUILD_PREFIX/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
CMake Error at CMakeLists.txt:5 (project):
The CMAKE_CXX_COMPILER:
g++-7
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.
MeshClust looks like just what I want for clustering sets of haplotye alleles. Thanks for writing it.
Please can you add to the documentation a description of the columns in the output file? I assume that column 1 is the cluster number and column 2 is obviously the sequence id. Column 3 looks like some measure of similarity but to what? What is column 4?
Thanks Harry
Hi there, I'm trying to run and all vs. all of around 10k sequences which are on average 100kb in length.
I've tried running with various combinations of -c 8 or not specifying any cores (server has 24 cores) but keep getting the error "Segmentation fault (core dumped)" each time.
~/programs/Identity/bin/identity -d pooled_genomes.fasta -o output.txt -t 0.9 -c 8
I'm not sure what I'm doing wrong. Any help would be greatly appreciated! :)
Hi Hani,
I've just come across your algorithm and paper - it looks great!
I am planning on using it to cluster ONT minION sequenced full length 16S microbiome samples (after which I plan on creating some "consensus" representative read samples to use for taxonomic classification - which hopefully will be an improvement on Kraken2 and other clustering methods e.g. UMAP based methods from https://github.com/genomicsITER/NanoCLUST ).
Obviously these reads differ from the microbiome reads described in your paper - for my test samples, the mean Q Score of the reads is around 11-12 (so low 90s % accurate). And they are around 1500bp long.
Do you have any advice on tweaking the parameters to optimise cluster quality given the low accuracy of the reads? As an aside, I've found -t 0.8 is on average is giving me the highest cluster score for my test set but I haven't done rigorous tests yet.
And furthermore, would you be able to provide some pre-compiled binaries (please excuse my ignorance if it is not possible, I am a bioinformatician/computational biologist not a computer scientist and I don't code in C++)? I am unable to compile the program on my M1 chip Mac (it runs fine on my Intel chip MacBook once I installed gcc using brew) because g++ doesn't seem to be compatible (yet).
Kind regards,
George Bouras
--Hi,
did you compare the results with the kmer-db tool (https://doi.org/10.1093/bioinformatics/bty610) ?
thank you --
Hi there,
Thanks for the tool!
When I tried the meshclust 3.0, I got the core dump error, do you have any suggestions for this? thank you!
The compute-node of the cluster has 56 cores (112 threads), 1.5T RAM, and we did not limit how much RAM the meshclust would like to use.
Best
Guanliang
-rw-rw-r-- 1 gmeng 1.5G Jun 13 17:15 combined.fa
-rw-rw-r-- 1 gmeng 112 Jun 14 10:00 meshclust3.sh
-rw-r--r-- 1 gmeng 5.9K Jun 15 20:16 meshclust3.sh.o539214
-rw------- 1 gmeng 18G Jun 15 22:18 core.229599
-rw-r--r-- 1 gmeng 416 Jun 15 22:18 meshclust3.sh.e539214
$ grep -c '>' combined.fa
5652580
meshclust3.sh:
/home/gmeng/soft/MeShClust_v3/Identity/bin/meshclust -d combined.fa -t 0.6 -o out.clstr -c 80 -e y -a n -p 10
meshclust3.sh.o539214:
MeShClust v3.0 is developed by Hani Z. Girgis, PhD.
This program clusters DNA sequences using identity scores obtained without alignment.
Copyright (C) 2021-2022 Hani Z. Girgis, PhD
Academic use: Affero General Public License version 1.
Any restrictions to use for profit or non-academics: Alternative commercial license is required.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Please contact Dr. Hani Z. Girgis ([email protected]) if you need more information.
Please cite the following papers:
1. Identity: Rapid alignment-free prediction of sequence alignment identity scores using
self-supervised general linear models. Hani Z. Girgis, Benjamin T. James, and Brian B.
Luczak. NAR GAB, 3(1):lqab001, 2021.
2. MeShClust: an intelligent tool for clustering DNA sequences. Benjamin T. James,
Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83, 2018.
3. MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
and alignment-free identity scores. Hani Z. Girgis. A great journal. 2022.
Database file: combined.fa
Output file: out.clstr
Cores: 80
Provided threshold: 0.6
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No
Average: 756
K: 4
Histogram size: 256
A histogram entry is 16 bits.
Generating data.
Preparing data ...
Positive examples: 10000
Training size: 5000
Validation size: 5000
Better performance of: 0.00324074
chi_squared x jeffrey_divergence
Better performance of: 0.00278104
chi_squared x jeffrey_divergence
chi_squared^2 x d2_s_r^2
Better performance of: 0.00275123
chi_squared x jeffrey_divergence
chi_squared^2 x d2_s_r^2
squared_chord^2 x hellinger^2
Better performance of: 0.00271437
chi_squared x jeffrey_divergence
chi_squared^2 x d2_s_r^2
bray_curtis^2 x d2_s_r^2
squared_chord^2 x hellinger^2
Better performance of: 0.00266334
chi_squared x squared_chord
chi_squared x jeffrey_divergence
chi_squared^2 x d2_s_r^2
bray_curtis^2 x d2_s_r^2
squared_chord^2 x hellinger^2
kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00263148
squared_chord
chi_squared x squared_chord
chi_squared x jeffrey_divergence
chi_squared^2 x d2_s_r^2
bray_curtis^2 x d2_s_r^2
squared_chord^2 x hellinger^2
kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00257594
squared_chord
chi_squared x squared_chord
chi_squared x jeffrey_divergence
hellinger x hellinger^2
chi_squared^2 x d2_s_r^2
bray_curtis^2 x d2_s_r^2
squared_chord^2 x hellinger^2
kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00249854
squared_chord
manhattan x simMM
chi_squared x squared_chord
chi_squared x jeffrey_divergence
hellinger x hellinger^2
chi_squared^2 x d2_s_r^2
bray_curtis^2 x d2_s_r^2
squared_chord^2 x hellinger^2
kulczynski_2^2 x d2_s_r^2
Selected statistics:
squared_chord
manhattan x simMM
chi_squared x squared_chord
chi_squared x jeffrey_divergence
hellinger x hellinger^2
chi_squared^2 x d2_s_r^2
bray_curtis^2 x d2_s_r^2
squared_chord^2 x hellinger^2
kulczynski_2^2 x d2_s_r^2
Finished training.
MAE: 0.036734
MSE: 0.00249854
Optimizing ...
Validating ...
MAE: 0.0426102
MSE: 0.00325363
Clustering ...
Data run 1 ...
Processed sequences: 25000
Unprocessed sequences: 0
Found centers: 772
Processed sequences: 50000
Unprocessed sequences: 24657
Found centers: 770
Processed sequences: 100478
Unprocessed sequences: 41448
Found centers: 1278
Processed sequences: 166024
Unprocessed sequences: 32518
Found centers: 2628
Processed sequences: 206655
Unprocessed sequences: 27580
Found centers: 3034
Processed sequences: 338846
Unprocessed sequences: 65658
Found centers: 3620
Processed sequences: 348903
Unprocessed sequences: 50307
Found centers: 4308
Processed sequences: 414183
Unprocessed sequences: 67888
Found centers: 4653
Processed sequences: 428889
Unprocessed sequences: 56801
Found centers: 5147
Processed sequences: 473924
Unprocessed sequences: 66571
Found centers: 5560
Processed sequences: 591912
Unprocessed sequences: 101368
Found centers: 6457
Processed sequences: 599863
Unprocessed sequences: 83946
Found centers: 6943
Processed sequences: 682732
Unprocessed sequences: 112078
Found centers: 7277
Processed sequences: 694499
Unprocessed sequences: 97930
Found centers: 7757
Processed sequences: 752209
Unprocessed sequences: 114752
Found centers: 8067
Processed sequences: 767163
Unprocessed sequences: 94407
Found centers: 8447
Processed sequences: 867163
Unprocessed sequences: 141679
Found centers: 8792
Processed sequences: 875812
Unprocessed sequences: 125026
Found centers: 9248
Processed sequences: 950986
Unprocessed sequences: 155363
Found centers: 9586
Processed sequences: 962281
Unprocessed sequences: 137454
Found centers: 10001
Processed sequences: 1050620
Unprocessed sequences: 173768
Found centers: 10430
Processed sequences: 1060816
Unprocessed sequences: 156809
Found centers: 10884
Processed sequences: 1138833
Unprocessed sequences: 189905
Found centers: 11240
Processed sequences: 1219898
Unprocessed sequences: 191996
Found centers: 12162
Processed sequences: 1234377
Unprocessed sequences: 173682
Found centers: 12615
Processed sequences: 1328038
Unprocessed sequences: 210768
Found centers: 13095
Processed sequences: 1338108
Unprocessed sequences: 194114
Found centers: 13563
Processed sequences: 1413309
Unprocessed sequences: 217638
Found centers: 13916
Processed sequences: 1426200
Unprocessed sequences: 203726
Found centers: 14366
Processed sequences: 1482720
Unprocessed sequences: 217439
Found centers: 14648
Processed sequences: 1549592
Unprocessed sequences: 216905
Found centers: 15453
Processed sequences: 1566431
Unprocessed sequences: 205939
Found centers: 15909
Processed sequences: 1610994
Unprocessed sequences: 211989
Found centers: 16228
meshclust3.sh.e539214:
Mean 1 (mean1) and Mean 2 (mean2) cannot be zeros. Mean 1 is: 0, mean 2 is: 0.226562
terminate called after throwing an instance of 'std::exception'
what(): std::exception
/opt/gridengine/default/spool/compute-0-0/job_scripts/539214: Zeile 1: 229599 Abgebrochen (Speicherabzug geschrieben) /home/gmeng/soft/MeShClust_v3/Identity/bin/meshclust -d combined.fa -t 0.6 -o out.clstr -c 80 -e y -a n -p 10
Hi
I downloaded the latest release. I compiled it successfully however i see the identity binary only. There is no MeShClust binary. I dont see any meshclust folder in the src
thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.