ms609 / treedist Goto Github PK

Calculate distances between phylogenetic trees in R

Home Page: https://ms609.github.io/TreeDist/

R 73.50% TeX 4.07% C++ 21.79% CSS 0.22% C 0.42%

r r-package rstats phylogenetic-trees tree-distances trees

treedist's Introduction

TreeDist

'TreeDist' is an R package that implements a suite of metrics that quantify the topological distance between pairs of unweighted phylogenetic trees. It also includes a simple 'Shiny' application to allow the visualization of distance-based tree spaces, and functions to calculate the information content of trees and splits.

'TreeDist' primarily employs metrics in the category of 'generalized Robinson–Foulds distances': they are based on comparing splits (bipartitions) between trees, and thus reflect the relationship data within trees, with no reference to branch lengths.

Generalized RF distances

The Robinson-Foulds distance simply tallies the number of non-trivial splits (sometimes inaccurately termed clades, nodes or edges) that occur in both trees – any splits that are not perfectly identical contribute one point to the distance score of zero, however similar or different they are. By overlooking potential similarities between almost-identical splits, this conservative approach has undesirable properties.

'Generalized' RF metrics generate matchings that pair splits in one tree with similar splits in the other. Each pair of splits is assigned a similarity score; the sum of these scores in the optimal matching then quantifies the similarity between two trees.

Different ways of calculating the the similarity between a pair of splits lead to different tree distance metrics, implemented in the functions below:

MutualClusteringInfo(), SharedPhylogeneticInfo()

Smith (2020) scores matchings based on the amount of information that one partition contains about the other. The Mutual Phylogenetic Information assigns zero similarity to split pairs that cannot both exist on a single tree; The Mutual Clustering Information metric is more forgiving, and exhibits more desirable behaviour; it is the recommended metric for tree comparison. (Its complement, ClusteringInfoDistance(), returns a tree distance.)
NyeSimilarity()

Nye et al. (2006) score matchings according to the size of the largest split that is consistent with both of them, normalized against the Jaccard index. This approach is extended by Böcker et al. (2013) with the Jaccard-Robinson-Foulds metric (function JaccardRobinsonFoulds()).
MatchingSplitDistance()

Bogdanowicz and Giaro (2012) and Lin et al. (2012) independently proposed counting the number of 'mismatched' leaves in a pair of splits. MatchingSplitInfoDistance() provides an information-based equivalent (Smith 2020).

The package also implements the variation of the path distance proposed by Kendal and Colijn (2016) (function KendallColijn()), approximations of the Nearest-Neighbour Interchange (NNI) distance (function NNIDist(); following Li et al. (1996)), and calculates the size (function MASTSize()) and information content (function MASTInfo()) of the Maximum Agreement Subtree.

For an implementation of the Tree Bisection and Reconnection (TBR) distance, see the package 'TBRDist'.

Installation

Install and load the library from CRAN as follows:

install.packages('TreeDist')
library('TreeDist')

You can install the development version of the package with:

if(!require("curl")) install.packages("curl")
if(!require("remotes")) install.packages("remotes")
remotes::install_github("ms609/TreeDist")

Tree space analysis

Construct tree spaces and readily visualize projected landscapes, avoiding common analytical pitfalls (Smith, 2022), using the inbuilt graphical user interface (Shiny GUI):

TreeDist::MapTrees()

Serious analysts should consult the vignette for a command-line interface.

Documentation

References

Böcker, S. et al. (2013) The Generalized Robinson-Foulds metric. Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science, 8126, 156–69.
Bogdanowicz, D. and Giaro, K. (2012) Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 150–160.
Kendall, M. and Colijn, C. (2016) Mapping phylogenetic trees to reveal distinct patterns of evolution. Mol Biol Evol, 33, 2735–2743.
Li, M., Tromp, J. and Zhang, L.-X. (1996) Some notes on the nearest neighbour interchange distance. Computing and Combinatorics, Goos, G., Hartmanis, J., Leeuwen, J., Cai, J.-Y., and Wong, C. K., eds. Springer, Berlin. 343–351.
Nye, T.M.W. et al. (2006) A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics, 22, 117–119.
Smith, M.R. (2020) Information theoretic Generalized Robinson-Foulds metrics for comparing phylogenetic trees. Bioinformatics, 36, 5007–5013.
Smith, M.R. (2022) Robust analysis of phylogenetic tree space. Systematic Biology, 71, 1255–1270.

Please note that the 'TreeDist' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

treedist's People

Contributors

Stargazers

Watchers

Forkers

sarahechapman pyspider funfwo rnaimehaom

treedist's Issues

LAPJV

Compare with other implementations:

https://github.com/gaborcsardi/lpSolve
https://www.thp.uni-koeln.de/~berg/GraphAlignment/R-docs/LinearAssignment.html [https://www.bioconductor.org/packages/release/bioc/html/GraphAlignment.html]
Hungarian is used in https://search.r-project.org/CRAN/refmans/clue/html/solve_LSAP.html

See [https://stackoverflow.com/questions/72806265/linear-sum-assignment-hungarian-method-performance-in-r]

Comparing trees with non-identical tips

Thanks so much for the amazing package, and particularly the incredible documentation (could be a book??).

The docs suggest that we drop an issue if we have a use-case for comparing trees with non-identical tips, so here I am.

Use case

In phylogenomics we often sample 1000's of genes from our taxa of interest, and typically we are missing 1 or more taxa from most genes. For reference, here's a real-world example of the number of taxa in each gene tree from a published dataset of 8295 genes:

Since taxa are missing ~randomly, most of the cases with <100% of taxa will have non-overlapping taxon sets. This dataset is fairly representative. 16% of genes are sampled in all taxa, the rest are not. A good first approximation is that there are likely to be ~80%, or roughly 6500 different taxon sets.

I'd say that this is now very common (near universal) in modern phylogenomic studies. And most empiricist would love to be able to explore these tree sets in detail.

Useful things

The most general would be to get a matrix of normalised pairwise distances. E.g. using any suitably normalised distance metric, this should produce meaningful comparisons across all trees. This would also (I assume, maybe wrong?) allow for the visualisation of such tree sets. This seems to fit well within the remit of the package, while the next two perhaps don't.

Another useful thing would be the number of unique trees, using perhaps with options for what is meant by unique, e.g.: (i) strictly unique such that different taxon sets means unique; (ii) unique in the sense of non-conflicting (e.g. RF == 0 after reducing both trees to the common taxon set). Combined with this, grouping the trees into their unique sets would be useful.

Another thing (again I think beyond the purview of TreeDist, but I mention it in case this is something that may exist as an internal data structure of e.g. an RF calculation) is information on the observed splits in the data. I don't really know how one handles ambiguous splits in this case (e.g. a split on a tree with 42 taxa may be congruent with a large number of possible splits on the full tree of 52 taxa). One option would be to simply distribute the weight of these splits (i.e. a total weight of 1) over all possible splits with which they are congruent. Though perhaps this is too silly. The general point here is that users likely want to know which splits are common in their gene trees, and whether the common splits are all represented on their tree of interest (e.g. a species tree). Related work is on gene concordance factors, which are a summary statistic for this, but can still miss a lot of useful information about gene trees that are discordant with the species tree.

Suggestion: how to compare tree with subsampled trees, thanks

Check for duplicate edges in trees when comparing

The supplementary figure in
https://www.mdpi.com/2073-4409/10/2/362#supplementary
includes some duplicate notes. Try to reconstruct how the resulting matching arose, and throw a warning when this situation arises (as it seems undesirable?)

Pathtrees

Document how to colour trees by likelihood and contour the resultant space, per
https://www.biorxiv.org/content/10.1101/2022.05.11.491507v1.full.pdf
Incorporate example from Wright & Lloyd 2020, to continue my polemic against the RF distance

64-bit splits

Investigate artefact with 64-tip and 128-tip trees; see https://ms609.github.io/TreeDist/articles/Tree-distance-metric-evaluation.html

Use SIMD to improve performance (?)

Could an implementation of SIMD bring about c. 2x performance gains?

See also 1
2

protoclust required for vignette

treespace.Rmd

Hypervolume comparison in app

Either compare cluster hypervolumes using "hypervolume", or (better still?) discover/invent a measure of overlap based on distances alone.

`Plot3` documentation

Check that this function is up to scratch before 2.1.0 release.

Include test coverage

Arboreal matchings not tested

Update documentation: Arboreal matchings are permitted, for reasons of computational efficiency, but non-coherent matchings may be prohibited.

LAPJV with non-square matrices

Code is ready in cpp's lapjv, but call is prevented in R's LAPJV.

Can we send non-square matrices without triggering a seg fault?

Calculating adjusted R2 values from ClusteringInfoDistance() after randomization?

Hey @ms609

Following up on the issue44, I am interested in calculating the goodness of fit and adjusted R2 value for ClusteringInfoDistance() value from a given pair of host and symbiont trees after 100,000 randomizations.

Any suggestion if it's possible?

MapTrees() with multiple batches

sld/mk'/mk3 trees don't plot when MST is visible; batches can't be added. CID seems to be a particular problem - plotting happens ok with PID.

Trees with different leaves:

Hi,

Is it possible to analyze trees with different leaf labels? I am interested in the general architecture of the tree rather than the identity of the individuals within...

Thanks,
Christina

Warn when tips don't match?

It's potentially confusing when distances of zero are computed with no message, e.g. where tree 1 contains underscores and tree 2 spaces.
Perhaps throw warning when comparing trees with different leaves.

Internalize multi-tree comparisons in C++

When comparing all pairs of trees, we could attain faster results by:

Loading all trees into C++ and converting to split lists once (rather than for each pair)
Storing a sorted list of splits alongside a list of their properties
- Use a k-way merge to produce a single index of all unique splits
- Each tree will then be represented as a series of links to splits
- Each unique split can have its properties (in_split) calculated and stored once
- Also possible to compare all pairs of splits once -- if this doesn't consume too much memory.

Use lighter Rcpp?

RcppCore/Rcpp#1191

input tree one by one?

polytomies allowed in the genetree?

Dear @ms609,

Are polytomies allowed in the gene trees in TreeDist?

Faster RF with multiple trees

Use Day (1985)'s algorithm to compute RF distances in linear time

Day, W. 1985. Optimal algorithms for comparing trees with labeled leaves. J. Classif. 2, 7–28.

VisualizeMatching

Hello,

I'm trying to compare two trees with the following command

VisualizeMatching(JaccardRobinsonFoulds, S16, Core_2) results in:
Error in edge.width[se] <- 1 + (10 * ns) :
NAs are not allowed in subscripted assignments

Any help in fixing this error would be helpful.

Below are the phylo trees.

S16: ((((GCF_002862005_1_ASM286200v1_genomic:0.0,GCF_002861945_1_ASM286194v1_genomic:0.0,GCF_000213955_1_ASM21395v1_genomic:0.0,GCF_013315085_1_ASM1331508v1_genomic:0.0,GCF_002861975_1_ASM286197v1_genomic:0.0,GCF_013315025_1_ASM1331502v1_genomic:0.0,GCF_002861965_1_ASM286196v1_genomic:0.0,GCF_002862015_1_ASM286201v1_genomic:0.0,GCF_013315045_1_ASM1331504v1_genomic:0.0):0.000000006,((GCF_000414525_1_ASM41452v1_genomic:0.002230831,(((GCF_001546445_1_ASM154644v1_genomic:0.0,GCF_013315115_1_ASM1331511v1_genomic:0.0):0.000000005,((GCF_002861905_1_ASM286190v1_genomic:0.0,GCF_000414605_1_ASM41460v1_genomic:0.0,GCF_000414665_1_ASM41466v1_genomic:0.0,GCF_000414585_1_ASM41458v1_genomic:0.0):0.000000005,GCF_002861885_1_ASM286188v1_genomic:0.002230870)0.966:0.006820446)0.969:0.009438344,((GCF_003426565_1_ASM342656v1_genomic:0.0,piotii_GCF_003397585_1_ASM339758v1_genomic:0.0):0.000000005,(((GCF_000414545_1_ASM41454v1_genomic:0.003992352,(GCF_003408835_1_ASM340883v1_genomic:0.0,GCF_000414505_1_ASM41450v1_genomic:0.0):0.000000005)0.909:0.008040404,(GCF_003397615_1_ASM339761v1_genomic:0.000000005,(GCF_000414565_1_ASM41456v1_genomic:0.003708023,(GCF_000414625_1_ASM41462v1_genomic:0.016167672,(GCF_000414485_1_ASM41448v1_genomic:0.0,GCF_000414425_1_ASM41442v1_genomic:0.0):0.000000005)0.000:0.000000005)0.000:0.000000006)0.928:0.000000005)0.948:0.018701881,(GCF_001546455_1_ASM154645v1_genomic:0.048525006,(GCF_001563665_1_ASM156366v1_genomic:0.116224039,((GCF_003408775_1_ASM340877v1_genomic:0.000000005,GCF_002884775_1_ASM288477v1_genomic:0.015930854)0.913:0.012725504,(((GCF_001953155_1_ASM195315v1_genomic:0.001988946,(GCF_013315145_1_ASM1331514v1_genomic:0.000000005,((GCF_003397745_1_ASM339774v1_genomic:0.0,GCF_003408815_1_ASM340881v1_genomic:0.0,swidsinskii_GCF_003397705_1_ASM339770v1_genomic:0.0):0.000000005,(GCF_000025205_1_ASM2520v1_genomic:0.000000005,(GCF_002884855_1_ASM288485v1_genomic:0.0,GCF_002884875_1_ASM288487v1_genomic:0.0):0.001978715)0.931:0.003992623)0.000:0.000000005)0.469:0.000000006)0.885:0.005473514,(GCF_013315195_1_ASM1331519v1_genomic:0.002012914,((GCF_002861125_1_ASM286112v1_genomic:0.0,GCF_013315125_1_ASM1331512v1_genomic:0.0,GCF_013315255_1_ASM1331525v1_genomic:0.0,GCF_003397635_1_ASM339763v1_genomic:0.0,GCF_002861145_1_ASM286114v1_genomic:0.0):0.006168310,leopoldii_GCF_003293675_1_ASM329367v1_genomic:0.001998855)0.781:0.002009527)0.871:0.004589922)0.934:0.014500618,(GCF_003408845_1_ASM340884v1_genomic:0.033674519,((GCF_000414465_1_ASM41446v1_genomic:0.0,GCF_000414445_1_ASM41444v1_genomic:0.0):0.001943013,GCF_001546485_1_ASM154648v1_genomic:0.000000005)1.000:0.047170589)0.714:0.015802120)0.924:0.021101316)0.654:0.021430364)0.995:0.072887040)0.435:0.003580847)0.278:0.005651482)0.892:0.005483145)0.884:0.005371297)0.793:0.000000005,GCF_003408785_1_ASM340878v1_genomic:0.004504993)0.849:0.002683606)0.000:0.000000005,(GCF_001660735_1_ASM166073v1_genomic:0.0,GCF_013315075_1_ASM1331507v1_genomic:0.0):0.000000005)0.932:0.005409354,(GCF_002861165_1_ASM286116v1_genomic:0.0,GCF_001660755_1_ASM166075v1_genomic:0.0):0.001866731,((GCF_000414645_1_ASM41464v1_genomic:0.000000005,(GCF_900637625_1_52295_C01_genomic:0.0,GCF_000414685_1_ASM41468v1_genomic:0.0,GCF_001042655_1_ASM104265v1_genomic:0.0,GCF_003397665_1_ASM339766v1_genomic:0.0,GCF_003408745_1_ASM340874v1_genomic:0.0,GCF_000159155_2_ASM15915v2_genomic:0.0,GCF_000178355_1_ASM17835v1_genomic:0.0,GCF_013315005_1_ASM1331500v1_genomic:0.0):0.000000005)0.489:0.000000005,((GCF_003585655_1_ASM358565v1_genomic:0.0,GCF_000414705_1_ASM41470v1_genomic:0.0,GCF_003812765_1_ASM381276v1_genomic:0.0,GCF_003585755_1_ASM358575v1_genomic:0.0):0.000000005,GCF_003397605_1_ASM339760v1_genomic:0.004518743)0.000:0.000000005)0.748:0.000000005);

Core_2:
(GCF_013315005_1_ASM1331500v1_genomic:0.029801777,((GCF_002861945_1_ASM286194v1_genomic:0.000000005,GCF_013315085_1_ASM1331508v1_genomic:0.000031313)1.000:0.021319502,(GCF_000159155_2_ASM15915v2_genomic:0.000031236,(GCF_000178355_1_ASM17835v1_genomic:0.000438654,(GCF_001042655_1_ASM104265v1_genomic:0.000062623,GCF_900637625_1_52295_C01_genomic:0.000031311)0.387:0.000000005)0.928:0.000094032)1.000:0.021525169)1.000:0.008877453,((((GCF_003585655_1_ASM358565v1_genomic:0.037400199,(((GCF_001563665_1_ASM156366v1_genomic:0.652791246,(((GCF_002884775_1_ASM288477v1_genomic:0.079035837,GCF_003408775_1_ASM340877v1_genomic:0.090527862)1.000:0.099122682,((leopoldii_GCF_003293675_1_ASM329367v1_genomic:0.010984029,(GCF_003397635_1_ASM339763v1_genomic:0.010125736,((GCF_002861125_1_ASM286112v1_genomic:0.000000005,GCF_002861145_1_ASM286114v1_genomic:0.000031301)1.000:0.009693544,((GCF_013315125_1_ASM1331512v1_genomic:0.0,GCF_013315255_1_ASM1331525v1_genomic:0.0):0.012966743,GCF_013315195_1_ASM1331519v1_genomic:0.013648178)1.000:0.005988651)1.000:0.004698554)0.990:0.005500145)1.000:0.069052794,((GCF_002884855_1_ASM288485v1_genomic:0.000062668,GCF_002884875_1_ASM288487v1_genomic:0.000000005)1.000:0.025579950,(((GCF_013315145_1_ASM1331514v1_genomic:0.019646990,swidsinskii_GCF_003397705_1_ASM339770v1_genomic:0.029512978)1.000:0.013756732,GCF_003408815_1_ASM340881v1_genomic:0.023520587)0.995:0.005875475,(GCF_003397745_1_ASM339774v1_genomic:0.030659080,(GCF_000025205_1_ASM2520v1_genomic:0.025213714,GCF_001953155_1_ASM195315v1_genomic:0.028312469)1.000:0.011831021)0.833:0.005198931)1.000:0.031300458)1.000:0.042156286)1.000:0.115485782)1.000:0.022843945,((GCF_001546485_1_ASM154648v1_genomic:0.017705376,(GCF_000414445_1_ASM41444v1_genomic:0.000119495,GCF_000414465_1_ASM41446v1_genomic:0.000288031)1.000:0.018254326)1.000:0.187323158,GCF_003408845_1_ASM340884v1_genomic:0.273480559)1.000:0.031516103)1.000:0.105287618)1.000:0.184239398,(GCF_001546455_1_ASM154645v1_genomic:0.158834848,((((GCF_000414665_1_ASM41466v1_genomic:0.024400129,((GCF_002861905_1_ASM286190v1_genomic:0.019593979,GCF_013315115_1_ASM1331511v1_genomic:0.028631331)1.000:0.012662116,(GCF_000414585_1_ASM41458v1_genomic:0.000000005,GCF_000414605_1_ASM41460v1_genomic:0.000062682)1.000:0.024365016)0.997:0.009709274)1.000:0.017158240,(GCF_001546445_1_ASM154644v1_genomic:0.045083787,GCF_002861885_1_ASM286188v1_genomic:0.041601544)0.989:0.009811219)1.000:0.008444557,(GCF_000414625_1_ASM41462v1_genomic:0.042525597,GCF_003408835_1_ASM340883v1_genomic:0.049892270)0.999:0.010953734)1.000:0.036963531,((GCF_003426565_1_ASM342656v1_genomic:0.000997096,piotii_GCF_003397585_1_ASM339758v1_genomic:0.000227405)1.000:0.051433551,(GCF_000414545_1_ASM41454v1_genomic:0.049885899,((GCF_000414425_1_ASM41442v1_genomic:0.040267066,(GCF_000414485_1_ASM41448v1_genomic:0.041665521,GCF_000414505_1_ASM41450v1_genomic:0.029538413)1.000:0.011386125)0.275:0.004900178,(GCF_000414565_1_ASM41456v1_genomic:0.040629573,GCF_003397615_1_ASM339761v1_genomic:0.041114766)1.000:0.009480673)1.000:0.017839738)0.891:0.011092271)1.000:0.017665095)1.000:0.041690935)1.000:0.103183064)1.000:0.070463199,(GCF_000414705_1_ASM41470v1_genomic:0.049521408,(GCF_000414525_1_ASM41452v1_genomic:0.080724487,GCF_003408785_1_ASM340878v1_genomic:0.061182628)1.000:0.024820373)1.000:0.021054869)1.000:0.033807240)1.000:0.018980712,(GCF_000414645_1_ASM41464v1_genomic:0.027907653,GCF_013315075_1_ASM1331507v1_genomic:0.026610409)0.984:0.006296718)1.000:0.007287308,((((GCF_002862005_1_ASM286200v1_genomic:0.0,GCF_002862015_1_ASM286201v1_genomic:0.0):0.031702636,(GCF_003397605_1_ASM339760v1_genomic:0.018691990,GCF_013315045_1_ASM1331504v1_genomic:0.028742078)1.000:0.008607032)1.000:0.004053679,(GCF_003397665_1_ASM339766v1_genomic:0.023462907,(GCF_001660735_1_ASM166073v1_genomic:0.013671697,(GCF_003585755_1_ASM358575v1_genomic:0.022805192,(GCF_001660755_1_ASM166075v1_genomic:0.000031311,GCF_002861165_1_ASM286116v1_genomic:0.000000005)1.000:0.012753268)1.000:0.012810342)1.000:0.007322703)1.000:0.004506398)1.000:0.007041812,((GCF_002861965_1_ASM286196v1_genomic:0.0,GCF_002861975_1_ASM286197v1_genomic:0.0):0.027555977,GCF_003812765_1_ASM381276v1_genomic:0.015476481)1.000:0.005316359)0.983:0.002876082)0.313:0.002898759,((GCF_000414685_1_ASM41468v1_genomic:0.019432538,GCF_003408745_1_ASM340874v1_genomic:0.025614830)1.000:0.007383498,(GCF_000213955_1_ASM21395v1_genomic:0.021663753,GCF_013315025_1_ASM1331502v1_genomic:0.021564210)0.690:0.005300605)0.993:0.003197446)0.996:0.003120071);

SPR project

Reading list:

The Treedist distance matrix output of Generalized RF and Nye et al. methods are zero

Dear @ms609,

Thank you again for the detailed manual and explanation of the methods! I do have a question on ClusteringInfoDistance() function and NyeSimilarity() functions.

I am running the following functions-

for distance matrix-

tree1<-read.tree(file="hosttree-d__Bacteria_p__Desulfobacterota_COG0215_tips_1.nwk")
tree2<-read.tree(file="symbionttree-d__Bacteria_p__Desulfobacterota_COG0215_tips_1.nwk")
tree1<-unroot(tree1)

#GRF
dist_rf <- ClusteringInfoDistance(tree1, tree2, normalize = TRUE)

#Nye
dist_ny <- NyeSimilarity(tree1, tree2, normalize = TRUE ,similarity = FALSE)

for p-values-

#GRF
nRep <- 100000 # Use more replicates for more accurate estimate of expected value
randomTrees <- lapply(logical(nRep), function (x) RandomTree(tree1$tip.label))
randomDists <- ClusteringInfoDistance(tree1, randomTrees, normalize = TRUE)
expectedCID <- mean(randomDists)

dist12 <- ClusteringInfoDistance(tree1, tree2, normalize = TRUE)
# Now count the number of random trees that are this similar to tree1
nThisSimilar <- sum(randomDists < dist12)
pValue <- nThisSimilar / nRep

#Nye-
nRep <- 100000 # Use more replicates for more accurate estimate of expected value
randomTrees <- lapply(logical(nRep), function (x) RandomTree(tree1$tip.label))
randomDists <- NyeSimilarity(tree1, randomTrees, normalize = TRUE,similarity = FALSE)
expectedCID <- mean(randomDists)


dist12 <- NyeSimilarity(tree1, tree2, normalize = TRUE,similarity = FALSE)
# Now count the number of random trees that are this similar to tree1
nThisSimilar <- sum(randomDists < dist12)
pValue2 <- nThisSimilar / nRep

I am getting a zero distance matrix and p-value outputs for the trees attached.
Tree1-https://github.com/Jigyasa3/errors/blob/master/hosttree-d__Bacteria_p__Desulfobacterota_COG0215_tips_1.nwk and Tree2- https://github.com/Jigyasa3/errors/blob/master/symbionttree-d__Bacteria_p__Desulfobacterota_COG0215_tips_1.nwk.
The two trees are completely identical to each other, yet the value of the distance matrix is 0. Why do you think that's happening?

Looking forward to your reply!

Replace \insertRef with \insertCite inline citations

Island hunting (Silva & Wilkinson 2021)

On Defining and Finding Islands of Trees and Mitigating Large Island Bias

Acknowledge another Generalized RF distance

Add to discussion of distances on "distances" branch:

Llabrés et al 2021: The Generalized Robinson-Foulds Distance for Phylogenetic Trees
https://dx.doi.org/10.1089/cmb.2021.0342

Replace `int` with `int_fastXX_t`

May improve speed on certain machines?

Can some iterators use [u?]int_fast8_t?

where is your shiny app for treedist?

When no. of tips in species tree and gene tree dont match. Adding new tips to the species tree

Hi @ms609 ,

Thanks again for a great package! I am trying to run TreeDist on species tree-gene tree pair where there are multiple no. of tips in the gene tree per species.

I found the add_host_tips.R script to help add new tips of zero branch length to the species tree.
But it only seems to work for a specific version of the species tree (generated from BEAST2 output) but not from IQTREE or FASTTREE. For the IQTREE and FASTREE versions of the species tree, the script generates a tree with all branch lengths equal to zero.

I was wondering if there is a workaround for this problem in TreeDist? Will it work if the no. of tips in two trees are unequal?

Regards,
Jigyasa

Replace projection warning?

Candidates:

Rooted trees or unrooted trees?

Hey @ms609 !

I wanted to ask if any of the trees used for comparison purposes can be unrooted? Or do I need to place an (arbitrary) root for TreeDist?

How does CID affect DREAM outcomes?

DREAM is a benchmark for tree reconstructions, e.g. https://www.biorxiv.org/content/10.1101/2022.02.14.480422v1.full.pdf

Do outcomes of benchmarking exercised change when CID is deployed?

Adapt for weighted trees

Bogdanowicz and Krzysztof Giaro 2022 give a straightforward method; perhaps also consider comparing with their Jaccard-MC metrics.

Error with `VisualizeMatching`

Hey Martin, thanks for building this amazing tool. I have been getting the following error when running VisualizeMatching. I tried running VisualizeMatching with the test data you use in the manual and it's running well so I know that the code is working. However, I'm unable to identify why my data is causing this error. Other functions (eg. TreeDistance) seems to be working. Any idea what I could be missing?

Error in if (any(DF[A, ] != DF[B, ])) { :
missing value where TRUE/FALSE needed

Thanks

TreeDistData

All references to TreeDistData have been removed for initial CRAN submission, to avoid mutual dependency. Once package available on CRAN, restore missing vignettes.

MASTSize failure

Crashing silently in Lin tests in TreeDistData/data-raw.

NNI dist performance

cpp_robinson_foulds_matching seems to be written with a view to insertion into cpp_nni_distance, where we could also replace cpp_edge_to_splits with a more streamlined bespoke function giving us the minimum required to calculate the matching.

MapTrees() updates

SpectralClustering() → SpectralEigens()

Mappings:

Add t-SNE mapping
Add CCA mapping?

Clustering

Option for clustering cutoff
- Default: Only show 'reasonable' clusters?
- ? Include 'reasonable' clustering in default tree set?

Internalize SPR.dist

Currently calls phangorn's SPR.dist. We can improve performance and stability, and drop import of phangorn, by moving more of this from R to C.

thinnedTrees() with multiple tree batches

in app.R, thinnedTrees() is naively defined as as.integer(seq(keptRange()[1], keptRange()[2], by = 2 ^ input$thinTrees))

This assumes that trees have been loaded from a single file. Otherwise the numbers are garbage.

It'd be nice if this didn't have to be the case. Failing that, we should disable the option if it's not relevant.

Add SSV metric to `MapTrees()`

To accompany KC.

Check for clustering with Hopkins statistic

https://journal.r-project.org/articles/RJ-2022-055/

Dimension goodness plotter fails with batches

Replicate in MapTrees() GUI by:

Selecting a file, and subsampling two batches of trees (I used best.tr from ms609/lobo)
Switching to display tab

I see

ncol(distReference) == ncol(distLowDin) is not TRUE

This seems to have disappeared on repeat – perhaps because other packages (TreeSearch?) were subsequently installed?

Match identical splits first

Reduce scale of LAPJV problem by matching identical splits (i.e. run RF first), and only using LAPJV to match non-identicals.

Visualize MST stress

It would be great if it were possible to colour each edge of the plotted MST according to its stress,
i.e. log(mapped length / original length). Would have to use a diverging palette with the zero point set to the average.