ms609 / treetools Goto Github PK

Create, modify and analyse phylogenetic trees in R

Home Page: https://ms609.github.io/TreeTools/

R 80.43% C++ 17.48% C 0.92% TeX 1.16%

phylogenetics phylogenetic-trees evolutionary-biology r-package cran

treetools's Introduction

TreeTools

'TreeTools' is an R package that provides efficient implementations of functions for the creation, modification and analysis of phylogenetic trees.

Applications include: generation of trees with specified shapes; analysis of tree shape; rooting of trees and extraction of subtrees; calculation and depiction of node support; calculation of ancestor-descendant relationships; import and export of trees from Newick, Nexus and TNT formats; and analysis of partitions and cladistic information.

It complements packages such as 'ape', 'phangorn' and 'phytools', aiming for efficient and robust implementations of functions, typically applied to unweighted trees (i.e. those without edge lengths).

Installation

Install and load the library from CRAN as follows:

install.packages("TreeTools")
library("TreeTools")

Install the very latest version, which may be under development, with:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("ms609/TreeTools")

Please note that the 'TreeTools' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

treetools's People

Contributors

Stargazers

Watchers

Forkers

nanoquanta hadley neptuneyt rnaimehaom keichenseer

treetools's Issues

Error: This many leaves cannot be supported

Hi,
I am trying to measure distances between two trees, and getting this error message:
> TreeDistance(t1, t2) Error: This many leaves cannot be supported. Please contact the TreeTools maintainer if you need to use more!
I tried decreasing the number of tips to 4096 (as mentioned in some part of the TreeDist manual), but I still get this error. Is there a workaround for this, and how much tips are allowed by default? Somehow I cannot find it in the documentation.
Thank you!

Implement sort.multiPhylo

Sorting trees into a consistent (and logical?) order will make it easier to view differences in lists of trees

Support edge lengths

Functions that do not yet support edge lengths:

Convert between TreeNumber and MixedBase

i.e. write functions as.TreeNumber.MixedBase() and vv.

using WriteTNTcharacters() with continuous matrix

I tried to export a TNT version of a continuous phylogenetic matrix using this function but the resultant characters aren't separated by anything so TNT doesn't interpret them correctly.

Here is an example of what the output looks like (68 continuous characters in this taxa):

taxa_a 1.7380.9960.1270.1570.5240.3030.880.1860.0890.8420.1030.0510.8820.1230.0570.7780.0020.0460.7380.0110.1230.8350.2220.0320.3380.5290.5730.5750.030.4470.3160.6020.5110.231.8180.9950.0970.130.4340.1590.94100.0970.8270.1760.2390.8270.2110.330.7890.2960.3270.7510.4250.2780.8810.1880.2670.3660.4450.5980.36800.4850.3050.6340.5080.271

I tried initially reading the dataset in using ReadCharacters() as well as using as.matrix() before using WriteTNTCharacters() to no avail. Both these methods read in the characters correctly. Hope I didn't miss a simple fix

Simulation of Birth-Death trees

Exact and efficient phylodynamic simulation from arbitrarily large populations

Merge `AllDescendantEdges()` with `DescendantEdges()`

Move edge parameter after parent & child in DescendantEdges call [breaking change?]
If edge = NULL, in DescendantEdges(), call AllDescendantEdges()
Make AllDescendantEdges() internal
Stop exporting AllDescendantEdges() (and move .AllDescendantEdges() into DescendantEdges())

Unsupported TNT file

Dear Martin,

I'm trying to read the TNT matrix from Mirande 2008 (Appendix S5 file characidae.tnt) using ReadTntCharacters() and I'm getting the following error:

Error in toupper (lines): invalid multibyte string 3842
In addition: Warning messages:
1: In grep ("'", lines, fixed = TRUE):
   input string 3857 is invalid at that locale
2: In grep (";", lines, fixed = TRUE):
   input string 3842 is invalid at that locale

Do you have any idea of what is happening? Can it be a problem of encoding? Is there a way to control it?

Best,

Sara

Sort trees by TreeNumber

Or some more principled method than their Newick representation after preordering.

`ClusterTable` memory requirements

Running consensus_info() with 36000 trees requires a vector of 36000 ClusterTables, which requires more memory than is available.

Can we reduce the memory requirement of a ClusterTable?
(Perhaps we need to operate on the heap rather than the stack?)

Replace `.C` with `.Call`

Dirk writes "[.C() is] discouraged given that using .Call() from R is so much more efficient"

TreeTools/R/tree_numbering.R

Line 37 in 8139af1

.C(`ape_neworder_pruningwise`, as.integer(nTip), as.integer(nNode),

DropTip will not remove tip on tree

I cannot get DropTip to remove a tip. What am I doing wrong? Tried with several trees. Here is an example:

library(phytools)
library(ggtree)
library(TreeTools)

tree2<-pbtree(n=5)
plotTree(tree2)
DropTip(tree2,'t4',preorder = TRUE,check=TRUE)
plotTree(tree2)

The tree does not change.

Thanks.

MSTEdges() explodes if sent invalid input

A matrix of distances, n rows, 2 cols, created by cmdscale(distances), causes R to crash.

Validate input before running.

RoguePlot() width scales with `p` (i.e. consensus proportion)

Quality of a dataset

Haag et al. measure the ruggedness of a tree landscape by training a regression model (trained on molecular datasets, implemented in C) based on:

Unique topologies after 100 parsimony searches: 42.9 %
RF-Distance between parsimony trees: 33.2 %
Entropy (Average Shannon entropy per column): 17.0 %
Patterns (unique columns)-over-taxa 13.6 %
% Gaps 2.5 %
Bollback 2.3 %
Sites(n columns)-over-taxa 1.5 %
% Invariant columns 0.6 %

LeafLabelInterchange(): Guarantee change to tree topology

It would be nice if we could avoid swapping sister leaves, thus ensuring that our LLI operation changes the tree.

Custom directory for caching

I want to use TreeTools inside a docker (singularity, in fact) container and it seems TreeTools uses /home for caching results which is a bottleneck in my application. It would be better if 1. I could tell TreeTools not to use a cache directory at all or 2. I could set the cache directory manually to be an arbitrary directory. Is this possible with TreeTools at the moment? If not, would this be relatively simple to implement? Many thanks.

Error parsing Nexus file

Hi there Martin,

Just trying to using ReadCharacters to read in a continuous dataset I am analysing in various ways.

Unfortunately the continuous decimals are parsed in individually as characters, eg 0.15 would be read in as '0' '.' '1' '5'

I had a a look at the github files and couldn't see any mention of continuous characters, though I may have missed that part. Is there a way to read them in at all?

I have attached the dataset in case that is of any use, but had to convert it to .txt instead of .nex
koch_raw_MASTER.txt

Thanks for your time

Will not install on Rstudio

*** arch - i386
/mingw32/bin/g++ -std=gnu++17 -I"C:/Users/THEODO~~1.ALL/R-41~~1.1/include" -DNDEBUG -I'C:/Users/Theodore.Allnutt/Rlibs/Rcpp/include' -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -c ClusterTable.cpp -o ClusterTable.o
In file included from ClusterTable.cpp:1:
../inst/include/TreeTools/ClusterTable.h:6:10: fatal error: Rcpp/Lightest: No such file or directory
#include <Rcpp/Lightest>
^~~~~~~~~~~~~~~
compilation terminated.
make: *** [C:/Users/THEODO~~1.ALL/R-41~~1.1/etc/i386/Makeconf:245: ClusterTable.o] Error 1
ERROR: compilation failed for package 'TreeTools'

removing 'C:/Users/Theodore.Allnutt/Rlibs/TreeTools'
Warning in install.packages :
installation of package ‘TreeTools’ had non-zero exit status

Reordering methods: support edge matrices

In tree_numbering.R, all functions should have a .numericMatrix method.

Perhaps await answer to https://stackoverflow.com/questions/62001733 to implement in optimal fashion.

NexusTokens() shiny interaction

NexusTokens() calls shiny::updateNumericInput("character_num"): this strikes me as not the most appropriate way to do this, shouldn't the caller be able to specify this? Attempt to remove, allowing the removal of shiny from DESCRIPTION Suggests: field.

"Preorder" classification

With #92, DropTip() now returns edges numbered in preorder, but not conforming to the additional requirements of Preorder(). Is it true to consider this "cladewise"?

Probably we need a function that tests whether a tree is in strict TreeTools-Preorder, or whether it's just in preorder; some functions may only require the latter, saving time in unnecessary renumbering. We should audit the code so we're only requesting what we require. Might this necessitate a flag, as with postorder, to indicate whether treetools conventions are followed?

Switch to inline citations

https://cran.r-project.org/web/packages/Rdpack/vignettes/Inserting_bibtex_references.pdf

Plotting rogue taxa

Example: https://github.com/seraklop/RoguePlots

The idea is to plot a backbone tree without the rogue taxa, with branches coloured according to the likelihood that the rogue should attach to that branch.

(Thanks Ludo Le Renard for the suggestion)

Reference formatting, once Rdpack > 2.1.3

Replace textual with nobrackets to allow italicization of sensu in e.g. Information.R

Hard-Deprecate `PostorderEdges()`

Replaced by Postorder(). Deprecate fully once 'TreeSearch' updated on CRAN.

@DSRovinsky wishlist

Support stats like:
[ ] Bremer/branch supports
[ ] CI/RI
[ ] bootstrapping

[ ] Ability to 'force' a tip into a clade & run a Templeton test for alternative hypotheses

Deprecate `in.Splits()`

Included at present as alias for %in%.Splits().

demo()

Consider which functions / function suites could be documented using the demo() functionality.

(Other packages may benefit too.)

Random Trees don't match balance

Both these tests fail, in opposite directions

expect_equal(
    mean(replicate(100, TotalCopheneticIndex(RandomTree(10, root = TRUE)))), # ~90
    TCIContext(10)$uniform.expected, # 76
    tolerance = 0.1
  )
  expect_equal(
    mean(replicate(100, TotalCopheneticIndex(ape::rtree(10, root = TRUE,  equiprob = TRUE, br = NULL)))), # ~50
    TCIContext(10)$uniform.expected,  # 76
    tolerance = 0.1
  )

Unsupported NEXUS file

Please attach the problematic NEXUS file and describe the issue

Am new in using R. Am not able to read the nexus file.

`as.MixedBase()` hangs (in `sort.multiPhylo()`)

Reproduce with:

tree <- read.tree("../iotuba-ms/phylo/treesearch/flab8_ew.tre")[[1]]
as.MixedBase(tree)

Preorder() fails with large trees

Crash occurs with Preorder(RandomTree(100000)). Check that input tree size is supported first.

multiPhylo object with single `tip.label` not supported

Read trees with read.tree("C:\Research\R\Cricocosmia\results\j...run1.t") then:

Subset with Consensus(trees[1:5]).
Drop tips with DropTip(trees, 'Acosmia')

Tree balance index

Add the Robust, Universal Tree Balance Index of doi:10.1093/sysbio/syac027, implemented for data.frames in R in https://github.com/robjohnnoble/RUtreebalance/blob/v1.0/RUtreebalance.R (with opportunities to improve)
Add other metrics from https://arxiv.org/pdf/2109.12281.pdf

`CollapseNode()` performance

Rewrite in C++
Retain order of original tree (at least if in preorder)

TNT multiline support

TNT can parse files with arbitrary line break positions; see dromaeodat.tnt in SI of
https://doi.org/10.1016/j.cub.2020.06.105

Work started on branch parse-tnt.

Unsolved problem: How to identify taxon names given that

Taxon names contain numbers in any position, and may contain exclusively digits; and
Character data may be interleaved

`%in%.Splits()` drops names

tree1<-read.tree(text = x<-"(a,b,(c,(d,e)));")
tree2<-read.tree(text = "(a,b,c,(d,e));")
splits1 <- as.Splits(tree1)
splits2 <- as.Splits(tree2)
splits1in2 <- splits1 %in% splits2
names(splits1in2) # should equal names(splits1); instead NULL

str method for relevant classes

Also in TreeDist, etc

EnforceOutgroup() is a near-duplicate of RootNode()

Deprecate EnforceOutgroup() and add its handling of character-class trees to RootNode().

Support edge lengths in `UnrootTree()`

I was wondering what exactly happen the tree while unrooting it in the background?
I have got some phylogenetic alpha diversity to run on on phyloseq objects with rooted trees which I need to unroot first before running this code as it was advised by some posts to avoid the random rooting of the tree
Does it make changes in any way to the tree?

### ROOTING the tree more appropriately ####
pick_new_outgroup <- function(tree.unrooted){
  require("magrittr")
  require("data.table")
  require("ape") # ape::Ntip
  # tablify parts of tree that we need.
  treeDT <- 
    cbind(
      data.table(tree.unrooted$edge),
      data.table(length = tree.unrooted$edge.length)
    )[1:Ntip(tree.unrooted)] %>% 
    cbind(data.table(id = tree.unrooted$tip.label))
  # Take the longest terminal branch as outgroup
  new.outgroup <- treeDT[which.max(length)]$id
  return(new.outgroup)
}

new.outgroup <- pick_new_outgroup(tree.unrooted)
# > new.outgroup
# [1] "ASV679"

rootedTree <- ape::root(phy_tree(ps.3), outgroup = new.outgroup, resolve.root = TRUE)

Why I am asking? because I have been woking with the above codes without any errors or warnings
but when I unrooted the tree using unrootTree() I get this error

> new.outgroup <- pick_new_outgroup(unrooted.tree)
Warning message:
In as.data.table.list(x, keep.rownames = keep.rownames, check.names = check.names,  :
  Item 1 has 3839 rows but longest item has 3840; recycled with remainder.

Thanks

If tree has defined node.labels AddTips() will not change them

Hi,

I'm not sure if I'm using AddTips() incorrectly or there is an issue with the function. Reprex below:

# create a random tree
set.seed(0)
tree <- ape::rtree(10)
# define node labels and plot
tree$node.label <- paste0("Node_", 1:tree$Nnode)
plot(tree, show.node.label = TRUE)

# add a tip
tree2 <- TreeTools::AddTip(tree, where = "t8", label = "new", edgeLength = 0)
# plot, notice that: 1. internal node labels change, 2. "Node_1" is now recycled!
plot(tree2, show.node.label = TRUE)

# Also node.label does not change
tree2$node.label
#> [1] "Node_1" "Node_2" "Node_3" "Node_4" "Node_5" "Node_6" "Node_7" "Node_8"
#> [9] "Node_9"

^{Created on 2024-03-22 with reprex v2.1.0}

This is a problem because tools that make use of the node.label element will produce all sorts of issues downstream.

Not sure this is a proper fix but adding something like this may work:

if ("node.label" %in% names(tree) {
  tree$node.label <- paste0("Node_",  1:tree$Nnode)
}

The names of the internal nodes will still change but the redundancy will be eliminated.

`SortTree()` only supports binary trees

It should also support non-binary, and unrooted, trees.

Document methods by class

e.g. only as.Splits.phylo currently @describeIn'd.

Format of the tree?

Hello

Do you provide a function to know what is the format of the tree I created?
I used the method here: https://f1000research.com/articles/5-1492
Does saving the tree in .rds change the tree format?
.txt would make it NEWICK
Cheers

Deprecate `NonDuplicateRoot()`

Behaviour is poorly defined, and necessity of function is questionable: if edge 1 must be a root edge, why not just ignore that edge, rather than its duplicate?

Only required by TreeSearch::SPR(), which will soon be ported into C. Once this is done, the function can be deprecated and removed.

Replace root_binary with root_on_node

In root_tree.h, check whether inline IntegerMatrix root_binary(const IntegerMatrix edge, const int outgroup) really outperforms root_on_node. If not, delete it.

ape memory allocation issues

Warning: Error in : cannot allocate vector of size 8.0 Gb in consensus() / dist.topo from call to ConsensusWithout() in app.R with lobo/best.tr trees

GH Actions templates

Create templates for bespoke actions at ms609/actions.
List on GitHub marketplace

as.TreeNumber not identifying unique topologies

After going through the documentation, I was under the impression that as.TreeNumber would generate numbers for all the different unique topologies in a data set. Currently I am working with 299 trees consisting of seven tips (all trees have all tips). I can generate the output and save the scores for all 299 trees. When looking more closely at the trees with the same score, the topologies are in fact not the same. Am I misinterpreting the usage of this function or is there an error somewhere that is causing different topologies to be scored the same?

Here is the code that is being used:
trees <- read.tree(file="299_trees.tre")
trees

tips <- TipLabels(trees)
tips

possible <- as.TreeNumber(trees, nTip=7,tipLabels = tips)

possible
class(possible)

sink("output.txt",type=c("output"))

print(possible)
sink()