Getting started: Exploring tree space

Martin R. Smith

2024-05-23

Mapping the distribution of trees in “tree space” can reveal patterns in the set of optimal trees, and allow additional details to be obtained from results that may appear to lack resolution (Smith, 2022a).

The ‘TreeSearch’ GUI allows rapid visualisation of tree spaces and the construction of cluster consensus trees, to depict the underlying structure inherent in a set of optimal trees.

More detailed tree space analysis can also be conducted using the ‘TreeDist’ package.

Loading trees into the GUI

This example uses the Sun et al. (2018) dataset included with the TreeSearch package, which contains trees generated under four optimality criteria. The approach will work on search results generated by tree search within the GUI, or any set of trees loaded using the “Load trees” function.

For rapid processing, large tree sets are down-sampled when loaded: here, 48 of the 125 trees are used for analysis. This is useful for obtaining an initial overview of results; including more trees gives a more complete picture but results will take quadratically longer to generate.

If loading the output of a Bayesian analysis, you may wish to modify the range of trees sampled to exclude trees generated during the burn-in phase.

Loading trees for analysis
Loading trees for analysis

Equivalent R code can be downloaded using the “Save plot: R script” button, and is provided here for reference:

# Load required libraries 
library("TreeTools", quietly = TRUE) 
library("TreeDist") 
library("TreeSearch") 
  

# Load Sun et al. 2018 trees from TreeSearch package
treeFile <- system.file("datasets/Sun2018.nex", package = "TreeSearch")

# Load trees from your own file with
# treeFile <- "Path/to/my.file"
# 
# To check and set working directory, use:
# getwd() # Should match location of data / tree files 
# setwd(".") # Replace . with desired/directory to change 

# Read all trees from file
trees <- read.nexus(treeFile) 

# Down-sample to a maximum of 48 trees
trees <- trees[unique(as.integer(seq.int(1, length(trees), length.out = 48)))] 

Exploring the consensus

The GUI will initially display the strict consensus of the input trees, after gaining as much information as possible by removing any “rogue” taxa whose position is highly variable (Smith, 2022b).

If trees were generated by a probabilistic sampling process (e.g. Markov Chain) then a majority rule consensus (figure: Majority = 0.5) may reveal additional resolution, but note that the frequency of splits in trees found by parsimony search does not correspond to split support – only the strict consensus tree has any direct interpretation in this setting.

A majority rule consensus has no interpretation with this dataset, but its improved resolution does hint that the strict consensus hides some underlying structure
A majority rule consensus has no interpretation with this dataset, but its improved resolution does hint that the strict consensus hides some underlying structure

Equivalent R code can be downloaded using the “Save plot: R script” button within the GUI, and is provided here for reference:

# Select an appropriate majority value
# When analysing parsimony results, this value should be 1.
majority <- 0.5

# Identify rogue taxa
exclude <- Rogue::QuickRogue(trees, p = majority)$taxon[-1]
exclude
## [1] "Paterimitra"         "Yuganotheca_elegans"
# Select a rogue whose positions should be depicted
plottedRogue <- exclude[1]

# Remove other excluded taxa from tree
consTrees <- lapply(trees, DropTip, setdiff(exclude, plottedRogue))

# Colour tip labels according to their original 'instability' (Smith 2022a)
tipCols <- Rogue::ColByStability(trees)

# Our plotted rogue will not appear on the tree
tipCols <- tipCols[setdiff(consTrees[[1]]$tip.label, plottedRogue)] 


# Set up plotting area 
par(
 mar = c(0, 0, 0, 0), # Zero margins 
 cex = 0.8            # Smaller font size 
)

# Plot the reduced consensus tree, showing position of our plotted rogue 
plotted <- RoguePlot( 
 trees = consTrees, 
 tip = plottedRogue,
 p = majority, 
 edgeLength = 1, 
 tip.color = tipCols 
) 

# Calculate split concordance
concordance <- SplitFrequency(plotted$cons, trees) / length(trees)

# Annotate splits by concordance 
LabelSplits(
 tree = plotted$cons,
 labels = signif(concordance, 3),
 col = SupportColor(concordance),
 frame = "none",
 pos = 3
) 

The structure of tree space

The “Tree space” display option plots trees as points, and attempts to map the distribution of points such that the distance between mapped points corresponds to the tree distance between each pair of trees.

The default method (Clustering Information distance) has many advantages over the widely-used Robinson–Foulds distance (Smith, 2020, 2022a); the quartet distance, whilst slower to calculate, can also produce good low-dimensional mappings of tree space.

The number of dimensions of space can be reduced to make mappings easier to interpret, but do check that the mapping quality gauge remains at least “good”, or perceived structure may be an artefact of distortion.

Tree space mapping reveals three tree clusters
Tree space mapping reveals three tree clusters

In this dataset, trees are assigned to three distinct clusters, and display “good” structure: clusters are marked only when their silhouette coefficient is greater than the threshold, whose interpretation is marked above the slider (here, 0.68 is “good”). As it happens, the clusters largely correspond to the tree reconstruction method used: Fitch (1971) parsimony (TNT_EW) trees form one distinct cluster; trees under “BGS” inapplicable-corrected parsimony (Brazeau, Guillerme, & Smith, 2019) another; and Bayesian trees a final, more diffuse cluster.

Tree space analysis in R

More sophisticated tree space analysis can be conducted using the ‘TreeDist’ package.

Sample R code can be downloaded using the “Save plot: R script” button within the GUI, and is provided here for reference:

Generate distances

Different tree distances reflect different aspects of tree similarity, which may or may not be relevant to phylogenetic questions (Smith, 2020, 2022a).

# Compute distances between pairs of trees
dists <- TreeDist::ClusteringInfoDistance(trees)
# The quartet distance is a slower alternative:
library("Quartet")
dists <- as.dist(Quartet::QuartetDivergence( 
   Quartet::ManyToManyQuartetAgreement(trees), similarity = FALSE)
)

Identify clustering structure

Here we compute clusters of trees using a selection of clustering methods (Bien & Tibshirani, 2011; Hartigan & Wong, 1979; Maechler, Rousseeuw, Struyf, Hubert, & Hornik, 2019), allowing the optimal clustering to be highlighted on the plot.

# Try K-means clustering (Hartigan & Wong 1979):
kClusters <- lapply(
  2:15, # Minimum 2 clusters; maximum 15
  function (k) kmeans(dists, k)
)
kSils <- vapply(kClusters, function (kCluster) { 
 mean(cluster::silhouette(kCluster$cluster, dists)[, 3]) 
}, double(1)) 
bestK <- which.max(kSils) 
kSil <- kSils[bestK] # Best silhouette coefficient 
kCluster <- kClusters[[bestK]]$cluster # Best solution 

# Try partitioning around medoids (Maechler et al. 2019):
pamClusters <- lapply(2:15, function (k) cluster::pam(dists, k = k)) 
pamSils <- vapply(pamClusters, function (pamCluster) { 
 mean(cluster::silhouette(pamCluster)[, 3]) 
}, double(1)) 
bestPam <- which.max(pamSils) 
pamSil <- pamSils[bestPam] # Best silhouette coefficient 
pamCluster <- pamClusters[[bestPam]]$cluster # Best solution 

# Try hierarchical clustering with minimax linkage (Bien & Tibshirani 2011):
hTree <- protoclust::protoclust(dists) 
hClusters <- lapply(2:15, function (k) cutree(hTree, k = k)) 
hSils <- vapply(hClusters, function (hCluster) { 
 mean(cluster::silhouette(hCluster, dists)[, 3]) 
}, double(1)) 
bestH <- which.max(hSils) 
hSil <- hSils[bestH] # Best silhouette coefficient 
hCluster <- hClusters[[bestH]] # Best solution

# Set threshold for recognizing meaningful clustering 
# no support < 0.25 < weak < 0.5 < good < 0.7 < strong 
threshold <- 0.5

# Compare silhouette coefficients of each method 
bestMethodId <- which.max(c(threshold, pamSil, hSil, kSil)) 
bestCluster <- c("none", "pam", "hmm", "kmn")[bestMethodId] 

# Best clustering:
c("Structure below threshold",
  "Partitioning around medoids",
  "Hierarchical-minimax",
  "K-means")[bestMethodId]
## [1] "K-means"
# Best silhouette coefficient:
max(c(threshold, pamSil, hSil, kSil))
## [1] 0.601017
# Store the cluster to which each tree is optimally assigned: 
clustering <- switch(bestCluster, pam = pamCluster, hmm = hCluster, kmn = kCluster, 1) 
nClusters <- length(unique(clustering))

Mapping tree space

Now we can map tree space into a few dimensions and visualise the distribution of trees:

nDim <- 3 # Number of dimensions to plot

# Generate first dimensions of tree space using PCoA 
map <- cmdscale(dists, k = nDim) 

# Prepare plot layout 
nPanels <- nDim * (nDim - 1L) / 2L # Lower-left triangle 
plotSeq <- matrix(0, nDim, nDim)
plotSeq[upper.tri(plotSeq)] <- seq_len(nPanels)
plotSeq[nDim - 1, 2] <- max(plotSeq) + 1L
layout(t(plotSeq[-nDim, -1]))

# Set plot margins 
par(mar = rep(0.2, 4))

# Set up tree plotting symbols
treePch <- 0 # Square
clusterCol <- palette.colors(nClusters, "Alphabet", alpha = 0.8)
treeCols <- clusterCol[clustering]
for (i in 2:nDim) for (j in seq_len(i - 1)) { 
  
  # Set up blank plot 
  plot( 
    x = map[, j], 
    y = map[, i], 
    ann = FALSE,        # No annotations 
    axes = FALSE,       # No axes 
    frame.plot = TRUE,  # Border around plot 
    type = "n",         # Don't plot any points yet 
    asp = 1,            # Fix aspect ratio to avoid distortion 
    xlim = range(map),  # Constant X range for all dimensions 
    ylim = range(map)   # Constant Y range for all dimensions 
  ) 
   
  # Plot minimum spanning tree (Gower 1969) to depict distortion (Smith 2022)
  mst <- MSTEdges(as.matrix(dists)) 
  segments( 
    x0 = map[mst[, 1], j], 
    y0 = map[mst[, 1], i], 
    x1 = map[mst[, 2], j], 
    y1 = map[mst[, 2], i], 
    col = "#bbbbbb", # Light grey 
    lty = 1          # Solid lines 
  ) 
   
  # Add points
  points( 
    x = map[, j], 
    y = map[, i], 
    pch = treePch, 
    col = treeCols, 
    cex = 1.7, # Point size 
    lwd = 2 # Line width 
  ) 
   
  # Mark clusters 
  for (clI in seq_len(nClusters)) { 
    inCluster <- clustering == clI 
    clusterX <- map[inCluster, j] 
    clusterY <- map[inCluster, i] 
    hull <- chull(clusterX, clusterY) 
    polygon( 
      x = clusterX[hull], 
      y = clusterY[hull], 
      lty = 1, # Solid line style 
      lwd = 2, # Wider line width 
      border = clusterCol[clI] 
    ) 
  } 
}

Using tree clustering to understand tree space structure

Plotting the consensus of trees within a specific cluster can give more insight into the underlying relationships within a tree set than viewing a single strict consensus, particularly where the data do not decisively distinguish between a small number of well-defined but conflicting phylogenetic hypotheses (Stockham, Wang, & Warnow, 2002).

Visualising the consensus of each cluster can reveal this “hidden” agreement between trees and allow more detailed and thoughtful analysis of the meaning of the phylogenetic output.

As the limited screen space available can render large trees difficult to read, it may be helpful to output the analysis using the “Save plot” options, either as an R script to allow further analysis within R, or as a PDF for more comfortable on-screen viewing.

Consensus of tree clusters
Consensus of tree clusters

The R code below is modified from that output by the GUI, in order to conduct rogue taxon analysis for each cluster separately. Note that it draws on the results of the clustering analysis above.

# Select proportion for majority rule consensus trees
# 1 is the appropriate value when parsimony trees are included
majority <- 1

# Plot each consensus tree in turn: 
for (i in seq_len(nClusters)) {
  clusterTrees <- trees[clustering == i]
  # Identify rogue taxa for this cluster
  clusterRogues <- Rogue::QuickRogue(clusterTrees, p = majority)$taxon[-1]
  
  # Colour tree labels based on stability across cluster
  tipCols <- Rogue::ColByStability(clusterTrees)
  
  cons <- ConsensusWithout( 
    trees = clusterTrees, 
    tip = clusterRogues,
    p = majority
  ) 
  
  # Root tree 
  cons <- RootTree(cons, trees[[1]]$tip.label[1])
  
  # Set unit edge lengths for viewing
  cons$edge.length <- rep.int(1, nrow(cons$edge)) 
  
  # Rotate nodes, to display clades in order of size 
  cons <- SortTree(cons, order = names(trees[[1]]$tip.label))
  
  # Plot tree
  par(mar = c(0, 0, 0, 0), cex = 0.8) # Set up plotting area
  plot( 
    cons, 
    edge.width = 2,             # Widen lines 
    font = 3,                   # Italicize labels 
    cex = 0.83,                 # Shrink tip font size 
    edge.color = clusterCol[i], # Colour tree according to previous plot
    tip.color = tipCols[cons$tip.label] 
  ) 
  legend(
    "bottomright", 
    paste("Cluster", i), 
    pch = 15,            # Filled circle icon 
    pt.cex = 1.5,        # Increase icon size 
    col = clusterCol[i], 
    bty = "n"            # Don't plot legend in box 
  ) 
  if (length(clusterRogues)) {
    # Mark omitted taxa, if any rogues present
    legend(
      "topright",
      clusterRogues,
      col = tipCols[clusterRogues],
      text.col = tipCols[clusterRogues],
      pch = "-",      # Mark as excluded
      cex = 0.8,      # Smaller font
      text.font = 3,  # Italicize
      bty = "n"       # Don't plot legend in box 
    )
  }
}

Where next?

References

Bien, J., & Tibshirani, R. (2011). Hierarchical clustering with prototypes via minimax linkage. Journal of the American Statistical Association, 106(495), 1075–1084. doi:10.1198/jasa.2011.tm10183
Brazeau, M. D., Guillerme, T., & Smith, M. R. (2019). An algorithm for morphological phylogenetic analysis with inapplicable data. Systematic Biology, 68, 619–631. doi:10.1093/sysbio/syy083
Fitch, W. M. (1971). Toward defining the course of evolution: Minimum change for a specific tree topology. Systematic Biology, 20(4), 406–416. doi:10.1093/sysbio/20.4.406
Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108. doi:10.2307/2346830
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2019). Cluster: Cluster Analysis Basics and Extensions. Comprehensive R Archive Network, 2.1.0.
Smith, M. R. (2020). Information theoretic Generalized Robinson-Foulds metrics for comparing phylogenetic trees. Bioinformatics, 36(20), 5007–5013. doi:10.1093/bioinformatics/btaa614
Smith, M. R. (2022a). Robust analysis of phylogenetic tree space. Systematic Biology, 71(5), 1255–1270. doi:10.1093/sysbio/syab100
Smith, M. R. (2022b). Using information theory to detect rogue taxa and improve consensus trees. Systematic Biology, 71(5), 1088–1094. doi:10.1093/sysbio/syab099
Stockham, C., Wang, L.-S., & Warnow, T. (2002). Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics, 18(Suppl 1), S285–S293. doi:10.1093/bioinformatics/18.suppl_1.S285
Sun, H.-J., Smith, M. R., Zeng, H., Zhao, F.-C., Li, G.-X., & Zhu, M.-Y. (2018). Hyoliths with pedicles illuminate the origin of the brachiopod body plan. Proceedings of the Royal Society B, 285(1887), 20181780. doi:10.1098/rspb.2018.1780