Generating and Interpreting Alternative Ancestral State Reconstructions

Overview

Ancestral state reconstruction (ASR) is a critical tool for comparative biology, providing inferences about the unobserved evolutionary history of lineages. While the evolutionary models of character evolution have advanced, the field has primarily relied on marginal reconstructions, which independently estimate the probability of each node state. In contrast, joint reconstructions estimate the most likely sequence of states across nodes. Joint reconstructions may capture the evolutionary process more accurately, but methods for estimating uncertainty in joint reconstructions are lacking.

This vignette demonstrates how to generate and analyze alternative joint ancestral state reconstructions using some clustering approaches (see Boyko et al. 2025).

Installation and Setup

First, ensure you have the required packages installed:

# Install required packages if not already installed
if (!requireNamespace("Rtsne", quietly = TRUE)) install.packages("Rtsne")
if (!requireNamespace("dbscan", quietly = TRUE)) install.packages("dbscan")
if (!requireNamespace("proxy", quietly = TRUE)) install.packages("proxy")
if (!requireNamespace("RColorBrewer", quietly = TRUE)) install.packages("RColorBrewer")
# For the developmental version of corHMM:
# devtools::install_github("thej022214/corHMM")

Load the necessary libraries and source custom functions:

# Load required libraries
library(Rtsne)
library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
library(proxy)
## 
## Attaching package: 'proxy'
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
library(ape)
library(RColorBrewer)
# library(corHMM)
devtools::load_all("~/corHMM/")
## ℹ Loading corHMM
## Loading required package: nloptr
## Loading required package: GenSA
## Warning in readLines(con, warn = FALSE, n = n, ok = ok, skipNul = skipNul):
## invalid input found on input connection '/Users/jboyko/corHMM/R/makeSimmap.R'

Data Preparation

This example uses the primates dataset included in the corHMM package. You can substitute your own phylogenetic tree and trait data.

# Load example data
data(primates)
phy <- multi2di(primates[[1]])  # Ensure tree is fully bifurcating
cor_dat <- primates[[2]]        # Trait data
# For your own data, use:
# phy <- read.tree("path/to/your/tree.tre")
# cor_dat <- read.csv("path/to/your/trait_data.csv")

Model Fitting with corHMM

First, fit a corHMM model to estimate transition rates between character states:

# Fit corHMM model
focal <- corHMM(phy, cor_dat, 
                rate.cat = 1,            # Number of rate categories
                model = "ER",            # Equal rates model
                node.states = "joint",   # Joint reconstruction
                root.p = "maddfitz")     # Root state frequencies
## Warning in corHMM(phy, cor_dat, rate.cat = 1, model = "ER", node.states =
## "joint", : Branch lengths of 0 detected. Adding 1.49011611938477e-08
## State distribution in data:
## States:  0|0 0|1 1|1 
## Counts:  29  10  21  
## Beginning thorough optimization search -- performing 0 random restarts 
## Finished. Inferring ancestral states using joint reconstruction.
state_cols <- RColorBrewer::brewer.pal(3, "Set1")
plot(phy, cex = 0.25, label.offset = 0.75, no.margin = TRUE)
tiplabels(pch = 21, bg = state_cols[as.numeric(focal$data.legend[,2])], cex = 0.75)
nodelabels(pch = 21, bg = state_cols[focal$states], cex = 0.75)

Generating Alternative Reconstructions

Generate alternative ancestral state reconstructions by sampling from the confidence interval:

set.seed(0104)  # Set seed for reproducibility

# Generate alternative reconstructions
joint_ci <- corHMM:::compute_joint_ci(
  focal,
  batch_size = 100,    # Number of samples per batch
  max_samples = 1000,  # Maximum number of samples
  ncores = 10)         # Number of cores for parallel processing

# Extract results
best_joint <- joint_ci$best_joint   # Best reconstruction
state_df <- joint_ci$state_df       # Data frame of alternative reconstructions

The compute_joint_ci function samples alternative reconstructions based on their conditional likelihoods, providing a more comprehensive view of possible evolutionary histories than a single point estimate.

Calculating Phylogenetic Distances

Calculate distances between alternative reconstructions, accounting for phylogenetic structure:

# Prepare phylogeny for distance calculation
phy <- ladderize(focal$phy)
tmp_phy <- phy
tmp_phy$edge.length <- rep(1, length(tmp_phy$edge.length))
phy_dist <- ape::dist.nodes(tmp_phy)[-(1:Ntip(tmp_phy)),-(1:Ntip(tmp_phy))]
state_df_matrix <- as.matrix(state_df)

# Register custom distance function
try(pr_DB$set_entry(FUN = corHMM:::phylo_aware_dist_local, names = "PhyloSpreadDistLocal"))

# Calculate phylogenetically-informed distances
phylo_dist_mat_local <- proxy::dist(state_df_matrix, 
                                   method = "PhyloSpreadDistLocal",
                                   phylo_dist = phy_dist)
heatmap(as.matrix(phylo_dist_mat_local))

The phylo_aware_dist_local function weights differences between reconstructions based on their phylogenetic proximity, so that differences at closely related nodes contribute more to the overall distance. What this heatmap is showing is not the primate data, but the distance between different joint reconstructions. The further apart a pair of reconstructions are, the darker the color.

Dimensionality Reduction and Clustering

Use t-SNE for dimensionality reduction and DBSCAN for clustering:

# t-SNE on phylogenetically-informed distances
set.seed(0104)
tsne_res <- Rtsne(phylo_dist_mat_local)

# Determine parameters for DBSCAN
n_recons <- length(joint_ci$lnliks)
min_pts <- round(.05 * n_recons)  # ~5% of reconstructions as minimum points

# Find appropriate epsilon value
par(mfrow = c(1,1))
kNNdistplot(tsne_res$Y, k = min_pts - 1, minPts = min_pts)
abline(h = 3.25, lty = 3, col = "red")  # Add reference line at eps = 3

# Run DBSCAN clustering
set.seed(0104)
dbres <- dbscan(tsne_res$Y, eps = 3.25, minPts = min_pts)

The kNNdistplot helps identify an appropriate eps value for DBSCAN clustering. The “knee” in the plot indicates a good cutoff for separating core points from noise.

Visualizing Clusters of Reconstructions

Create visualizations to explore the clustering results:

# Set up cluster assignments and colors
cluster_assignments <- dbres$cluster
clusters <- factor(cluster_assignments)
optimal_k_gap <- sum(as.numeric(unique(dbres$cluster)) > 0)
clust_cols <- RColorBrewer::brewer.pal(sum(levels(clusters) != 0), "Set3")
## Warning in RColorBrewer::brewer.pal(sum(levels(clusters) != 0), "Set3"): minimal value for n is 3, returning requested palette with 3 different levels
# Plot t-SNE results colored by cluster
par(mfrow = c(1,2))
plot(tsne_res$Y, pch = 21, 
     bg = c("white", clust_cols)[cluster_assignments+1], 
     main = "Phylo Weighted Distance", 
     xlab = "tSNE Dim 1", ylab = "tSNE Dim 2")

# Plot likelihood distributions by cluster
corHMM:::plot_stacked_densities(joint_ci$lnliks, clusters, 
                      cols = clust_cols, 
                      main = "lnlik Distribution by cluster")

These visualizations help identify distinct clusters of reconstructions and their likelihood distributions.

Analyzing Cluster Characteristics

Examine the characteristics of each cluster:

# Visualize state distributions within each cluster
par(mar=c(1, 2, 2, 2))
layout(matrix(1:(optimal_k_gap*2), nrow = optimal_k_gap, byrow = TRUE))

for(i in 1:optimal_k_gap) {
  # Plot pie charts showing state frequencies at each node
  clust_id = i
  pies <- t(apply(state_df[clusters == clust_id,], 2, 
                 function(x) table(factor(x, levels = c(1,2,3)))/length(x)))
  plot.phylo(phy, show.tip.label = FALSE, edge.width = 0.5, 
            direction = "downwards", main = paste0("Proportions within cluster ", i))
  nodelabels(pie = pies, cex = 0.5, piecol = state_cols)
  
  # Plot representative reconstruction from the cluster
  clust_id = i
  plot.phylo(phy, show.tip.label = FALSE, edge.width = 0.5, 
            direction = "downwards", main = paste0("Average of cluster ", i))
  index_lowest_dist <- which.min(colMeans(
    as.matrix(phylo_dist_mat_local)[clusters == clust_id, clusters == clust_id]))
  max_node_labels <- state_df[clusters == clust_id,][index_lowest_dist,]
  nodelabels(pch = 21, 
            bg = state_cols[as.numeric(max_node_labels)])
}

Identifying Important Nodes for Each Cluster

Calculate importance scores to identify nodes that differentiate clusters:

# Calculate node importance scores
node_importance <- corHMM:::node_importance_by_cluster(state_df, clusters)

# Identify important nodes for each cluster
important_nodes <- lapply(unique(clusters), function(clust) {
  cluster_nodes <- node_importance[node_importance$cluster == clust,]
  cluster_nodes[cluster_nodes$importance > 0.5,]
})

Nodes with high importance scores are those that most strongly differentiate one cluster from others.

Comprehensive Visualization

Create a visualization showing the clustering results:

par(mfrow = c(1,1))
plot_info <- corHMM:::plot_clustered_phylo(
  phy = phy,
  state_df = state_df,
  cluster_colors = clust_cols,
  cluster_assignments = clusters,
  node_states = focal$states,
  important_nodes = important_nodes,
  state_colors = state_cols,
  state_labels = c("0|0", "0|1", "1|1"),
  legend_title = "Trait",
  phylogram_params = list(edge.color = "grey50"),
  point_params = list(cex = 1),
  segment_params = list(lty = 3, lwd = 1.5),
  outer_cex = 1.5, 
  inner_rad = .4, 
  node_offset = 2
)

This visualization shows the state at each node in the best reconstruction and important nodes that differentiate clusters. This lets us quickly see the alternative state reconstructions inferred across different clusters.

Interpreting Results

When interpreting the clustering results, consider:

  1. Number of clusters: More clusters suggest greater uncertainty or multiple plausible evolutionary scenarios.

  2. Likelihood distributions: Clusters with similar likelihood values represent equally plausible alternative reconstructions.

  3. Important nodes: Nodes with high importance scores represent evolutionary transitions where uncertainty is concentrated.

  4. State distributions: Pie charts show the frequency of different states at each node within a cluster.

Adapting to Your Data

To use this approach with your own data:

  1. Replace the example data with your own phylogeny and trait data
  2. Adjust the parameters of compute_joint_ci() based on your computational resources
  3. Experiment with different eps values for DBSCAN based on your kNN distance plot
  4. Modify visualization parameters to suit your trait coding system

Function Reference

  • compute_joint_ci(): Generates alternative ancestral state reconstructions
  • phylo_aware_dist_local(): Calculates phylogenetically-informed distances
  • node_importance_by_cluster(): Identifies nodes that differentiate clusters
  • plot_clustered_phylo(): Creates visualization of clustering results
  • plot_stacked_densities(): Visualizes likelihood distributions by cluster