Generating and Interpreting Alternative Ancestral State Reconstructions

Overview

Ancestral state reconstruction (ASR) is a critical tool for comparative biology, providing inferences about the unobserved evolutionary history of lineages. While the evolutionary models of character evolution have advanced, the field has primarily relied on marginal reconstructions, which independently estimate the probability of each node state. In contrast, joint reconstructions estimate the most likely sequence of states across nodes. Joint reconstructions may capture the evolutionary process more accurately, but methods for estimating uncertainty in joint reconstructions are lacking.

This vignette demonstrates how to generate and analyze alternative joint ancestral state reconstructions using some clustering approaches (see Boyko et al. 2025).

Installation and Setup

First, ensure you have the required packages installed:

# Install required packages if not already installed
if (!requireNamespace("Rtsne", quietly = TRUE)) install.packages("Rtsne")
if (!requireNamespace("dbscan", quietly = TRUE)) install.packages("dbscan")
if (!requireNamespace("proxy", quietly = TRUE)) install.packages("proxy")
if (!requireNamespace("RColorBrewer", quietly = TRUE)) install.packages("RColorBrewer")
# For the developmental version of corHMM:
# devtools::install_github("thej022214/corHMM")

Load the necessary libraries and source custom functions:

# Load required libraries
library(Rtsne)
library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
library(proxy)
## 
## Attaching package: 'proxy'
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## The following object is masked from 'package:base':
## 
##     as.matrix
library(ape)
library(RColorBrewer)
# library(corHMM)
devtools::load_all("~/corHMM/") # this vignette was made from a local corhmm version, you'll load it via library
## ℹ Loading corHMM
## Loading required package: nloptr
## Loading required package: GenSA

Data Preparation

This example uses the primates dataset included in the corHMM package. You can substitute your own phylogenetic tree and trait data.

# Load example data
data(primates)
phy <- multi2di(primates[[1]])  # Ensure tree is fully bifurcating
cor_dat <- primates[[2]]        # Trait data
# For your own data, use:
# phy <- read.tree("path/to/your/tree.tre")
# cor_dat <- read.csv("path/to/your/trait_data.csv")

Model Fitting with corHMM

First, fit a corHMM model to estimate transition rates between character states:

# Fit corHMM model
focal <- corHMM(phy, cor_dat, 
                rate.cat = 1,            # Number of rate categories
                model = "ER",            # Equal rates model
                node.states = "joint",   # Joint reconstruction
                root.p = "maddfitz")     # Root state frequencies
## Warning in corHMM(phy, cor_dat, rate.cat = 1, model = "ER", node.states =
## "joint", : Branch lengths of 0 detected. Adding 1.49011611938477e-08
## State distribution in data:
## States:  0|0 0|1 1|1 
## Counts:  29  10  21  
## Beginning thorough optimization search -- performing 0 random restarts 
## Finished. Inferring ancestral states using joint reconstruction.
state_cols <- RColorBrewer::brewer.pal(3, "Set1")
plot(phy, cex = 0.25, label.offset = 0.75, no.margin = TRUE)
tiplabels(pch = 21, bg = state_cols[as.numeric(focal$data.legend[,2])], cex = 0.75)
nodelabels(pch = 21, bg = state_cols[focal$states], cex = 0.75)

Generating Alternative Reconstructions

Generate alternative ancestral state reconstructions by sampling from the confidence interval:

set.seed(0104)  # Set seed for reproducibility

# Generate alternative reconstructions
joint_ci <- corHMM:::compute_joint_ci(
  focal,
  batch_size = 100,    # Number of samples per batch
  max_samples = 1000)  # Maximum number of samples)     

# Extract results
best_joint <- joint_ci$best_joint   # Best reconstruction
state_df <- joint_ci$state_df       # Data frame of alternative reconstructions

The compute_joint_ci function samples alternative reconstructions based on their conditional likelihoods, providing a more comprehensive view of possible evolutionary histories than a single point estimate.

We can also just take a look at the state_df data.frame which contains of the alternative internal node states.

corHMM:::plot_raw_recon(as.matrix(state_df), rep(1, nrow(state_df)), "grey")

Each row is a different character history and each column is a different node. There is some structure we can see here, but we can do better and organize similar reconstructions together based on their overall similarty. To do that we need to calculate some measure of distance between the reconstructions.

Calculating Phylogenetic Distances

Calculate distances between alternative reconstructions, accounting for phylogenetic structure:

# Prepare phylogeny for distance calculation
phy <- ladderize(focal$phy)
tmp_phy <- phy
tmp_phy$edge.length <- rep(1, length(tmp_phy$edge.length))
phy_dist <- ape::dist.nodes(tmp_phy)[-(1:Ntip(tmp_phy)),-(1:Ntip(tmp_phy))]
state_df_matrix <- as.matrix(state_df)

# Register custom distance function
try(pr_DB$set_entry(FUN = phylo_aware_dist_local, names = "PhyloSpreadDistLocal"))

# Calculate phylogenetically-informed distances
phylo_dist_mat_local <- proxy::dist(state_df_matrix, 
                                   method = "PhyloSpreadDistLocal",
                                   phylo_dist = phy_dist)
heatmap(as.matrix(phylo_dist_mat_local))

The phylo_aware_dist_local function weights differences between reconstructions based on their phylogenetic proximity, so that differences at closely related nodes contribute more to the overall distance. What this heatmap is showing is not the primate data, but the distance between different joint reconstructions. The further apart a pair of reconstructions are, the darker the color.

Dimensionality Reduction and Clustering

Use t-SNE for dimensionality reduction and DBSCAN for clustering:

# t-SNE on phylogenetically-informed distances
set.seed(0104)
tsne_res <- Rtsne(phylo_dist_mat_local)

# Determine parameters for DBSCAN
n_recons <- length(joint_ci$lnliks)
min_pts <- round(.05 * n_recons)  # ~5% of reconstructions as minimum points

# Find appropriate epsilon value
par(mfrow = c(1,1))
kNNdistplot(tsne_res$Y, k = min_pts - 1, minPts = min_pts)
abline(h = 4.25, lty = 3, col = "red")  # Add reference line at eps = 3

# Run DBSCAN clustering
set.seed(0104)
dbres <- dbscan(tsne_res$Y, eps = 3.25, minPts = min_pts)

The kNNdistplot helps identify an appropriate eps value for DBSCAN clustering. The “knee” in the plot indicates a good cutoff for separating core points from noise.

Visualizing Clusters of Reconstructions

Create visualizations to explore the clustering results:

# Set up cluster assignments and colors
cluster_assignments <- dbres$cluster
clusters <- factor(cluster_assignments)
optimal_k_gap <- sum(as.numeric(unique(dbres$cluster)) > 0)
clust_cols <- RColorBrewer::brewer.pal(sum(levels(clusters) != 0), "Set3")

# Plot t-SNE results colored by cluster
par(mfrow = c(1,2))
plot(tsne_res$Y, pch = 21, 
     bg = c("white", clust_cols)[cluster_assignments+1], 
     main = "Phylo Weighted Distance", 
     xlab = "tSNE Dim 1", ylab = "tSNE Dim 2")

# Plot likelihood distributions by cluster
corHMM:::plot_stacked_densities(joint_ci$lnliks, clusters, 
                      cols = clust_cols, 
                      main = "lnlik Distribution by cluster")

These visualizations help identify distinct clusters of reconstructions and their likelihood distributions. Now we can return to our raw plot recon to see how these groupings cluster together in the actual character history space.

corHMM:::plot_raw_recon(as.matrix(state_df), cluster_assignments, clust_cols)

This is the same clustering as above, but now organized into the distinct clutsers we’ve found. In the next section we’ll take a deeper dive into each chunk and see how we can characterize each cluster as a single reconstruction.

Analyzing Cluster Characteristics

First we’ll plot the average character state at each node as a pie chart and then take a single representitve. What you’ll notice is that there are differences in when a particular state emerges or is lost.

Examine the characteristics of each cluster:

# Visualize state distributions within each cluster
par(mar=c(1, 2, 2, 2))
layout(matrix(1:(optimal_k_gap*2), nrow = optimal_k_gap, byrow = TRUE))

for(i in 1:optimal_k_gap) {
  # Plot pie charts showing state frequencies at each node
  clust_id = i
  pies <- t(apply(state_df[clusters == clust_id,], 2, 
                 function(x) table(factor(x, levels = c(1,2,3)))/length(x)))
  plot.phylo(phy, show.tip.label = FALSE, edge.width = 0.5, 
            direction = "downwards", main = paste0("Proportions within cluster ", i))
  nodelabels(pie = pies, cex = 0.5, piecol = state_cols)
  
  # Plot representative reconstruction from the cluster
  clust_id = i
  plot.phylo(phy, show.tip.label = FALSE, edge.width = 0.5, 
            direction = "downwards", main = paste0("Average of cluster ", i))
  index_lowest_dist <- which.min(colMeans(
    as.matrix(phylo_dist_mat_local)[clusters == clust_id, clusters == clust_id]))
  max_node_labels <- state_df[clusters == clust_id,][index_lowest_dist,]
  nodelabels(pch = 21, 
            bg = state_cols[as.numeric(max_node_labels)])
}

Identifying Important Nodes for Each Cluster

We can also visualize this in a more focused way. I.e., we only want to visualize nodes where there are differences in the different reconstruction clusters.

Calculate importance scores to identify nodes that differentiate clusters: # note you can change the threshold for what is considered important if you choose. If the next plot throws an error it may be too strict of an importance threshold.

# Calculate node importance scores
node_importance <- corHMM:::node_importance_by_cluster(state_df, clusters)

# Identify important nodes for each cluster
important_nodes <- lapply(unique(clusters), function(clust) {
  cluster_nodes <- node_importance[node_importance$cluster == clust,]
  cluster_nodes[cluster_nodes$importance > 0.25,]
})

Nodes with high importance scores are those that most strongly differentiate one cluster from others.

Comprehensive Visualization

Create a visualization showing the clustering results:

par(mfrow = c(1,1))
plot_info <- corHMM:::plot_clustered_phylo(
  phy = phy,
  state_df = state_df,
  cluster_colors = clust_cols,
  cluster_assignments = clusters,
  node_states = focal$states,
  important_nodes = important_nodes,
  state_colors = state_cols,
  state_labels = c("0|0", "0|1", "1|1"),
  legend_title = "Trait",
  phylogram_params = list(edge.color = "grey50"),
  point_params = list(cex = 1),
  segment_params = list(lty = 3, lwd = 1.5),
  outer_cex = 1.5, 
  inner_rad = .4, 
  node_offset = 2
)

This visualization shows the state at each node in the best reconstruction and important nodes that differentiate clusters. This lets us quickly see the alternative state reconstructions inferred across different clusters.

Interpreting Results

When interpreting the clustering results, consider:

  1. Number of clusters: More clusters suggest greater uncertainty or multiple plausible evolutionary scenarios.

  2. Likelihood distributions: Clusters with similar likelihood values represent equally plausible alternative reconstructions.

  3. Important nodes: Nodes with high importance scores represent evolutionary transitions where uncertainty is concentrated.

  4. State distributions: Pie charts show the frequency of different states at each node within a cluster.

Adapting to Your Data

To use this approach with your own data:

  1. Replace the example data with your own phylogeny and trait data
  2. Adjust the parameters of compute_joint_ci() based on your computational resources
  3. Experiment with different eps values for DBSCAN based on your kNN distance plot
  4. Modify visualization parameters to suit your trait coding system

Function Reference

  • compute_joint_ci(): Generates alternative ancestral state reconstructions
  • phylo_aware_dist_local(): Calculates phylogenetically-informed distances
  • node_importance_by_cluster(): Identifies nodes that differentiate clusters
  • plot_clustered_phylo(): Creates visualization of clustering results
  • plot_stacked_densities(): Visualizes likelihood distributions by cluster