Ancestral state reconstruction (ASR) is a critical tool for comparative biology, providing inferences about the unobserved evolutionary history of lineages. While the evolutionary models of character evolution have advanced, the field has primarily relied on marginal reconstructions, which independently estimate the probability of each node state. In contrast, joint reconstructions estimate the most likely sequence of states across nodes. Joint reconstructions may capture the evolutionary process more accurately, but methods for estimating uncertainty in joint reconstructions are lacking.
This vignette demonstrates how to generate and analyze alternative joint ancestral state reconstructions using some clustering approaches (see Boyko et al. 2025).
First, ensure you have the required packages installed:
# Install required packages if not already installed
if (!requireNamespace("Rtsne", quietly = TRUE)) install.packages("Rtsne")
if (!requireNamespace("dbscan", quietly = TRUE)) install.packages("dbscan")
if (!requireNamespace("proxy", quietly = TRUE)) install.packages("proxy")
if (!requireNamespace("RColorBrewer", quietly = TRUE)) install.packages("RColorBrewer")
# For the developmental version of corHMM:
# devtools::install_github("thej022214/corHMM")
Load the necessary libraries and source custom functions:
# Load required libraries
library(Rtsne)
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(proxy)
##
## Attaching package: 'proxy'
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
library(ape)
library(RColorBrewer)
# library(corHMM)
devtools::load_all("~/corHMM/")
## ℹ Loading corHMM
## Loading required package: nloptr
## Loading required package: GenSA
## Warning in readLines(con, warn = FALSE, n = n, ok = ok, skipNul = skipNul):
## invalid input found on input connection '/Users/jboyko/corHMM/R/makeSimmap.R'
This example uses the primates dataset included in the corHMM package. You can substitute your own phylogenetic tree and trait data.
# Load example data
data(primates)
phy <- multi2di(primates[[1]]) # Ensure tree is fully bifurcating
cor_dat <- primates[[2]] # Trait data
# For your own data, use:
# phy <- read.tree("path/to/your/tree.tre")
# cor_dat <- read.csv("path/to/your/trait_data.csv")
First, fit a corHMM model to estimate transition rates between character states:
# Fit corHMM model
focal <- corHMM(phy, cor_dat,
rate.cat = 1, # Number of rate categories
model = "ER", # Equal rates model
node.states = "joint", # Joint reconstruction
root.p = "maddfitz") # Root state frequencies
## Warning in corHMM(phy, cor_dat, rate.cat = 1, model = "ER", node.states =
## "joint", : Branch lengths of 0 detected. Adding 1.49011611938477e-08
## State distribution in data:
## States: 0|0 0|1 1|1
## Counts: 29 10 21
## Beginning thorough optimization search -- performing 0 random restarts
## Finished. Inferring ancestral states using joint reconstruction.
state_cols <- RColorBrewer::brewer.pal(3, "Set1")
plot(phy, cex = 0.25, label.offset = 0.75, no.margin = TRUE)
tiplabels(pch = 21, bg = state_cols[as.numeric(focal$data.legend[,2])], cex = 0.75)
nodelabels(pch = 21, bg = state_cols[focal$states], cex = 0.75)
Generate alternative ancestral state reconstructions by sampling from the confidence interval:
set.seed(0104) # Set seed for reproducibility
# Generate alternative reconstructions
joint_ci <- corHMM:::compute_joint_ci(
focal,
batch_size = 100, # Number of samples per batch
max_samples = 1000, # Maximum number of samples
ncores = 10) # Number of cores for parallel processing
# Extract results
best_joint <- joint_ci$best_joint # Best reconstruction
state_df <- joint_ci$state_df # Data frame of alternative reconstructions
The compute_joint_ci
function samples alternative
reconstructions based on their conditional likelihoods, providing a more
comprehensive view of possible evolutionary histories than a single
point estimate.
Calculate distances between alternative reconstructions, accounting for phylogenetic structure:
# Prepare phylogeny for distance calculation
phy <- ladderize(focal$phy)
tmp_phy <- phy
tmp_phy$edge.length <- rep(1, length(tmp_phy$edge.length))
phy_dist <- ape::dist.nodes(tmp_phy)[-(1:Ntip(tmp_phy)),-(1:Ntip(tmp_phy))]
state_df_matrix <- as.matrix(state_df)
# Register custom distance function
try(pr_DB$set_entry(FUN = corHMM:::phylo_aware_dist_local, names = "PhyloSpreadDistLocal"))
# Calculate phylogenetically-informed distances
phylo_dist_mat_local <- proxy::dist(state_df_matrix,
method = "PhyloSpreadDistLocal",
phylo_dist = phy_dist)
heatmap(as.matrix(phylo_dist_mat_local))
The phylo_aware_dist_local
function weights differences
between reconstructions based on their phylogenetic proximity, so that
differences at closely related nodes contribute more to the overall
distance. What this heatmap is showing is not the primate data, but the
distance between different joint reconstructions. The further apart a
pair of reconstructions are, the darker the color.
Use t-SNE for dimensionality reduction and DBSCAN for clustering:
# t-SNE on phylogenetically-informed distances
set.seed(0104)
tsne_res <- Rtsne(phylo_dist_mat_local)
# Determine parameters for DBSCAN
n_recons <- length(joint_ci$lnliks)
min_pts <- round(.05 * n_recons) # ~5% of reconstructions as minimum points
# Find appropriate epsilon value
par(mfrow = c(1,1))
kNNdistplot(tsne_res$Y, k = min_pts - 1, minPts = min_pts)
abline(h = 3.25, lty = 3, col = "red") # Add reference line at eps = 3
# Run DBSCAN clustering
set.seed(0104)
dbres <- dbscan(tsne_res$Y, eps = 3.25, minPts = min_pts)
The kNNdistplot
helps identify an appropriate
eps
value for DBSCAN clustering. The “knee” in the plot
indicates a good cutoff for separating core points from noise.
Create visualizations to explore the clustering results:
# Set up cluster assignments and colors
cluster_assignments <- dbres$cluster
clusters <- factor(cluster_assignments)
optimal_k_gap <- sum(as.numeric(unique(dbres$cluster)) > 0)
clust_cols <- RColorBrewer::brewer.pal(sum(levels(clusters) != 0), "Set3")
## Warning in RColorBrewer::brewer.pal(sum(levels(clusters) != 0), "Set3"): minimal value for n is 3, returning requested palette with 3 different levels
# Plot t-SNE results colored by cluster
par(mfrow = c(1,2))
plot(tsne_res$Y, pch = 21,
bg = c("white", clust_cols)[cluster_assignments+1],
main = "Phylo Weighted Distance",
xlab = "tSNE Dim 1", ylab = "tSNE Dim 2")
# Plot likelihood distributions by cluster
corHMM:::plot_stacked_densities(joint_ci$lnliks, clusters,
cols = clust_cols,
main = "lnlik Distribution by cluster")
These visualizations help identify distinct clusters of reconstructions and their likelihood distributions.
Examine the characteristics of each cluster:
# Visualize state distributions within each cluster
par(mar=c(1, 2, 2, 2))
layout(matrix(1:(optimal_k_gap*2), nrow = optimal_k_gap, byrow = TRUE))
for(i in 1:optimal_k_gap) {
# Plot pie charts showing state frequencies at each node
clust_id = i
pies <- t(apply(state_df[clusters == clust_id,], 2,
function(x) table(factor(x, levels = c(1,2,3)))/length(x)))
plot.phylo(phy, show.tip.label = FALSE, edge.width = 0.5,
direction = "downwards", main = paste0("Proportions within cluster ", i))
nodelabels(pie = pies, cex = 0.5, piecol = state_cols)
# Plot representative reconstruction from the cluster
clust_id = i
plot.phylo(phy, show.tip.label = FALSE, edge.width = 0.5,
direction = "downwards", main = paste0("Average of cluster ", i))
index_lowest_dist <- which.min(colMeans(
as.matrix(phylo_dist_mat_local)[clusters == clust_id, clusters == clust_id]))
max_node_labels <- state_df[clusters == clust_id,][index_lowest_dist,]
nodelabels(pch = 21,
bg = state_cols[as.numeric(max_node_labels)])
}
Calculate importance scores to identify nodes that differentiate clusters:
# Calculate node importance scores
node_importance <- corHMM:::node_importance_by_cluster(state_df, clusters)
# Identify important nodes for each cluster
important_nodes <- lapply(unique(clusters), function(clust) {
cluster_nodes <- node_importance[node_importance$cluster == clust,]
cluster_nodes[cluster_nodes$importance > 0.5,]
})
Nodes with high importance scores are those that most strongly differentiate one cluster from others.
Create a visualization showing the clustering results:
par(mfrow = c(1,1))
plot_info <- corHMM:::plot_clustered_phylo(
phy = phy,
state_df = state_df,
cluster_colors = clust_cols,
cluster_assignments = clusters,
node_states = focal$states,
important_nodes = important_nodes,
state_colors = state_cols,
state_labels = c("0|0", "0|1", "1|1"),
legend_title = "Trait",
phylogram_params = list(edge.color = "grey50"),
point_params = list(cex = 1),
segment_params = list(lty = 3, lwd = 1.5),
outer_cex = 1.5,
inner_rad = .4,
node_offset = 2
)
This visualization shows the state at each node in the best reconstruction and important nodes that differentiate clusters. This lets us quickly see the alternative state reconstructions inferred across different clusters.
When interpreting the clustering results, consider:
Number of clusters: More clusters suggest greater uncertainty or multiple plausible evolutionary scenarios.
Likelihood distributions: Clusters with similar likelihood values represent equally plausible alternative reconstructions.
Important nodes: Nodes with high importance scores represent evolutionary transitions where uncertainty is concentrated.
State distributions: Pie charts show the frequency of different states at each node within a cluster.
To use this approach with your own data:
compute_joint_ci()
based on
your computational resourceseps
values for DBSCAN based
on your kNN distance plot