Skip to content

Add fast RF cross-pairs via Day 1985 ClusterTable batch#192

Merged
ms609 merged 1 commit intomainfrom
feature/rf-cross-pairs
Mar 28, 2026
Merged

Add fast RF cross-pairs via Day 1985 ClusterTable batch#192
ms609 merged 1 commit intomainfrom
feature/rf-cross-pairs

Conversation

@ms609
Copy link
Copy Markdown
Owner

@ms609 ms609 commented Mar 28, 2026

Summary

RobinsonFoulds(trees1, trees2) previously fell through to per-pair R dispatch via CalculateTreeDistance().SplitDistanceManyMany().mapply(), which was ~27× slower per pair than the all-pairs path.

This PR adds robinson_foulds_cross_pairs() in day_1985.cpp using the same Day 1985 algorithm and ClusterTable encoding as the existing robinson_foulds_all_pairs(). The R dispatch in RobinsonFoulds() now uses this fast path when both inputs are tree lists with matching tip labels.

Root cause

Path Entry point Per-pair cost
RobinsonFoulds(trees) (all-pairs) robinson_foulds_all_pairs() — batch C++
RobinsonFoulds(trees1, trees2) (old) CalculateTreeDistance() → per-pair .mapply() ~27×
RobinsonFoulds(trees1, trees2) (new) robinson_foulds_cross_pairs() — batch C++ ~1.8×

The remaining 1.8× gap vs all-pairs is from calling as.ClusterTable() twice (once per collection).

Benchmark

50 anchor trees × 250 target trees (50 tips):

  • Old path: 542 ms
  • New path: 26 ms
  • Speedup: 21×

Motivation

Tree-topology ESS computation in MkPrime uses RobinsonFoulds() for pairwise distance matrices. The cross-distance path enables computing rectangular submatrices (e.g. 200 anchor rows × n columns) instead of full n×n, reducing the number of comparisons by ~60% at n = 1000. Previously the per-pair overhead negated this algorithmic saving; now it's viable.

Changes

  • src/day_1985.cpp: New robinson_foulds_cross_pairs() — identical inner loop to all-pairs, M×N iteration bounds
  • R/tree_distance_rf.R: Fast path in RobinsonFoulds() when both inputs are multiPhylo with matching tips; falls back to CalculateTreeDistance() for single-tree inputs or reportMatching = TRUE
  • tests/testthat/test-batch_coverage.R: 5 new tests (distance, similarity, normalization, single-tree fallback, all-pairs consistency)

@ms609 ms609 changed the base branch from transfer-consensus to main March 28, 2026 10:31
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.74%. Comparing base (cd2cf8e) to head (8851f7f).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #192      +/-   ##
==========================================
+ Coverage   95.66%   95.74%   +0.07%     
==========================================
  Files          49       49              
  Lines        4448     4531      +83     
==========================================
+ Hits         4255     4338      +83     
  Misses        193      193              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ms609 ms609 force-pushed the feature/rf-cross-pairs branch from fb0396b to 8692ae6 Compare March 28, 2026 10:39
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

⚠️ This benchmark result is outdated. See the latest comment below.

Performance benchmark results

Call Status Change Time (ms)
ClusteringInfoDistance(tr200) ⚪ NSD 0.2% 15.5 →
15.4, 15.6
ClusteringInfoDistance(tr50) ⚪ NSD -1.71% 11.8 →
11.8, 12.2
LAPJV(test2000) 🟠 Slower 🙁 -5.24% 94.2 →
99.7, 98.4
LAPJV(test40) ⚪ NSD -0.58% 0.0171 →
0.0173, 0.0171
LAPJV(test400) ⚪ NSD -0.47% 3.15 →
3.17, 3.16
MutualClusteringInfo(tr200) ⚪ NSD 1.67% 23.2 →
22.6, 23.4
MutualClusteringInfo(tr50) ⚪ NSD 0.17% 24.8 →
24.3, 25.3
PathDist(postTrees) 🟠 Slower 🙁 -55.45% 3.43 →
5.32, 5.34
PhylogeneticInfoDistance(tr200) 🟣 ~same 1.45% 231 →
228, 227
PhylogeneticInfoDistance(tr50) 🟣 ~same -0.72% 77.7 →
78.3, 78.3
RobinsonFoulds(tr200) ⚪ NSD -3.21% 2.88 →
2.98, 2.98
RobinsonFoulds(tr200) ⚪ NSD -3.41% 2.66 →
2.77, 2.75
RobinsonFoulds(tr50) ⚪ NSD -1.95% 4.42 →
4.51, 4.5

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

⚠️ This benchmark result is outdated. See the latest comment below.

Performance benchmark results

Call Status Change Time (ms)
ClusteringInfoDistance(tr200) ⚪ NSD 0.4% 15.7 →
15.7, 15.6
ClusteringInfoDistance(tr50) ⚪ NSD -1.33% 11.8 →
12, 11.9
LAPJV(test2000) ⚪ NSD 1.85% 89.1 →
88.1, 86.4
LAPJV(test40) ⚪ NSD 0.87% 0.0148 →
0.0147, 0.0147
LAPJV(test400) ⚪ NSD 0.32% 2.93 →
2.91, 2.92
MutualClusteringInfo(tr200) ⚪ NSD -1.93% 22.5 →
23.1, 22.6
MutualClusteringInfo(tr50) ⚪ NSD -1.44% 23.7 →
23.8, 24.3
PathDist(postTrees) ⚪ NSD -1.25% 3.47 →
3.5, 3.54
PhylogeneticInfoDistance(tr200) 🟣 ~same 0.52% 233 →
233, 232
PhylogeneticInfoDistance(tr50) ⚪ NSD -0.06% 80.2 →
80.3, 80.2
RobinsonFoulds(tr200) ⚪ NSD -1.12% 2.91 →
2.94, 2.94
RobinsonFoulds(tr200) ⚪ NSD -2.08% 2.68 →
2.75, 2.71
RobinsonFoulds(tr50) ⚪ NSD -1.42% 4.47 →
4.52, 4.55

@ms609 ms609 force-pushed the feature/rf-cross-pairs branch from 8692ae6 to 2f660d5 Compare March 28, 2026 13:53
RobinsonFoulds(trees1, trees2) previously fell through to per-pair
R dispatch via CalculateTreeDistance() -> .mapply(), which was ~27x
slower per pair than the all-pairs path.

Add robinson_foulds_cross_pairs() C++ function in day_1985.cpp using
the same Day 1985 algorithm and ClusterTable encoding as the existing
all-pairs function. Wire it into RobinsonFoulds() as a fast path when
both inputs are tree lists with matching tip labels.

Benchmark: 21x speedup for RobinsonFoulds(trees1, trees2).
Per-pair cost is now 1.84x all-pairs (down from 27x).

Tests: 5 new tests in test-batch_coverage.R covering distance,
similarity, normalization, single-tree fallback, and all-pairs
consistency.
@ms609 ms609 force-pushed the feature/rf-cross-pairs branch from 2f660d5 to 8851f7f Compare March 28, 2026 14:06
@ms609 ms609 merged commit 0b8299c into main Mar 28, 2026
11 of 13 checks passed
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

⚠️ This benchmark result is outdated. See the latest comment below.

Performance benchmark results

Call Status Change Time (ms)
ClusteringInfoDistance(tr200) ⚪ NSD -0.67% 15.3 →
15.4, 15.4
ClusteringInfoDistance(tr50) ⚪ NSD -1.46% 11.4 →
11.6, 11.5
LAPJV(test2000) ⚪ NSD -3.28% 83.2 →
94.3, 84.1
LAPJV(test40) ⚪ NSD -1.52% 0.0144 →
0.0148, 0.0145
LAPJV(test400) 🟣 ~same -1.27% 2.88 →
2.93, 2.91
MutualClusteringInfo(tr200) 🟠 Slower 🙁 -9.23% 21.4 →
24.5, 22.7
MutualClusteringInfo(tr50) 🟠 Slower 🙁 -20.98% 21.9 →
28.8, 25.4
PathDist(postTrees) ⚪ NSD 1.37% 3.46 →
3.43, 3.4
PhylogeneticInfoDistance(tr200) ⚪ NSD -0.23% 229 →
229, 229
PhylogeneticInfoDistance(tr50) ⚪ NSD -1.12% 79.2 →
79.1, 81.2
RobinsonFoulds(tr200) ⚪ NSD 1.6% 2.92 →
2.89, 2.85
RobinsonFoulds(tr200) ⚪ NSD -1.28% 2.63 →
2.69, 2.64
RobinsonFoulds(tr50) ⚪ NSD -0.86% 4.35 →
4.4, 4.37

@ms609 ms609 deleted the feature/rf-cross-pairs branch March 28, 2026 14:06
@github-actions
Copy link
Copy Markdown

Performance benchmark results

Call Status Change Time (ms)
ClusteringInfoDistance(tr200) ⚪ NSD -0.36% 15.3 →
15.4, 15.2
ClusteringInfoDistance(tr50) ⚪ NSD -0.89% 11.5 →
11.6, 11.5
LAPJV(test2000) ⚪ NSD -1.37% 82.3 →
82.5, 83.9
LAPJV(test40) ⚪ NSD 0.95% 0.0148 →
0.0147, 0.0146
LAPJV(test400) ⚪ NSD -0.41% 2.88 →
2.88, 2.9
MutualClusteringInfo(tr200) ⚪ NSD -2.01% 21.7 →
22.7, 21.7
MutualClusteringInfo(tr50) ⚪ NSD 0.87% 22.6 →
22.3, 22.7
PathDist(postTrees) ⚪ NSD -0.75% 3.41 →
3.44, 3.41
PhylogeneticInfoDistance(tr200) ⚪ NSD -0.1% 229 →
229, 229
PhylogeneticInfoDistance(tr50) ⚪ NSD -0.5% 79.3 →
79.4, 79.8
RobinsonFoulds(tr200) ⚪ NSD 0.65% 2.88 →
2.84, 2.87
RobinsonFoulds(tr200) ⚪ NSD -0.27% 2.65 →
2.65, 2.66
RobinsonFoulds(tr50) ⚪ NSD -0.18% 4.36 →
4.34, 4.39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant