Add fast RF cross-pairs via Day 1985 ClusterTable batch#192
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #192 +/- ##
==========================================
+ Coverage 95.66% 95.74% +0.07%
==========================================
Files 49 49
Lines 4448 4531 +83
==========================================
+ Hits 4255 4338 +83
Misses 193 193 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fb0396b to
8692ae6
Compare
Performance benchmark results
|
Performance benchmark results
|
8692ae6 to
2f660d5
Compare
RobinsonFoulds(trees1, trees2) previously fell through to per-pair R dispatch via CalculateTreeDistance() -> .mapply(), which was ~27x slower per pair than the all-pairs path. Add robinson_foulds_cross_pairs() C++ function in day_1985.cpp using the same Day 1985 algorithm and ClusterTable encoding as the existing all-pairs function. Wire it into RobinsonFoulds() as a fast path when both inputs are tree lists with matching tip labels. Benchmark: 21x speedup for RobinsonFoulds(trees1, trees2). Per-pair cost is now 1.84x all-pairs (down from 27x). Tests: 5 new tests in test-batch_coverage.R covering distance, similarity, normalization, single-tree fallback, and all-pairs consistency.
2f660d5 to
8851f7f
Compare
Performance benchmark results
|
Performance benchmark results
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RobinsonFoulds(trees1, trees2)previously fell through to per-pair R dispatch viaCalculateTreeDistance()→.SplitDistanceManyMany()→.mapply(), which was ~27× slower per pair than the all-pairs path.This PR adds
robinson_foulds_cross_pairs()inday_1985.cppusing the same Day 1985 algorithm andClusterTableencoding as the existingrobinson_foulds_all_pairs(). The R dispatch inRobinsonFoulds()now uses this fast path when both inputs are tree lists with matching tip labels.Root cause
RobinsonFoulds(trees)(all-pairs)robinson_foulds_all_pairs()— batch C++RobinsonFoulds(trees1, trees2)(old)CalculateTreeDistance()→ per-pair.mapply()RobinsonFoulds(trees1, trees2)(new)robinson_foulds_cross_pairs()— batch C++The remaining 1.8× gap vs all-pairs is from calling
as.ClusterTable()twice (once per collection).Benchmark
50 anchor trees × 250 target trees (50 tips):
Motivation
Tree-topology ESS computation in MkPrime uses
RobinsonFoulds()for pairwise distance matrices. The cross-distance path enables computing rectangular submatrices (e.g. 200 anchor rows × n columns) instead of full n×n, reducing the number of comparisons by ~60% at n = 1000. Previously the per-pair overhead negated this algorithmic saving; now it's viable.Changes
src/day_1985.cpp: Newrobinson_foulds_cross_pairs()— identical inner loop to all-pairs, M×N iteration boundsR/tree_distance_rf.R: Fast path inRobinsonFoulds()when both inputs are multiPhylo with matching tips; falls back toCalculateTreeDistance()for single-tree inputs orreportMatching = TRUEtests/testthat/test-batch_coverage.R: 5 new tests (distance, similarity, normalization, single-tree fallback, all-pairs consistency)