Support for inline-beta filtered search with expressions#782
Support for inline-beta filtered search with expressions#782
Conversation
- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools
There was a problem hiding this comment.
Pull request overview
This PR integrates label-filtered (“document”) insertion and inline beta filtered search into the DiskANN benchmark/tooling flow, enabling benchmarks that operate on { vector, attributes } documents and evaluate filtered queries.
Changes:
- Added
DocumentInsertStrategyand supporting public types to insert/queryDocumentobjects (vector + attributes) throughDocumentProvider. - Extended inline beta filter search to handle predicate encoding failures and added a constructor for
InlineBetaStrategy. - Added a new benchmark input/backend (
document-index-build) plus example config for running document + filter benchmarks.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test_data/disk_index_search/data.256.label.jsonl | Updates LFS pointer for label test data used in filter benchmarks. |
| diskann-tools/src/utils/ground_truth.rs | Adds array-aware label matching/expansion and extensive tracing diagnostics for filter ground-truth generation. |
| diskann-tools/Cargo.toml | Adds serde_json dependency (and adjusts manifest metadata). |
| diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs | Adds Vec<T> query support for full-precision in-mem provider (for inline beta usage). |
| diskann-label-filter/src/lib.rs | Exposes the new document_insert_strategy module under encoded_attribute_provider. |
| diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs | Adds InlineBetaStrategy::new and introduces is_valid_filter fast-path logic. |
| diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs | Adjusts filter encoding to be optional and threads is_valid_filter into the query computer. |
| diskann-label-filter/src/encoded_attribute_provider/roaring_attribute_store.rs | Makes RoaringAttributeStore public for cross-crate use. |
| diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs | Changes encoded filter representation to Option, allowing “invalid filter” fallback behavior. |
| diskann-label-filter/src/encoded_attribute_provider/document_provider.rs | Allows vector types used in documents to be ?Sized. |
| diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs | New strategy wrapper enabling insertion/search over Document values. |
| diskann-label-filter/src/encoded_attribute_provider/ast_label_id_mapper.rs | Simplifies lookup error messaging and signature for attribute→id mapping. |
| diskann-label-filter/src/document.rs | Makes Document generic over ?Sized vectors. |
| diskann-benchmark/src/utils/tokio.rs | Adds a reusable multi-thread Tokio runtime builder. |
| diskann-benchmark/src/utils/recall.rs | Re-exports knn recall helper for benchmark use. |
| diskann-benchmark/src/inputs/mod.rs | Registers a new document_index input module. |
| diskann-benchmark/src/inputs/document_index.rs | New benchmark input schema for document-index build + filtered search runs. |
| diskann-benchmark/src/backend/mod.rs | Registers new document_index backend benchmarks. |
| diskann-benchmark/src/backend/index/result.rs | Extends search result reporting with query count and wall-clock summary columns. |
| diskann-benchmark/src/backend/document_index/mod.rs | New backend module entrypoint for document index benchmarks. |
| diskann-benchmark/src/backend/document_index/benchmark.rs | New end-to-end benchmark: build via DocumentInsertStrategy + filtered search via InlineBetaStrategy. |
| diskann-benchmark/example/document-filter.json | Adds example job configuration for document filter benchmark runs. |
| Cargo.lock | Adds serde_json to the lockfile dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs
Outdated
Show resolved
Hide resolved
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs
Outdated
Show resolved
Hide resolved
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs
Outdated
Show resolved
Hide resolved
diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs
Outdated
Show resolved
Hide resolved
diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs
Outdated
Show resolved
Hide resolved
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs
Outdated
Show resolved
Hide resolved
diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs
Outdated
Show resolved
Hide resolved
diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…DiskANN into sync-from-cdb-diskann
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (58.63%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #782 +/- ##
==========================================
+ Coverage 89.31% 89.36% +0.04%
==========================================
Files 445 448 +3
Lines 84095 84406 +311
==========================================
+ Hits 75113 75431 +318
+ Misses 8982 8975 -7
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
Thank you @sampathrg - this is a big step in the right direction. I think some upcoming changes I have will make the query nesting a little easier.
Outside of the comments I left in relevant places, one thing that really concerns me is the lack of test coverage associated with this PR.
Code within the diskann-label-provider crate should definitely have intentional tests. And while I understand that diskann-benchmark is relatively lightly tested, we at least have integration tests in main.rs that provide a smoke screen against end-to-end tests completely failing. Please include such a test. Also, if the filtering code is going behind a feature in diskann-benchmark like I requested, we'll have to add the crate features to the coverage pipeline. Since that change should have been present anyways, I support adding that in this PR.
diskann-label-filter/src/query.rs
Outdated
| pub struct FilteredQuery<V> { | ||
| query: V, | ||
| pub struct FilteredQuery<'a, V: ?Sized> { | ||
| query: &'a V, |
There was a problem hiding this comment.
One potential gotcha of converting query to a reference is that we'll be unable to use this paged search. At least without rearchitecting paged search.
I have an in-progress change that should make it considerably easier to have the query be a proxy for the inner type (originally, this PR ran afoul of an incompatibility between Vec<T> and [T] that I believe this is now working around).
My larger point is: expect this to change again in the very near future.
There was a problem hiding this comment.
Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of
pub struct FilteredQuery<V> {
query: V,
filter_expr: ASTExpr,
}
impl<V> FilteredQuery<V> {
fn query<'a>(&'a self) -> V::Target
where
V: Reborrow<'a>,
{
self.query.reborrow()
}
}And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.
This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.
Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).
There was a problem hiding this comment.
I've made this change. This causes some changes to other type constraints. Earlier, the type would be constrained just by the inner type for e.g,
IA: BuildQueryComputer<&'q Q>,
Q: ?Sized,
with this change it would be:
IA: BuildQueryComputer<Q::Target>,
Q: Reborrow<'q>,
The 2nd might be harder to understand at first glance (Unless you already know what FilteredQuery does). This might be okay, just calling it out. I was wondering if making a type alias within FilteredQuery like:
type Target = V::Target and, using that in the constraint above would be better. The Reborrow constraint still needs to be used so I guess it's fine the way it is now.
| .filter_expr | ||
| .encoded_filter_expr() | ||
| .accept(&pred_eval) | ||
| .expect("Expected predicate evaluation to not error out!") |
There was a problem hiding this comment.
How robust is the assumption that predicate evaluation won't error out in practice? What preconditions are needed?
If it relies on an internal invariant holding independently of any user code, then maybe this is fine. But if user code has a way of triggering this panic, this will need to be reevaluated.
There was a problem hiding this comment.
The only reason this would return an error is if the trait function implementation Set::contains returns an error and that gets bubbled up when evaluating an AND or an OR expression. All current implementations of this function via the macro impl_set_for_roaring return an Ok(_) value. There maybe future implementations of the Set trait that might decide to return an error and causing a panic here but that seems unlikely to me.
There was a problem hiding this comment.
On one hand, I see your point. On the other, this is a point of contention. Because our current implementations always return Ok - does it make sense to instead change the trait to never return an error? That sidesteps the problem entirely.
There was a problem hiding this comment.
That makes sense. I can probably do that as a separate PR since it's not related to this change.
diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs
Outdated
Show resolved
Hide resolved
diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs
Outdated
Show resolved
Hide resolved
diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs
Outdated
Show resolved
Hide resolved
| + diskann::graph::SampleableForStart | ||
| + diskann_utils::sampling::WithApproximateNorm | ||
| + 'static, | ||
| for<'b> diskann_vector::distance::SquaredL2: PureDistanceFunction<&'b [T], &'b [T]>, |
There was a problem hiding this comment.
how about other distances, (consine, consine_normalize, IP), do we plan to support them?
There was a problem hiding this comment.
Those should work, this is just an example benchmark. I haven't tried running this with different distance functions.
There was a problem hiding this comment.
Those are available via VectorRepr - the L2 bound I believe is just for finding the index of the medoid.
|
|
||
| if num_vectors != label_count { | ||
| return Err(anyhow::anyhow!( | ||
| "Mismatch: {} vectors but {} label documents", |
There was a problem hiding this comment.
I am confused with vector and document, from my understanding one document can be chunked into multiple vectors, at the same time, one vector can be mapped with multiple documents.
So here "document" should be same definition of vector from my understanding, is it correct?
There was a problem hiding this comment.
Yes, here document is vector + attributes associated with the document.
| .collect(); | ||
|
|
||
| let build_time: MicroSeconds = timer.elapsed().into(); | ||
| writeln!(output, " Index built in {} s", build_time.as_seconds())?; |
There was a problem hiding this comment.
I am not sure whether we can collect memory usage in this step, but it will be better to track peak memory in benchmark.
…the documentItem on which insertStrategy and setElement are implemented to references. This is due to the change in index.insert requiring copy trait on the element being set. Other changes include fixes to api breakage.
| /// that returns `true`. | ||
| /// * An empty array is treated as an absent field (preserving the previous behaviour). | ||
| /// * When all fields have been consumed, `eval_query_expr` is called on the accumulated object. | ||
| fn eval_map_recursive( |
There was a problem hiding this comment.
I'm flagging this because I'm seeing a systemic issue that I think ties back to the use of serde_json::Value as the heavy-lift vessel for diskann-label-filter. There are several inter-related problems:
- This function takes a
&[_]when taking an iterator would avoid allocation (this is related to the second point). - The implementation of this function is recursive on the runtime length of the slice.
- The
currentmap is cloned for every element in arrays, which increases in cost the deeper in the stack we get.
Ideally, we'd use something like a &mut HashMap<&str, &Value> for current and use iteration for this recursive map. However, eval_query_expr requires a &Value, so even if we used a HashMap<&str, &Value> or equivalent, we're still forced to materialize a full Map and all the corresponding allocations.
To me, this is indicating that the fundamental abstraction of diskann-label-filter is getting in the way of writing efficient code. This isn't necessarily a blocker for this PR, but should be prioritized for the long term viability of diskann-label-filter.
On a side note: if this functionality is needed here in ground_truth.rs, does that indicate it is more broadly useful/working around some issue with the semantics of eval_query_expr and if so, should that not be fixed instead?
| i, | ||
| label.doc_id, | ||
| label.label | ||
| ); |
There was a problem hiding this comment.
There are lots of little debug prints here that make arbitrary cutoff decisions for when to stop. Are these development artifacts that could be removed?
There was a problem hiding this comment.
yes, some of these can be removed. I'll leave the tracing::Info alone.
diskann-label-filter/src/query.rs
Outdated
| pub struct FilteredQuery<V> { | ||
| query: V, | ||
| pub struct FilteredQuery<'a, V: ?Sized> { | ||
| query: &'a V, |
There was a problem hiding this comment.
Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of
pub struct FilteredQuery<V> {
query: V,
filter_expr: ASTExpr,
}
impl<V> FilteredQuery<V> {
fn query<'a>(&'a self) -> V::Target
where
V: Reborrow<'a>,
{
self.query.reborrow()
}
}And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.
This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.
Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).
| + diskann::graph::SampleableForStart | ||
| + diskann_utils::sampling::WithApproximateNorm | ||
| + 'static, | ||
| for<'b> diskann_vector::distance::SquaredL2: PureDistanceFunction<&'b [T], &'b [T]>, |
There was a problem hiding this comment.
Those are available via VectorRepr - the L2 bound I believe is just for finding the index of the medoid.
| let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone()); | ||
|
|
||
| // Use a concrete IdDistance scratch buffer so that both the IDs and distances | ||
| // are captured. Afterwards, the valid IDs are forwarded into the framework buffer. |
There was a problem hiding this comment.
Perhaps diskann-benchmark-core should be updated to capture distances as well. I think this can be done in a non-breaking way (not a blocker for this PR).
| let query_vec = self.queries.row(index); | ||
| let (_, ref ast_expr) = self.predicates[index]; | ||
| let strategy = InlineBetaStrategy::new(self.beta, common::FullPrecision); | ||
| let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone()); |
There was a problem hiding this comment.
One theme I've been observing throughout diskann-label-filter is the design kind of inherently forces patterns like cloning the ast_expr for the query.
I'm not reviewing the benchmark code in too much detail, but I strongly encourage looking for patterns like forced clones in loops as opportunities for making the underlying implementation better.
There was a problem hiding this comment.
I looked through the benchmark code. This clone right here seems like the only one that would cause performance issues. This would mean holding a reference to the ast_expr instead of owning it.
This PR has the following changes:
Add support for inline-beta search with filter expressions that support AND, OR expressions and equality comparisons.
Benchmark to evaluate perf and recall on small dataset and which also serves as an example on how to set things up to use filtered search with expressions.