Skip to content

Support for inline-beta filtered search with expressions#782

Open
gopalrs wants to merge 50 commits intomainfrom
sync-from-cdb-diskann
Open

Support for inline-beta filtered search with expressions#782
gopalrs wants to merge 50 commits intomainfrom
sync-from-cdb-diskann

Conversation

@gopalrs
Copy link
Copy Markdown
Contributor

@gopalrs gopalrs commented Feb 16, 2026

This PR has the following changes:

  • Add support for inline-beta search with filter expressions that support AND, OR expressions and equality comparisons.

  • Benchmark to evaluate perf and recall on small dataset and which also serves as an example on how to set things up to use filtered search with expressions.

- Refactored recall utilities in diskann-benchmark
- Updated tokio utilities
- Added attribute and format parser improvements in label-filter
- Updated ground_truth utilities in diskann-tools
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates label-filtered (“document”) insertion and inline beta filtered search into the DiskANN benchmark/tooling flow, enabling benchmarks that operate on { vector, attributes } documents and evaluate filtered queries.

Changes:

  • Added DocumentInsertStrategy and supporting public types to insert/query Document objects (vector + attributes) through DocumentProvider.
  • Extended inline beta filter search to handle predicate encoding failures and added a constructor for InlineBetaStrategy.
  • Added a new benchmark input/backend (document-index-build) plus example config for running document + filter benchmarks.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test_data/disk_index_search/data.256.label.jsonl Updates LFS pointer for label test data used in filter benchmarks.
diskann-tools/src/utils/ground_truth.rs Adds array-aware label matching/expansion and extensive tracing diagnostics for filter ground-truth generation.
diskann-tools/Cargo.toml Adds serde_json dependency (and adjusts manifest metadata).
diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs Adds Vec<T> query support for full-precision in-mem provider (for inline beta usage).
diskann-label-filter/src/lib.rs Exposes the new document_insert_strategy module under encoded_attribute_provider.
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs Adds InlineBetaStrategy::new and introduces is_valid_filter fast-path logic.
diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs Adjusts filter encoding to be optional and threads is_valid_filter into the query computer.
diskann-label-filter/src/encoded_attribute_provider/roaring_attribute_store.rs Makes RoaringAttributeStore public for cross-crate use.
diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs Changes encoded filter representation to Option, allowing “invalid filter” fallback behavior.
diskann-label-filter/src/encoded_attribute_provider/document_provider.rs Allows vector types used in documents to be ?Sized.
diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs New strategy wrapper enabling insertion/search over Document values.
diskann-label-filter/src/encoded_attribute_provider/ast_label_id_mapper.rs Simplifies lookup error messaging and signature for attribute→id mapping.
diskann-label-filter/src/document.rs Makes Document generic over ?Sized vectors.
diskann-benchmark/src/utils/tokio.rs Adds a reusable multi-thread Tokio runtime builder.
diskann-benchmark/src/utils/recall.rs Re-exports knn recall helper for benchmark use.
diskann-benchmark/src/inputs/mod.rs Registers a new document_index input module.
diskann-benchmark/src/inputs/document_index.rs New benchmark input schema for document-index build + filtered search runs.
diskann-benchmark/src/backend/mod.rs Registers new document_index backend benchmarks.
diskann-benchmark/src/backend/index/result.rs Extends search result reporting with query count and wall-clock summary columns.
diskann-benchmark/src/backend/document_index/mod.rs New backend module entrypoint for document index benchmarks.
diskann-benchmark/src/backend/document_index/benchmark.rs New end-to-end benchmark: build via DocumentInsertStrategy + filtered search via InlineBetaStrategy.
diskann-benchmark/example/document-filter.json Adds example job configuration for document filter benchmark runs.
Cargo.lock Adds serde_json to the lockfile dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sampathrg sampathrg requested a review from hildebrandmw March 16, 2026 10:33
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 58.63095% with 139 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.36%. Comparing base (0ced23d) to head (aaf0488).

Files with missing lines Patch % Lines
diskann-benchmark/src/inputs/document_index.rs 44.79% 53 Missing ⚠️
...ded_attribute_provider/document_insert_strategy.rs 54.11% 39 Missing ⚠️
diskann-tools/src/utils/ground_truth.rs 56.17% 39 Missing ⚠️
...oded_attribute_provider/roaring_attribute_store.rs 0.00% 3 Missing ⚠️
...ilter/src/inline_beta_search/inline_beta_filter.rs 81.25% 3 Missing ⚠️
...rc/encoded_attribute_provider/document_provider.rs 0.00% 1 Missing ⚠️
...rc/inline_beta_search/encoded_document_accessor.rs 0.00% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (58.63%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #782      +/-   ##
==========================================
+ Coverage   89.31%   89.36%   +0.04%     
==========================================
  Files         445      448       +3     
  Lines       84095    84406     +311     
==========================================
+ Hits        75113    75431     +318     
+ Misses       8982     8975       -7     
Flag Coverage Δ
miri 89.36% <58.63%> (+0.04%) ⬆️
unittests 89.20% <58.63%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...iskann-benchmark/src/backend/document_index/mod.rs 100.00% <100.00%> (ø)
diskann-benchmark/src/backend/index/build.rs 85.92% <100.00%> (ø)
diskann-benchmark/src/backend/index/mod.rs 100.00% <ø> (ø)
diskann-benchmark/src/backend/mod.rs 100.00% <100.00%> (ø)
diskann-benchmark/src/inputs/mod.rs 79.16% <100.00%> (+0.90%) ⬆️
diskann-benchmark/src/main.rs 91.79% <100.00%> (+0.66%) ⬆️
diskann-label-filter/src/document.rs 71.42% <ø> (+71.42%) ⬆️
.../encoded_attribute_provider/ast_label_id_mapper.rs 97.44% <100.00%> (-0.03%) ⬇️
diskann-label-filter/src/query.rs 72.72% <100.00%> (+72.72%) ⬆️
...rc/encoded_attribute_provider/document_provider.rs 8.73% <0.00%> (+8.73%) ⬆️
... and 6 more

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sampathrg sampathrg changed the title Integrating in-mem, inline, beta search into GH DiskANN Support for inline-beta filtered search with expressions Mar 19, 2026
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sampathrg - this is a big step in the right direction. I think some upcoming changes I have will make the query nesting a little easier.

Outside of the comments I left in relevant places, one thing that really concerns me is the lack of test coverage associated with this PR.

Code within the diskann-label-provider crate should definitely have intentional tests. And while I understand that diskann-benchmark is relatively lightly tested, we at least have integration tests in main.rs that provide a smoke screen against end-to-end tests completely failing. Please include such a test. Also, if the filtering code is going behind a feature in diskann-benchmark like I requested, we'll have to add the crate features to the coverage pipeline. Since that change should have been present anyways, I support adding that in this PR.

pub struct FilteredQuery<V> {
query: V,
pub struct FilteredQuery<'a, V: ?Sized> {
query: &'a V,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential gotcha of converting query to a reference is that we'll be unable to use this paged search. At least without rearchitecting paged search.

I have an in-progress change that should make it considerably easier to have the query be a proxy for the inner type (originally, this PR ran afoul of an incompatibility between Vec<T> and [T] that I believe this is now working around).

My larger point is: expect this to change again in the very near future.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of

pub struct FilteredQuery<V> {
    query: V,
    filter_expr: ASTExpr,
}

impl<V> FilteredQuery<V> {
    fn query<'a>(&'a self) -> V::Target
    where
        V: Reborrow<'a>,
    {
        self.query.reborrow()
    }
}

And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.

This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.

Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change. This causes some changes to other type constraints. Earlier, the type would be constrained just by the inner type for e.g,

IA: BuildQueryComputer<&'q Q>,
Q: ?Sized,

with this change it would be:

IA: BuildQueryComputer<Q::Target>,
Q: Reborrow<'q>,

The 2nd might be harder to understand at first glance (Unless you already know what FilteredQuery does). This might be okay, just calling it out. I was wondering if making a type alias within FilteredQuery like:
type Target = V::Target and, using that in the constraint above would be better. The Reborrow constraint still needs to be used so I guess it's fine the way it is now.

.filter_expr
.encoded_filter_expr()
.accept(&pred_eval)
.expect("Expected predicate evaluation to not error out!")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How robust is the assumption that predicate evaluation won't error out in practice? What preconditions are needed?

If it relies on an internal invariant holding independently of any user code, then maybe this is fine. But if user code has a way of triggering this panic, this will need to be reevaluated.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason this would return an error is if the trait function implementation Set::contains returns an error and that gets bubbled up when evaluating an AND or an OR expression. All current implementations of this function via the macro impl_set_for_roaring return an Ok(_) value. There maybe future implementations of the Set trait that might decide to return an error and causing a panic here but that seems unlikely to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand, I see your point. On the other, this is a point of contention. Because our current implementations always return Ok - does it make sense to instead change the trait to never return an error? That sidesteps the problem entirely.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I can probably do that as a separate PR since it's not related to this change.

+ diskann::graph::SampleableForStart
+ diskann_utils::sampling::WithApproximateNorm
+ 'static,
for<'b> diskann_vector::distance::SquaredL2: PureDistanceFunction<&'b [T], &'b [T]>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about other distances, (consine, consine_normalize, IP), do we plan to support them?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those should work, this is just an example benchmark. I haven't tried running this with different distance functions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are available via VectorRepr - the L2 bound I believe is just for finding the index of the medoid.


if num_vectors != label_count {
return Err(anyhow::anyhow!(
"Mismatch: {} vectors but {} label documents",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused with vector and document, from my understanding one document can be chunked into multiple vectors, at the same time, one vector can be mapped with multiple documents.
So here "document" should be same definition of vector from my understanding, is it correct?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here document is vector + attributes associated with the document.

.collect();

let build_time: MicroSeconds = timer.elapsed().into();
writeln!(output, " Index built in {} s", build_time.as_seconds())?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether we can collect memory usage in this step, but it will be better to track peak memory in benchmark.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I can add that.

@sampathrg sampathrg requested a review from hildebrandmw March 23, 2026 13:44
/// that returns `true`.
/// * An empty array is treated as an absent field (preserving the previous behaviour).
/// * When all fields have been consumed, `eval_query_expr` is called on the accumulated object.
fn eval_map_recursive(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm flagging this because I'm seeing a systemic issue that I think ties back to the use of serde_json::Value as the heavy-lift vessel for diskann-label-filter. There are several inter-related problems:

  1. This function takes a &[_] when taking an iterator would avoid allocation (this is related to the second point).
  2. The implementation of this function is recursive on the runtime length of the slice.
  3. The current map is cloned for every element in arrays, which increases in cost the deeper in the stack we get.

Ideally, we'd use something like a &mut HashMap<&str, &Value> for current and use iteration for this recursive map. However, eval_query_expr requires a &Value, so even if we used a HashMap<&str, &Value> or equivalent, we're still forced to materialize a full Map and all the corresponding allocations.

To me, this is indicating that the fundamental abstraction of diskann-label-filter is getting in the way of writing efficient code. This isn't necessarily a blocker for this PR, but should be prioritized for the long term viability of diskann-label-filter.

On a side note: if this functionality is needed here in ground_truth.rs, does that indicate it is more broadly useful/working around some issue with the semantics of eval_query_expr and if so, should that not be fixed instead?

i,
label.doc_id,
label.label
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are lots of little debug prints here that make arbitrary cutoff decisions for when to stop. Are these development artifacts that could be removed?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, some of these can be removed. I'll leave the tracing::Info alone.

pub struct FilteredQuery<V> {
query: V,
pub struct FilteredQuery<'a, V: ?Sized> {
query: &'a V,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of

pub struct FilteredQuery<V> {
    query: V,
    filter_expr: ASTExpr,
}

impl<V> FilteredQuery<V> {
    fn query<'a>(&'a self) -> V::Target
    where
        V: Reborrow<'a>,
    {
        self.query.reborrow()
    }
}

And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.

This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.

Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).

+ diskann::graph::SampleableForStart
+ diskann_utils::sampling::WithApproximateNorm
+ 'static,
for<'b> diskann_vector::distance::SquaredL2: PureDistanceFunction<&'b [T], &'b [T]>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are available via VectorRepr - the L2 bound I believe is just for finding the index of the medoid.

let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone());

// Use a concrete IdDistance scratch buffer so that both the IDs and distances
// are captured. Afterwards, the valid IDs are forwarded into the framework buffer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps diskann-benchmark-core should be updated to capture distances as well. I think this can be done in a non-breaking way (not a blocker for this PR).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can do it in a separate PR.

let query_vec = self.queries.row(index);
let (_, ref ast_expr) = self.predicates[index];
let strategy = InlineBetaStrategy::new(self.beta, common::FullPrecision);
let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One theme I've been observing throughout diskann-label-filter is the design kind of inherently forces patterns like cloning the ast_expr for the query.

I'm not reviewing the benchmark code in too much detail, but I strongly encourage looking for patterns like forced clones in loops as opportunities for making the underlying implementation better.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the benchmark code. This clone right here seems like the only one that would cause performance issues. This would mean holding a reference to the ast_expr instead of owning it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants