Support for inline-beta filtered search with expressions by gopalrs · Pull Request #782 · microsoft/DiskANN

gopalrs · 2026-02-16T10:31:17Z

This PR has the following changes:

Add support for inline-beta search with filter expressions that support AND, OR expressions and equality comparisons.
Benchmark to evaluate perf and recall on small dataset and which also serves as an example on how to set things up to use filtered search with expressions.

- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools

Copilot

Pull request overview

This PR integrates label-filtered (“document”) insertion and inline beta filtered search into the DiskANN benchmark/tooling flow, enabling benchmarks that operate on { vector, attributes } documents and evaluate filtered queries.

Changes:

Added DocumentInsertStrategy and supporting public types to insert/query Document objects (vector + attributes) through DocumentProvider.
Extended inline beta filter search to handle predicate encoding failures and added a constructor for InlineBetaStrategy.
Added a new benchmark input/backend (document-index-build) plus example config for running document + filter benchmarks.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
test_data/disk_index_search/data.256.label.jsonl	Updates LFS pointer for label test data used in filter benchmarks.
diskann-tools/src/utils/ground_truth.rs	Adds array-aware label matching/expansion and extensive tracing diagnostics for filter ground-truth generation.
diskann-tools/Cargo.toml	Adds `serde_json` dependency (and adjusts manifest metadata).
diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs	Adds `Vec<T>` query support for full-precision in-mem provider (for inline beta usage).
diskann-label-filter/src/lib.rs	Exposes the new `document_insert_strategy` module under `encoded_attribute_provider`.
diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs	Adds `InlineBetaStrategy::new` and introduces `is_valid_filter` fast-path logic.
diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs	Adjusts filter encoding to be optional and threads `is_valid_filter` into the query computer.
diskann-label-filter/src/encoded_attribute_provider/roaring_attribute_store.rs	Makes `RoaringAttributeStore` public for cross-crate use.
diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs	Changes encoded filter representation to `Option`, allowing “invalid filter” fallback behavior.
diskann-label-filter/src/encoded_attribute_provider/document_provider.rs	Allows vector types used in documents to be `?Sized`.
diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs	New strategy wrapper enabling insertion/search over `Document` values.
diskann-label-filter/src/encoded_attribute_provider/ast_label_id_mapper.rs	Simplifies lookup error messaging and signature for attribute→id mapping.
diskann-label-filter/src/document.rs	Makes `Document` generic over `?Sized` vectors.
diskann-benchmark/src/utils/tokio.rs	Adds a reusable multi-thread Tokio runtime builder.
diskann-benchmark/src/utils/recall.rs	Re-exports `knn` recall helper for benchmark use.
diskann-benchmark/src/inputs/mod.rs	Registers a new `document_index` input module.
diskann-benchmark/src/inputs/document_index.rs	New benchmark input schema for document-index build + filtered search runs.
diskann-benchmark/src/backend/mod.rs	Registers new `document_index` backend benchmarks.
diskann-benchmark/src/backend/index/result.rs	Extends search result reporting with query count and wall-clock summary columns.
diskann-benchmark/src/backend/document_index/mod.rs	New backend module entrypoint for document index benchmarks.
diskann-benchmark/src/backend/document_index/benchmark.rs	New end-to-end benchmark: build via `DocumentInsertStrategy` + filtered search via `InlineBetaStrategy`.
diskann-benchmark/example/document-filter.json	Adds example job configuration for document filter benchmark runs.
Cargo.lock	Adds `serde_json` to the lockfile dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

diskann-benchmark/src/backend/document_index/benchmark.rs

diskann-tools/Cargo.toml

diskann-tools/src/utils/ground_truth.rs

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

diskann-benchmark/src/utils/tokio.rs

diskann-benchmark/src/backend/document_index/benchmark.rs

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs

diskann-benchmark/src/backend/document_index/benchmark.rs

diskann-benchmark/src/inputs/document_index.rs

diskann-label-filter/src/inline_beta_search/encoded_document_accessor.rs

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

diskann-providers/src/model/graph/provider/async_/inmem/full_precision.rs

test_data/disk_index_search/data.256.label.jsonl

…kv map

…ilterExpression

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…DiskANN into sync-from-cdb-diskann

…earchParams

codecov-commenter · 2026-03-16T11:51:25Z

Codecov Report

❌ Patch coverage is 58.63095% with 139 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.36%. Comparing base (0ced23d) to head (aaf0488).

Files with missing lines	Patch %	Lines
diskann-benchmark/src/inputs/document_index.rs	44.79%	53 Missing ⚠️
...ded_attribute_provider/document_insert_strategy.rs	54.11%	39 Missing ⚠️
diskann-tools/src/utils/ground_truth.rs	56.17%	39 Missing ⚠️
...oded_attribute_provider/roaring_attribute_store.rs	0.00%	3 Missing ⚠️
...ilter/src/inline_beta_search/inline_beta_filter.rs	81.25%	3 Missing ⚠️
...rc/encoded_attribute_provider/document_provider.rs	0.00%	1 Missing ⚠️
...rc/inline_beta_search/encoded_document_accessor.rs	0.00%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (58.63%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #782      +/-   ##
==========================================
+ Coverage   89.31%   89.36%   +0.04%     
==========================================
  Files         445      448       +3     
  Lines       84095    84406     +311     
==========================================
+ Hits        75113    75431     +318     
+ Misses       8982     8975       -7

Flag	Coverage Δ
miri	`89.36% <58.63%> (+0.04%)`	⬆️
unittests	`89.20% <58.63%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...iskann-benchmark/src/backend/document_index/mod.rs	`100.00% <100.00%> (ø)`
diskann-benchmark/src/backend/index/build.rs	`85.92% <100.00%> (ø)`
diskann-benchmark/src/backend/index/mod.rs	`100.00% <ø> (ø)`
diskann-benchmark/src/backend/mod.rs	`100.00% <100.00%> (ø)`
diskann-benchmark/src/inputs/mod.rs	`79.16% <100.00%> (+0.90%)`	⬆️
diskann-benchmark/src/main.rs	`91.79% <100.00%> (+0.66%)`	⬆️
diskann-label-filter/src/document.rs	`71.42% <ø> (+71.42%)`	⬆️
.../encoded_attribute_provider/ast_label_id_mapper.rs	`97.44% <100.00%> (-0.03%)`	⬇️
diskann-label-filter/src/query.rs	`72.72% <100.00%> (+72.72%)`	⬆️
...rc/encoded_attribute_provider/document_provider.rs	`8.73% <0.00%> (+8.73%)`	⬆️
... and 6 more

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hildebrandmw

Thank you @sampathrg - this is a big step in the right direction. I think some upcoming changes I have will make the query nesting a little easier.

Outside of the comments I left in relevant places, one thing that really concerns me is the lack of test coverage associated with this PR.

Code within the diskann-label-provider crate should definitely have intentional tests. And while I understand that diskann-benchmark is relatively lightly tested, we at least have integration tests in main.rs that provide a smoke screen against end-to-end tests completely failing. Please include such a test. Also, if the filtering code is going behind a feature in diskann-benchmark like I requested, we'll have to add the crate features to the coverage pipeline. Since that change should have been present anyways, I support adding that in this PR.

hildebrandmw · 2026-03-19T17:25:38Z

diskann-label-filter/src/query.rs

-pub struct FilteredQuery<V> {
-    query: V,
+pub struct FilteredQuery<'a, V: ?Sized> {
+    query: &'a V,


One potential gotcha of converting query to a reference is that we'll be unable to use this paged search. At least without rearchitecting paged search.

I have an in-progress change that should make it considerably easier to have the query be a proxy for the inner type (originally, this PR ran afoul of an incompatibility between Vec<T> and [T] that I believe this is now working around).

My larger point is: expect this to change again in the very near future.

Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of

pub struct FilteredQuery<V> { query: V, filter_expr: ASTExpr, } impl<V> FilteredQuery<V> { fn query<'a>(&'a self) -> V::Target where V: Reborrow<'a>, { self.query.reborrow() } }

And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.

This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.

Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).

I've made this change. This causes some changes to other type constraints. Earlier, the type would be constrained just by the inner type for e.g,

IA: BuildQueryComputer<&'q Q>, Q: ?Sized,

with this change it would be:

IA: BuildQueryComputer<Q::Target>, Q: Reborrow<'q>,

The 2nd might be harder to understand at first glance (Unless you already know what FilteredQuery does). This might be okay, just calling it out. I was wondering if making a type alias within FilteredQuery like:
type Target = V::Target and, using that in the constraint above would be better. The Reborrow constraint still needs to be used so I guess it's fine the way it is now.

hildebrandmw · 2026-03-19T17:28:17Z

diskann-label-filter/src/inline_beta_search/inline_beta_filter.rs

+            .filter_expr
+            .encoded_filter_expr()
+            .accept(&pred_eval)
+            .expect("Expected predicate evaluation to not error out!")


How robust is the assumption that predicate evaluation won't error out in practice? What preconditions are needed?

If it relies on an internal invariant holding independently of any user code, then maybe this is fine. But if user code has a way of triggering this panic, this will need to be reevaluated.

The only reason this would return an error is if the trait function implementation Set::contains returns an error and that gets bubbled up when evaluating an AND or an OR expression. All current implementations of this function via the macro impl_set_for_roaring return an Ok(_) value. There maybe future implementations of the Set trait that might decide to return an error and causing a panic here but that seems unlikely to me.

On one hand, I see your point. On the other, this is a point of contention. Because our current implementations always return Ok - does it make sense to instead change the trait to never return an error? That sidesteps the problem entirely.

That makes sense. I can probably do that as a separate PR since it's not related to this change.

diskann-label-filter/src/encoded_attribute_provider/encoded_filter_expr.rs

diskann-label-filter/src/encoded_attribute_provider/document_insert_strategy.rs

diskann-benchmark/src/backend/document_index/benchmark.rs

diskann-benchmark/src/backend/document_index/mod.rs

diskann-benchmark/example/document-filter.json

ChenSunriseJiaBao · 2026-03-23T07:08:23Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+            + diskann::graph::SampleableForStart
+            + diskann_utils::sampling::WithApproximateNorm
+            + 'static,
+        for<'b> diskann_vector::distance::SquaredL2: PureDistanceFunction<&'b [T], &'b [T]>,


how about other distances, (consine, consine_normalize, IP), do we plan to support them?

Those should work, this is just an example benchmark. I haven't tried running this with different distance functions.

Those are available via VectorRepr - the L2 bound I believe is just for finding the index of the medoid.

ChenSunriseJiaBao · 2026-03-23T07:14:11Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+
+        if num_vectors != label_count {
+            return Err(anyhow::anyhow!(
+                "Mismatch: {} vectors but {} label documents",


I am confused with vector and document, from my understanding one document can be chunked into multiple vectors, at the same time, one vector can be mapped with multiple documents.
So here "document" should be same definition of vector from my understanding, is it correct?

Yes, here document is vector + attributes associated with the document.

ChenSunriseJiaBao · 2026-03-23T07:19:50Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+            .collect();
+
+        let build_time: MicroSeconds = timer.elapsed().into();
+        writeln!(output, "  Index built in {} s", build_time.as_seconds())?;


I am not sure whether we can collect memory usage in this step, but it will be better to track peak memory in benchmark.

Okay. I can add that.

…the documentItem on which insertStrategy and setElement are implemented to references. This is due to the change in index.insert requiring copy trait on the element being set. Other changes include fixes to api breakage.

hildebrandmw · 2026-04-02T18:35:28Z

diskann-tools/src/utils/ground_truth.rs

+///   that returns `true`.
+/// * An empty array is treated as an absent field (preserving the previous behaviour).
+/// * When all fields have been consumed, `eval_query_expr` is called on the accumulated object.
+fn eval_map_recursive(


I'm flagging this because I'm seeing a systemic issue that I think ties back to the use of serde_json::Value as the heavy-lift vessel for diskann-label-filter. There are several inter-related problems:

This function takes a &[_] when taking an iterator would avoid allocation (this is related to the second point).

The implementation of this function is recursive on the runtime length of the slice.

The current map is cloned for every element in arrays, which increases in cost the deeper in the stack we get.

Ideally, we'd use something like a &mut HashMap<&str, &Value> for current and use iteration for this recursive map. However, eval_query_expr requires a &Value, so even if we used a HashMap<&str, &Value> or equivalent, we're still forced to materialize a full Map and all the corresponding allocations.

To me, this is indicating that the fundamental abstraction of diskann-label-filter is getting in the way of writing efficient code. This isn't necessarily a blocker for this PR, but should be prioritized for the long term viability of diskann-label-filter.

On a side note: if this functionality is needed here in ground_truth.rs, does that indicate it is more broadly useful/working around some issue with the semantics of eval_query_expr and if so, should that not be fixed instead?

hildebrandmw · 2026-04-02T18:36:05Z

diskann-tools/src/utils/ground_truth.rs

+            i,
+            label.doc_id,
+            label.label
+        );


There are lots of little debug prints here that make arbitrary cutoff decisions for when to stop. Are these development artifacts that could be removed?

yes, some of these can be removed. I'll leave the tracing::Info alone.

hildebrandmw · 2026-04-02T18:48:15Z

diskann-label-filter/src/query.rs

-pub struct FilteredQuery<V> {
-    query: V,
+pub struct FilteredQuery<'a, V: ?Sized> {
+    query: &'a V,


Okay, now that we've made the API change, there's one thing we can do to clean up how this works a little. Instead of

pub struct FilteredQuery<V> { query: V, filter_expr: ASTExpr, } impl<V> FilteredQuery<V> { fn query<'a>(&'a self) -> V::Target where V: Reborrow<'a>, { self.query.reborrow() } }

And instead of requiring &V for the inner trait bounds, we use <V as Reborrow<'a>>::Target (or just V::Target when the associated lifetime is unambiguous.

This does a couple things. First, it lets FilteredQuery have an owned query if needed and gets rid of the repeated lifetime bound.

Second, it will compose slightly better with providers that use non-slice types (e.g. multi-vectors).

hildebrandmw · 2026-04-02T18:59:23Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+            + diskann::graph::SampleableForStart
+            + diskann_utils::sampling::WithApproximateNorm
+            + 'static,
+        for<'b> diskann_vector::distance::SquaredL2: PureDistanceFunction<&'b [T], &'b [T]>,


Those are available via VectorRepr - the L2 bound I believe is just for finding the index of the medoid.

hildebrandmw · 2026-04-02T19:08:44Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+        let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone());
+
+        // Use a concrete IdDistance scratch buffer so that both the IDs and distances
+        // are captured. Afterwards, the valid IDs are forwarded into the framework buffer.


Perhaps diskann-benchmark-core should be updated to capture distances as well. I think this can be done in a non-breaking way (not a blocker for this PR).

Okay, I can do it in a separate PR.

hildebrandmw · 2026-04-02T19:11:51Z

diskann-benchmark/src/backend/document_index/benchmark.rs

+        let query_vec = self.queries.row(index);
+        let (_, ref ast_expr) = self.predicates[index];
+        let strategy = InlineBetaStrategy::new(self.beta, common::FullPrecision);
+        let filtered_query = FilteredQuery::new(query_vec, ast_expr.clone());


One theme I've been observing throughout diskann-label-filter is the design kind of inherently forces patterns like cloning the ast_expr for the query.

I'm not reviewing the benchmark code in too much detail, but I strongly encourage looking for patterns like forced clones in loops as opportunities for making the underlying implementation better.

I looked through the benchmark code. This clone right here seems like the only one that would cause performance issues. This would mean holding a reference to the ast_expr instead of owning it.

…lteredQuery

gopal-msr added 5 commits February 10, 2026 10:10

Sync changes from CDB_DiskANN repo

e3e1651

- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools

Before merging with main

ec2091f

Merge with main

5ad6e44

Working version of inline beta search

a949024

Merge branch 'main' into sync-from-cdb-diskann

95cd4a9

gopalrs requested review from a team and Copilot February 16, 2026 10:31

Copilot started reviewing on behalf of gopalrs February 16, 2026 10:31 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

Removing unnecessary stats

98ad4f7

hildebrandmw requested changes Feb 18, 2026

View reviewed changes

Sampath Rajendra and others added 19 commits March 2, 2026 15:54

Fix clippy warnings

edfdee6

use search and build benchmark apis

cbb77f6

Rename struct for recall metrics

670782f

Use copyIds

6c2c967

Use renamed struct in SearchResults

d13dc7f

Evaluate query progressively without flattening and cloning the json …

bd19bde

…kv map

Use config api to validate values

8e3a89b

Specify number of threads explicitly

40b1314

Error when the visit of input expression fails when creating encodedF…

3d23972

…ilterExpression

remove new runtime method added, use method in benchmark::core

9e35ccb

Use dispatch rule to validate benchmark type support

5a8c560

Use compute_medioid helper

de20365

Remaining changes from search + build api refactor

9c477cd

Merge branch 'gopalsr/integrate_bm' into sync-from-cdb-diskann

f87e24a

Apply suggestion from @Copilot

7f24432

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix merge errors

9c1870a

Merge branch 'sync-from-cdb-diskann' of https://github.com/microsoft/…

54471b0

…DiskANN into sync-from-cdb-diskann

Fix merge errors white recall metrics

1591956

Remove the need for Vec<T> variants

f13e849

sampathrg requested a review from hildebrandmw March 16, 2026 10:33

sampathrg added 4 commits March 16, 2026 16:08

Fix formatting errors

4441dc7

Fix clippy warning

5ed93b9

Merge branch 'main' into sync-from-cdb-diskann

28c8ec3

Fix build errors after merge with main - Use Knn instead of the old S…

072edb5

…earchParams

sampathrg added 3 commits March 16, 2026 18:52

Undo rename of RecallMetrics

6a31782

Remove fallback to unfiltered search.

a70ee53

Fix formatting error

a313013

sampathrg changed the title ~~Integrating in-mem, inline, beta search into GH DiskANN~~ Support for inline-beta filtered search with expressions Mar 19, 2026

hildebrandmw reviewed Mar 19, 2026

View reviewed changes

Address review comments

7ea6e4e

ChenSunriseJiaBao reviewed Mar 23, 2026

View reviewed changes

sampathrg added 6 commits March 23, 2026 18:27

Formatting + revert to old names for some functions

8010d37

Add some unit tests + smoke test for benchmark

8a86e0d

Update output serializer + remove unnecessary type parameter

5664dca

Put the benchmarks and the smoke test behind a feature

b03bef4

changes to Cargo.lock

c1b7a3c

Merge branch 'main' into sync-from-cdb-diskann

026bb91

sampathrg requested a review from hildebrandmw March 23, 2026 13:44

sampathrg added 8 commits March 26, 2026 19:05

Move the tests to a separate folder as is the convention

5ebe477

Fix formatting

c9a97bd

Fix formatting

cc6a9b6

Merge branch 'main' into sync-from-cdb-diskann

27e20a9

Fix build errors after merging with main

47a2e69

Merge branch 'main' into sync-from-cdb-diskann

6a7674d

Fix build errors in test

cba7147

hildebrandmw reviewed Apr 2, 2026

View reviewed changes

Remove lifetime bound and ?Sized requirement on the inner query of Fi…

aaf0488

…lteredQuery

Conversation

gopalrs commented Feb 16, 2026 • edited by sampathrg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gopalrs commented Feb 16, 2026 •

edited by sampathrg

Loading

codecov-commenter commented Mar 16, 2026 •

edited

Loading