-
Notifications
You must be signed in to change notification settings - Fork 394
Description
Inference Interception Follow-ups
Status: Draft
Date: 2026-02-23
Context: Follow-ups from MR!38(feat(inference): inference interception and routing).
Confirmed Constraints
- Inference routing is currently supported only in cluster mode.
- Local sandbox mode (
--policy-rules/--policy-data) is not expected to support inference routing at this time.
Gaps Observed
1) Missing cluster-only guardrail in local mode
Current behavior can accept CONNECT for inspect_for_inference and then fail later if inference runtime prerequisites are missing.
Desired behavior:
- If sandbox is not running with cluster prerequisites (
sandbox_id+ gateway endpoint + TLS state), fail fast and clearly. - Do not emit
200 Connection Establishedfor requests that cannot be serviced.
2) Incomplete Anthropic authentication handling
Current routing path is OpenAI-Bearer-centric and does not provide full Anthropic-compatible API key behavior.
Desired behavior:
- Preserve credential isolation for Anthropic flows just like OpenAI flows.
- Ensure route credentials (not sandbox/client credentials) are used for Anthropic upstream calls.
- Support Anthropic header semantics end-to-end (
x-api-key,anthropic-version, and related required headers).
3) api_patterns policy field is not wired
InferencePolicy.api_patterns exists in proto but interception currently uses built-in defaults only.
Desired behavior:
- If policy defines
api_patterns, use them. - If absent/empty, fall back to built-in defaults.
4) Missing GET /v1/models support
Current interception focuses on completion/messages-style write endpoints and does not provide a route-aware models listing behavior.
Desired behavior:
- Support
GET /v1/modelsfor intercepted OpenAI-compatible flows. - Return a deterministic response strategy:
- proxy to compatible backend route, or
- synthesize a route-aware response when backend behavior is unsuitable.
- Ensure policy and route compatibility filtering applies to model listing just like inference generation endpoints.
5) Missing streaming support
Current proxying model is request/response body buffering, which is insufficient for streaming APIs.
Desired behavior:
- Support streaming for OpenAI and Anthropic-compatible APIs.
- Preserve chunk boundaries and event framing (
text/event-stream/ SSE semantics where applicable). - Propagate cancellation/connection-close behavior correctly across sandbox proxy, gRPC transport, and backend.
Next Steps
P0: Enforce cluster-only inference explicitly
- Add startup validation in sandbox runtime:
- If inference routing is configured but cluster prerequisites are absent, fail sandbox startup with a clear error.
- Add proxy-side defensive handling:
- For
inspect_for_inference, verify prerequisites before returning200 Connection Established. - Return a deterministic error response when inference is unavailable.
- Add tests:
- Unit/integration test for local mode + inference policy => explicit failure.
- Regression test ensuring no optimistic
200is sent before prerequisite validation.
P0: Full Anthropic API key support
- Define protocol-aware auth rewrite behavior in router backend:
openai_*:Authorization: Bearer <route.api_key>.anthropic_messages: setx-api-key: <route.api_key>and preserve/validateanthropic-versionbehavior.
- Strip inbound client credentials for Anthropic at interception boundary:
- Remove
authorizationandx-api-keyfrom forwarded headers. - Keep non-sensitive headers that are required for request compatibility.
- Add tests:
- Router integration test that verifies
x-api-keyrewrite for Anthropic routes. - Regression test proving client-supplied Anthropic credentials are not forwarded upstream.
P1: Add GET /v1/models support
- Extend inference API pattern matching to classify models-list requests.
- Implement route-aware handling in gateway/router for model-list requests.
- Add tests:
- e2e test for intercepted
GET /v1/models. - compatibility test across multiple allowed routes.
P1: Add streaming support
- Define streaming transport contract across proxy <-> gateway (gRPC streaming or framed chunk transport).
- Implement protocol-aware streaming passthrough in router backend.
- Ensure cancellation propagation and timeout behavior are explicit.
- Add tests:
- e2e streaming chat completion test.
- disconnection/cancellation regression test.
P1: Wire policy-driven API pattern configuration
- Map
sandbox.policy.inference.api_patternsinto sandbox inference interception context. - Add validation for malformed patterns.
- Add tests for:
- custom pattern match,
- default fallback behavior,
- invalid pattern rejection.
Definition of Done
- Local mode with inference configuration fails fast with a clear, actionable error.
- Anthropic requests route successfully using route-managed credentials only.
GET /v1/modelsworks through interception with policy/route-aware behavior.- Streaming inference requests are supported end-to-end (including cancellation).
api_patternsworks when configured and defaults remain backward compatible.- Unit/integration coverage added for each gap above.
Originally by @pimlock on 2026-02-22T22:10:54.126-08:00