Skip to content

follow-up(inference): cluster-only guardrails, anthropic auth, v1/models, and streaming #37

@pimlock

Description

@pimlock

Inference Interception Follow-ups

Status: Draft
Date: 2026-02-23
Context: Follow-ups from MR !38 (feat(inference): inference interception and routing).

Confirmed Constraints

  1. Inference routing is currently supported only in cluster mode.
  2. Local sandbox mode (--policy-rules/--policy-data) is not expected to support inference routing at this time.

Gaps Observed

1) Missing cluster-only guardrail in local mode

Current behavior can accept CONNECT for inspect_for_inference and then fail later if inference runtime prerequisites are missing.

Desired behavior:

  • If sandbox is not running with cluster prerequisites (sandbox_id + gateway endpoint + TLS state), fail fast and clearly.
  • Do not emit 200 Connection Established for requests that cannot be serviced.

2) Incomplete Anthropic authentication handling

Current routing path is OpenAI-Bearer-centric and does not provide full Anthropic-compatible API key behavior.

Desired behavior:

  • Preserve credential isolation for Anthropic flows just like OpenAI flows.
  • Ensure route credentials (not sandbox/client credentials) are used for Anthropic upstream calls.
  • Support Anthropic header semantics end-to-end (x-api-key, anthropic-version, and related required headers).

3) api_patterns policy field is not wired

InferencePolicy.api_patterns exists in proto but interception currently uses built-in defaults only.

Desired behavior:

  • If policy defines api_patterns, use them.
  • If absent/empty, fall back to built-in defaults.

4) Missing GET /v1/models support

Current interception focuses on completion/messages-style write endpoints and does not provide a route-aware models listing behavior.

Desired behavior:

  • Support GET /v1/models for intercepted OpenAI-compatible flows.
  • Return a deterministic response strategy:
    • proxy to compatible backend route, or
    • synthesize a route-aware response when backend behavior is unsuitable.
  • Ensure policy and route compatibility filtering applies to model listing just like inference generation endpoints.

5) Missing streaming support

Current proxying model is request/response body buffering, which is insufficient for streaming APIs.

Desired behavior:

  • Support streaming for OpenAI and Anthropic-compatible APIs.
  • Preserve chunk boundaries and event framing (text/event-stream / SSE semantics where applicable).
  • Propagate cancellation/connection-close behavior correctly across sandbox proxy, gRPC transport, and backend.

Next Steps

P0: Enforce cluster-only inference explicitly

  1. Add startup validation in sandbox runtime:
  • If inference routing is configured but cluster prerequisites are absent, fail sandbox startup with a clear error.
  1. Add proxy-side defensive handling:
  • For inspect_for_inference, verify prerequisites before returning 200 Connection Established.
  • Return a deterministic error response when inference is unavailable.
  1. Add tests:
  • Unit/integration test for local mode + inference policy => explicit failure.
  • Regression test ensuring no optimistic 200 is sent before prerequisite validation.

P0: Full Anthropic API key support

  1. Define protocol-aware auth rewrite behavior in router backend:
  • openai_*: Authorization: Bearer <route.api_key>.
  • anthropic_messages: set x-api-key: <route.api_key> and preserve/validate anthropic-version behavior.
  1. Strip inbound client credentials for Anthropic at interception boundary:
  • Remove authorization and x-api-key from forwarded headers.
  • Keep non-sensitive headers that are required for request compatibility.
  1. Add tests:
  • Router integration test that verifies x-api-key rewrite for Anthropic routes.
  • Regression test proving client-supplied Anthropic credentials are not forwarded upstream.

P1: Add GET /v1/models support

  1. Extend inference API pattern matching to classify models-list requests.
  2. Implement route-aware handling in gateway/router for model-list requests.
  3. Add tests:
  • e2e test for intercepted GET /v1/models.
  • compatibility test across multiple allowed routes.

P1: Add streaming support

  1. Define streaming transport contract across proxy <-> gateway (gRPC streaming or framed chunk transport).
  2. Implement protocol-aware streaming passthrough in router backend.
  3. Ensure cancellation propagation and timeout behavior are explicit.
  4. Add tests:
  • e2e streaming chat completion test.
  • disconnection/cancellation regression test.

P1: Wire policy-driven API pattern configuration

  1. Map sandbox.policy.inference.api_patterns into sandbox inference interception context.
  2. Add validation for malformed patterns.
  3. Add tests for:
  • custom pattern match,
  • default fallback behavior,
  • invalid pattern rejection.

Definition of Done

  • Local mode with inference configuration fails fast with a clear, actionable error.
  • Anthropic requests route successfully using route-managed credentials only.
  • GET /v1/models works through interception with policy/route-aware behavior.
  • Streaming inference requests are supported end-to-end (including cancellation).
  • api_patterns works when configured and defaults remain backward compatible.
  • Unit/integration coverage added for each gap above.

Originally by @pimlock on 2026-02-22T22:10:54.126-08:00

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:inferenceInference routing and configuration work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions