Replace REST and RabbitMQ with gRPC#176
Replace REST and RabbitMQ with gRPC#176benthecarman wants to merge 17 commits intolightningdevkit:mainfrom
Conversation
|
👋 Thanks for assigning @joostjager as a reviewer! |
0fc1127 to
f8a3565
Compare
| @@ -0,0 +1,265 @@ | |||
| // This file is Copyright its original authors, visible in version control | |||
There was a problem hiding this comment.
What's missing vs. the gRPC spec
Critical for external client compatibility
-
Trailers-Only encoding may be wrong - For error responses with no body (
GrpcBody::Error), the gRPC spec says to use "Trailers-Only" mode: a single HEADERS frame containing both the HTTP headers (:status,content-type) and the gRPC trailers (grpc-status,grpc-message). The current implementation sends the HTTP response headers first, then the grpc-status as a separate trailers frame. Some client libraries (especially grpc-go, grpc-java) may not find grpc-status where they expect it, or may interpret the empty-body-with-trailers differently. This affects every error path. -
Content-type check is too loose -
starts_with("application/grpc")(line 220) acceptsapplication/grpcfooorapplication/grpc+json. The spec requires exactlyapplication/grpc,application/grpc+proto, orapplication/grpc+<codec>for a codec you support. A client sendingapplication/grpc+jsonwould be accepted but get protobuf back. Should reject anything other thanapplication/grpcandapplication/grpc+proto. -
No
grpc-timeout/ deadline support - External clients routinely setgrpc-timeout(e.g.,grpc-timeout: 5S). This server ignores it entirely, so a client expecting a DEADLINE_EXCEEDED after 5 seconds will instead hang until the handler finishes or the connection drops. -
No
grpc-accept-encodingadvertisement - The server doesn't sendgrpc-accept-encoding: identityin responses. External clients that default to gzip compression will try it, get UNIMPLEMENTED, then retry without compression on every single call (doubling latency on first request per stream). Advertising identity upfront avoids this. -
No server reflection - Tools like
grpcurl, Postman, and Kreya that external developers use for testing/debugging rely on thegrpc.reflection.v1service to discover the API. Without it, external developers need the.protofiles out-of-band, which is a significant friction point.
Functional issues (affect all clients)
-
Stream errors always appear as OK -
GrpcBody::Streamsends OK trailers when the mpsc channel closes (line 191-193), regardless of why it closed. If the server encounters an error mid-stream (or the broadcast sender is dropped), the client seesgrpc-status: 0. There's no mechanism to send error trailers on a stream. -
Broadcast lag silently terminates streams - The broadcast->mpsc bridge (line 307-313) doesn't handle
RecvError::Lagged. If a slow client falls behind the 1024-message broadcast buffer,rx.recv()returnsErr(Lagged), which breaks the loop and closes the stream with OK status. The client has no idea events were lost. -
No graceful shutdown for streams - On SIGTERM, active
SubscribeEventsstreams aren't drained or notified. The TCP connection just drops, so external clients see a transport error rather than a clean gRPC status. -
Trailing data after gRPC frame is silently ignored -
decode_grpc_bodyreads only one frame. If a client sends multiple gRPC frames in a single request body, the extra data is discarded silently. (Only matters if a client reuses a connection oddly, since these are all unary RPCs.) -
No request cancellation detection - When a client cancels (RST_STREAM), the handler keeps running to completion. Handlers that do expensive work (e.g.,
ExportPathfindingScores) waste resources on cancelled requests.
Nice-to-have for production with external clients
-
No health checking protocol - No
grpc.health.v1.Healthservice. External infrastructure (load balancers, Kubernetes, service meshes) uses this standard protocol for liveness/readiness probes. -
No binary metadata support - The gRPC spec says headers ending in
-binare base64-encoded binary. External clients/interceptors may send binary metadata (e.g., tracing context ingrpc-trace-bin), which this server would misinterpret as text. -
No compression support - Correctly returns UNIMPLEMENTED per spec, which is fine. But combined with 4 (no
grpc-accept-encoding), external clients pay a round-trip penalty discovering this.
Verdict
For the controlled ldk-server-client, most of this works because both sides agree on the wire format quirks. For external clients, items #1-#5 are the real blockers: #1 can cause error responses to be misinterpreted, #2 can silently accept incompatible codecs, #3 breaks client-side timeout contracts, #4 causes unnecessary round-trips, and #5 makes the API hard to discover. Items #6-#8 are correctness issues that affect everyone. Items #11-#13 are quality-of-life gaps for production deployments.
There was a problem hiding this comment.
Did most of these beside
reflection: tonic doesn't even support this out of the box
health check: can just use, get-node-info
compression: added the flag so people won't try and not really needed
a9e3448 to
e554203
Compare
|
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
tnull
left a comment
There was a problem hiding this comment.
Started a first pass, so far reviewed the first 4 commits.
In general it will be hard to catch all the details, so the strategy likely is to drastically improve test coverage where we see the opportunity and go from there.
| use std::task::{Context, Poll}; | ||
|
|
||
| use base64::Engine; | ||
| use bytes::{BufMut, Bytes, BytesMut}; |
There was a problem hiding this comment.
I think dropping bytes and replacing base64 with our own (e.g., @tankyleo's) might be worth exploring in a follow-up.
| Ok(tls_stream) => { | ||
| let io_stream = TokioIo::new(tls_stream); | ||
| if let Err(err) = http1::Builder::new().serve_connection(io_stream, node_service).await { | ||
| if let Err(err) = http2::Builder::new(TokioExecutor::new()).serve_connection(io_stream, node_service).await { |
There was a problem hiding this comment.
Seems this is not creating another runtime, but maybe we should we be more intentional about this and maybe reuse a specific new runtime for gRPC requests to avoid any weird tokio behavior with LDK Node's runtime? I.e., do we want to implement Executor for a new GrpcRuntime object and then (re-)use this always? Or maybe we should punt on this and rethink runtime handling in a follow up (in this case open an issue?)?
There was a problem hiding this comment.
Yeah probably better as a follow up
| paginated_store: Arc<dyn PaginatedKVStore>, | ||
| ) { | ||
| if let Some(payment_details) = event_node.payment(payment_id) { | ||
| let payment = payment_to_proto(payment_details); |
There was a problem hiding this comment.
Hmm, pre-existing, but should we really query the payment store each time? Can't we use the data from the LDK Node events for this?
There was a problem hiding this comment.
We only have this because ldk-server is currently duplicating the payment store for the pagination. Once we move over in ldk-node we can remove this.
Define the LightningNode service with all 38 unary RPCs and a server-streaming SubscribeEvents RPC. Add SubscribeEventsRequest message and events.proto import. Update response comments from REST/HTTP style to gRPC style. prost-build ignores service blocks, so this is a documentation- only change that prepares for the gRPC migration without altering any generated code or runtime behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Take &Context instead of Context by value in all 37 handler functions. The handlers only read from context.node and context.paginated_kv_store (both Arc), so borrowing avoids unnecessary Arc clones per request and prepares for the gRPC service rewrite where Context lives in the service struct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement gRPC wire protocol primitives (encode/decode frames, status codes, GrpcBody, trailers) directly on hyper's HTTP/2 types without tonic. No callers yet — this is the building block for the server gRPC conversion in the next commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch from HTTP/1.1 protobuf-over-REST to gRPC over HTTP/2, implemented directly on hyper without tonic. The gRPC framing module handles encode/decode of the 5-byte length-prefixed wire format, status codes, and HTTP/2 trailers. Key changes: - hyper http1 → http2, add TokioExecutor for HTTP/2 connections - TLS ALPN set to h2 for HTTP/2 negotiation - HMAC auth simplified to timestamp-only (no body) since TLS guarantees integrity - Events delivered via tokio broadcast channel, replacing the RabbitMQ EventPublisher infrastructure - Config renamed rest_service_addr → grpc_service_addr - Removed lapin, async-trait, events-rabbitmq feature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e554203 to
309b589
Compare
|
rebased and responded to @tnull's comments and added a bunch more tests (and prop tests) around some of the grpc stuff |
Switch from protobuf-over-REST to gRPC wire format. Requests use
gRPC framing (5-byte length-prefix) with application/grpc+proto
content-type, routed to /api.LightningNode/{method}. HMAC auth
simplified to timestamp-only signing to match the server change.
Error handling uses Trailers-Only mode: when the server returns a
gRPC error without a body, grpc-status appears as a regular HTTP/2
header that reqwest can read.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement the SubscribeEvents RPC end-to-end: a Stream variant in GrpcBody that delivers multiple gRPC-framed messages from an mpsc channel, a handler in service.rs that bridges the broadcast channel to the streaming response, a client EventStream type that reads gRPC frames incrementally from the HTTP/2 body, and e2e test assertions that verify payment events arrive via the stream. Also regenerates proto code to include SubscribeEventsRequest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename rest_service_address to grpc_service_address in CLI config and e2e test harness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change all handler functions from sync to async, taking Arc<Context> by value instead of &Context by reference. This enables future cancellation support and allows handlers to perform async operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a reference to the gRPC HTTP/2 protocol spec for easier comparison and future reference when working on the wire protocol implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces scattered string literals like "grpc-status", "application/grpc+proto", etc. with named constants to avoid magic strings and make the wire protocol easier to audit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use and_then/map_err to combine the gRPC frame decoding and protobuf deserialization into a single Result chain instead of two separate match blocks with early returns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a test with a pre-computed HMAC value (cross-checked against Python's hmac module) so the auth logic is verified against a known fixture, not just round-tripped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds exhaustive and varied-input tests for encode/decode roundtrip (13 payload sizes including boundary values), compression flag rejection (all 255 non-zero values), short input rejection, percent-encoding (full ASCII range + multi-byte UTF-8), and timeout parsing (zero values, large values, invalid formats). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test validate_grpc_request (method and content-type checks), grpc_error_response/grpc_response structure, GrpcBody poll sequences (Unary and Empty), all ldk_error_to_grpc_status variants, and auth edge cases (timestamp boundaries, future timestamps, malformed HMACs, empty API key). Also make validate_grpc_request generic over body type since it only inspects headers — this enabled testing without needing to construct hyper::body::Incoming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
309b589 to
ef97b98
Compare
|
The reason we chose protobuf-over-rest earlier was due to web support, I don't see that being covered or discussed here? What are your thoughts on that? And multiple reasons for protobuf over json:
|
|
My view on this is that some kind of http interface is probably needed anyway. For web support, or other connections where grpc doesn't work. In an AI-first world, that is sufficient. The AIs will sort it out. Adding (DIY) grpc on top of that seems mostly a theoretical advantage. Similarly, I don't think protobuf-over-rest is necessary anymore. The LLM will do json just fine too. |
| ] | ||
|
|
||
| [[package]] | ||
| name = "time" |
Implement gRPC directly on hyper's HTTP/2 support, without tonic, to keep the dependency tree small. This consolidates two separate protocols (protobuf-over-REST for RPCs, RabbitMQ for events) into a single gRPC interface.
#175 proved the migration works with tonic. We opted for from-scratch approach instead, informed by the plan at: https://gist.github.com/tnull/1f6ffc5ae71c418f28844e27eee8c62b
Events are now delivered via a server-streaming SubscribeEvents RPC backed by a tokio broadcast channel, eliminating the RabbitMQ operational dependency. HMAC auth is simplified to timestamp-only signing since TLS guarantees body integrity.
Had claude break it into several commits that should make it easy to review.
Also tested with the cli made in #175 and it works so seems our implementation is compliant