Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens Orleans SignalR connection/group cleanup behavior, especially for batch group operations that complete after a client disconnect and for grace-period recovery where Orleans-owned state must remain scheduler-confined.
Changes:
- Keep grace-period state transitions on the Orleans grain turn while offloading buffered observer replay I/O off-scheduler.
- Add compensating cleanup when batch group adds finish after a disconnect; skip heartbeats for disconnected connections.
- Make
Subscriptiongrain tracking concurrency-safe; update feature docs and agent guidance.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ManagedCode.Orleans.SignalR.Server/SignalRObserverGrainBase.cs | Ensures grace-period state changes stay on the grain scheduler while replay is offloaded. |
| ManagedCode.Orleans.SignalR.Core/SignalR/OrleansHubLifetimeManager.cs | Adds disconnect-aware batch group cleanup and avoids heartbeat work for disconnected connections. |
| ManagedCode.Orleans.SignalR.Core/SignalR/Observers/Subscription.cs | Switches to concurrent collections to support overlapping batch mutations and disconnect cleanup. |
| docs/Features/Observer-Health-and-Circuit-Breaker.md | Documents scheduler vs. off-scheduler responsibilities for grace-period recovery. |
| docs/Features/Hub-Lifetime-Manager-Integration.md | Documents compensating cleanup when late batch joins complete after disconnect. |
| docs/Features/Group-Partitioning.md | Documents late-join cleanup behavior for partitioned batching. |
| AGENTS.md | Adds guidance emphasizing simple, ordering/idempotency-based fixes and avoiding locks in Orleans paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| if (tasks.Length > 0) | ||
| { | ||
| await Task.WhenAll(tasks); |
There was a problem hiding this comment.
In the non-partitioned path, subscription.AddGrain(groupGrain) now happens after AddConnection completes. If Task.WhenAll(tasks) throws (e.g., one group add fails while others succeed), the subscription will never track the successfully-added grains, so OnDisconnectedAsync won't remove those memberships and the connection can leak in group state. Consider adding grains to the subscription before issuing the adds, or wrapping WhenAll in try/catch and running a compensating RemoveConnection for any/all groupGrains before returning/throwing.
| await Task.WhenAll(tasks); | |
| try | |
| { | |
| await Task.WhenAll(tasks); | |
| } | |
| catch | |
| { | |
| var cleanupTasks = groupGrains | |
| .Select(groupGrain => Task.Run(() => groupGrain.RemoveConnection(connectionId, subscriptionReference), cancellationToken)) | |
| .ToArray(); | |
| if (cleanupTasks.Length > 0) | |
| { | |
| await Task.WhenAll(cleanupTasks); | |
| } | |
| throw; | |
| } |
| if (IsConnectionDisconnected(connectionId)) | ||
| { | ||
| var cleanupTasks = groupGrains | ||
| .Select(groupGrain => Task.Run(() => groupGrain.RemoveConnection(connectionId, subscriptionReference), cancellationToken)) | ||
| .ToArray(); | ||
|
|
||
| if (cleanupTasks.Length > 0) | ||
| { | ||
| await Task.WhenAll(cleanupTasks); | ||
| } | ||
|
|
||
| return; | ||
| } | ||
|
|
||
| foreach (var groupGrain in groupGrains) | ||
| { | ||
| subscription.AddGrain(groupGrain); | ||
| } | ||
|
|
||
| if (IsConnectionDisconnected(connectionId)) | ||
| { | ||
| var cleanupTasks = groupGrains | ||
| .Select(groupGrain => Task.Run(() => groupGrain.RemoveConnection(connectionId, subscriptionReference), cancellationToken)) | ||
| .ToArray(); | ||
|
|
||
| if (cleanupTasks.Length > 0) | ||
| { | ||
| await Task.WhenAll(cleanupTasks); | ||
| } | ||
|
|
||
| return; | ||
| } |
There was a problem hiding this comment.
The disconnect-cleanup logic for the non-partitioned path is duplicated (two IsConnectionDisconnected checks that each build and WhenAll the same RemoveConnection task set). This makes the flow harder to reason about and increases the chance of divergence in future edits. Consider extracting a small helper (e.g., CleanupDisconnectedGroupMembershipAsync) and calling it from both sites, or restructuring to a single post-add check with a single cleanup path.
| public int PartitionId { get; private set; } | ||
|
|
||
| public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains; | ||
| public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains.IsEmpty ? [] : [.. _grains.Keys]; |
There was a problem hiding this comment.
Grains currently materializes a new array on every access ([.. _grains.Keys]). Since ConcurrentDictionary<TKey,TValue>.Keys is already a thread-safe collection view that implements IReadOnlyCollection<TKey>, you can likely return _grains.Keys directly to avoid repeated allocations and key enumeration.
| public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains.IsEmpty ? [] : [.. _grains.Keys]; | |
| public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains.Keys; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2e440b8ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var partitionIds = await Task.Run( | ||
| () => coordinatorGrain.AddConnectionToGroups(uniqueGroupNames, connectionId, subscriptionReference), | ||
| cancellationToken); |
There was a problem hiding this comment.
Register partition grains before awaiting coordinator writes
This now awaits coordinatorGrain.AddConnectionToGroups(...) before recording touched partition grains in subscription. If the coordinator call partially succeeds and then faults (it fans out partition writes in parallel and Task.WhenAll can fail after some partitions succeeded), the method exits before subscription.AddGrain(...) runs, so disconnect cleanup no longer knows which partition grains to remove from. That leaves stale group membership/observer entries for connections that later disconnect.
Useful? React with 👍 / 👎.
| if (tasks.Length > 0) | ||
| { | ||
| await Task.WhenAll(tasks); | ||
| } |
There was a problem hiding this comment.
Track group grains before awaiting AddConnection fan-out
In the non-partitioned path, Task.WhenAll(tasks) is awaited before any subscription.AddGrain(groupGrain) calls. If one AddConnection task fails while others succeed, this throws and skips the later tracking step, so successful group memberships are never added to subscription.Grains and therefore are not removed during OnDisconnectedAsync. The result is stale memberships that can outlive the connection.
Useful? React with 👍 / 👎.
Summary
Verification
ManagedCode.Orleans.SignalR.Tests.HighAvailabilityTests.ClientsSurviveThirdAndFourthSiloShutdown|FullyQualifiedNameManagedCode.Orleans.SignalR.Tests.PartitioningTests.BatchGroupMembershipShouldCleanupTouchedPartitionsWhenDisconnectHappensMidJoinAsync|FullyQualifiedNameManagedCode.Orleans.SignalR.Tests.HubSmokeTests.BatchGroupHelpersShouldWorkWithoutOrleansRegistrationAsync|FullyQualifiedNameManagedCode.Orleans.SignalR.Tests.GrainPersistenceTests.GroupCoordinatorBatchMethodsShouldTrackMembershipAcrossTouchedPartitionsAsync|FullyQualifiedName~ManagedCode.Orleans.SignalR.Tests.GrainPersistenceTests.GroupCoordinatorBatchRemoveShouldCleanupPartitionWhenAssignmentMetadataIsMissingAsync"