Skip to content

fix: harden batch group cleanup#8

Merged
KSemenenko merged 2 commits intomainfrom
codex/fix-batch-group-cleanup
Apr 2, 2026
Merged

fix: harden batch group cleanup#8
KSemenenko merged 2 commits intomainfrom
codex/fix-batch-group-cleanup

Conversation

@KSemenenko
Copy link
Copy Markdown
Member

@KSemenenko KSemenenko commented Apr 2, 2026

Summary

Verification

  • dotnet build ManagedCode.Orleans.SignalR.slnx --configuration Release
  • dotnet build ManagedCode.Orleans.SignalR.slnx --configuration Debug
  • dotnet test ManagedCode.Orleans.SignalR.Tests/ManagedCode.Orleans.SignalR.Tests.csproj --configuration Release --no-build --verbosity normal
  • dotnet test ManagedCode.Orleans.SignalR.Tests/ManagedCode.Orleans.SignalR.Tests.csproj --configuration Debug --no-build
  • dotnet test ManagedCode.Orleans.SignalR.Tests/ManagedCode.Orleans.SignalR.Tests.csproj --configuration Debug --no-build --collect:"XPlat Code Coverage" --filter "FullyQualifiedNameManagedCode.Orleans.SignalR.Tests.HighAvailabilityTests.ClientsSurviveThirdAndFourthSiloShutdown|FullyQualifiedNameManagedCode.Orleans.SignalR.Tests.PartitioningTests.BatchGroupMembershipShouldCleanupTouchedPartitionsWhenDisconnectHappensMidJoinAsync|FullyQualifiedNameManagedCode.Orleans.SignalR.Tests.HubSmokeTests.BatchGroupHelpersShouldWorkWithoutOrleansRegistrationAsync|FullyQualifiedNameManagedCode.Orleans.SignalR.Tests.GrainPersistenceTests.GroupCoordinatorBatchMethodsShouldTrackMembershipAcrossTouchedPartitionsAsync|FullyQualifiedName~ManagedCode.Orleans.SignalR.Tests.GrainPersistenceTests.GroupCoordinatorBatchRemoveShouldCleanupPartitionWhenAssignmentMetadataIsMissingAsync"
  • dotnet format

@KSemenenko KSemenenko marked this pull request as ready for review April 2, 2026 06:36
Copilot AI review requested due to automatic review settings April 2, 2026 06:36
@KSemenenko KSemenenko merged commit 2a1fdf1 into main Apr 2, 2026
5 checks passed
@KSemenenko KSemenenko deleted the codex/fix-batch-group-cleanup branch April 2, 2026 06:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Orleans SignalR connection/group cleanup behavior, especially for batch group operations that complete after a client disconnect and for grace-period recovery where Orleans-owned state must remain scheduler-confined.

Changes:

  • Keep grace-period state transitions on the Orleans grain turn while offloading buffered observer replay I/O off-scheduler.
  • Add compensating cleanup when batch group adds finish after a disconnect; skip heartbeats for disconnected connections.
  • Make Subscription grain tracking concurrency-safe; update feature docs and agent guidance.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ManagedCode.Orleans.SignalR.Server/SignalRObserverGrainBase.cs Ensures grace-period state changes stay on the grain scheduler while replay is offloaded.
ManagedCode.Orleans.SignalR.Core/SignalR/OrleansHubLifetimeManager.cs Adds disconnect-aware batch group cleanup and avoids heartbeat work for disconnected connections.
ManagedCode.Orleans.SignalR.Core/SignalR/Observers/Subscription.cs Switches to concurrent collections to support overlapping batch mutations and disconnect cleanup.
docs/Features/Observer-Health-and-Circuit-Breaker.md Documents scheduler vs. off-scheduler responsibilities for grace-period recovery.
docs/Features/Hub-Lifetime-Manager-Integration.md Documents compensating cleanup when late batch joins complete after disconnect.
docs/Features/Group-Partitioning.md Documents late-join cleanup behavior for partitioned batching.
AGENTS.md Adds guidance emphasizing simple, ordering/idempotency-based fixes and avoiding locks in Orleans paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


if (tasks.Length > 0)
{
await Task.WhenAll(tasks);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-partitioned path, subscription.AddGrain(groupGrain) now happens after AddConnection completes. If Task.WhenAll(tasks) throws (e.g., one group add fails while others succeed), the subscription will never track the successfully-added grains, so OnDisconnectedAsync won't remove those memberships and the connection can leak in group state. Consider adding grains to the subscription before issuing the adds, or wrapping WhenAll in try/catch and running a compensating RemoveConnection for any/all groupGrains before returning/throwing.

Suggested change
await Task.WhenAll(tasks);
try
{
await Task.WhenAll(tasks);
}
catch
{
var cleanupTasks = groupGrains
.Select(groupGrain => Task.Run(() => groupGrain.RemoveConnection(connectionId, subscriptionReference), cancellationToken))
.ToArray();
if (cleanupTasks.Length > 0)
{
await Task.WhenAll(cleanupTasks);
}
throw;
}

Copilot uses AI. Check for mistakes.
Comment on lines +424 to +455
if (IsConnectionDisconnected(connectionId))
{
var cleanupTasks = groupGrains
.Select(groupGrain => Task.Run(() => groupGrain.RemoveConnection(connectionId, subscriptionReference), cancellationToken))
.ToArray();

if (cleanupTasks.Length > 0)
{
await Task.WhenAll(cleanupTasks);
}

return;
}

foreach (var groupGrain in groupGrains)
{
subscription.AddGrain(groupGrain);
}

if (IsConnectionDisconnected(connectionId))
{
var cleanupTasks = groupGrains
.Select(groupGrain => Task.Run(() => groupGrain.RemoveConnection(connectionId, subscriptionReference), cancellationToken))
.ToArray();

if (cleanupTasks.Length > 0)
{
await Task.WhenAll(cleanupTasks);
}

return;
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disconnect-cleanup logic for the non-partitioned path is duplicated (two IsConnectionDisconnected checks that each build and WhenAll the same RemoveConnection task set). This makes the flow harder to reason about and increases the chance of divergence in future edits. Consider extracting a small helper (e.g., CleanupDisconnectedGroupMembershipAsync) and calling it from both sites, or restructuring to a single post-add check with a single cleanup path.

Copilot uses AI. Check for mistakes.
public int PartitionId { get; private set; }

public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains;
public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains.IsEmpty ? [] : [.. _grains.Keys];
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grains currently materializes a new array on every access ([.. _grains.Keys]). Since ConcurrentDictionary<TKey,TValue>.Keys is already a thread-safe collection view that implements IReadOnlyCollection<TKey>, you can likely return _grains.Keys directly to avoid repeated allocations and key enumeration.

Suggested change
public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains.IsEmpty ? [] : [.. _grains.Keys];
public IReadOnlyCollection<IObserverConnectionManager> Grains => _grains.Keys;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2e440b8ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +376 to +378
var partitionIds = await Task.Run(
() => coordinatorGrain.AddConnectionToGroups(uniqueGroupNames, connectionId, subscriptionReference),
cancellationToken);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Register partition grains before awaiting coordinator writes

This now awaits coordinatorGrain.AddConnectionToGroups(...) before recording touched partition grains in subscription. If the coordinator call partially succeeds and then faults (it fans out partition writes in parallel and Task.WhenAll can fail after some partitions succeeded), the method exits before subscription.AddGrain(...) runs, so disconnect cleanup no longer knows which partition grains to remove from. That leaves stale group membership/observer entries for connections that later disconnect.

Useful? React with 👍 / 👎.

Comment on lines 419 to 422
if (tasks.Length > 0)
{
await Task.WhenAll(tasks);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Track group grains before awaiting AddConnection fan-out

In the non-partitioned path, Task.WhenAll(tasks) is awaited before any subscription.AddGrain(groupGrain) calls. If one AddConnection task fails while others succeed, this throws and skips the later tracking step, so successful group memberships are never added to subscription.Grains and therefore are not removed during OnDisconnectedAsync. The result is stale memberships that can outlive the connection.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants