Skip to content

Try to start the queue maintainer multiple times with backoff#1184

Open
brandur wants to merge 1 commit intomasterfrom
brandur-queue-maintainer-start
Open

Try to start the queue maintainer multiple times with backoff#1184
brandur wants to merge 1 commit intomasterfrom
brandur-queue-maintainer-start

Conversation

@brandur
Copy link
Contributor

@brandur brandur commented Mar 25, 2026

This one's aimed at addressing #1161. HookPeriodicJobsStart.Start may
return an error that causes the queue maintainer not to start, and there
are a few other intermittent errors that may cause it not to start (say
in the case of a transient DB problem). If this were to occur, the
course of action currently is for the client to to just spit an error to
logs and not try any additional remediation, which could have the effect
of leaving the queue maintainer offline for extended periods.

Here, try to address this broadly by allowing the queue maintainer a few
attempts at starting, and with our standard exponential backoff (1s, 2s,
4s, 8s, etc.). In case a queue maintainer fails to start completely, the
client requests resignation and hands leadership off to another client
to see if it can start successfully.

I think this is an okay compromise because in case of a non-transient
fundamental error (say HookPeriodicJobsStart.Start always returns an
error), we don't go into a hot loop that starts hammering things.
Instead, we'll get a reasonably responsible slow back off that gives
things a chance to recover, and which should be very visible in logs.

Fixes #1161.

@brandur brandur force-pushed the brandur-queue-maintainer-start branch from 7f173d6 to fe16c1d Compare March 25, 2026 03:42
This one's aimed at addressing #1161. `HookPeriodicJobsStart.Start` may
return an error that causes the queue maintainer not to start, and there
are a few other intermittent errors that may cause it not to start (say
in the case of a transient DB problem). If this were to occur, the
course of action currently is for the client to to just spit an error to
logs and not try any additional remediation, which could have the effect
of leaving the queue maintainer offline for extended periods.

Here, try to address this broadly by allowing the queue maintainer a few
attempts at starting, and with our standard exponential backoff (1s, 2s,
4s, 8s, etc.). In case a queue maintainer fails to start completely, the
client requests resignation and hands leadership off to another client
to see if it can start successfully.

I think this is an okay compromise because in case of a non-transient
fundamental error (say `HookPeriodicJobsStart.Start` always returns an
error), we don't go into a hot loop that starts hammering things.
Instead, we'll get a reasonably responsible slow back off that gives
things a chance to recover, and which should be very visible in logs.

Fixes #1161.
@brandur brandur force-pushed the brandur-queue-maintainer-start branch from fe16c1d to a4e03c6 Compare March 25, 2026 03:57
initialPeriodicJobs []*riverpilot.PeriodicJob
subServices []startstop.Service
)
if err := func() error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up indenting all of this in so that there's a single if err != nil check that we can call stopped() in at the end. Previously, stopped wasn't being invoked in some error branches which could lead this service erroring on start and never really stopping :/

for _, service := range m.servicesByName {
if err := service.Start(ctx); err != nil {
startstop.StopAllParallel(maputil.Values(m.servicesByName)...)
stopped()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, make sure stopped() is invoked in the error condition. I really need to go look at this startstop API again to see if we can make this safer.

@brandur brandur requested a review from bgentry March 25, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Maintenance process startup errors can leave leaders partially functional until leadership change

1 participant