-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Problem
opencode run (non-interactive mode) doesn't write session files to opencode's native storage directory. This means:
- Sessions launched via
agentctl launch --matrixare invisible toagentctl list - The
events()generator never sees them start or stop - No
session.stoppedlifecycle events are emitted to OrgLoop - The supervisor never gets notified when coding agents finish
Discovered during ENG-8709 and ENG-8669 autonomous implementation runs: agents did real work (Kimi committed code, modified 5+ files) but died before pushing/opening PRs, and the supervisor was never woken.
Session meta files ARE written to ~/.agentctl/opencode-sessions/*.json with PIDs, but the discovery logic in list() and events() only enumerates from opencode's native storage (~/.local/share/opencode/storage/session/).
Design: Three-Prong Fuse
When agentctl launches an opencode session, register three signals. First to fire cancels the others (AbortController per session).
Signal 1: Wrapper Exit Hook (primary, immediate)
- Launch a thin wrapper script instead of opencode directly
- Wrapper runs opencode, captures exit code, writes it to a
.exitfile alongside the session meta - On next poll cycle (≤15s), agentctl detects the exit file → emits
session.stoppedwith exit code - Gives us exit code for free, which PID poll alone can't provide
Signal 2: PID Death Poll (fallback, ≤15s latency)
- The
events()generator pollskill(pid, 0)every ~15s for all tracked sessions - Catches crashes, OOM kills, SIGKILL — anything the wrapper wouldn't survive
- Uses stored
startTimefrom session meta to detect PID recycling - If PID dies without an exit file: emit
session.stoppedwith unknown exit code
Signal 3: Master Timeout (safety net, configurable)
- Configurable per-session, default 3 hours
- Emits
session.timeout(distinct event type fromsession.stopped) - Does NOT kill the process — the supervisor decides what to do
- Catches: zombie PIDs, PID recycled to another matching process, poll bugs
Fuse Pattern
launch → register { pid, exitFileWatcher, pollInterval, timeout, abortController }
any signal fires → emit event → controller.abort() → cancels the other two
Each session gets its own AbortController. Daemon tracks Map<sessionId, AbortController>.
Implementation Notes
- Wrapper collapses signals 1+2: exit file is the fast path, PID poll is the fallback
- Poll interval: 15s (cheap,
kill(pid, 0)is ~0 cost) - The
list()method should also enumerate from the meta dir as a primary source, not just supplementary session.timeoutshould be a new event type in the lifecycle types- Timeout is configurable via launch opts or agentctl config, not hardcoded
Current State
- Session meta files:
~/.agentctl/opencode-sessions/<uuid>.json(already written on launch ✅) - PID stored with startTime for recycling detection ✅
events()async generator exists but only watches opencode's native storage ❌- No wrapper, no PID poll, no timeout ❌
Acceptance Criteria
-
agentctl list -ashows sessions launched via--matrix/launch opencode -
session.stoppedfires within 15s of opencode process exit -
session.timeoutfires after configurable duration (default 3h) - First signal cancels the other two (fuse pattern)
- Exit code captured when available (via wrapper)
- PID recycling detected and handled
- Existing claude-code/codex lifecycle behavior unchanged