fix(agents): invalidate stale session rows on non-success terminal status #1111
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!1111
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "code-lead/1106"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #1106
Summary
claude_sdk_sessionsrows were written eagerly at first SDK event andnever cleared on SIGKILL / cancel / non-success terminal states, so the
next dispatch resumed a half-executed conversation against a stale plan
(the #1104 cursor task that burned ~45 min running shell loops).
Three layers of defence — eager invalidation, runtime gate, boot-time sweep
— so a single missed hook can't poison the next dispatch.
Design
1. Eager invalidation on every terminal-status path
New helper
invalidateSessionForTask({repo, issue_number, agent_type, reason, taskId})in
domain/agent/session-invalidation.ts. Wired into:dispatch/cancel.ts::persistAndBroadcastCancellation— operator cancel, issue-unassign, board reroute.main.ts /cancelinline handlers — both queued-drop and running-abort branches.main.ts::onTaskShutdown— SIGTERM drain casualty path (the exact route the #1104 repro hit).dispatch/registry.ts::onFinish— any non-successterminal state (interrupted / cancelled / failure / aborted_no_skill).Best-effort with try/catch so a transient SQLite hiccup can't fail the
operator cancel response.
2. Runtime safety gate inside
runWithSessionResumeisSessionResumeSafe(repo, issue, type)queries the latesttask_historyrow for the tuple. Only
status === "success"clears the gate; missing rowsand every non-success status drop the stored session and proceed with
resume=null. Threaded throughrunAgentTaskso every dispatch consultstask_history before resuming. Fail-open on DB errors.
3. Boot-time sweep
sweepInvalidSessions()walks everyclaude_sdk_sessionsrow, parses thekey via the new
parseSessionKeyhelper, looks up the latest task, anddrops anything orphaned or non-success. Runs at startup after the existing
TTL
sweepStaleSessionsso the two compose: TTL evictsAnthropic-side-expired sessions; this evicts SIGKILL-orphaned ones.
Provider scope
Uniform across
anthropic/cursor/ future provider tags — per the AC,cursor is the most fragile resume target but no provider should resume
across a task that never reached
success.Files
domain/agent/session-invalidation.ts(new)infrastructure/database/task-store.tsgetLatestTaskByIssueAndType(repo, issue, type)infrastructure/database/sessions.tsparseSessionKey(key)— inverse ofsessionKey()domain/agent/agent-runner.tsisResumeSafecallback onrunWithSessionResume; wired fromrunAgentTaskdomain/dispatch/cancel.tspersistAndBroadcastCancellationdomain/dispatch/registry.tsonFinishwhen status ≠ successmain.ts/cancelinline handlers +onTaskShutdown; boot-timesweepInvalidSessions()wired into startupTests
33 new test cases covering all three AC scenarios:
session-invalidation.test.ts(new, 16 cases) — lifecycle drops, idempotency, scoping by issue/type/forge, cursor parity, runtime check matrix across every terminal status, boot-sweep mixed-fleet matrix.session-resume.test.ts(+5 cases) —isResumeSafe=falsedrops session (no SDK retry),=truepreserves, async predicate, fail-open on throw,useSession=falseskips the gate.sessions.test.ts(+5 cases) —parseSessionKeyround-trip + edge cases.task-store.test.ts(+4 cases) —getLatestTaskByIssueAndTypescoping.agent-runner.test.ts— seeded a prior-success row in the existing resume-retry test so the new gate clears the stored session for the Anthropic-TTL-expiry retry path the test exercises.Test plan
just qaclean (typecheck + Biome format + Biome lint + 3,308 server tests)resume=<none>(covered bysweepInvalidSessions > drops session rows whose latest task is missing)invalidateSessionForTasklifecycle suite +isSessionResumeSafe > returns false when the latest task was cancelled)isSessionResumeSafe > returns true when the latest task ended in success+sweepInvalidSessions > preserves session rows whose latest task succeeded)just restartmid-flight, re-assign, confirm new task spawns withresume=<none>.🤖 Generated with Claude Code
Review — PR #1111
fix(agents): invalidate stale session rows on non-success terminal statusCI: ✅ green. Mergeable: ✅.
Overview
Root-cause fix for the #1104 repro:
claude_sdk_sessionsrows are written eagerly at the first SDK event and never cleared when a task doesn't reachsuccess. A SIGKILL'd worker left a row pointing at a half-executed conversation, the next dispatch resumed that poisoned session, and the cursor task burned ~45 min running shell loops.The three-layer defence is the right shape for this: eager invalidation covers the normal paths, the runtime gate catches anything that slipped through, and the boot sweep catches restart-casualties from older deployments.
Layer-by-layer assessment
session-invalidation.ts(new module) — Clean, well-documented. The three public exports map one-to-one to the three defence layers with no cross-contamination. Fail-open semantics throughout are appropriate: a transient SQLite hiccup shouldn't fail an operator cancel or needlessly discard a valid session.parseSessionKeyinsessions.ts— Correct inverse ofsessionKey.slice(2, -1).join(":")handles the<forge>:<type>:<owner>/<name>:<issue>format correctly and is future-proof if the repo segment ever contains:. Minor: the doc comment says "allows:in the owner/name" but it's really the repo segment as a whole that's preserved — harmless wording quibble.getLatestTaskByIssueAndTypeintask-store.ts— Narrow projection, correctstarted_at DESCordering. Thestatus as TerminalStatuscast is safe given howpersistTaskworks (only called at terminal time).finished_atis included inLatestTaskStatusRowbut only.statusis consumed today — no issue, could be useful for future logging.runWithSessionResumeinagent-runner.ts— TheisResumeSafe?: () => boolean | Promise<boolean>option is minimal and backward-compatible (existing tests need no changes). The fail-open catch path (log +safe = true) is correct: a thrown safety check is less likely to indicate a poisoned session than a transient error, and the boot sweep is the catch-all.registry.tsonFinish— Reason mapping (cancelled→"cancelled",interrupted→"interrupted", else →"non-success-finish") correctly coversfailureandaborted_no_skill.void-fire afterpersistTaskis appropriate.cancel.ts/main.ts— All four terminal-cancellation paths (queue-drop, running-abort,persistAndBroadcastCancellation,onTaskShutdown) now carry the eager drop.onTaskShutdownis the exact path that triggered the #1104 repro — well-identified.Boot-time sweep in
main.ts— Correctly sequenced aftersweepStaleSessionswith a.catchguard. The log line (swept N orphaned session row(s)) gives operators a clear signal without being noisy on a clean restart.Test coverage
33 new cases across 5 files — matrix coverage is comprehensive:
session-invalidation.test.ts(16)session-resume.test.ts(+5)isResumeSafe=falsedrops + no SDK retry,=truepreserves, async predicate, fail-open on throw,useSession=falseskips gate entirelysessions.test.ts(+5)parseSessionKeyround-trip, forge variants, non-numeric suffix, too-few segments, slash-heavy repotask-store.test.ts(+4)started_at,agent_typescoping,issue_numberscopingagent-runner.test.tsThe
agent-runner.test.tsseed is the subtlest part: the gate now consultstask_historybefore allowing resume, so the existing TTL-expiry retry test needed a success row to pass the gate. The comment correctly distinguishes the two failure modes (SIGKILL poisoning vs. Anthropic-side TTL expiry).Minor observations (non-blocking)
dropSessionis awaited sequentially inside thesweepInvalidSessionsloop. Fine for a boot-time operation with a small number of rows; if the session table ever grows large aPromise.allSettledfan-out would be worth revisiting.LatestTaskStatusRow.finished_atis selected but unused today — harmless, leave it for future log enhancements.APPROVED — merging.