fix(agents): invalidate stale session rows on non-success terminal status #1111

Merged
reviewer merged 1 commit from code-lead/1106 into main 2026-05-11 19:26:29 +00:00
Collaborator

Closes #1106

Summary

claude_sdk_sessions rows were written eagerly at first SDK event and
never cleared on SIGKILL / cancel / non-success terminal states, so the
next dispatch resumed a half-executed conversation against a stale plan
(the #1104 cursor task that burned ~45 min running shell loops).

Three layers of defence — eager invalidation, runtime gate, boot-time sweep
— so a single missed hook can't poison the next dispatch.

Design

1. Eager invalidation on every terminal-status path

New helper invalidateSessionForTask({repo, issue_number, agent_type, reason, taskId})
in domain/agent/session-invalidation.ts. Wired into:

  • dispatch/cancel.ts::persistAndBroadcastCancellation — operator cancel, issue-unassign, board reroute.
  • main.ts /cancel inline handlers — both queued-drop and running-abort branches.
  • main.ts::onTaskShutdown — SIGTERM drain casualty path (the exact route the #1104 repro hit).
  • dispatch/registry.ts::onFinish — any non-success terminal state (interrupted / cancelled / failure / aborted_no_skill).

Best-effort with try/catch so a transient SQLite hiccup can't fail the
operator cancel response.

2. Runtime safety gate inside runWithSessionResume

isSessionResumeSafe(repo, issue, type) queries the latest task_history
row for the tuple. Only status === "success" clears the gate; missing rows
and every non-success status drop the stored session and proceed with
resume=null. Threaded through runAgentTask so every dispatch consults
task_history before resuming. Fail-open on DB errors.

3. Boot-time sweep

sweepInvalidSessions() walks every claude_sdk_sessions row, parses the
key via the new parseSessionKey helper, looks up the latest task, and
drops anything orphaned or non-success. Runs at startup after the existing
TTL sweepStaleSessions so the two compose: TTL evicts
Anthropic-side-expired sessions; this evicts SIGKILL-orphaned ones.

Provider scope

Uniform across anthropic / cursor / future provider tags — per the AC,
cursor is the most fragile resume target but no provider should resume
across a task that never reached success.

Files

Layer File Purpose
Module domain/agent/session-invalidation.ts (new) Centralised gate — eager / runtime / sweep helpers
Schema query infrastructure/database/task-store.ts getLatestTaskByIssueAndType(repo, issue, type)
Schema query infrastructure/database/sessions.ts parseSessionKey(key) — inverse of sessionKey()
Runner domain/agent/agent-runner.ts New optional isResumeSafe callback on runWithSessionResume; wired from runAgentTask
Lifecycle domain/dispatch/cancel.ts Eager drop after persistAndBroadcastCancellation
Lifecycle domain/dispatch/registry.ts Eager drop in onFinish when status ≠ success
Lifecycle main.ts Eager drop in /cancel inline handlers + onTaskShutdown; boot-time sweepInvalidSessions() wired into startup

Tests

33 new test cases covering all three AC scenarios:

  • session-invalidation.test.ts (new, 16 cases) — lifecycle drops, idempotency, scoping by issue/type/forge, cursor parity, runtime check matrix across every terminal status, boot-sweep mixed-fleet matrix.
  • session-resume.test.ts (+5 cases) — isResumeSafe=false drops session (no SDK retry), =true preserves, async predicate, fail-open on throw, useSession=false skips the gate.
  • sessions.test.ts (+5 cases) — parseSessionKey round-trip + edge cases.
  • task-store.test.ts (+4 cases) — getLatestTaskByIssueAndType scoping.
  • agent-runner.test.ts — seeded a prior-success row in the existing resume-retry test so the new gate clears the stored session for the Anthropic-TTL-expiry retry path the test exercises.

Test plan

  • just qa clean (typecheck + Biome format + Biome lint + 3,308 server tests)
  • Pre-push hook clean (3,312 server tests)
  • AC test: simulate task that writes a session row, gets SIGKILL'd before finishing → next dispatch resolves to resume=<none> (covered by sweepInvalidSessions > drops session rows whose latest task is missing)
  • AC test: cancelled task → session row cleared on cancel path (covered by invalidateSessionForTask lifecycle suite + isSessionResumeSafe > returns false when the latest task was cancelled)
  • AC test: successful task → session row preserved, resume on the next dispatch for the same issue (covered by isSessionResumeSafe > returns true when the latest task ended in success + sweepInvalidSessions > preserves session rows whose latest task succeeded)
  • Manual QA: reproduce the original scenario — dispatch a cursor task, just restart mid-flight, re-assign, confirm new task spawns with resume=<none>.

🤖 Generated with Claude Code

Closes #1106 ## Summary `claude_sdk_sessions` rows were written eagerly at first SDK event and never cleared on SIGKILL / cancel / non-success terminal states, so the next dispatch resumed a half-executed conversation against a stale plan (the #1104 cursor task that burned ~45 min running shell loops). Three layers of defence — eager invalidation, runtime gate, boot-time sweep — so a single missed hook can't poison the next dispatch. ## Design ### 1. Eager invalidation on every terminal-status path New helper `invalidateSessionForTask({repo, issue_number, agent_type, reason, taskId})` in `domain/agent/session-invalidation.ts`. Wired into: - `dispatch/cancel.ts::persistAndBroadcastCancellation` — operator cancel, issue-unassign, board reroute. - `main.ts /cancel` inline handlers — both queued-drop and running-abort branches. - `main.ts::onTaskShutdown` — SIGTERM drain casualty path (the exact route the #1104 repro hit). - `dispatch/registry.ts::onFinish` — any non-`success` terminal state (interrupted / cancelled / failure / aborted_no_skill). Best-effort with try/catch so a transient SQLite hiccup can't fail the operator cancel response. ### 2. Runtime safety gate inside `runWithSessionResume` `isSessionResumeSafe(repo, issue, type)` queries the latest `task_history` row for the tuple. Only `status === "success"` clears the gate; missing rows and every non-success status drop the stored session and proceed with `resume=null`. Threaded through `runAgentTask` so every dispatch consults task_history before resuming. Fail-open on DB errors. ### 3. Boot-time sweep `sweepInvalidSessions()` walks every `claude_sdk_sessions` row, parses the key via the new `parseSessionKey` helper, looks up the latest task, and drops anything orphaned or non-success. Runs at startup after the existing TTL `sweepStaleSessions` so the two compose: TTL evicts Anthropic-side-expired sessions; this evicts SIGKILL-orphaned ones. ### Provider scope Uniform across `anthropic` / `cursor` / future provider tags — per the AC, cursor is the most fragile resume target but no provider should resume across a task that never reached `success`. ## Files | Layer | File | Purpose | | --- | --- | --- | | Module | `domain/agent/session-invalidation.ts` (new) | Centralised gate — eager / runtime / sweep helpers | | Schema query | `infrastructure/database/task-store.ts` | `getLatestTaskByIssueAndType(repo, issue, type)` | | Schema query | `infrastructure/database/sessions.ts` | `parseSessionKey(key)` — inverse of `sessionKey()` | | Runner | `domain/agent/agent-runner.ts` | New optional `isResumeSafe` callback on `runWithSessionResume`; wired from `runAgentTask` | | Lifecycle | `domain/dispatch/cancel.ts` | Eager drop after `persistAndBroadcastCancellation` | | Lifecycle | `domain/dispatch/registry.ts` | Eager drop in `onFinish` when status ≠ success | | Lifecycle | `main.ts` | Eager drop in `/cancel` inline handlers + `onTaskShutdown`; boot-time `sweepInvalidSessions()` wired into startup | ## Tests 33 new test cases covering all three AC scenarios: - `session-invalidation.test.ts` (new, 16 cases) — lifecycle drops, idempotency, scoping by issue/type/forge, cursor parity, runtime check matrix across every terminal status, boot-sweep mixed-fleet matrix. - `session-resume.test.ts` (+5 cases) — `isResumeSafe=false` drops session (no SDK retry), `=true` preserves, async predicate, fail-open on throw, `useSession=false` skips the gate. - `sessions.test.ts` (+5 cases) — `parseSessionKey` round-trip + edge cases. - `task-store.test.ts` (+4 cases) — `getLatestTaskByIssueAndType` scoping. - `agent-runner.test.ts` — seeded a prior-success row in the existing resume-retry test so the new gate clears the stored session for the Anthropic-TTL-expiry retry path the test exercises. ## Test plan - [x] `just qa` clean (typecheck + Biome format + Biome lint + 3,308 server tests) - [x] Pre-push hook clean (3,312 server tests) - [x] AC test: simulate task that writes a session row, gets SIGKILL'd before finishing → next dispatch resolves to `resume=<none>` (covered by `sweepInvalidSessions > drops session rows whose latest task is missing`) - [x] AC test: cancelled task → session row cleared on cancel path (covered by `invalidateSessionForTask` lifecycle suite + `isSessionResumeSafe > returns false when the latest task was cancelled`) - [x] AC test: successful task → session row preserved, resume on the next dispatch for the same issue (covered by `isSessionResumeSafe > returns true when the latest task ended in success` + `sweepInvalidSessions > preserves session rows whose latest task succeeded`) - [ ] Manual QA: reproduce the original scenario — dispatch a cursor task, `just restart` mid-flight, re-assign, confirm new task spawns with `resume=<none>`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fix(agents): invalidate stale session rows on non-success terminal status
All checks were successful
qa / i18n-string-check (pull_request) Successful in 54s
qa / sql-layer-check (pull_request) Successful in 1m4s
qa / db-schema (pull_request) Successful in 1m9s
qa / dockerfile (pull_request) Successful in 1m12s
qa / qa-1 (pull_request) Successful in 5m57s
qa / qa (pull_request) Successful in 0s
1426c6d5a7
Closes #1106.

`claude_sdk_sessions` rows were written eagerly at first SDK event and
never cleared on SIGKILL / cancel / failure, so the next dispatch resumed
a half-executed conversation against a stale plan (the #1104 cursor task
that burned ~45 min running shell loops).

Three layers of defence:

1. Eager invalidation — every terminal-status path (operator cancel via
   cancel.ts + main.ts /cancel inline branches, shutdown drain in
   onTaskShutdown, non-success onFinish in registry.ts) now drops the
   stored session id.
2. Runtime safety check — runWithSessionResume consults
   isSessionResumeSafe(repo, issue, type) before handing the stored id to
   the SDK; only a prior task_history row with status=success clears it.
3. Boot-time sweep — sweepInvalidSessions() runs at startup and drops
   orphaned rows whose latest task is missing or non-success, catching
   anything stranded by older service versions.

Provider-scope is uniform (anthropic + cursor + future tags). Fail-open
on DB errors so a transient SQLite hiccup never blocks cancel / dispatch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
reviewer approved these changes 2026-05-11 19:26:17 +00:00
reviewer left a comment

Review — PR #1111 fix(agents): invalidate stale session rows on non-success terminal status

CI: green. Mergeable: .


Overview

Root-cause fix for the #1104 repro: claude_sdk_sessions rows are written eagerly at the first SDK event and never cleared when a task doesn't reach success. A SIGKILL'd worker left a row pointing at a half-executed conversation, the next dispatch resumed that poisoned session, and the cursor task burned ~45 min running shell loops.

The three-layer defence is the right shape for this: eager invalidation covers the normal paths, the runtime gate catches anything that slipped through, and the boot sweep catches restart-casualties from older deployments.


Layer-by-layer assessment

session-invalidation.ts (new module) — Clean, well-documented. The three public exports map one-to-one to the three defence layers with no cross-contamination. Fail-open semantics throughout are appropriate: a transient SQLite hiccup shouldn't fail an operator cancel or needlessly discard a valid session.

parseSessionKey in sessions.ts — Correct inverse of sessionKey. slice(2, -1).join(":") handles the <forge>:<type>:<owner>/<name>:<issue> format correctly and is future-proof if the repo segment ever contains :. Minor: the doc comment says "allows : in the owner/name" but it's really the repo segment as a whole that's preserved — harmless wording quibble.

getLatestTaskByIssueAndType in task-store.ts — Narrow projection, correct started_at DESC ordering. The status as TerminalStatus cast is safe given how persistTask works (only called at terminal time). finished_at is included in LatestTaskStatusRow but only .status is consumed today — no issue, could be useful for future logging.

runWithSessionResume in agent-runner.ts — The isResumeSafe?: () => boolean | Promise<boolean> option is minimal and backward-compatible (existing tests need no changes). The fail-open catch path (log + safe = true) is correct: a thrown safety check is less likely to indicate a poisoned session than a transient error, and the boot sweep is the catch-all.

registry.ts onFinish — Reason mapping (cancelled"cancelled", interrupted"interrupted", else → "non-success-finish") correctly covers failure and aborted_no_skill. void-fire after persistTask is appropriate.

cancel.ts / main.ts — All four terminal-cancellation paths (queue-drop, running-abort, persistAndBroadcastCancellation, onTaskShutdown) now carry the eager drop. onTaskShutdown is the exact path that triggered the #1104 repro — well-identified.

Boot-time sweep in main.ts — Correctly sequenced after sweepStaleSessions with a .catch guard. The log line (swept N orphaned session row(s)) gives operators a clear signal without being noisy on a clean restart.


Test coverage

33 new cases across 5 files — matrix coverage is comprehensive:

Suite What it covers
session-invalidation.test.ts (16) Lifecycle drop, idempotency, scoping, cursor parity; runtime check across every terminal status + agent_type isolation; boot-sweep orphan / non-success / success / mixed-fleet
session-resume.test.ts (+5) isResumeSafe=false drops + no SDK retry, =true preserves, async predicate, fail-open on throw, useSession=false skips gate entirely
sessions.test.ts (+5) parseSessionKey round-trip, forge variants, non-numeric suffix, too-few segments, slash-heavy repo
task-store.test.ts (+4) Null on no-row, ordering by started_at, agent_type scoping, issue_number scoping
agent-runner.test.ts Seeds a prior-success row for the Anthropic TTL-expiry retry path — comment explains exactly why

The agent-runner.test.ts seed is the subtlest part: the gate now consults task_history before allowing resume, so the existing TTL-expiry retry test needed a success row to pass the gate. The comment correctly distinguishes the two failure modes (SIGKILL poisoning vs. Anthropic-side TTL expiry).


Minor observations (non-blocking)

  1. dropSession is awaited sequentially inside the sweepInvalidSessions loop. Fine for a boot-time operation with a small number of rows; if the session table ever grows large a Promise.allSettled fan-out would be worth revisiting.
  2. LatestTaskStatusRow.finished_at is selected but unused today — harmless, leave it for future log enhancements.

APPROVED — merging.

## Review — PR #1111 `fix(agents): invalidate stale session rows on non-success terminal status` CI: ✅ green. Mergeable: ✅. --- ### Overview Root-cause fix for the #1104 repro: `claude_sdk_sessions` rows are written eagerly at the first SDK event and never cleared when a task doesn't reach `success`. A SIGKILL'd worker left a row pointing at a half-executed conversation, the next dispatch resumed that poisoned session, and the cursor task burned ~45 min running shell loops. The three-layer defence is the right shape for this: eager invalidation covers the normal paths, the runtime gate catches anything that slipped through, and the boot sweep catches restart-casualties from older deployments. --- ### Layer-by-layer assessment **`session-invalidation.ts` (new module)** — Clean, well-documented. The three public exports map one-to-one to the three defence layers with no cross-contamination. Fail-open semantics throughout are appropriate: a transient SQLite hiccup shouldn't fail an operator cancel or needlessly discard a valid session. **`parseSessionKey` in `sessions.ts`** — Correct inverse of `sessionKey`. `slice(2, -1).join(":")` handles the `<forge>:<type>:<owner>/<name>:<issue>` format correctly and is future-proof if the repo segment ever contains `:`. Minor: the doc comment says "allows `:` in the owner/name" but it's really the repo segment as a whole that's preserved — harmless wording quibble. **`getLatestTaskByIssueAndType` in `task-store.ts`** — Narrow projection, correct `started_at DESC` ordering. The `status as TerminalStatus` cast is safe given how `persistTask` works (only called at terminal time). `finished_at` is included in `LatestTaskStatusRow` but only `.status` is consumed today — no issue, could be useful for future logging. **`runWithSessionResume` in `agent-runner.ts`** — The `isResumeSafe?: () => boolean | Promise<boolean>` option is minimal and backward-compatible (existing tests need no changes). The fail-open catch path (log + `safe = true`) is correct: a thrown safety check is less likely to indicate a poisoned session than a transient error, and the boot sweep is the catch-all. **`registry.ts` `onFinish`** — Reason mapping (`cancelled` → `"cancelled"`, `interrupted` → `"interrupted"`, else → `"non-success-finish"`) correctly covers `failure` and `aborted_no_skill`. `void`-fire after `persistTask` is appropriate. **`cancel.ts` / `main.ts`** — All four terminal-cancellation paths (queue-drop, running-abort, `persistAndBroadcastCancellation`, `onTaskShutdown`) now carry the eager drop. `onTaskShutdown` is the exact path that triggered the #1104 repro — well-identified. **Boot-time sweep in `main.ts`** — Correctly sequenced after `sweepStaleSessions` with a `.catch` guard. The log line (`swept N orphaned session row(s)`) gives operators a clear signal without being noisy on a clean restart. --- ### Test coverage 33 new cases across 5 files — matrix coverage is comprehensive: | Suite | What it covers | |---|---| | `session-invalidation.test.ts` (16) | Lifecycle drop, idempotency, scoping, cursor parity; runtime check across every terminal status + agent_type isolation; boot-sweep orphan / non-success / success / mixed-fleet | | `session-resume.test.ts` (+5) | `isResumeSafe=false` drops + no SDK retry, `=true` preserves, async predicate, fail-open on throw, `useSession=false` skips gate entirely | | `sessions.test.ts` (+5) | `parseSessionKey` round-trip, forge variants, non-numeric suffix, too-few segments, slash-heavy repo | | `task-store.test.ts` (+4) | Null on no-row, ordering by `started_at`, `agent_type` scoping, `issue_number` scoping | | `agent-runner.test.ts` | Seeds a prior-success row for the Anthropic TTL-expiry retry path — comment explains exactly why | The `agent-runner.test.ts` seed is the subtlest part: the gate now consults `task_history` before allowing resume, so the existing TTL-expiry retry test needed a success row to pass the gate. The comment correctly distinguishes the two failure modes (SIGKILL poisoning vs. Anthropic-side TTL expiry). --- ### Minor observations (non-blocking) 1. `dropSession` is awaited sequentially inside the `sweepInvalidSessions` loop. Fine for a boot-time operation with a small number of rows; if the session table ever grows large a `Promise.allSettled` fan-out would be worth revisiting. 2. `LatestTaskStatusRow.finished_at` is selected but unused today — harmless, leave it for future log enhancements. --- **APPROVED — merging.**
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!1111
No description provided.