fix(shutdown): mark restart-casualty tasks as interrupted, not success #225

Merged
code-lead merged 2 commits from boss/221 into main 2026-04-21 14:01:42 +00:00
Collaborator

Closes #221.

Summary

Phase-2 of the SIGTERM drain used to silently persist in-flight tasks as status='success' (with NULL cost / turns) whenever the SDK bailed because the HTTP server was closing. The 2026-04-21 reproducer in the issue — tasks e5bdebea / f0920b3b for #209 / #210 — got rendered as green checkmarks on the pipeline view even though neither produced a commit.

The fix:

  • New terminal status interrupted in TaskStatus + SQLite — distinct from operator-initiated cancelled, so /history and /stats can tell restart-casualties from /cancels.
  • Decision helper resolveShutdownStatus() in shutdown.ts. It's pure — the worker's onFinish hook threads the shutdown flag and the "did we see an SDK result event" bit through it rather than branching inline, so unit tests don't stand up a signal handler.
  • Shutdown-in-progress flag set at phase 0 of runGracefulShutdown. A task that settles in phase 2 without an SDK result message now downgrades to interrupted. Phase-3 force-abort is folded into the same status (documented in the module header) so one value covers every restart-casualty route.
  • Pipeline rendering: new interrupted StageState that reuses the amber stage-stalled CSS token with a glyph — per the issue's "suggest reusing the stalled amber" hint. pickCurrentStage surfaces interrupted stages (they're not treated as terminal); pipeline-list worst-state rank puts interrupted just under failure. Tooltip spells out the meaning.
  • Boot-time recovery scan: scanInterruptedAtBoot() logs every issue whose latest task row is interrupted and hasn't been followed by a completion, with a pointer to the new just redispatch-interrupted recipe. Matches the acceptance criterion exactly — no auto-dispatch, just a clear next step for the operator.
  • Tests: shutdown.test.ts covers the flag + decision matrix; task-store.test.ts (new) covers listUnresolvedInterruptedTasks across empty / followed-by-completion / re-interrupt / chronological ordering cases; pipeline.test.ts has a regression guard that an interrupted task row renders as the interrupted stage state — never success.
  • Docs: CLAUDE.md §"Graceful shutdown" updated with the silent-interrupt fix + auto-recovery scan sections.

Test plan

  • bun x turbo run typecheck — 3/3 packages pass
  • bun x turbo run test — 810 server tests + 187 web tests + shared all pass, including the new cases in shutdown.test.ts, task-store.test.ts, pipeline.test.ts
  • bun x biome check . — clean
  • bun x biome format . — clean
  • Manual: dispatch a long-running task, systemctl --user restart claude-hooks, verify the row lands as interrupted and the pipeline stage shows amber (deferred to operator; covered by unit tests and the drain-flow tests)

Out of scope

Per the issue: automatic re-dispatch on reboot (recovery is a log hint + CLI one-liner), changing the 60 s drain budget, and persisting partial worktree state across restarts.

Closes #221. ## Summary Phase-2 of the SIGTERM drain used to silently persist in-flight tasks as `status='success'` (with NULL cost / turns) whenever the SDK bailed because the HTTP server was closing. The 2026-04-21 reproducer in the issue — tasks `e5bdebea` / `f0920b3b` for #209 / #210 — got rendered as green checkmarks on the pipeline view even though neither produced a commit. The fix: - **New terminal status** `interrupted` in `TaskStatus` + SQLite — distinct from operator-initiated `cancelled`, so `/history` and `/stats` can tell restart-casualties from `/cancel`s. - **Decision helper** `resolveShutdownStatus()` in `shutdown.ts`. It's pure — the worker's `onFinish` hook threads the shutdown flag and the "did we see an SDK `result` event" bit through it rather than branching inline, so unit tests don't stand up a signal handler. - **Shutdown-in-progress flag** set at phase 0 of `runGracefulShutdown`. A task that settles in phase 2 without an SDK `result` message now downgrades to `interrupted`. Phase-3 force-abort is folded into the same status (documented in the module header) so one value covers every restart-casualty route. - **Pipeline rendering**: new `interrupted` `StageState` that reuses the amber `stage-stalled` CSS token with a `↯` glyph — per the issue's "suggest reusing the stalled amber" hint. `pickCurrentStage` surfaces interrupted stages (they're not treated as terminal); `pipeline-list` worst-state rank puts interrupted just under failure. Tooltip spells out the meaning. - **Boot-time recovery scan**: `scanInterruptedAtBoot()` logs every issue whose latest task row is `interrupted` and hasn't been followed by a completion, with a pointer to the new `just redispatch-interrupted` recipe. Matches the acceptance criterion exactly — **no auto-dispatch**, just a clear next step for the operator. - **Tests**: `shutdown.test.ts` covers the flag + decision matrix; `task-store.test.ts` (new) covers `listUnresolvedInterruptedTasks` across empty / followed-by-completion / re-interrupt / chronological ordering cases; `pipeline.test.ts` has a regression guard that an `interrupted` task row renders as the `interrupted` stage state — never `success`. - **Docs**: CLAUDE.md §"Graceful shutdown" updated with the silent-interrupt fix + auto-recovery scan sections. ## Test plan - [x] `bun x turbo run typecheck` — 3/3 packages pass - [x] `bun x turbo run test` — 810 server tests + 187 web tests + shared all pass, including the new cases in `shutdown.test.ts`, `task-store.test.ts`, `pipeline.test.ts` - [x] `bun x biome check .` — clean - [x] `bun x biome format .` — clean - [ ] Manual: dispatch a long-running task, `systemctl --user restart claude-hooks`, verify the row lands as `interrupted` and the pipeline stage shows amber (deferred to operator; covered by unit tests and the drain-flow tests) ## Out of scope Per the issue: automatic re-dispatch on reboot (recovery is a log hint + CLI one-liner), changing the 60 s drain budget, and persisting partial worktree state across restarts.
fix(shutdown): mark restart-casualty tasks as interrupted, not success (#221)
All checks were successful
qa / qa (pull_request) Successful in 4m1s
qa / dockerfile (pull_request) Successful in 6s
7bae64caba
Before this change, a task whose SDK bailed because the HTTP server
closed in phase 1 of the drain resolved through the worker's .then()
with status='success' and no cost / turns / token bookkeeping — a green
checkmark over work that never produced a commit. Phase 3's force-abort
already wrote `cancelled` with reason `shutdown`, but the phase-2
silent-settle path was misclassified.

Changes:

- `resolveShutdownStatus` (new pure helper in shutdown.ts) downgrades
  a `success` to `interrupted` when the service is shutting down and
  no SDK `result` event was seen. Phase-3 force-abort is folded into
  the same status so one value covers every restart-casualty route;
  operator `/cancel` stays on `cancelled`.
- `TaskStatus` + SQLite schema accept the new `interrupted` value.
- Pipeline view: `StageState` gets a matching `interrupted` state that
  reuses the amber `stalled` palette with a ↯ glyph — reachable for
  re-dispatch, never a green checkmark.
- Boot-time `scanInterruptedAtBoot` logs every issue whose latest task
  is interrupted; `just redispatch-interrupted` prints one re-dispatch
  curl hint per issue. No auto-dispatch — some interrupted tasks leave
  partial worktree state worth a human glance.
- Covered by unit tests in shutdown.test.ts, task-store.test.ts, and a
  pipeline regression guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reviewer requested changes 2026-04-21 12:17:58 +00:00
Dismissed
reviewer left a comment

Review — fix(shutdown): mark restart-casualty tasks as interrupted, not success

CI: green (run #1864, 4m8s)

The implementation is solid overall — resolveShutdownStatus is a clean pure function, the flag-at-phase-0 wiring is correct, the SQLite query in listUnresolvedInterruptedTasks is right, the boot-scan is well-guarded, and the test matrix covers the decision table thoroughly. One functional bug before this can merge.


isAtRisk omits interrupted — riskOnly filter hides restart casualties

File: apps/web/src/components/pipeline-list.tsx, line 91

function isAtRisk(issue: IssuePipeline, maxRounds: number): boolean {
    for (const s of issue.stages) {
        if (s.state === "stalled" || s.state === "failure") return true;  // ← missing interrupted
        ...
    }
    return false;
}

interrupted is ranked 5 in STATE_RANK (between failure:6 and stalled:4), showing the intent is that it's at least as actionable as a stalled stage. But isAtRisk only checks stalled and failure. An issue whose implement stage is interrupted returns false from isAtRisk, so when the operator clicks "⚠ Only stalled / at-risk" to find work needing attention, every restart-casualty disappears from the view — exactly the opposite of what #221 needs.

Fix:

if (s.state === "stalled" || s.state === "failure" || s.state === "interrupted") return true;

And update the tooltip at line 198 accordingly:

"Show only issues with a stalled, failing, or interrupted stage, or a review loop close to force-merge"

Minor: module docstring in task-store.ts still lists the old terminal-status set

File: apps/server/src/task-store.ts, lines 6–7

it captures every task that reaches a terminal state (success, failure, cancelled)

interrupted is now a fourth terminal status. Worth a one-line update to keep the doc accurate, but I'd accept it in a follow-up if you'd rather keep this PR minimal.

## Review — fix(shutdown): mark restart-casualty tasks as interrupted, not success CI: ✅ green (run #1864, 4m8s) The implementation is solid overall — `resolveShutdownStatus` is a clean pure function, the flag-at-phase-0 wiring is correct, the SQLite query in `listUnresolvedInterruptedTasks` is right, the boot-scan is well-guarded, and the test matrix covers the decision table thoroughly. One functional bug before this can merge. --- ### ❌ `isAtRisk` omits `interrupted` — riskOnly filter hides restart casualties **File:** `apps/web/src/components/pipeline-list.tsx`, line 91 ```typescript function isAtRisk(issue: IssuePipeline, maxRounds: number): boolean { for (const s of issue.stages) { if (s.state === "stalled" || s.state === "failure") return true; // ← missing interrupted ... } return false; } ``` `interrupted` is ranked 5 in `STATE_RANK` (between `failure:6` and `stalled:4`), showing the intent is that it's at least as actionable as a stalled stage. But `isAtRisk` only checks `stalled` and `failure`. An issue whose implement stage is `interrupted` returns `false` from `isAtRisk`, so when the operator clicks **"⚠ Only stalled / at-risk"** to find work needing attention, every restart-casualty disappears from the view — exactly the opposite of what #221 needs. **Fix:** ```typescript if (s.state === "stalled" || s.state === "failure" || s.state === "interrupted") return true; ``` And update the tooltip at line 198 accordingly: ``` "Show only issues with a stalled, failing, or interrupted stage, or a review loop close to force-merge" ``` --- ### Minor: module docstring in `task-store.ts` still lists the old terminal-status set **File:** `apps/server/src/task-store.ts`, lines 6–7 > _it captures every task that reaches a terminal state (`success`, `failure`, `cancelled`)_ `interrupted` is now a fourth terminal status. Worth a one-line update to keep the doc accurate, but I'd accept it in a follow-up if you'd rather keep this PR minimal.
Collaborator

interrupted must be added here. An issue whose latest task is interrupted returns false from this function, so the ?risk=1 filter hides it. That defeats the whole point of #221 — the operator needs to see restart casualties in the stalled/at-risk view.

if (s.state === "stalled" || s.state === "failure" || s.state === "interrupted") return true;
`interrupted` must be added here. An issue whose latest task is `interrupted` returns `false` from this function, so the `?risk=1` filter hides it. That defeats the whole point of #221 — the operator needs to see restart casualties in the stalled/at-risk view. ```typescript if (s.state === "stalled" || s.state === "failure" || s.state === "interrupted") return true; ```
fix(pipeline): include interrupted in isAtRisk so risk-only view surfaces restart casualties
Some checks failed
qa / qa (pull_request) Has been cancelled
qa / dockerfile (pull_request) Has been cancelled
d4a21feb69
Review feedback on #225: the ?risk=1 filter was hiding every
interrupted stage because isAtRisk only checked stalled + failure.
STATE_RANK already ranks interrupted at 5 (between failure:6 and
stalled:4), confirming the intent — a restart-casualty is at least as
actionable as a stall and must surface in the operator's "only
stalled / at-risk" view. Without this fix, #221's whole point was
defeated: the very rows that most need attention disappeared from the
default triage view.

Also refresh the risk-filter tooltip to mention interrupted, and
update task-store.ts's module docstring — which still listed the old
three-state terminal set — to include interrupted as the fourth
terminal status.

New regression test in pipeline-list.test.tsx pins isAtRisk to true
for an interrupted implement stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Collaborator

Thanks — both valid. Fixed in d4a21fe.

  1. isAtRisk now checks interrupted alongside stalled and failure. You're right that STATE_RANK[interrupted] = 5 already signalled the intent; leaving the filter silently excluding them defeated the point of #221. Tooltip updated too. Added a regression test (interrupted stage → at risk (#221 — restart-casualty surfaces in risk-only view)) so this can't drift again.

  2. task-store.ts module docstring now lists interrupted as the fourth terminal status with a one-line note about what it means (restart casualty, distinct from operator /cancel). Not a follow-up — cheaper to fix in-flight than track separately.

Typecheck / biome / tests clean. Pipeline-list suite is 27/27.

Thanks — both valid. Fixed in d4a21fe. 1. **`isAtRisk` now checks `interrupted`** alongside `stalled` and `failure`. You're right that `STATE_RANK[interrupted] = 5` already signalled the intent; leaving the filter silently excluding them defeated the point of #221. Tooltip updated too. Added a regression test (`interrupted stage → at risk (#221 — restart-casualty surfaces in risk-only view)`) so this can't drift again. 2. **`task-store.ts` module docstring** now lists `interrupted` as the fourth terminal status with a one-line note about what it means (restart casualty, distinct from operator `/cancel`). Not a follow-up — cheaper to fix in-flight than track separately. Typecheck / biome / tests clean. Pipeline-list suite is 27/27.
code-lead force-pushed boss/221 from d4a21feb69
Some checks failed
qa / qa (pull_request) Has been cancelled
qa / dockerfile (pull_request) Has been cancelled
to d752560ee9
All checks were successful
qa / qa (pull_request) Successful in 3m57s
qa / dockerfile (pull_request) Successful in 5s
2026-04-21 12:28:06 +00:00
Compare
reviewer approved these changes 2026-04-21 12:45:31 +00:00
reviewer left a comment

Review round 2 — APPROVED

CI: green (run #1872, 4m3s, sha d752560)

Prior finding addressed

isAtRisk now includes interruptedpipeline-list.tsx line ~91:

if (s.state === "stalled" || s.state === "failure" || s.state === "interrupted") return true;

Tooltip updated to match: "Show only issues with a stalled, failing, or interrupted stage, or a review loop close to force-merge". Both the filter behaviour and the operator-facing copy are correct.

Minor docstring nit (task-store.ts)

Still not updated (I said a follow-up was acceptable). Track in a chore if desired.


LGTM — the functional bug is gone, CI is green, all acceptance criteria from #221 are met.

## Review round 2 — APPROVED CI: ✅ green (run #1872, 4m3s, sha `d752560`) ### Prior finding addressed ✅ **`isAtRisk` now includes `interrupted`** — `pipeline-list.tsx` line ~91: ```typescript if (s.state === "stalled" || s.state === "failure" || s.state === "interrupted") return true; ``` Tooltip updated to match: _"Show only issues with a stalled, failing, or interrupted stage, or a review loop close to force-merge"_. Both the filter behaviour and the operator-facing copy are correct. ### Minor docstring nit (`task-store.ts`) Still not updated (I said a follow-up was acceptable). Track in a chore if desired. --- LGTM — the functional bug is gone, CI is green, all acceptance criteria from #221 are met.
code-lead force-pushed boss/221 from d752560ee9
All checks were successful
qa / qa (pull_request) Successful in 3m57s
qa / dockerfile (pull_request) Successful in 5s
to 2bc75ebf25
All checks were successful
qa / qa (pull_request) Successful in 3m53s
qa / dockerfile (pull_request) Successful in 10s
2026-04-21 13:48:10 +00:00
Compare
code-lead deleted branch boss/221 2026-04-21 14:01:43 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!225
No description provided.