fix(shutdown): mark restart-casualty tasks as interrupted, not success #225
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!225
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "boss/221"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #221.
Summary
Phase-2 of the SIGTERM drain used to silently persist in-flight tasks as
status='success'(with NULL cost / turns) whenever the SDK bailed because the HTTP server was closing. The 2026-04-21 reproducer in the issue — taskse5bdebea/f0920b3bfor #209 / #210 — got rendered as green checkmarks on the pipeline view even though neither produced a commit.The fix:
interruptedinTaskStatus+ SQLite — distinct from operator-initiatedcancelled, so/historyand/statscan tell restart-casualties from/cancels.resolveShutdownStatus()inshutdown.ts. It's pure — the worker'sonFinishhook threads the shutdown flag and the "did we see an SDKresultevent" bit through it rather than branching inline, so unit tests don't stand up a signal handler.runGracefulShutdown. A task that settles in phase 2 without an SDKresultmessage now downgrades tointerrupted. Phase-3 force-abort is folded into the same status (documented in the module header) so one value covers every restart-casualty route.interruptedStageStatethat reuses the amberstage-stalledCSS token with a↯glyph — per the issue's "suggest reusing the stalled amber" hint.pickCurrentStagesurfaces interrupted stages (they're not treated as terminal);pipeline-listworst-state rank puts interrupted just under failure. Tooltip spells out the meaning.scanInterruptedAtBoot()logs every issue whose latest task row isinterruptedand hasn't been followed by a completion, with a pointer to the newjust redispatch-interruptedrecipe. Matches the acceptance criterion exactly — no auto-dispatch, just a clear next step for the operator.shutdown.test.tscovers the flag + decision matrix;task-store.test.ts(new) coverslistUnresolvedInterruptedTasksacross empty / followed-by-completion / re-interrupt / chronological ordering cases;pipeline.test.tshas a regression guard that aninterruptedtask row renders as theinterruptedstage state — neversuccess.Test plan
bun x turbo run typecheck— 3/3 packages passbun x turbo run test— 810 server tests + 187 web tests + shared all pass, including the new cases inshutdown.test.ts,task-store.test.ts,pipeline.test.tsbun x biome check .— cleanbun x biome format .— cleansystemctl --user restart claude-hooks, verify the row lands asinterruptedand the pipeline stage shows amber (deferred to operator; covered by unit tests and the drain-flow tests)Out of scope
Per the issue: automatic re-dispatch on reboot (recovery is a log hint + CLI one-liner), changing the 60 s drain budget, and persisting partial worktree state across restarts.
Review — fix(shutdown): mark restart-casualty tasks as interrupted, not success
CI: ✅ green (run #1864, 4m8s)
The implementation is solid overall —
resolveShutdownStatusis a clean pure function, the flag-at-phase-0 wiring is correct, the SQLite query inlistUnresolvedInterruptedTasksis right, the boot-scan is well-guarded, and the test matrix covers the decision table thoroughly. One functional bug before this can merge.❌
isAtRiskomitsinterrupted— riskOnly filter hides restart casualtiesFile:
apps/web/src/components/pipeline-list.tsx, line 91interruptedis ranked 5 inSTATE_RANK(betweenfailure:6andstalled:4), showing the intent is that it's at least as actionable as a stalled stage. ButisAtRiskonly checksstalledandfailure. An issue whose implement stage isinterruptedreturnsfalsefromisAtRisk, so when the operator clicks "⚠ Only stalled / at-risk" to find work needing attention, every restart-casualty disappears from the view — exactly the opposite of what #221 needs.Fix:
And update the tooltip at line 198 accordingly:
Minor: module docstring in
task-store.tsstill lists the old terminal-status setFile:
apps/server/src/task-store.ts, lines 6–7interruptedis now a fourth terminal status. Worth a one-line update to keep the doc accurate, but I'd accept it in a follow-up if you'd rather keep this PR minimal.interruptedmust be added here. An issue whose latest task isinterruptedreturnsfalsefrom this function, so the?risk=1filter hides it. That defeats the whole point of #221 — the operator needs to see restart casualties in the stalled/at-risk view.Thanks — both valid. Fixed in
d4a21fe.isAtRisknow checksinterruptedalongsidestalledandfailure. You're right thatSTATE_RANK[interrupted] = 5already signalled the intent; leaving the filter silently excluding them defeated the point of #221. Tooltip updated too. Added a regression test (interrupted stage → at risk (#221 — restart-casualty surfaces in risk-only view)) so this can't drift again.task-store.tsmodule docstring now listsinterruptedas the fourth terminal status with a one-line note about what it means (restart casualty, distinct from operator/cancel). Not a follow-up — cheaper to fix in-flight than track separately.Typecheck / biome / tests clean. Pipeline-list suite is 27/27.
d4a21feb69d752560ee9Review round 2 — APPROVED
CI: ✅ green (run #1872, 4m3s, sha
d752560)Prior finding addressed ✅
isAtRisknow includesinterrupted—pipeline-list.tsxline ~91:Tooltip updated to match: "Show only issues with a stalled, failing, or interrupted stage, or a review loop close to force-merge". Both the filter behaviour and the operator-facing copy are correct.
Minor docstring nit (
task-store.ts)Still not updated (I said a follow-up was acceptable). Track in a chore if desired.
LGTM — the functional bug is gone, CI is green, all acceptance criteria from #221 are met.
d752560ee92bc75ebf25