fix(shutdown): tasks aborted during drain are marked status=success with no output #221

Closed
opened 2026-04-21 11:54:42 +00:00 by claude-desktop · 0 comments
Collaborator

Bug

When the service receives SIGTERM/SIGINT and runs the four-phase drain, any task that "naturally settles" inside phase 2 (drain budget) has its task_history.status set to success even when the SDK bailed because the HTTP listener closing triggered an indirect abort. The worker's promise resolves via the normal .then() path, so the task-store writer treats it as a success.

The symptom is a row with:

  • status = 'success'
  • turns = NULL
  • cost_usd = NULL
  • finished_at ≈ SIGTERM timestamp exactly

…and a matching pipeline row rendered with a green checkmark for work that never produced a commit or a PR. The operator can't tell the difference between "task succeeded in 2 minutes" and "task died silently during a restart."

Repro (2026-04-21 incident — #209, #210)

  1. #208 PR merged → propagator auto-assigned #209 / #210 to dev at 13:02:10/11.
  2. Dev agents started. Tasks e5bdebea (#209) and f0920b3b (#210) ran for ~120 s each.
  3. At 13:04:10 operator ran systemctl --user restart claude-hooks for an unrelated fix.
  4. Shutdown logs: [shutdown] task e5bdebea settled after 252ms, [shutdown] task f0920b3b settled after 252ms, [shutdown] all 2 task(s) drained cleanly + drained=2 force_aborted=0.
  5. task_history rows: both status=success, turns=NULL, cost_usd=NULL, finished_at=13:04:10 on the dot.
  6. Pipeline view rendered both issues with a green "implement" checkmark.
  7. Worktrees retained uncommitted changes (dev-2: warning: worktree has uncommitted changes from a previous dispatch on the next dispatch) — confirming the agent had started editing but never reached commit/push.

Acceptance criteria

Status accuracy

  • Any task that finalizes during the drain window and has turns = 0 / cost_usd = null / no recorded result SDK message is marked with a distinct status — proposed: interrupted (new value) rather than overloading cancelled (which today means /cancel from the operator).
  • Existing cancelled semantics are unchanged — operator-initiated cancels + phase-3 force-aborts stay on cancelled.
  • TaskStatus in packages/shared/src/task.ts (or wherever it lives) gets the new value. task-store.ts persistence switches on the shutdown flag.

Shutdown wiring

  • shutdown.ts sets a shutdown-in-progress flag observable to worker.ts. When a task settles while that flag is set and the SDK didn't emit a result message (i.e. the agent didn't finish producing output), the worker writes interrupted.
  • Phase-3 force-abort already marks cancelled with reason shutdown — keep that, or fold both paths into interrupted and drop the shutdown-reason column. Up to implementer — pick one, document in a code comment.

Pipeline rendering

  • pipeline.ts treats interrupted the same way it treats cancelled for stage rendering — not a green checkmark. The implement stage reverts to pending (or a new interrupted pill state if that reads clearer).
  • packages/shared/src/pipeline.ts exports the new stage state if one is introduced; <StagePill> in apps/web/src/components/stage-pill.tsx gets a matching visual (suggest reusing the stalled amber instead of inventing a new color).

Auto-recovery

  • On service boot, after container reconcile, scan task_history for rows whose status='interrupted' has not been followed by a subsequent completed task on the same issue. Log each as [recovery] issue #N has interrupted task <id> — consider re-dispatch. Do not auto-dispatch — the operator should decide (some tasks may have produced partial PRs worth reviewing first). Print a just redispatch-interrupted one-liner in the log so there's a clear next step.

Verification

  • Unit test in apps/server/src/shutdown.test.ts (or task-store.test.ts) — fires SIGTERM mid-task with a mocked worker, asserts task_history.status === 'interrupted'.
  • Manual: dispatch a long-running task, systemctl --user restart claude-hooks, inspect task_history — row should be interrupted, pipeline stage should NOT show green.

Out of scope

  • Automatic re-dispatch on reboot — the recovery step is a log hint + CLI one-liner, not a cron.
  • Changing the 60 s drain budget (shutdown.drain_ms) — orthogonal knob.
  • Persisting the partial worktree state across restarts — tasks already find their uncommitted changes on the next dispatch (the worker warned on this on 2026-04-21), which is enough.

References

  • apps/server/src/shutdown.ts — drain machinery.
  • apps/server/src/worker.ts — where task_history rows are written on task finalize.
  • apps/server/src/task-store.tsTaskRecord shape + persistence.
  • apps/server/src/pipeline.ts — stage-state derivation from task_history.status.
  • CLAUDE.md §"Graceful shutdown (issue #182)" — four-phase drain description (status-writing is not explicit today; this ticket makes it so).
  • 2026-04-21 incident: tasks e5bdebea-14d6-4358-b266-132e69037fe2 (#209) and f0920b3b-9cda-4811-920f-76602ad0a947 (#210) — evidence of the false-success classification.
## Bug When the service receives SIGTERM/SIGINT and runs the four-phase drain, any task that "naturally settles" inside phase 2 (drain budget) has its `task_history.status` set to **`success`** even when the SDK bailed because the HTTP listener closing triggered an indirect abort. The worker's promise resolves via the normal `.then()` path, so the task-store writer treats it as a success. The symptom is a row with: - `status = 'success'` - `turns = NULL` - `cost_usd = NULL` - `finished_at` ≈ SIGTERM timestamp exactly …and a matching pipeline row rendered with a green checkmark for work that never produced a commit or a PR. The operator can't tell the difference between "task succeeded in 2 minutes" and "task died silently during a restart." ## Repro (2026-04-21 incident — #209, #210) 1. `#208` PR merged → propagator auto-assigned `#209` / `#210` to dev at `13:02:10/11`. 2. Dev agents started. Tasks `e5bdebea` (#209) and `f0920b3b` (#210) ran for ~120 s each. 3. At `13:04:10` operator ran `systemctl --user restart claude-hooks` for an unrelated fix. 4. Shutdown logs: `[shutdown] task e5bdebea settled after 252ms`, `[shutdown] task f0920b3b settled after 252ms`, `[shutdown] all 2 task(s) drained cleanly` + `drained=2 force_aborted=0`. 5. `task_history` rows: both `status=success`, `turns=NULL`, `cost_usd=NULL`, `finished_at=13:04:10` on the dot. 6. Pipeline view rendered both issues with a green "implement" checkmark. 7. Worktrees retained uncommitted changes (`dev-2: warning: worktree has uncommitted changes from a previous dispatch` on the next dispatch) — confirming the agent had started editing but never reached commit/push. ## Acceptance criteria ### Status accuracy - [ ] Any task that finalizes during the drain window **and** has `turns = 0` / `cost_usd = null` / no recorded `result` SDK message is marked with a distinct status — proposed: `interrupted` (new value) rather than overloading `cancelled` (which today means `/cancel` from the operator). - [ ] Existing `cancelled` semantics are unchanged — operator-initiated cancels + phase-3 force-aborts stay on `cancelled`. - [ ] `TaskStatus` in `packages/shared/src/task.ts` (or wherever it lives) gets the new value. `task-store.ts` persistence switches on the shutdown flag. ### Shutdown wiring - [ ] `shutdown.ts` sets a shutdown-in-progress flag observable to `worker.ts`. When a task settles while that flag is set **and** the SDK didn't emit a `result` message (i.e. the agent didn't finish producing output), the worker writes `interrupted`. - [ ] Phase-3 force-abort already marks `cancelled` with reason `shutdown` — keep that, or fold both paths into `interrupted` and drop the shutdown-reason column. Up to implementer — pick one, document in a code comment. ### Pipeline rendering - [ ] `pipeline.ts` treats `interrupted` the same way it treats `cancelled` for stage rendering — **not** a green checkmark. The implement stage reverts to `pending` (or a new `interrupted` pill state if that reads clearer). - [ ] `packages/shared/src/pipeline.ts` exports the new stage state if one is introduced; `<StagePill>` in `apps/web/src/components/stage-pill.tsx` gets a matching visual (suggest reusing the `stalled` amber instead of inventing a new color). ### Auto-recovery - [ ] On service boot, after container reconcile, scan `task_history` for rows whose `status='interrupted'` has not been followed by a subsequent completed task on the same issue. Log each as `[recovery] issue #N has interrupted task <id> — consider re-dispatch`. Do **not** auto-dispatch — the operator should decide (some tasks may have produced partial PRs worth reviewing first). Print a `just redispatch-interrupted` one-liner in the log so there's a clear next step. ### Verification - [ ] Unit test in `apps/server/src/shutdown.test.ts` (or `task-store.test.ts`) — fires SIGTERM mid-task with a mocked worker, asserts `task_history.status === 'interrupted'`. - [ ] Manual: dispatch a long-running task, `systemctl --user restart claude-hooks`, inspect `task_history` — row should be `interrupted`, pipeline stage should NOT show green. ## Out of scope - Automatic re-dispatch on reboot — the recovery step is a log hint + CLI one-liner, not a cron. - Changing the 60 s drain budget (`shutdown.drain_ms`) — orthogonal knob. - Persisting the partial worktree state across restarts — tasks already find their uncommitted changes on the next dispatch (the worker warned on this on 2026-04-21), which is enough. ## References - `apps/server/src/shutdown.ts` — drain machinery. - `apps/server/src/worker.ts` — where `task_history` rows are written on task finalize. - `apps/server/src/task-store.ts` — `TaskRecord` shape + persistence. - `apps/server/src/pipeline.ts` — stage-state derivation from `task_history.status`. - `CLAUDE.md` §"Graceful shutdown (issue #182)" — four-phase drain description (status-writing is not explicit today; this ticket makes it so). - 2026-04-21 incident: tasks `e5bdebea-14d6-4358-b266-132e69037fe2` (#209) and `f0920b3b-9cda-4811-920f-76602ad0a947` (#210) — evidence of the false-`success` classification.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#221
No description provided.