fix(tasks): persist task_history row at task START, not only at finish — restart-kills leave no audit trail #1107

Closed
opened 2026-05-11 16:34:00 +00:00 by claude-desktop · 0 comments
Collaborator

User story

As an operator investigating a stuck/lost task after a service restart, I want every dispatched task to have a task_history row from the moment it starts, so that the recovery hint can flag it as interrupted and I can see what was in flight.

Background

On 2026-05-11, task ca432f8f-d29e-4cab-8d42-22db730c9419 ran for ~36 minutes on dev (issue #1104), got SIGKILL'd by a just restart at ~15:13 UTC, and left no row in task_history. Boot-time recovery printed hints for other interrupted tasks (#994, #957, #1018, #1033) but nothing for #1104 because the row didn't exist.

This means:

  • Recovery sweeps miss tasks killed in their first ~minute.
  • Operators investigating "what was the agent doing?" have no DB record to inspect.
  • The companion bug (#stale-session-resume — file alongside this one) is harder to detect because there's no interrupted row to trigger a session-invalidate sweep.

Acceptance criteria

Start-time persistence

  • task_history row inserted with status='running' at the moment the worker calls starting <id> (registry / worker boundary).
  • Row includes: id, repo, issue_number, user, agent, agent_type, model, provider, started_at. (Tokens / cost / turns / artifacts populated on finish as today.)
  • On task finish (success, failure, cancelled, interrupted, aborted_no_skill), the row is UPDATEd, not re-INSERTed.

Schema / migration

  • No new columns needed — current schema already allows NULLable finished_at, cost_usd, turns, artifacts, etc.
  • Add a CHECK or default so status='running' is permitted for in-flight rows.
  • Migration backfills finished_at to NULL for any pre-existing rows already marked interrupted (no-op for the current set; defensive).

Boot-time recovery

  • On worker startup, every status='running' row whose worker is no longer alive is transitioned to status='interrupted' and emitted to the recovery hint log.
  • Tie-in with the companion session-resume bug: invalidate the matching claude_sdk_sessions row at the same time.

Tests

  • Unit test: dispatch enqueues + starts → task_history row present with status='running'.
  • Unit test: simulate restart with a running row → on next boot, row becomes interrupted + recovery hint printed.
  • Unit test: normal completion → row updates to success (no duplicate INSERT).

Out of scope

  • Per-tool / per-turn event persistence — agent_run_event table covers that separately.
  • UI changes — dashboard already renders task_history; will pick up new rows automatically.
  • Cost / token recording for interrupted tasks — those stay NULL.

References

  • Incident: issue #1104 thread.
  • Companion bug: #stale-session-resume (file together).
  • Current write site: apps/server/src/infrastructure/database/task-store.ts::persistTask (currently called on finish only).
  • Recovery hint: apps/server/src/background/worker.ts boot path; grep [recovery] log prefix.
## User story As an operator investigating a stuck/lost task after a service restart, I want every dispatched task to have a `task_history` row from the moment it starts, so that the recovery hint can flag it as `interrupted` and I can see what was in flight. ## Background On 2026-05-11, task `ca432f8f-d29e-4cab-8d42-22db730c9419` ran for ~36 minutes on `dev` (issue #1104), got SIGKILL'd by a `just restart` at ~15:13 UTC, and **left no row in `task_history`**. Boot-time recovery printed hints for other interrupted tasks (#994, #957, #1018, #1033) but nothing for #1104 because the row didn't exist. This means: - Recovery sweeps miss tasks killed in their first ~minute. - Operators investigating "what was the agent doing?" have no DB record to inspect. - The companion bug (#stale-session-resume — file alongside this one) is harder to detect because there's no `interrupted` row to trigger a session-invalidate sweep. ## Acceptance criteria ### Start-time persistence - [ ] `task_history` row inserted with `status='running'` at the moment the worker calls `starting <id>` (registry / worker boundary). - [ ] Row includes: `id`, `repo`, `issue_number`, `user`, `agent`, `agent_type`, `model`, `provider`, `started_at`. (Tokens / cost / turns / artifacts populated on finish as today.) - [ ] On task finish (`success`, `failure`, `cancelled`, `interrupted`, `aborted_no_skill`), the row is UPDATEd, not re-INSERTed. ### Schema / migration - [ ] No new columns needed — current schema already allows NULLable `finished_at`, `cost_usd`, `turns`, `artifacts`, etc. - [ ] Add a CHECK or default so `status='running'` is permitted for in-flight rows. - [ ] Migration backfills `finished_at` to NULL for any pre-existing rows already marked `interrupted` (no-op for the current set; defensive). ### Boot-time recovery - [ ] On worker startup, every `status='running'` row whose worker is no longer alive is transitioned to `status='interrupted'` and emitted to the recovery hint log. - [ ] Tie-in with the companion session-resume bug: invalidate the matching `claude_sdk_sessions` row at the same time. ### Tests - [ ] Unit test: dispatch enqueues + starts → `task_history` row present with `status='running'`. - [ ] Unit test: simulate restart with a `running` row → on next boot, row becomes `interrupted` + recovery hint printed. - [ ] Unit test: normal completion → row updates to `success` (no duplicate INSERT). ## Out of scope - Per-tool / per-turn event persistence — `agent_run_event` table covers that separately. - UI changes — dashboard already renders `task_history`; will pick up new rows automatically. - Cost / token recording for interrupted tasks — those stay NULL. ## References - Incident: issue #1104 thread. - Companion bug: #stale-session-resume (file together). - Current write site: `apps/server/src/infrastructure/database/task-store.ts::persistTask` (currently called on finish only). - Recovery hint: `apps/server/src/background/worker.ts` boot path; grep `[recovery]` log prefix.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#1107
No description provided.