fix(agents): unsafe cursor session resume after task SIGKILL — poisons next dispatch with stale state #1106

New issue

Closed

opened 2026-05-11 16:33:46 +00:00 by claude-desktop · 1 comment

claude-desktop commented

2026-05-11 16:33:46 +00:00

Collaborator

User story

As an operator who restarts the service while a task is running, I want the next dispatch for that same issue to start a fresh cursor session, so that the agent doesn't resume a stale conversation against a half-executed plan and burn 45min of confused shells.

Repro

Observed 2026-05-11 around 16:37 UTC on charles/claude-hooks#1104:

Operator assigned dev to issue #1104 at 14:37:21 UTC.
issue-assigned flow ran task ca432f8f-d29e-4cab-8d42-22db730c9419 with resume=<none> (fresh).
Cursor adapter created session cursor:6ce4f4b4-483a-49d1-a10b-18b9c19f21b4 and persisted it to claude_sdk_sessions at 14:37:25 UTC (key forgejo:dev:charles/claude-hooks:1104).
Service restarted at ~15:13 UTC (just restart). SIGKILL'd worker process. Task ca432f8f never wrote a task_history row.
Operator re-triggered dispatch by unassign+reassign. New task 1b8e7ea2-4576-441f-8e64-28d1c95722e0 started at 15:34:49 UTC with resuming session cursor:6ce4f4b4-....
Resumed cursor session was in post-SIGKILL state — agent ran shell loops for ~45min, never edited a file, never created a branch, never committed.
Adapter eventually hit 249s no-output heartbeat; operator-initiated cancel killed cursor-agent at 15:26:31 UTC.

Manual recovery: DELETE FROM claude_sdk_sessions WHERE key='forgejo:dev:charles/claude-hooks:1104'; then unassign+reassign → fresh session cursor:f521f4c8-..., task c45ef893-... completed normally.

Root cause hypothesis

claude_sdk_sessions row is written eagerly (first turn) but never invalidated when the corresponding task dies abnormally. Resume logic blindly trusts the session id.

Acceptance criteria

Session lifecycle

When a task transitions to status interrupted (restart-killed) or cancelled, the corresponding claude_sdk_sessions row is deleted (or marked invalid) so the next dispatch starts fresh.
If a session row exists but the corresponding task never reached success, treat it as invalid: log a warning and proceed with resume=<none>.
Boot-time recovery (when the worker process restarts) sweeps claude_sdk_sessions for rows whose linked task is missing or non-terminal and invalidates them.

Provider scope

Applies to both provider='anthropic' (claude-sdk) and provider='cursor' rows in claude_sdk_sessions.
Anthropic resume tolerates a SIGKILL'd state better (per past experience), but the same rule applies — don't resume a session whose dispatch never finished.

Tests

Unit test: simulate task that writes a session row, gets SIGKILL'd before finishing. Next dispatch resolves to resume=<none>.
Unit test: cancelled task → session row cleared on cancel path.
Unit test: successful task → session row preserved (resume on the next dispatch for the same issue).

Manual QA

Repro the original scenario: dispatch a task, just restart mid-flight, re-assign, confirm new task spawns with resume=<none>.

Out of scope

Persisting partial agent state (todos, plan, etc.) across restarts — that's the M14 / persistent workdirs scope.
Generic worker graceful shutdown — separate concern.
The latent [dev] warning: worktree has uncommitted changes from a previous dispatch message — track separately if it turns out to leak across tasks.

References

Incident timeline: see issue #1104 thread (decision notes 2026-05-11).
Session table: apps/server/src/infrastructure/database/schema/claude-sdk-sessions.ts (key + session_id + last_used_at + provider).
Resume logic: search resuming session in apps/server/src — currently in cursor-cli-adapter + claude-sdk runner paths.

## User story As an operator who restarts the service while a task is running, I want the next dispatch for that same issue to start a **fresh** cursor session, so that the agent doesn't resume a stale conversation against a half-executed plan and burn 45min of confused shells. ## Repro Observed 2026-05-11 around 16:37 UTC on `charles/claude-hooks#1104`: 1. Operator assigned `dev` to issue #1104 at 14:37:21 UTC. 2. `issue-assigned` flow ran task `ca432f8f-d29e-4cab-8d42-22db730c9419` with `resume=<none>` (fresh). 3. Cursor adapter created session `cursor:6ce4f4b4-483a-49d1-a10b-18b9c19f21b4` and persisted it to `claude_sdk_sessions` at 14:37:25 UTC (key `forgejo:dev:charles/claude-hooks:1104`). 4. Service restarted at ~15:13 UTC (`just restart`). SIGKILL'd worker process. Task `ca432f8f` never wrote a `task_history` row. 5. Operator re-triggered dispatch by unassign+reassign. New task `1b8e7ea2-4576-441f-8e64-28d1c95722e0` started at 15:34:49 UTC with `resuming session cursor:6ce4f4b4-...`. 6. Resumed cursor session was in post-SIGKILL state — agent ran shell loops for ~45min, never edited a file, never created a branch, never committed. 7. Adapter eventually hit 249s no-output heartbeat; operator-initiated cancel killed cursor-agent at 15:26:31 UTC. Manual recovery: `DELETE FROM claude_sdk_sessions WHERE key='forgejo:dev:charles/claude-hooks:1104';` then unassign+reassign → fresh session `cursor:f521f4c8-...`, task `c45ef893-...` completed normally. ## Root cause hypothesis `claude_sdk_sessions` row is written eagerly (first turn) but never invalidated when the corresponding task dies abnormally. Resume logic blindly trusts the session id. ## Acceptance criteria ### Session lifecycle - [ ] When a task transitions to status `interrupted` (restart-killed) or `cancelled`, the corresponding `claude_sdk_sessions` row is deleted (or marked `invalid`) so the next dispatch starts fresh. - [ ] If a session row exists but the corresponding task never reached `success`, treat it as invalid: log a warning and proceed with `resume=<none>`. - [ ] Boot-time recovery (when the worker process restarts) sweeps `claude_sdk_sessions` for rows whose linked task is missing or non-terminal and invalidates them. ### Provider scope - [ ] Applies to both `provider='anthropic'` (claude-sdk) and `provider='cursor'` rows in `claude_sdk_sessions`. - [ ] Anthropic resume tolerates a SIGKILL'd state better (per past experience), but the same rule applies — don't resume a session whose dispatch never finished. ### Tests - [ ] Unit test: simulate task that writes a session row, gets SIGKILL'd before finishing. Next dispatch resolves to `resume=<none>`. - [ ] Unit test: cancelled task → session row cleared on cancel path. - [ ] Unit test: successful task → session row preserved (resume on the next dispatch for the same issue). ### Manual QA - [ ] Repro the original scenario: dispatch a task, `just restart` mid-flight, re-assign, confirm new task spawns with `resume=<none>`. ## Out of scope - Persisting partial agent state (todos, plan, etc.) across restarts — that's the M14 / persistent workdirs scope. - Generic worker graceful shutdown — separate concern. - The latent `[dev] warning: worktree has uncommitted changes from a previous dispatch` message — track separately if it turns out to leak across tasks. ## References - Incident timeline: see issue #1104 thread (decision notes 2026-05-11). - Session table: `apps/server/src/infrastructure/database/schema/claude-sdk-sessions.ts` (key + session_id + last_used_at + provider). - Resume logic: search `resuming session` in `apps/server/src` — currently in cursor-cli-adapter + claude-sdk runner paths.

claude-desktop added the

area:agents

type:bug

labels

2026-05-11 16:34:18 +00:00