fix(worker): ScheduleWakeup leaves worker.currentTask non-null after result event #646

Closed
opened 2026-05-01 15:28:30 +00:00 by claude-desktop · 0 comments
Collaborator

User story

As an operator, I want the worker registry to release currentTask when the agent emits a terminal result event, so that an agent's use of ScheduleWakeup does not pin the worker indefinitely and the dashboard board card stops showing RUNNING after the work is genuinely done.

Repro (hit twice today, 2026-05-01)

Both stuck on the same failure mode:

Task Worker Issue / PR Last event Wall-clock pinned
1a4db328-5c23-407d-9127-7d54a48198c8 boss-default #643 (SC-5 PR — merged) result "Completed in 25 turns" @ 13:10 UTC, preceded by ScheduleWakeup{ delaySeconds: 240, prompt: "Resume merging PR #643. Check background task ... for CI completion, then proceed with squash-merge if green." } ~50 min
f21fc19c-5326-4418-8e91-6c1d77b0c7a0 reviewer-default #645 review (SC-10) result "Completed in 60 turns" @ 14:06 UTC ~10 min before manual cancel

/health reported both as busy=true, current.id=<task>, current.repo=charles/claude-hooks. The board card stayed RUNNING. Manual POST /cancel { task_id } was the only way out; the kick endpoint (#609) cannot recover this state because it dispatches a new task — the stuck currentTask slot is the actual lock.

The dispatch chain unblocked only after I also bounced the PR's requested_reviewers to re-fire the review_requested webhook, because the original review task had silently completed without posting a verdict.

Root cause hypothesis

agent-runner.ts:973 calls steerChannel.close() on a result event so the streaming-input loop can return. But the SDK's streaming iterator does not return immediately when the agent uses ScheduleWakeup — that tool is designed to suspend the conversation server-side and resume on a future tick, so the SDK keeps the session alive past the "Completed N turns" envelope. The worker's runTask Promise therefore never resolves, worker.ts:450 this.currentTask = null never runs, and the slot stays pinned.

The result event is logged into task_history.events, which is why the dashboard sees a terminal-shaped event but the worker registry still shows busy. They diverge.

Acceptance criteria

Worker release semantics

  • When the SDK emits a result event, worker.ts::processNext releases currentTask and re-enters the queue loop. Subsequent SDK output for the same task (a wake-up, a delayed tool call) either:
    1. Re-enqueues as a fresh task that re-claims a slot from the pool (preferred — keeps the pool fair under wake-up storms), or
    2. Drops with a logged warning (acceptable if re-enqueue is too invasive for v1).
  • Pick (1) or (2) explicitly in the handler docstring with rationale; either is correct provided the worker slot is no longer held during the sleep window.

ScheduleWakeup contract

  • Document the chosen behaviour in docs/agents-architecture.md so a future agent knows whether ScheduleWakeup survives a worker-slot rotation.
  • If (1): the wake-up replays the original task body (already does, per ScheduleWakeup.prompt); the new dispatch goes through the regular pool selector with the same agent type.
  • If (2): emit a wakeup_dropped SSE event so the operator notices when long-running agents lose continuity.

Diagnostic

  • /health reports a worker as busy only while a task is actively executing — no false-busy windows after a result event.
  • When task_history.events ends in result but the worker is still pinned, the watchdog logs a worker_stuck_after_result { task_id, worker, age_seconds } warning every 60 s. This catches regressions on the chosen behaviour without needing the operator to spot it from the board.

Tests

  • Unit: a fixture runTask hook that emits result then keeps the iterator alive triggers currentTask = null (mirrors the ScheduleWakeup-style SDK behaviour without standing up a real SDK).
  • Unit: the watchdog worker_stuck_after_result warning fires when currentTask outlives a result event by ≥ 60 s.
  • Regression: replay the recorded event sequence from task 1a4db328-… (or a synthetic equivalent) and assert that the worker releases.

Operator surface

  • The board side panel's Cancel button is reachable on a card whose underlying task has emitted result but the worker still shows busy. (Today the cancel is gated on card.status === "running"; with this fix the card itself should clear, but as a backstop the cancel control should be reachable while the bug persists.)

Out of scope

  • Reworking ScheduleWakeup itself (that lives in claude-code, not in claude-hooks).
  • Fixing the missing-verdict failure mode on PR #645 — separate symptom (reviewer task completed without submit_pull_review); track if it recurs after this lands.
  • Deciding the long-term shape of long-running agents that need to wait minutes/hours — see related: persistent agent-state spec (out of scope for this fix).

References

  • apps/server/src/background/worker.ts:406-454processNext and the currentTask lifecycle
  • apps/server/src/domain/agent/agent-runner.ts:967-975result event closes steerChannel but the streaming-input loop keeps consuming SDK output
  • apps/server/src/infrastructure/container/container-watchdog.ts — likely host for the new worker_stuck_after_result probe
  • Companion: #609 (kick endpoint) — operator escape hatch; this fix removes the need to wield it for this class of stall
  • Companion: #615 (CI pill stuck) — different bug, same observability theme (terminal-state events not propagating)
## User story As an operator, I want the worker registry to release `currentTask` when the agent emits a terminal `result` event, so that an agent's use of `ScheduleWakeup` does not pin the worker indefinitely and the dashboard board card stops showing `RUNNING` after the work is genuinely done. ## Repro (hit twice today, 2026-05-01) Both stuck on the **same** failure mode: | Task | Worker | Issue / PR | Last event | Wall-clock pinned | |---|---|---|---|---| | `1a4db328-5c23-407d-9127-7d54a48198c8` | boss-default | #643 (SC-5 PR — merged) | `result` "Completed in 25 turns" @ 13:10 UTC, preceded by `ScheduleWakeup{ delaySeconds: 240, prompt: "Resume merging PR #643. Check background task ... for CI completion, then proceed with squash-merge if green." }` | ~50 min | | `f21fc19c-5326-4418-8e91-6c1d77b0c7a0` | reviewer-default | #645 review (SC-10) | `result` "Completed in 60 turns" @ 14:06 UTC | ~10 min before manual cancel | `/health` reported both as `busy=true`, `current.id=<task>`, `current.repo=charles/claude-hooks`. The board card stayed `RUNNING`. Manual `POST /cancel { task_id }` was the only way out; the kick endpoint (#609) cannot recover this state because it dispatches a *new* task — the stuck `currentTask` slot is the actual lock. The dispatch chain unblocked only after I also bounced the PR's `requested_reviewers` to re-fire the `review_requested` webhook, because the original review task had silently completed without posting a verdict. ## Root cause hypothesis `agent-runner.ts:973` calls `steerChannel.close()` on a `result` event so the streaming-input loop can return. But the SDK's streaming iterator does not return immediately when the agent uses `ScheduleWakeup` — that tool is designed to **suspend** the conversation server-side and resume on a future tick, so the SDK keeps the session alive past the "Completed N turns" envelope. The worker's `runTask` Promise therefore never resolves, `worker.ts:450` `this.currentTask = null` never runs, and the slot stays pinned. The `result` event is logged into `task_history.events`, which is why the dashboard sees a terminal-shaped event but the worker registry still shows `busy`. They diverge. ## Acceptance criteria ### Worker release semantics - [ ] When the SDK emits a `result` event, `worker.ts::processNext` releases `currentTask` and re-enters the queue loop. Subsequent SDK output for the same task (a wake-up, a delayed tool call) either: 1. **Re-enqueues** as a fresh task that re-claims a slot from the pool (preferred — keeps the pool fair under wake-up storms), or 2. **Drops** with a logged warning (acceptable if re-enqueue is too invasive for v1). - [ ] Pick (1) or (2) explicitly in the handler docstring with rationale; either is correct provided the worker slot is no longer held during the sleep window. ### ScheduleWakeup contract - [ ] Document the chosen behaviour in `docs/agents-architecture.md` so a future agent knows whether `ScheduleWakeup` survives a worker-slot rotation. - [ ] If (1): the wake-up replays the original task body (already does, per `ScheduleWakeup.prompt`); the new dispatch goes through the regular pool selector with the same agent type. - [ ] If (2): emit a `wakeup_dropped` SSE event so the operator notices when long-running agents lose continuity. ### Diagnostic - [ ] `/health` reports a worker as `busy` only while a task is actively executing — no false-busy windows after a `result` event. - [ ] When `task_history.events` ends in `result` but the worker is still pinned, the watchdog logs a `worker_stuck_after_result { task_id, worker, age_seconds }` warning every 60 s. This catches regressions on the chosen behaviour without needing the operator to spot it from the board. ### Tests - [ ] Unit: a fixture `runTask` hook that emits `result` then keeps the iterator alive triggers `currentTask = null` (mirrors the ScheduleWakeup-style SDK behaviour without standing up a real SDK). - [ ] Unit: the watchdog `worker_stuck_after_result` warning fires when `currentTask` outlives a `result` event by ≥ 60 s. - [ ] Regression: replay the recorded event sequence from task `1a4db328-…` (or a synthetic equivalent) and assert that the worker releases. ### Operator surface - [ ] The board side panel's `Cancel` button is reachable on a card whose underlying task has emitted `result` but the worker still shows `busy`. (Today the cancel is gated on `card.status === "running"`; with this fix the card itself should clear, but as a backstop the cancel control should be reachable while the bug persists.) ## Out of scope - Reworking `ScheduleWakeup` itself (that lives in claude-code, not in claude-hooks). - Fixing the missing-verdict failure mode on PR #645 — separate symptom (reviewer task completed without `submit_pull_review`); track if it recurs after this lands. - Deciding the long-term shape of long-running agents that need to wait minutes/hours — see related: persistent agent-state spec (out of scope for this fix). ## References - `apps/server/src/background/worker.ts:406-454` — `processNext` and the `currentTask` lifecycle - `apps/server/src/domain/agent/agent-runner.ts:967-975` — `result` event closes `steerChannel` but the streaming-input loop keeps consuming SDK output - `apps/server/src/infrastructure/container/container-watchdog.ts` — likely host for the new `worker_stuck_after_result` probe - Companion: #609 (kick endpoint) — operator escape hatch; this fix removes the need to wield it for this class of stall - Companion: #615 (CI pill stuck) — different bug, same observability theme (terminal-state events not propagating)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#646
No description provided.