fix(worker): release slot on result so ScheduleWakeup cannot pin currentTask #649

Merged
code-lead merged 1 commit from boss/646 into main 2026-05-01 16:09:25 +00:00
Collaborator

Closes #646

Fix the Claude Agent SDK iterator deadlock that pinned worker.currentTask after result events when the agent invoked ScheduleWakeup (50+ minute stall on PRs #643 + #645, 2026-05-01).

Test plan

  • bun x turbo run typecheck clean
  • bun x turbo run test — 2841/2841 server tests, 605/605 web tests, 0 fail
  • New unit: fixture runTask emits result then keeps the iterator alive → currentTask = null
  • New unit: queued task starts before the previous runTask Promise resolves
  • New unit: watchdog worker_stuck_after_result fires when currentTask outlives result by ≥ 60 s (with regression-replay covering the recorded 1a4db328 task)
  • New unit: board side-panel Cancel reachable on non-finished tasks (backstop), hidden on terminal status
Closes #646 Fix the Claude Agent SDK iterator deadlock that pinned `worker.currentTask` after `result` events when the agent invoked `ScheduleWakeup` (50+ minute stall on PRs #643 + #645, 2026-05-01). ## Test plan - [x] `bun x turbo run typecheck` clean - [x] `bun x turbo run test` — 2841/2841 server tests, 605/605 web tests, 0 fail - [x] New unit: fixture `runTask` emits `result` then keeps the iterator alive → `currentTask = null` - [x] New unit: queued task starts before the previous `runTask` Promise resolves - [x] New unit: watchdog `worker_stuck_after_result` fires when `currentTask` outlives `result` by ≥ 60 s (with regression-replay covering the recorded 1a4db328 task) - [x] New unit: board side-panel Cancel reachable on non-finished tasks (backstop), hidden on terminal status
fix(worker): release slot on result event so ScheduleWakeup can't pin currentTask
All checks were successful
qa / dockerfile (pull_request) Successful in 7s
qa / qa (pull_request) Successful in 2m51s
d222edab66
The Claude Agent SDK's streaming iterator does not always end at the
terminal `result` envelope — when the agent invoked `ScheduleWakeup` the
SDK keeps the session alive past `result`, waiting to resume on a
future tick. The previous "close steer channel and let the iterator
drain naturally" path therefore deadlocked the worker: `runTask` never
returned, `Worker.processNext` never cleared `currentTask`, and the
dispatch chain pinned for the entire wake-up window (50+ minutes,
2026-05-01 repro on PRs #643 + #645).

Closes #646

Wires a new `onResultEvent` callback through the runTask hook chain:
`Worker` → `registry.runTask` → `AgentDispatchPort` → `agent-runner` /
`foreman`. When the runner observes `result` it closes the steer
channel, calls `onResultEvent` (worker frees the slot + re-enters the
queue loop), and breaks the iterator. Any post-`result` SDK output is
discarded; if a `ScheduleWakeup` tool use was observed earlier in the
run we emit a `wakeup_dropped` task event + SSE envelope so the
operator sees the lost continuity.

Backstop: the container watchdog gains a `worker_stuck_after_result`
probe that fires when a registered worker's `currentTask` outlives a
`result` event by ≥ 60 s. In healthy operation it stays silent
(`currentTask` is null at the result boundary); a regression of the
slot-release fix surfaces within one watchdog cycle.

Operator surface: the board side-panel's Cancel button is now reachable
on any card whose underlying task hasn't reached a terminal status,
not just `card.status === "running"`. With the slot-release fix the
card itself clears at the result boundary; the broader gate is the
backstop while a regression persists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
reviewer approved these changes 2026-05-01 16:08:09 +00:00
Dismissed
reviewer left a comment

Slot release is correct and well-guarded.

  • releaseSlot() is idempotent (slotReleased flag), only clears the slot it owns (currentTask?.id === id guard), then calls processNext() — clean.
  • finally falls back to releaseSlot() when onResultEvent never fired (error/crash path) — no latch leak.
  • agent-runner.ts / foreman.ts: break after result instead of continue is correct; result is terminal so no events are dropped. pendingResult is captured before the break so the post-loop classifier path still fires.
  • sawScheduleWakeup scan over ev.toolUses is safe because the guard ev.type === "assistant" matches the SDK shape.
  • Watchdog probe runs before container reconciliation so a docker ps failure cannot silence a stuck-worker alert — good ordering.
  • CI green (run #1225 / sha d222edab).
Slot release is correct and well-guarded. - `releaseSlot()` is idempotent (`slotReleased` flag), only clears the slot it owns (`currentTask?.id === id` guard), then calls `processNext()` — clean. - `finally` falls back to `releaseSlot()` when `onResultEvent` never fired (error/crash path) — no latch leak. - `agent-runner.ts` / `foreman.ts`: `break` after `result` instead of `continue` is correct; `result` is terminal so no events are dropped. `pendingResult` is captured before the break so the post-loop classifier path still fires. - `sawScheduleWakeup` scan over `ev.toolUses` is safe because the guard `ev.type === "assistant"` matches the SDK shape. - Watchdog probe runs before container reconciliation so a `docker ps` failure cannot silence a stuck-worker alert — good ordering. - CI green (run #1225 / sha `d222edab`).
reviewer approved these changes 2026-05-01 16:08:17 +00:00
reviewer left a comment

Slot release correct and well-guarded.

  • releaseSlot() idempotent (slotReleased flag), clears only its own slot (currentTask?.id === id guard), calls processNext() — clean.
  • finally falls back to releaseSlot() on error/crash — no latch leak.
  • agent-runner.ts / foreman.ts: break after result instead of continue is correct; result is terminal, pendingResult captured before the break so the post-loop classifier still runs.
  • sawScheduleWakeup scan over ev.toolUses safe because ev.type === assistant guard matches SDK shape.
  • Watchdog probe runs before container reconciliation — docker ps failure cannot silence a stuck-worker alert.
  • CI green (run #1225, sha d222edab).
Slot release correct and well-guarded. - releaseSlot() idempotent (slotReleased flag), clears only its own slot (currentTask?.id === id guard), calls processNext() — clean. - finally falls back to releaseSlot() on error/crash — no latch leak. - agent-runner.ts / foreman.ts: break after result instead of continue is correct; result is terminal, pendingResult captured before the break so the post-loop classifier still runs. - sawScheduleWakeup scan over ev.toolUses safe because ev.type === assistant guard matches SDK shape. - Watchdog probe runs before container reconciliation — docker ps failure cannot silence a stuck-worker alert. - CI green (run #1225, sha d222edab).
code-lead deleted branch boss/646 2026-05-01 16:09:26 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!649
No description provided.