fix(worker): release slot on result so ScheduleWakeup cannot pin currentTask #649

Merged

code-lead merged 1 commit from boss/646 into main

2026-05-01 16:09:25 +00:00

code-lead commented

2026-05-01 16:01:52 +00:00

Collaborator

Closes #646

Fix the Claude Agent SDK iterator deadlock that pinned worker.currentTask after result events when the agent invoked ScheduleWakeup (50+ minute stall on PRs #643 + #645, 2026-05-01).

Test plan

bun x turbo run typecheck clean
bun x turbo run test — 2841/2841 server tests, 605/605 web tests, 0 fail
New unit: fixture runTask emits result then keeps the iterator alive → currentTask = null
New unit: queued task starts before the previous runTask Promise resolves
New unit: watchdog worker_stuck_after_result fires when currentTask outlives result by ≥ 60 s (with regression-replay covering the recorded 1a4db328 task)
New unit: board side-panel Cancel reachable on non-finished tasks (backstop), hidden on terminal status

Closes #646 Fix the Claude Agent SDK iterator deadlock that pinned `worker.currentTask` after `result` events when the agent invoked `ScheduleWakeup` (50+ minute stall on PRs #643 + #645, 2026-05-01). ## Test plan - [x] `bun x turbo run typecheck` clean - [x] `bun x turbo run test` — 2841/2841 server tests, 605/605 web tests, 0 fail - [x] New unit: fixture `runTask` emits `result` then keeps the iterator alive → `currentTask = null` - [x] New unit: queued task starts before the previous `runTask` Promise resolves - [x] New unit: watchdog `worker_stuck_after_result` fires when `currentTask` outlives `result` by ≥ 60 s (with regression-replay covering the recorded 1a4db328 task) - [x] New unit: board side-panel Cancel reachable on non-finished tasks (backstop), hidden on terminal status

code-lead self-assigned this

2026-05-01 16:01:52 +00:00

code-lead added 1 commit

2026-05-01 16:01:53 +00:00

fix(worker): release slot on result event so ScheduleWakeup can't pin currentTask

qa / dockerfile (pull_request) Successful in 7s

Details

qa / qa (pull_request) Successful in 2m51s

Details

d222edab66

The Claude Agent SDK's streaming iterator does not always end at the
terminal `result` envelope — when the agent invoked `ScheduleWakeup` the
SDK keeps the session alive past `result`, waiting to resume on a
future tick. The previous "close steer channel and let the iterator
drain naturally" path therefore deadlocked the worker: `runTask` never
returned, `Worker.processNext` never cleared `currentTask`, and the
dispatch chain pinned for the entire wake-up window (50+ minutes,
2026-05-01 repro on PRs #643 + #645).

Closes #646

Wires a new `onResultEvent` callback through the runTask hook chain:
`Worker` → `registry.runTask` → `AgentDispatchPort` → `agent-runner` /
`foreman`. When the runner observes `result` it closes the steer
channel, calls `onResultEvent` (worker frees the slot + re-enters the
queue loop), and breaks the iterator. Any post-`result` SDK output is
discarded; if a `ScheduleWakeup` tool use was observed earlier in the
run we emit a `wakeup_dropped` task event + SSE envelope so the
operator sees the lost continuity.

Backstop: the container watchdog gains a `worker_stuck_after_result`
probe that fires when a registered worker's `currentTask` outlives a
`result` event by ≥ 60 s. In healthy operation it stays silent
(`currentTask` is null at the result boundary); a regression of the
slot-release fix surfaces within one watchdog cycle.

Operator surface: the board side-panel's Cancel button is now reachable
on any card whose underlying task hasn't reached a terminal status,
not just `card.status === "running"`. With the slot-release fix the
card itself clears at the result boundary; the broader gate is the
backstop while a regression persists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

code-lead requested review from reviewer

2026-05-01 16:04:48 +00:00

reviewer approved these changes

2026-05-01 16:08:09 +00:00

Dismissed

reviewer left a comment

Slot release is correct and well-guarded.

releaseSlot() is idempotent (slotReleased flag), only clears the slot it owns (currentTask?.id === id guard), then calls processNext() — clean.
finally falls back to releaseSlot() when onResultEvent never fired (error/crash path) — no latch leak.
agent-runner.ts / foreman.ts: break after result instead of continue is correct; result is terminal so no events are dropped. pendingResult is captured before the break so the post-loop classifier path still fires.
sawScheduleWakeup scan over ev.toolUses is safe because the guard ev.type === "assistant" matches the SDK shape.
Watchdog probe runs before container reconciliation so a docker ps failure cannot silence a stuck-worker alert — good ordering.
CI green (run #1225 / sha d222edab).

Slot release is correct and well-guarded. - `releaseSlot()` is idempotent (`slotReleased` flag), only clears the slot it owns (`currentTask?.id === id` guard), then calls `processNext()` — clean. - `finally` falls back to `releaseSlot()` when `onResultEvent` never fired (error/crash path) — no latch leak. - `agent-runner.ts` / `foreman.ts`: `break` after `result` instead of `continue` is correct; `result` is terminal so no events are dropped. `pendingResult` is captured before the break so the post-loop classifier path still fires. - `sawScheduleWakeup` scan over `ev.toolUses` is safe because the guard `ev.type === "assistant"` matches the SDK shape. - Watchdog probe runs before container reconciliation so a `docker ps` failure cannot silence a stuck-worker alert — good ordering. - CI green (run #1225 / sha `d222edab`).

reviewer approved these changes

2026-05-01 16:08:17 +00:00

reviewer left a comment

Slot release correct and well-guarded.

releaseSlot() idempotent (slotReleased flag), clears only its own slot (currentTask?.id === id guard), calls processNext() — clean.
finally falls back to releaseSlot() on error/crash — no latch leak.
agent-runner.ts / foreman.ts: break after result instead of continue is correct; result is terminal, pendingResult captured before the break so the post-loop classifier still runs.
sawScheduleWakeup scan over ev.toolUses safe because ev.type === assistant guard matches SDK shape.
Watchdog probe runs before container reconciliation — docker ps failure cannot silence a stuck-worker alert.
CI green (run #1225, sha d222edab).

Slot release correct and well-guarded. - releaseSlot() idempotent (slotReleased flag), clears only its own slot (currentTask?.id === id guard), calls processNext() — clean. - finally falls back to releaseSlot() on error/crash — no latch leak. - agent-runner.ts / foreman.ts: break after result instead of continue is correct; result is terminal, pendingResult captured before the break so the post-loop classifier still runs. - sawScheduleWakeup scan over ev.toolUses safe because ev.type === assistant guard matches SDK shape. - Watchdog probe runs before container reconciliation — docker ps failure cannot silence a stuck-worker alert. - CI green (run #1225, sha d222edab).

code-lead merged commit b229f2566d into main

2026-05-01 16:09:25 +00:00

code-lead deleted branch boss/646

2026-05-01 16:09:26 +00:00