fix(worker): ScheduleWakeup leaves worker.currentTask non-null after result event #646
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#646
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
User story
As an operator, I want the worker registry to release
currentTaskwhen the agent emits a terminalresultevent, so that an agent's use ofScheduleWakeupdoes not pin the worker indefinitely and the dashboard board card stops showingRUNNINGafter the work is genuinely done.Repro (hit twice today, 2026-05-01)
Both stuck on the same failure mode:
1a4db328-5c23-407d-9127-7d54a48198c8result"Completed in 25 turns" @ 13:10 UTC, preceded byScheduleWakeup{ delaySeconds: 240, prompt: "Resume merging PR #643. Check background task ... for CI completion, then proceed with squash-merge if green." }f21fc19c-5326-4418-8e91-6c1d77b0c7a0result"Completed in 60 turns" @ 14:06 UTC/healthreported both asbusy=true,current.id=<task>,current.repo=charles/claude-hooks. The board card stayedRUNNING. ManualPOST /cancel { task_id }was the only way out; the kick endpoint (#609) cannot recover this state because it dispatches a new task — the stuckcurrentTaskslot is the actual lock.The dispatch chain unblocked only after I also bounced the PR's
requested_reviewersto re-fire thereview_requestedwebhook, because the original review task had silently completed without posting a verdict.Root cause hypothesis
agent-runner.ts:973callssteerChannel.close()on aresultevent so the streaming-input loop can return. But the SDK's streaming iterator does not return immediately when the agent usesScheduleWakeup— that tool is designed to suspend the conversation server-side and resume on a future tick, so the SDK keeps the session alive past the "Completed N turns" envelope. The worker'srunTaskPromise therefore never resolves,worker.ts:450this.currentTask = nullnever runs, and the slot stays pinned.The
resultevent is logged intotask_history.events, which is why the dashboard sees a terminal-shaped event but the worker registry still showsbusy. They diverge.Acceptance criteria
Worker release semantics
resultevent,worker.ts::processNextreleasescurrentTaskand re-enters the queue loop. Subsequent SDK output for the same task (a wake-up, a delayed tool call) either:ScheduleWakeup contract
docs/agents-architecture.mdso a future agent knows whetherScheduleWakeupsurvives a worker-slot rotation.ScheduleWakeup.prompt); the new dispatch goes through the regular pool selector with the same agent type.wakeup_droppedSSE event so the operator notices when long-running agents lose continuity.Diagnostic
/healthreports a worker asbusyonly while a task is actively executing — no false-busy windows after aresultevent.task_history.eventsends inresultbut the worker is still pinned, the watchdog logs aworker_stuck_after_result { task_id, worker, age_seconds }warning every 60 s. This catches regressions on the chosen behaviour without needing the operator to spot it from the board.Tests
runTaskhook that emitsresultthen keeps the iterator alive triggerscurrentTask = null(mirrors the ScheduleWakeup-style SDK behaviour without standing up a real SDK).worker_stuck_after_resultwarning fires whencurrentTaskoutlives aresultevent by ≥ 60 s.1a4db328-…(or a synthetic equivalent) and assert that the worker releases.Operator surface
Cancelbutton is reachable on a card whose underlying task has emittedresultbut the worker still showsbusy. (Today the cancel is gated oncard.status === "running"; with this fix the card itself should clear, but as a backstop the cancel control should be reachable while the bug persists.)Out of scope
ScheduleWakeupitself (that lives in claude-code, not in claude-hooks).submit_pull_review); track if it recurs after this lands.References
apps/server/src/background/worker.ts:406-454—processNextand thecurrentTasklifecycleapps/server/src/domain/agent/agent-runner.ts:967-975—resultevent closessteerChannelbut the streaming-input loop keeps consuming SDK outputapps/server/src/infrastructure/container/container-watchdog.ts— likely host for the newworker_stuck_after_resultprobe