feat(server): graceful SIGTERM/SIGINT drain for in-flight tasks #184
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!184
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "boss/182"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #182.
Summary
systemctl restart claude-hooks(or any signal-based stop) used to kill Bun immediately, leaving in-flight tasks stuck asrunningintask_historyforever, the agent work lost, and no webhook re-delivery on the Forgejo side.TimeoutStopSec=300in the systemd unit was decorative — nothing listened for the drain window.This PR adds a four-phase graceful-shutdown handler in
apps/server/src/shutdown.ts, registered frommain.tsonSIGTERM/SIGINT:Bun.serve.stop(true)closes the HTTP listener so/task,/webhook/forgejo,/breakdown,/architect/chatrefuse at the TCP layer. Webhook events delivered mid-drain are dropped; Forgejo re-delivers after restart.shutdown.drain_ms(default 60s, clamped to[1_000, 270_000]to fit safely underTimeoutStopSec=300) for in-flight tasks to finish naturally. Polls each worker'scurrentTaskslot every 250ms and logs each settle time.currentAbort.abort()on every still-busy worker and persists itstask_historyrow ascancelledwith reasonshutdown(so/historystops showing itrunningforever).docker exec <container> sh -c 'kill -TERM $(cat /tmp/claude.pid)'so the in-container Claude CLI stops chewing Pro-Max quota with no host-side listener.Orphan guard
The container exec shim (
ensureExecShimincontainer.ts) now wraps theclaudespawn withsh -c 'echo $$ > /tmp/claude.pid; exec "$@"'. The innerexecpreserves PID, so the recorded id is the actualclaudeprocess. Pidfile path lives onCONTAINER_PIDFILE_PATHsoshutdown.tsand the shim agree.Observability
Structured log lines per phase (
[shutdown] draining N tasks,[shutdown] task <id> settled after <ms>,[shutdown] force-abort after <ms>,[shutdown] bye). SSE envelopes:service_shutdown— carriesdrain_ms+busy_workersfor the dashboard banner.shutdown_received/shutdown_drained/shutdown_force_abort/shutdown_complete— phase markers with elapsed_ms + structured detail.Config
Optional top-level
shutdownblock inconfig/agents.json:Clamped to
[1_000, 270_000]ms. Non-number / malformed values degrade to the 60 s default with a warning.Test plan
bun x turbo run test— 683 server tests pass (up from 673) including 10 newshutdown.test.tscases and 5 new loader cases forshutdown.drain_ms.onTaskShutdownfires + SSE envelopedocker exec … sh -c 'kill -TERM …'invocationservice_shutdownpayload carries busy_workers + drain_msinstallSignalHandlersdisposer unregisters process listenersbun x turbo run typecheck— clean across all three workspaces.bun x biome check .— no lint or format issues.systemctl --user restart claude-hookswith an in-flight task and observe the drain banner +[shutdown]logs.Review: APPROVED ✅
CI green (run #1779,
success). All four acceptance-criteria phases are correctly implemented and tested.What I verified
Phase correctness
server.stop(true)closes the HTTP listener at the TCP layer; new requests are refused before any handler runs.listWorkers()freshly on every tick so tasks that finish between iterations are picked up without a dedicated "drained" signal. The deadline math (start + drainMs) is correct.currentAbort.abort()fires per busy worker;onTaskShutdownwrites thecancelled/shutdownrow to SQLite and broadcasts the SSE event so/historystops showing the task asrunningforever.remaining(the force-aborted set), not against workers that drained naturally. Correct.Orphan guard (PID file)
The shim runs:
exec "$@"replaces the inner shell in-place — no fork — so the PID recorded in the file is exactly the PID theclaudeprocess will have. The shutdown handler'sif [ -f ... ]; then kill -TERM $(cat ...); fiscript is safe (constant path, no user input).containerModegatecontainerMode: w.config.container?.enabled === trueevaluates tofalsefor the architect (container.enabled === false), so no spuriousdocker execis attempted for host-mode workers. Confirmed by thehost-mode worker → force-aborted without docker exectest.Config loading
shutdown.drain_msis clamped to[1_000, 270_000]ms, leaving ≥30 s margin underTimeoutStopSec=300. Non-number anddrain_ms ≤ 0degrade gracefully to the 60 s default with a warning.Re-entry guard
The
drainingflag ininstallSignalHandlersprevents a second SIGTERM mid-drain from starting a second drain pass. Correct — a second signal is logged as ignored and systemd's SIGKILL will eventually take over if the drain hangs.Test coverage
All 10
shutdown.test.tscases match the behaviour described in the issue:abortSeen)onTaskShutdowncalled,abortSeen = truedocker exec claude-hooks-dev-default sh -c '…/tmp/claude.pid…'service_shutdownenvelope carriesbusy_workers+drain_ms[shutdown]log lines emitted per phaseonExit5 loader tests for
shutdown.drain_ms(default/explicit/clamp-floor/clamp-ceiling/invalid) also pass.Minor observation (not blocking)
In the
catchblock ofinstallSignalHandlers, the synthetic report passestotalElapsedMs: 0rather than the actual elapsed time. SinceonExitisprocess.exit(0)in production, nothing reads this value in the error path. Worth fixing at some point for log accuracy, but not a correctness concern here.Acceptance criteria: all met ✅
shutdown.ts✅service_shutdownSSE withdrain_ms+busy_workers✅shutdown.test.tscovering drain/force-abort/container/host-mode ✅shutdown.drain_msconfig with clamping ✅