bug(server): SIGTERM kills in-flight tasks — no graceful drain #182
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#182
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
The service has no SIGTERM / SIGINT handler. Every
systemctl restart claude-hooks(or any signal-based stop) kills in-flight agent tasks immediately. TheTimeoutStopSec=300configured inclaude-hooks.serviceis decorative — nothing listens for the drain window, so Bun just exits.Reproducer
systemctl --user restart claude-hookswhile the task is running.task_historyrow is leftrunningforever; the agent work is lost; the webhook won't re-fire (no state change on Forgejo side).Hit twice today after a systemd unit edit (once clearing
src/main.ts→apps/server/src/main.ts, once addingExecStartPre/Posthooks). Each restart dropped boss-2#174 and designer-default#181; I had to re-toggle the assignee / label on each to re-dispatch.Root cause
apps/server/src/main.tsends with:No
process.on("SIGTERM", …)/SIGINTanywhere in the tree. ThecurrentAbortAbortController on each worker is wired up only for the/cancelHTTP route (seemain.tshandleCancel), never invoked from a signal.Acceptance criteria
Signal handler
shutdown.ts(or inline inmain.ts) registersSIGTERM+SIGINThandlersBun.serveso/task,/webhook/forgejo,/breakdown,/architect/chatreject with 503 (or TCP-close). Webhook events that fire during shutdown are dropped; Forgejo will re-deliver on next restart / the next label eventSHUTDOWN_DRAIN_MS(default 60_000, configurable viaconfig/agents.json::shutdown.drain_ms, capped belowTimeoutStopSec)worker.currentAbort.abort()on each busy worker; marktask_historyrowscancelledwith reasonshutdownclaudePID to a known file at dispatch start — see "Orphan guard" below). This prevents the container-side SDK from chewing Pro Max quota with no listenerOrphan guard (containerised agents)
agent-runner.runAgentTaskwrites the spawned CLI's in-container PID to a well-known path (e.g./tmp/claude.pidinside the container) at dispatch startdocker exec <container> kill -TERM $(cat /tmp/claude.pid)before declaring drain completeObservability
broadcastSSE({ type: "service_shutdown", drain_ms, busy_workers: [...] })on signal receipt so the dashboard can show a banner instead of silently flipping to disconnected[shutdown] draining N tasks,[shutdown] task <id> settled after <ms>,[shutdown] force-abort after <ms>,[shutdown] byeTests
shutdown.test.ts— fake worker with an abortable task; SIGTERM triggers graceful drain within budgetshutdown.test.ts— drain exceeded → force-abort fires, task marked cancelled with reasonshutdownshutdown.test.ts— new/taskPOSTs during drain return 503kill -TERMis issued against the in-container PID on force-abortDocs
systemctl restart claude-hooksdrains up to 60s before force-killingOut of scope
src/dashboard.html(the HTTP-server close covers it). The M18-3 SPA is static-served and requires no drain.Dependencies
References
apps/server/src/main.ts:1578(no signal handler after Bun.serve)PATCH /issues/174assignee reset.TimeoutStopSec=300inclaude-hooks.serviceis waiting for a drain handler that doesn't exist.