charles/claude-hooks

Fork 0

feat(server): graceful SIGTERM/SIGINT drain for in-flight tasks #184

Merged

code-lead merged 1 commit from boss/182 into main

2026-04-20 19:38:39 +00:00

code-lead commented

2026-04-20 19:27:38 +00:00

Collaborator

Closes #182.

Summary

systemctl restart claude-hooks (or any signal-based stop) used to kill Bun immediately, leaving in-flight tasks stuck as running in task_history forever, the agent work lost, and no webhook re-delivery on the Forgejo side. TimeoutStopSec=300 in the systemd unit was decorative — nothing listened for the drain window.

This PR adds a four-phase graceful-shutdown handler in apps/server/src/shutdown.ts, registered from main.ts on SIGTERM / SIGINT:

Stop accepting new work — Bun.serve.stop(true) closes the HTTP listener so /task, /webhook/forgejo, /breakdown, /architect/chat refuse at the TCP layer. Webhook events delivered mid-drain are dropped; Forgejo re-delivers after restart.
Wait up to shutdown.drain_ms (default 60s, clamped to [1_000, 270_000] to fit safely under TimeoutStopSec=300) for in-flight tasks to finish naturally. Polls each worker's currentTask slot every 250ms and logs each settle time.
Force-abort on timeout — calls currentAbort.abort() on every still-busy worker and persists its task_history row as cancelled with reason shutdown (so /history stops showing it running forever).
Container-side SIGTERM — for each container-mode worker, docker exec <container> sh -c 'kill -TERM $(cat /tmp/claude.pid)' so the in-container Claude CLI stops chewing Pro-Max quota with no host-side listener.

Orphan guard

The container exec shim (ensureExecShim in container.ts) now wraps the claude spawn with sh -c 'echo $$ > /tmp/claude.pid; exec "$@"'. The inner exec preserves PID, so the recorded id is the actual claude process. Pidfile path lives on CONTAINER_PIDFILE_PATH so shutdown.ts and the shim agree.

Observability

Structured log lines per phase ([shutdown] draining N tasks, [shutdown] task <id> settled after <ms>, [shutdown] force-abort after <ms>, [shutdown] bye). SSE envelopes:

service_shutdown — carries drain_ms + busy_workers for the dashboard banner.
shutdown_received / shutdown_drained / shutdown_force_abort / shutdown_complete — phase markers with elapsed_ms + structured detail.

Config

Optional top-level shutdown block in config/agents.json:

{
  "shutdown": { "drain_ms": 120000 }
}

Clamped to [1_000, 270_000] ms. Non-number / malformed values degrade to the 60 s default with a warning.

Test plan

bun x turbo run test — 683 server tests pass (up from 673) including 10 new shutdown.test.ts cases and 5 new loader cases for shutdown.drain_ms.
- no busy workers → immediate drain, correct SSE sequence
- busy task settling inside budget → drained, no force-abort
- drain exceeded → force-abort + onTaskShutdown fires + SSE envelope
- container-mode worker → docker exec … sh -c 'kill -TERM …' invocation
- docker exec failure → logged but drain continues, returns report
- host-mode worker → force-aborted without docker exec
- service_shutdown payload carries busy_workers + drain_ms
- structured log lines per phase
- installSignalHandlers disposer unregisters process listeners
- registered handler invokes full drain on call
- loader: default 60s / explicit / clamp floor / clamp ceiling / invalid → warning
bun x turbo run typecheck — clean across all three workspaces.
bun x biome check . — no lint or format issues.
Manual smoke: systemctl --user restart claude-hooks with an in-flight task and observe the drain banner + [shutdown] logs.

Closes #182. ## Summary `systemctl restart claude-hooks` (or any signal-based stop) used to kill Bun immediately, leaving in-flight tasks stuck as `running` in `task_history` forever, the agent work lost, and no webhook re-delivery on the Forgejo side. `TimeoutStopSec=300` in the systemd unit was decorative — nothing listened for the drain window. This PR adds a four-phase graceful-shutdown handler in `apps/server/src/shutdown.ts`, registered from `main.ts` on `SIGTERM` / `SIGINT`: 1. **Stop accepting new work** — `Bun.serve.stop(true)` closes the HTTP listener so `/task`, `/webhook/forgejo`, `/breakdown`, `/architect/chat` refuse at the TCP layer. Webhook events delivered mid-drain are dropped; Forgejo re-delivers after restart. 2. **Wait up to `shutdown.drain_ms`** (default 60s, clamped to `[1_000, 270_000]` to fit safely under `TimeoutStopSec=300`) for in-flight tasks to finish naturally. Polls each worker's `currentTask` slot every 250ms and logs each settle time. 3. **Force-abort on timeout** — calls `currentAbort.abort()` on every still-busy worker and persists its `task_history` row as `cancelled` with reason `shutdown` (so `/history` stops showing it `running` forever). 4. **Container-side SIGTERM** — for each container-mode worker, `docker exec <container> sh -c 'kill -TERM $(cat /tmp/claude.pid)'` so the in-container Claude CLI stops chewing Pro-Max quota with no host-side listener. ### Orphan guard The container exec shim (`ensureExecShim` in `container.ts`) now wraps the `claude` spawn with `sh -c 'echo $$ > /tmp/claude.pid; exec "$@"'`. The inner `exec` preserves PID, so the recorded id is the actual `claude` process. Pidfile path lives on `CONTAINER_PIDFILE_PATH` so `shutdown.ts` and the shim agree. ### Observability Structured log lines per phase (`[shutdown] draining N tasks`, `[shutdown] task <id> settled after <ms>`, `[shutdown] force-abort after <ms>`, `[shutdown] bye`). SSE envelopes: - `service_shutdown` — carries `drain_ms` + `busy_workers` for the dashboard banner. - `shutdown_received` / `shutdown_drained` / `shutdown_force_abort` / `shutdown_complete` — phase markers with elapsed_ms + structured detail. ### Config Optional top-level `shutdown` block in `config/agents.json`: ```jsonc { "shutdown": { "drain_ms": 120000 } } ``` Clamped to `[1_000, 270_000]` ms. Non-number / malformed values degrade to the 60 s default with a warning. ## Test plan - [x] `bun x turbo run test` — 683 server tests pass (up from 673) including 10 new `shutdown.test.ts` cases and 5 new loader cases for `shutdown.drain_ms`. - no busy workers → immediate drain, correct SSE sequence - busy task settling inside budget → drained, no force-abort - drain exceeded → force-abort + `onTaskShutdown` fires + SSE envelope - container-mode worker → `docker exec … sh -c 'kill -TERM …'` invocation - docker exec failure → logged but drain continues, returns report - host-mode worker → force-aborted without docker exec - `service_shutdown` payload carries busy_workers + drain_ms - structured log lines per phase - `installSignalHandlers` disposer unregisters process listeners - registered handler invokes full drain on call - loader: default 60s / explicit / clamp floor / clamp ceiling / invalid → warning - [x] `bun x turbo run typecheck` — clean across all three workspaces. - [x] `bun x biome check .` — no lint or format issues. - [ ] Manual smoke: `systemctl --user restart claude-hooks` with an in-flight task and observe the drain banner + `[shutdown]` logs.

code-lead added 1 commit

2026-04-20 19:27:38 +00:00

feat(server): graceful SIGTERM/SIGINT drain for in-flight tasks

qa / qa (pull_request) Successful in 3m46s

Details

qa / dockerfile (pull_request) Successful in 11s

Details

cfa25ae23c

Closes #182.

`systemctl restart claude-hooks` (or any signal-based stop) used to kill
Bun immediately, leaving in-flight tasks with `running` forever in
task_history and no webhook re-delivery. Adds a four-phase handler in
`apps/server/src/shutdown.ts` registered from `main.ts` on SIGTERM/SIGINT:

1. `Bun.serve.stop(true)` — /task, /webhook/forgejo, /breakdown,
   /architect/chat refuse at the TCP layer.
2. Poll each worker's `currentTask` slot up to `shutdown.drain_ms`
   (default 60s, clamped to [1_000, 270_000] to fit under
   TimeoutStopSec=300).
3. Force-abort stragglers via `currentAbort.abort()` and mark their
   task_history rows `cancelled` with reason `shutdown`.
4. For container-mode workers, `docker exec <container> kill -TERM
   $(cat /tmp/claude.pid)` so the in-container CLI doesn't keep chewing
   Pro-Max quota after the host handler is gone.

Orphan guard: the container exec shim (`ensureExecShim`) now wraps the
`claude` spawn with `sh -c 'echo $$ > /tmp/claude.pid; exec "$@"'` so
the PID recorded matches the exec-replaced `claude` process. Path lives
on `CONTAINER_PIDFILE_PATH`.

Observability: SSE envelopes (`shutdown_received`, `service_shutdown`,
`shutdown_drained`, `shutdown_force_abort`, `shutdown_complete`) drive
the dashboard banner; structured log lines land in journalctl per phase.

Tests: `shutdown.test.ts` exercises all four phases (drain happy-path,
force-abort, docker-exec kill, docker failure logged-not-fatal,
host-mode worker skip, re-entry disposer, SSE event sequence); loader
tests cover default / override / clamp / invalid shutdown.drain_ms.

Docs: new "Graceful shutdown" sections in CLAUDE.md and README with
tuning guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

code-lead requested review from reviewer

2026-04-20 19:35:03 +00:00

reviewer approved these changes

2026-04-20 19:37:57 +00:00

reviewer left a comment

Review: APPROVED ✅

CI green (run #1779, success). All four acceptance-criteria phases are correctly implemented and tested.

What I verified

Phase correctness

Phase 1 — server.stop(true) closes the HTTP listener at the TCP layer; new requests are refused before any handler runs.
Phase 2 — Polls listWorkers() freshly on every tick so tasks that finish between iterations are picked up without a dedicated "drained" signal. The deadline math (start + drainMs) is correct.
Phase 3 — currentAbort.abort() fires per busy worker; onTaskShutdown writes the cancelled/shutdown row to SQLite and broadcasts the SSE event so /history stops showing the task as running forever.
Phase 4 — Container kills are issued only against remaining (the force-aborted set), not against workers that drained naturally. Correct.

Orphan guard (PID file)
The shim runs:

sh -c 'echo $$ > "/tmp/claude.pid" 2>/dev/null || true; exec "$@"' sh "${CONTAINER_CLAUDE_EXECUTABLE}" "$@"

exec "$@" replaces the inner shell in-place — no fork — so the PID recorded in the file is exactly the PID the claude process will have. The shutdown handler's if [ -f ... ]; then kill -TERM $(cat ...); fi script is safe (constant path, no user input).

containerMode gate
containerMode: w.config.container?.enabled === true evaluates to false for the architect (container.enabled === false), so no spurious docker exec is attempted for host-mode workers. Confirmed by the host-mode worker → force-aborted without docker exec test.

Config loading
shutdown.drain_ms is clamped to [1_000, 270_000] ms, leaving ≥30 s margin under TimeoutStopSec=300. Non-number and drain_ms ≤ 0 degrade gracefully to the 60 s default with a warning.

Re-entry guard
The draining flag in installSignalHandlers prevents a second SIGTERM mid-drain from starting a second drain pass. Correct — a second signal is logged as ignored and systemd's SIGKILL will eventually take over if the drain hangs.

Test coverage

All 10 shutdown.test.ts cases match the behaviour described in the issue:

No-busy → immediate drain, correct SSE sequence
Task finishes inside budget → drained (no force-abort, no abortSeen)
Drain exceeded → force-abort fires, onTaskShutdown called, abortSeen = true
Container-mode worker → docker exec claude-hooks-dev-default sh -c '…/tmp/claude.pid…'
Docker exec returns non-zero → warn-logged, drain report still returns, no crash
Host-mode worker → force-aborted without touching docker runner
service_shutdown envelope carries busy_workers + drain_ms
Structured [shutdown] log lines emitted per phase
Disposer unregisters SIGTERM/SIGINT listeners
Registered handler runs full drain and calls onExit

5 loader tests for shutdown.drain_ms (default/explicit/clamp-floor/clamp-ceiling/invalid) also pass.

Minor observation (not blocking)

In the catch block of installSignalHandlers, the synthetic report passes totalElapsedMs: 0 rather than the actual elapsed time. Since onExit is process.exit(0) in production, nothing reads this value in the error path. Worth fixing at some point for log accuracy, but not a correctness concern here.

Acceptance criteria: all met ✅

Signal handler in shutdown.ts ✅
All four drain phases ✅
Orphan guard (pidfile + container SIGTERM) ✅
service_shutdown SSE with drain_ms + busy_workers ✅
Structured log lines per phase ✅
shutdown.test.ts covering drain/force-abort/container/host-mode ✅
shutdown.drain_ms config with clamping ✅
CLAUDE.md + README updated ✅

## Review: APPROVED ✅ CI green (run #1779, `success`). All four acceptance-criteria phases are correctly implemented and tested. ### What I verified **Phase correctness** - **Phase 1** — `server.stop(true)` closes the HTTP listener at the TCP layer; new requests are refused before any handler runs. - **Phase 2** — Polls `listWorkers()` freshly on every tick so tasks that finish between iterations are picked up without a dedicated "drained" signal. The deadline math (`start + drainMs`) is correct. - **Phase 3** — `currentAbort.abort()` fires per busy worker; `onTaskShutdown` writes the `cancelled`/`shutdown` row to SQLite and broadcasts the SSE event so `/history` stops showing the task as `running` forever. - **Phase 4** — Container kills are issued only against `remaining` (the force-aborted set), not against workers that drained naturally. Correct. **Orphan guard (PID file)** The shim runs: ```sh sh -c 'echo $$ > "/tmp/claude.pid" 2>/dev/null || true; exec "$@"' sh "${CONTAINER_CLAUDE_EXECUTABLE}" "$@" ``` `exec "$@"` replaces the inner shell in-place — no fork — so the PID recorded in the file is exactly the PID the `claude` process will have. The shutdown handler's `if [ -f ... ]; then kill -TERM $(cat ...); fi` script is safe (constant path, no user input). **`containerMode` gate** `containerMode: w.config.container?.enabled === true` evaluates to `false` for the architect (`container.enabled === false`), so no spurious `docker exec` is attempted for host-mode workers. Confirmed by the `host-mode worker → force-aborted without docker exec` test. **Config loading** `shutdown.drain_ms` is clamped to `[1_000, 270_000]` ms, leaving ≥30 s margin under `TimeoutStopSec=300`. Non-number and `drain_ms ≤ 0` degrade gracefully to the 60 s default with a warning. **Re-entry guard** The `draining` flag in `installSignalHandlers` prevents a second SIGTERM mid-drain from starting a second drain pass. Correct — a second signal is logged as ignored and systemd's SIGKILL will eventually take over if the drain hangs. ### Test coverage All 10 `shutdown.test.ts` cases match the behaviour described in the issue: - No-busy → immediate drain, correct SSE sequence - Task finishes inside budget → drained (no force-abort, no `abortSeen`) - Drain exceeded → force-abort fires, `onTaskShutdown` called, `abortSeen = true` - Container-mode worker → `docker exec claude-hooks-dev-default sh -c '…/tmp/claude.pid…'` - Docker exec returns non-zero → warn-logged, drain report still returns, no crash - Host-mode worker → force-aborted without touching docker runner - `service_shutdown` envelope carries `busy_workers` + `drain_ms` - Structured `[shutdown]` log lines emitted per phase - Disposer unregisters SIGTERM/SIGINT listeners - Registered handler runs full drain and calls `onExit` 5 loader tests for `shutdown.drain_ms` (default/explicit/clamp-floor/clamp-ceiling/invalid) also pass. ### Minor observation (not blocking) In the `catch` block of `installSignalHandlers`, the synthetic report passes `totalElapsedMs: 0` rather than the actual elapsed time. Since `onExit` is `process.exit(0)` in production, nothing reads this value in the error path. Worth fixing at some point for log accuracy, but not a correctness concern here. ### Acceptance criteria: all met ✅ - Signal handler in `shutdown.ts` ✅ - All four drain phases ✅ - Orphan guard (pidfile + container SIGTERM) ✅ - `service_shutdown` SSE with `drain_ms` + `busy_workers` ✅ - Structured log lines per phase ✅ - `shutdown.test.ts` covering drain/force-abort/container/host-mode ✅ - `shutdown.drain_ms` config with clamping ✅ - CLAUDE.md + README updated ✅

code-lead merged commit c2064da004 into main

2026-04-20 19:38:39 +00:00

code-lead deleted branch boss/182

2026-04-20 19:38:40 +00:00