bug(server): SIGTERM kills in-flight tasks — no graceful drain #182

New issue

Closed

opened 2026-04-20 19:11:58 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-04-20 19:11:58 +00:00

Collaborator

Summary

The service has no SIGTERM / SIGINT handler. Every systemctl restart claude-hooks (or any signal-based stop) kills in-flight agent tasks immediately. The TimeoutStopSec=300 configured in claude-hooks.service is decorative — nothing listens for the drain window, so Bun just exits.

Reproducer

Dispatch any long-running task (assign an issue to an agent).
systemctl --user restart claude-hooks while the task is running.
Observe: the task's task_history row is left running forever; the agent work is lost; the webhook won't re-fire (no state change on Forgejo side).

Hit twice today after a systemd unit edit (once clearing src/main.ts → apps/server/src/main.ts, once adding ExecStartPre/Post hooks). Each restart dropped boss-2#174 and designer-default#181; I had to re-toggle the assignee / label on each to re-dispatch.

Root cause

apps/server/src/main.ts ends with:

Bun.serve({ hostname: HOST, port: PORT, fetch: handleRequest });

No process.on("SIGTERM", …) / SIGINT anywhere in the tree. The currentAbort AbortController on each worker is wired up only for the /cancel HTTP route (see main.ts handleCancel), never invoked from a signal.

Acceptance criteria

Signal handler

New shutdown.ts (or inline in main.ts) registers SIGTERM + SIGINT handlers
Handler phases:
1. Stop accepting new work — close Bun.serve so /task, /webhook/forgejo, /breakdown, /architect/chat reject with 503 (or TCP-close). Webhook events that fire during shutdown are dropped; Forgejo will re-deliver on next restart / the next label event
2. Wait for in-flight tasks up to SHUTDOWN_DRAIN_MS (default 60_000, configurable via config/agents.json::shutdown.drain_ms, capped below TimeoutStopSec)
3. Force-abort on timeout — call worker.currentAbort.abort() on each busy worker; mark task_history rows cancelled with reason shutdown
4. Container cleanup — for container-mode workers with an in-flight task, send SIGTERM to the in-container CLI PID (the agent container writes its claude PID to a known file at dispatch start — see "Orphan guard" below). This prevents the container-side SDK from chewing Pro Max quota with no listener

Orphan guard (containerised agents)

agent-runner.runAgentTask writes the spawned CLI's in-container PID to a well-known path (e.g. /tmp/claude.pid inside the container) at dispatch start
Shutdown handler: for each busy container-mode worker, docker exec <container> kill -TERM $(cat /tmp/claude.pid) before declaring drain complete

Observability

broadcastSSE({ type: "service_shutdown", drain_ms, busy_workers: [...] }) on signal receipt so the dashboard can show a banner instead of silently flipping to disconnected
Structured log lines per phase: [shutdown] draining N tasks, [shutdown] task <id> settled after <ms>, [shutdown] force-abort after <ms>, [shutdown] bye

Tests

shutdown.test.ts — fake worker with an abortable task; SIGTERM triggers graceful drain within budget
shutdown.test.ts — drain exceeded → force-abort fires, task marked cancelled with reason shutdown
shutdown.test.ts — new /task POSTs during drain return 503
Integration (optional): docker-exec stub confirms kill -TERM is issued against the in-container PID on force-abort

Docs

CLAUDE.md: new "Graceful shutdown" subsection describing the drain budget and how to tune it
README: add to "Commands" — note that systemctl restart claude-hooks drains up to 60s before force-killing

Out of scope

Graceful shutdown of the legacy src/dashboard.html (the HTTP-server close covers it). The M18-3 SPA is static-served and requires no drain.
Checkpoint-and-resume of in-flight SDK conversations across restarts — Claude Agent SDK doesn't expose mid-turn serialisation.
Re-queueing force-aborted tasks on next boot. Operator-triggered re-dispatch (label toggle) is fine for the single-operator use case.

Dependencies

None. Backend-only story. Can land anytime.

References

Root cause trace: apps/server/src/main.ts:1578 (no signal handler after Bun.serve)
Reproducer witnessed 2026-04-20: two service restarts dropped #174 and #181 each time; operator had to manually re-dispatch via label toggle + PATCH /issues/174 assignee reset.
Systemd unit TimeoutStopSec=300 in claude-hooks.service is waiting for a drain handler that doesn't exist.

## Summary The service has **no SIGTERM / SIGINT handler**. Every `systemctl restart claude-hooks` (or any signal-based stop) kills in-flight agent tasks immediately. The `TimeoutStopSec=300` configured in `claude-hooks.service` is decorative — nothing listens for the drain window, so Bun just exits. ## Reproducer 1. Dispatch any long-running task (assign an issue to an agent). 2. `systemctl --user restart claude-hooks` while the task is running. 3. Observe: the task's `task_history` row is left `running` forever; the agent work is lost; the webhook won't re-fire (no state change on Forgejo side). Hit twice today after a systemd unit edit (once clearing `src/main.ts` → `apps/server/src/main.ts`, once adding `ExecStartPre/Post` hooks). Each restart dropped boss-2#174 and designer-default#181; I had to re-toggle the assignee / label on each to re-dispatch. ## Root cause `apps/server/src/main.ts` ends with: ```ts Bun.serve({ hostname: HOST, port: PORT, fetch: handleRequest }); ``` No `process.on("SIGTERM", …)` / `SIGINT` anywhere in the tree. The `currentAbort` AbortController on each worker is wired up only for the `/cancel` HTTP route (see `main.ts` `handleCancel`), never invoked from a signal. ## Acceptance criteria ### Signal handler - [ ] New `shutdown.ts` (or inline in `main.ts`) registers `SIGTERM` + `SIGINT` handlers - [ ] Handler phases: 1. **Stop accepting new work** — close `Bun.serve` so `/task`, `/webhook/forgejo`, `/breakdown`, `/architect/chat` reject with 503 (or TCP-close). Webhook events that fire during shutdown are dropped; Forgejo will re-deliver on next restart / the next label event 2. **Wait for in-flight tasks** up to `SHUTDOWN_DRAIN_MS` (default 60_000, configurable via `config/agents.json::shutdown.drain_ms`, capped below `TimeoutStopSec`) 3. **Force-abort on timeout** — call `worker.currentAbort.abort()` on each busy worker; mark `task_history` rows `cancelled` with reason `shutdown` 4. **Container cleanup** — for container-mode workers with an in-flight task, send SIGTERM to the in-container CLI PID (the agent container writes its `claude` PID to a known file at dispatch start — see "Orphan guard" below). This prevents the container-side SDK from chewing Pro Max quota with no listener ### Orphan guard (containerised agents) - [ ] `agent-runner.runAgentTask` writes the spawned CLI's in-container PID to a well-known path (e.g. `/tmp/claude.pid` inside the container) at dispatch start - [ ] Shutdown handler: for each busy container-mode worker, `docker exec <container> kill -TERM $(cat /tmp/claude.pid)` before declaring drain complete ### Observability - [ ] `broadcastSSE({ type: "service_shutdown", drain_ms, busy_workers: [...] })` on signal receipt so the dashboard can show a banner instead of silently flipping to disconnected - [ ] Structured log lines per phase: `[shutdown] draining N tasks`, `[shutdown] task <id> settled after <ms>`, `[shutdown] force-abort after <ms>`, `[shutdown] bye` ### Tests - [ ] `shutdown.test.ts` — fake worker with an abortable task; SIGTERM triggers graceful drain within budget - [ ] `shutdown.test.ts` — drain exceeded → force-abort fires, task marked cancelled with reason `shutdown` - [ ] `shutdown.test.ts` — new `/task` POSTs during drain return 503 - [ ] Integration (optional): docker-exec stub confirms `kill -TERM` is issued against the in-container PID on force-abort ### Docs - [ ] CLAUDE.md: new "Graceful shutdown" subsection describing the drain budget and how to tune it - [ ] README: add to "Commands" — note that `systemctl restart claude-hooks` drains up to 60s before force-killing ## Out of scope - Graceful shutdown of the legacy `src/dashboard.html` (the HTTP-server close covers it). The M18-3 SPA is static-served and requires no drain. - Checkpoint-and-resume of in-flight SDK conversations across restarts — Claude Agent SDK doesn't expose mid-turn serialisation. - Re-queueing force-aborted tasks on next boot. Operator-triggered re-dispatch (label toggle) is fine for the single-operator use case. ## Dependencies - None. Backend-only story. Can land anytime. ## References - Root cause trace: `apps/server/src/main.ts:1578` (no signal handler after Bun.serve) - Reproducer witnessed 2026-04-20: two service restarts dropped #174 and #181 each time; operator had to manually re-dispatch via label toggle + `PATCH /issues/174` assignee reset. - Systemd unit `TimeoutStopSec=300` in `claude-hooks.service` is waiting for a drain handler that doesn't exist.

claude-desktop added the

area:infra

type:bug

labels

2026-04-20 19:12:06 +00:00