feat(watchdog): periodic docker ps + reconcile for missing containers #134
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!134
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "boss/132"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #132.
Summary
Twice on 2026-04-20 the
claude-hooks-dev-defaultcontainer vanished fromdocker ps -awithout a logged destroy event, breaking in-flight tasks with exit 137 and crypticexec failed: OCI runtimeerrors. Per the issue's "one of: fix / follow-up / monitoring PR" menu, this is the monitoring PR — root cause is still unconfirmed (most likely culprit is an externaldocker rmby name; see the investigation comment on #132 for the full elimination).What it does
New
src/container-watchdog.tsruns every 60 s. On each tick:docker ps -a --filter name=^claude-hooks-→ set of containers the daemon knows about, with running/stopped state.listResolvedAgents()filtered bycontainer.enabled.running— indocker ps: nothing to do.stopped— indocker ps -abut not running: emitcontainer_stoppedevent, trust--restart unless-stoppedto bounce.missing— absent fromdocker ps -aentirely (the #132 failure mode): emitcontainer_missing, callreconcileOne()to recreate in-place.container_recreatedorcontainer_recreate_failed.Events fan out through the existing SSE broadcast so the dashboard surfaces the issue in real time instead of the operator learning from the next failed task.
Safety properties
reconcileOne()is the same code path startup + CRUD already use; the state volume survives recreation.docker psfailure. Daemon unreachable → recreating every container would compound the outage. The watchdog treatsexitCode != 0fromdocker psas "don't know" and waits for the next tick. NextreconcileAll(via systemdExecStartPreon service restart) converges once the daemon is back.container_recreate_failedevent; the tick itself always completes so a bad actor doesn't kill the loop.container.enabled === true, so host-mode types don't get phantomcontainer_missingevents.Tests
src/container-watchdog.test.tscovers all four event types:container_stoppedonly (no recreate)container_missing+container_recreatedon successdocker runfails →container_recreate_failedwith stderr captureddocker psfailure → no events, no reconcile calls (the "don't compound the outage" guard)All 508 existing tests still pass.
Not in this PR
Test plan
bun test)claude-hooks-dev-defaultout of band → within 60 s the watchdog emitscontainer_stopped, Docker restart-policy brings it back.docker rm -f claude-hooks-dev-default→ within 60 s the watchdog emitscontainer_missing+container_recreated, next dispatch works.🤖 Generated with Claude Code
claude-hooks-dev-defaultcontainer disappears silently #132Review: APPROVED ✅
CI: green (run #1700,
success).Summary
Solid monitoring PR that directly addresses the #132 acceptance criteria — the "monitoring PR" option from the issue's deliverable menu. Clean, well-tested, and safe.
Acceptance criteria check
docker ps+ reconcile for missing containersrunning/stopped/missing— the explicit AC from #132: "distinguish between 'container stopped then restart-policy respected' vs. 'container was removed entirely'"Code correctness
container-watchdog.tsSingle
docker ps -a --filter name=^claude-hooks- --format {{.Names}} {{.State}}call answers both "is it in docker ps -a?" and "is it running?" in one round-trip. The three-way partition (running/stopped/missing) is correct.reconcileOnereturning"unchanged"or"removed"for a supposedly-missing container →container_recreate_failedwith a descriptive detail. Good defensive coding — the operator sees it rather than a silent no-op.Never-throw invariant is solid: per-instance errors are caught and logged; the tick always completes.
main.tswiringstartWatchdog({ defaultImage: webhookConfig.containerImage, onEvent: → broadcastSSE })is minimal and correct.defaultImagecomes from the same config object thatreconcileAlluses at startup — consistent image across all reconcile paths. NointervalMsoverride → 60 s default matches the stated ~1-minute detection goal.Tests
All six scenarios covered: healthy / stopped / missing+success / missing+fail (docker daemon rejects run) / multiple instances (one healthy, one missing) / docker-ps-failure (the "don't compound the outage" guard). Separate runner doubles for the watchdog path and the reconcile path correctly verify independent injectability.
Minor observation (not a blocker)
The
elsebranch inrunWatchdogTickwherereconcileOnereturns"unchanged"or"removed"for a missing container has no dedicated test. This is defensive code for a rare race condition; the important paths (created + throw) are both covered. Fine as-is.Review: APPROVED ✅
CI: green (run #1700,
success).Summary
Solid monitoring PR that directly addresses the #132 acceptance criteria — the "monitoring PR" option from the issue's deliverable menu. Clean, well-tested, and safe.
Acceptance criteria check
docker ps+ reconcile for missing containersrunning/stopped/missing— the explicit AC from #132: "distinguish between 'container stopped then restart-policy respected' vs. 'container was removed entirely'"Code correctness
container-watchdog.tsSingle
docker ps -a --filter name=^claude-hooks- --format {{.Names}} {{.State}}call answers both "is it in docker ps -a?" and "is it running?" in one round-trip. The three-way partition (running/stopped/missing) is correct.reconcileOnereturning"unchanged"or"removed"for a supposedly-missing container →container_recreate_failedwith a descriptive detail. Good defensive coding — the operator sees it rather than a silent no-op.Never-throw invariant is solid: per-instance errors are caught and logged; the tick always completes.
main.tswiringstartWatchdog({ defaultImage: webhookConfig.containerImage, onEvent: → broadcastSSE })is minimal and correct.defaultImagecomes from the same config object thatreconcileAlluses at startup — consistent image across all reconcile paths. NointervalMsoverride → 60 s default matches the stated ~1-minute detection goal.Tests
All six scenarios covered: healthy / stopped / missing+success / missing+fail (docker daemon rejects run) / multiple instances (one healthy, one missing) / docker-ps-failure (the "don't compound the outage" guard). Separate runner doubles for the watchdog path and the reconcile path correctly verify independent injectability.
Minor observation (not a blocker)
The
elsebranch inrunWatchdogTickwherereconcileOnereturns"unchanged"or"removed"for a missing container has no dedicated test. This is defensive code for a rare race condition; the important paths (created + throw) are both covered. Fine as-is.claude-hooks-dev-defaultcontainer disappears silently #132docker stop/docker rmcalls on claude-hooks-* containers #149