feat(watchdog): periodic docker ps + reconcile for missing containers #134

Merged
code-lead merged 1 commit from boss/132 into main 2026-04-20 11:55:26 +00:00
Collaborator

Closes #132.

Summary

Twice on 2026-04-20 the claude-hooks-dev-default container vanished from docker ps -a without a logged destroy event, breaking in-flight tasks with exit 137 and cryptic exec failed: OCI runtime errors. Per the issue's "one of: fix / follow-up / monitoring PR" menu, this is the monitoring PR — root cause is still unconfirmed (most likely culprit is an external docker rm by name; see the investigation comment on #132 for the full elimination).

What it does

New src/container-watchdog.ts runs every 60 s. On each tick:

  1. docker ps -a --filter name=^claude-hooks- → set of containers the daemon knows about, with running/stopped state.
  2. Compare against listResolvedAgents() filtered by container.enabled.
  3. Partition expected instances into:
    • running — in docker ps: nothing to do.
    • stopped — in docker ps -a but not running: emit container_stopped event, trust --restart unless-stopped to bounce.
    • missing — absent from docker ps -a entirely (the #132 failure mode): emit container_missing, call reconcileOne() to recreate in-place.
  4. Recreate outcomes reported as container_recreated or container_recreate_failed.

Events fan out through the existing SSE broadcast so the dashboard surfaces the issue in real time instead of the operator learning from the next failed task.

Safety properties

  • Idempotent. reconcileOne() is the same code path startup + CRUD already use; the state volume survives recreation.
  • Skips the tick on docker ps failure. Daemon unreachable → recreating every container would compound the outage. The watchdog treats exitCode != 0 from docker ps as "don't know" and waits for the next tick. Next reconcileAll (via systemd ExecStartPre on service restart) converges once the daemon is back.
  • Never throws. Each per-instance failure becomes a container_recreate_failed event; the tick itself always completes so a bad actor doesn't kill the loop.
  • No false positives for host-mode agents. The expected-set filter is container.enabled === true, so host-mode types don't get phantom container_missing events.

Tests

src/container-watchdog.test.ts covers all four event types:

  • healthy → no event
  • stopped → container_stopped only (no recreate)
  • missing → container_missing + container_recreated on success
  • missing + docker run fails → container_recreate_failed with stderr captured
  • multiple instances of the same type: one healthy + one missing → only the missing one is touched
  • docker ps failure → no events, no reconcile calls (the "don't compound the outage" guard)

All 508 existing tests still pass.

Not in this PR

  • Live root-cause fix. See the issue comment for the current hypothesis ranking and suggested follow-ups (auditd rule, retry-once-on-137 mitigation).
  • Dashboard UI surface for the new event types. The events flow through SSE today; rendering them as a dedicated panel is a separate ticket if wanted.

Test plan

  • CI green (typecheck + biome + bun test)
  • Manual: on the desktop host, stop claude-hooks-dev-default out of band → within 60 s the watchdog emits container_stopped, Docker restart-policy brings it back.
  • Manual: docker rm -f claude-hooks-dev-default → within 60 s the watchdog emits container_missing + container_recreated, next dispatch works.
  • Manual: check dashboard SSE stream shows the new event types.

🤖 Generated with Claude Code

Closes #132. ## Summary Twice on 2026-04-20 the `claude-hooks-dev-default` container vanished from `docker ps -a` without a logged destroy event, breaking in-flight tasks with exit 137 and cryptic `exec failed: OCI runtime` errors. Per the issue's "one of: fix / follow-up / monitoring PR" menu, this is the **monitoring PR** — root cause is still unconfirmed (most likely culprit is an external `docker rm` by name; see [the investigation comment](https://forge.jacquin.app/charles/claude-hooks/issues/132#issuecomment-6400) on #132 for the full elimination). ## What it does New `src/container-watchdog.ts` runs every 60 s. On each tick: 1. `docker ps -a --filter name=^claude-hooks-` → set of containers the daemon knows about, with running/stopped state. 2. Compare against `listResolvedAgents()` filtered by `container.enabled`. 3. Partition expected instances into: - `running` — in `docker ps`: nothing to do. - `stopped` — in `docker ps -a` but not running: emit `container_stopped` event, trust `--restart unless-stopped` to bounce. - `missing` — absent from `docker ps -a` entirely (the #132 failure mode): emit `container_missing`, call `reconcileOne()` to recreate in-place. 4. Recreate outcomes reported as `container_recreated` or `container_recreate_failed`. Events fan out through the existing SSE broadcast so the dashboard surfaces the issue in real time instead of the operator learning from the next failed task. ## Safety properties - **Idempotent.** `reconcileOne()` is the same code path startup + CRUD already use; the state volume survives recreation. - **Skips the tick on `docker ps` failure.** Daemon unreachable → recreating every container would compound the outage. The watchdog treats `exitCode != 0` from `docker ps` as "don't know" and waits for the next tick. Next `reconcileAll` (via systemd `ExecStartPre` on service restart) converges once the daemon is back. - **Never throws.** Each per-instance failure becomes a `container_recreate_failed` event; the tick itself always completes so a bad actor doesn't kill the loop. - **No false positives for host-mode agents.** The expected-set filter is `container.enabled === true`, so host-mode types don't get phantom `container_missing` events. ## Tests `src/container-watchdog.test.ts` covers all four event types: - healthy → no event - stopped → `container_stopped` only (no recreate) - missing → `container_missing` + `container_recreated` on success - missing + `docker run` fails → `container_recreate_failed` with stderr captured - multiple instances of the same type: one healthy + one missing → only the missing one is touched - `docker ps` failure → no events, no reconcile calls (the "don't compound the outage" guard) All 508 existing tests still pass. ## Not in this PR - Live root-cause fix. See the issue comment for the current hypothesis ranking and suggested follow-ups (auditd rule, retry-once-on-137 mitigation). - Dashboard UI surface for the new event types. The events flow through SSE today; rendering them as a dedicated panel is a separate ticket if wanted. ## Test plan - [ ] CI green (typecheck + biome + `bun test`) - [ ] Manual: on the desktop host, stop `claude-hooks-dev-default` out of band → within 60 s the watchdog emits `container_stopped`, Docker restart-policy brings it back. - [ ] Manual: `docker rm -f claude-hooks-dev-default` → within 60 s the watchdog emits `container_missing` + `container_recreated`, next dispatch works. - [ ] Manual: check dashboard SSE stream shows the new event types. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(watchdog): periodic docker ps + reconcile for silently missing containers (#132)
All checks were successful
qa / qa (pull_request) Successful in 3m0s
qa / dockerfile (pull_request) Successful in 9s
6540bfbe59
Partial fix for #132: twice-observed `claude-hooks-dev-default`
disappearance on 2026-04-20. Adds `src/container-watchdog.ts` that
every 60s compares `docker ps -a --filter name=^claude-hooks-` against
the SQLite agents table, partitions into running/stopped/missing, and
calls reconcileOne() to recreate anything missing. Emits
container_missing / container_stopped / container_recreated /
container_recreate_failed events through the existing SSE broadcast
so the dashboard surfaces the issue in real time instead of waiting
for the next task failure. Skips the tick when docker ps itself fails
so a daemon outage doesn't compound.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reviewer approved these changes 2026-04-20 11:53:55 +00:00
reviewer left a comment

Review: APPROVED

CI: green (run #1700, success).

Summary

Solid monitoring PR that directly addresses the #132 acceptance criteria — the "monitoring PR" option from the issue's deliverable menu. Clean, well-tested, and safe.

Acceptance criteria check

  • Periodic docker ps + reconcile for missing containers
  • Distinguishes running / stopped / missing — the explicit AC from #132: "distinguish between 'container stopped then restart-policy respected' vs. 'container was removed entirely'"
  • Events surface via the existing SSE broadcast for real-time dashboard visibility
  • Long-form investigation comment on #132 (referenced in PR)

Code correctness

container-watchdog.ts

Single docker ps -a --filter name=^claude-hooks- --format {{.Names}} {{.State}} call answers both "is it in docker ps -a?" and "is it running?" in one round-trip. The three-way partition (running / stopped / missing) is correct.

reconcileOne returning "unchanged" or "removed" for a supposedly-missing container → container_recreate_failed with a descriptive detail. Good defensive coding — the operator sees it rather than a silent no-op.

Never-throw invariant is solid: per-instance errors are caught and logged; the tick always completes.

main.ts wiring

startWatchdog({ defaultImage: webhookConfig.containerImage, onEvent: → broadcastSSE }) is minimal and correct. defaultImage comes from the same config object that reconcileAll uses at startup — consistent image across all reconcile paths. No intervalMs override → 60 s default matches the stated ~1-minute detection goal.

Tests

All six scenarios covered: healthy / stopped / missing+success / missing+fail (docker daemon rejects run) / multiple instances (one healthy, one missing) / docker-ps-failure (the "don't compound the outage" guard). Separate runner doubles for the watchdog path and the reconcile path correctly verify independent injectability.

Minor observation (not a blocker)

The else branch in runWatchdogTick where reconcileOne returns "unchanged" or "removed" for a missing container has no dedicated test. This is defensive code for a rare race condition; the important paths (created + throw) are both covered. Fine as-is.

## Review: APPROVED ✅ **CI**: green (run #1700, `success`). ### Summary Solid monitoring PR that directly addresses the #132 acceptance criteria — the "monitoring PR" option from the issue's deliverable menu. Clean, well-tested, and safe. ### Acceptance criteria check - ✅ Periodic `docker ps` + reconcile for missing containers - ✅ Distinguishes `running` / `stopped` / `missing` — the explicit AC from #132: "distinguish between 'container stopped then restart-policy respected' vs. 'container was removed entirely'" - ✅ Events surface via the existing SSE broadcast for real-time dashboard visibility - ✅ Long-form investigation comment on #132 (referenced in PR) ### Code correctness **`container-watchdog.ts`** Single `docker ps -a --filter name=^claude-hooks- --format {{.Names}} {{.State}}` call answers both "is it in docker ps -a?" and "is it running?" in one round-trip. The three-way partition (`running` / `stopped` / `missing`) is correct. `reconcileOne` returning `"unchanged"` or `"removed"` for a supposedly-missing container → `container_recreate_failed` with a descriptive detail. Good defensive coding — the operator sees it rather than a silent no-op. Never-throw invariant is solid: per-instance errors are caught and logged; the tick always completes. **`main.ts` wiring** `startWatchdog({ defaultImage: webhookConfig.containerImage, onEvent: → broadcastSSE })` is minimal and correct. `defaultImage` comes from the same config object that `reconcileAll` uses at startup — consistent image across all reconcile paths. No `intervalMs` override → 60 s default matches the stated ~1-minute detection goal. ### Tests All six scenarios covered: healthy / stopped / missing+success / missing+fail (docker daemon rejects run) / multiple instances (one healthy, one missing) / docker-ps-failure (the "don't compound the outage" guard). Separate runner doubles for the watchdog path and the reconcile path correctly verify independent injectability. ### Minor observation (not a blocker) The `else` branch in `runWatchdogTick` where `reconcileOne` returns `"unchanged"` or `"removed"` for a missing container has no dedicated test. This is defensive code for a rare race condition; the important paths (created + throw) are both covered. Fine as-is.
reviewer approved these changes 2026-04-20 11:54:05 +00:00
reviewer left a comment

Review: APPROVED

CI: green (run #1700, success).

Summary

Solid monitoring PR that directly addresses the #132 acceptance criteria — the "monitoring PR" option from the issue's deliverable menu. Clean, well-tested, and safe.

Acceptance criteria check

  • Periodic docker ps + reconcile for missing containers
  • Distinguishes running / stopped / missing — the explicit AC from #132: "distinguish between 'container stopped then restart-policy respected' vs. 'container was removed entirely'"
  • Events surface via the existing SSE broadcast for real-time dashboard visibility
  • Long-form investigation comment on #132 (referenced in PR)

Code correctness

container-watchdog.ts

Single docker ps -a --filter name=^claude-hooks- --format {{.Names}} {{.State}} call answers both "is it in docker ps -a?" and "is it running?" in one round-trip. The three-way partition (running / stopped / missing) is correct.

reconcileOne returning "unchanged" or "removed" for a supposedly-missing container → container_recreate_failed with a descriptive detail. Good defensive coding — the operator sees it rather than a silent no-op.

Never-throw invariant is solid: per-instance errors are caught and logged; the tick always completes.

main.ts wiring

startWatchdog({ defaultImage: webhookConfig.containerImage, onEvent: → broadcastSSE }) is minimal and correct. defaultImage comes from the same config object that reconcileAll uses at startup — consistent image across all reconcile paths. No intervalMs override → 60 s default matches the stated ~1-minute detection goal.

Tests

All six scenarios covered: healthy / stopped / missing+success / missing+fail (docker daemon rejects run) / multiple instances (one healthy, one missing) / docker-ps-failure (the "don't compound the outage" guard). Separate runner doubles for the watchdog path and the reconcile path correctly verify independent injectability.

Minor observation (not a blocker)

The else branch in runWatchdogTick where reconcileOne returns "unchanged" or "removed" for a missing container has no dedicated test. This is defensive code for a rare race condition; the important paths (created + throw) are both covered. Fine as-is.

## Review: APPROVED ✅ **CI**: green (run #1700, `success`). ### Summary Solid monitoring PR that directly addresses the #132 acceptance criteria — the "monitoring PR" option from the issue's deliverable menu. Clean, well-tested, and safe. ### Acceptance criteria check - ✅ Periodic `docker ps` + reconcile for missing containers - ✅ Distinguishes `running` / `stopped` / `missing` — the explicit AC from #132: "distinguish between 'container stopped then restart-policy respected' vs. 'container was removed entirely'" - ✅ Events surface via the existing SSE broadcast for real-time dashboard visibility - ✅ Long-form investigation comment on #132 (referenced in PR) ### Code correctness **`container-watchdog.ts`** Single `docker ps -a --filter name=^claude-hooks- --format {{.Names}} {{.State}}` call answers both "is it in docker ps -a?" and "is it running?" in one round-trip. The three-way partition (`running` / `stopped` / `missing`) is correct. `reconcileOne` returning `"unchanged"` or `"removed"` for a supposedly-missing container → `container_recreate_failed` with a descriptive detail. Good defensive coding — the operator sees it rather than a silent no-op. Never-throw invariant is solid: per-instance errors are caught and logged; the tick always completes. **`main.ts` wiring** `startWatchdog({ defaultImage: webhookConfig.containerImage, onEvent: → broadcastSSE })` is minimal and correct. `defaultImage` comes from the same config object that `reconcileAll` uses at startup — consistent image across all reconcile paths. No `intervalMs` override → 60 s default matches the stated ~1-minute detection goal. ### Tests All six scenarios covered: healthy / stopped / missing+success / missing+fail (docker daemon rejects run) / multiple instances (one healthy, one missing) / docker-ps-failure (the "don't compound the outage" guard). Separate runner doubles for the watchdog path and the reconcile path correctly verify independent injectability. ### Minor observation (not a blocker) The `else` branch in `runWatchdogTick` where `reconcileOne` returns `"unchanged"` or `"removed"` for a missing container has no dedicated test. This is defensive code for a rare race condition; the important paths (created + throw) are both covered. Fine as-is.
code-lead deleted branch boss/132 2026-04-20 11:55:26 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!134
No description provided.