bug(reconcile): agents-sync leaves stopped containers stopped — every restart drops half the fleet #188

New issue

Closed

opened 2026-04-20 20:45:12 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-04-20 20:45:12 +00:00

Collaborator

Summary

After every systemctl restart claude-hooks, about half of the agent containers stay in exited state — the ones targeted by containers-down (all -default instances). Webhook dispatches to those agents succeed at enqueue but then silently fail when trying to docker exec into a stopped container. The fleet half-breaks on every restart.

Reproducer (witnessed 2026-04-20)

Start with all 8 agent containers Up (3 -2 pool members + 5 -default instances).
systemctl --user restart claude-hooks.

Observe: the 3 -2 containers stay up; all 5 -default containers show Exited (137):

claude-hooks-dev-default               Exited (137) 41 minutes ago
claude-hooks-reviewer-2                Up 9 hours
claude-hooks-boss-2                    Up 9 hours
claude-hooks-dev-2                     Up 9 hours
claude-hooks-reviewer-default          Exited (137) 40 minutes ago
claude-hooks-designer-default          Exited (137) 41 minutes ago
claude-hooks-design-reviewer-default   Exited (137) 41 minutes ago
claude-hooks-boss-default              Exited (137) 41 minutes ago

Dispatch an issue to any -default agent. Service logs show:

[designer-default] enqueued … charles/claude-hooks#187 (depth: 1)
[designer-default] starting …
[container-watchdog] claude-hooks-designer-default: container present but not running; Docker restart-policy should recover

but the agent never actually runs. Task sits in limbo; webhook won't re-fire.

Manual docker start claude-hooks-<name> on each unblocks them. That workaround was applied at 2026-04-20T20:42Z to keep M18-5.1 / M19-0 dispatches flowing.

Root cause

Two asymmetric recipes are glued into the systemd unit:

ExecStopPost=just containers-down (justfile) iterates config/agents.json::types and stops claude-hooks-<type>-default for each. It stops the 5 -default instances but leaves the -2 pool members alone (they're SQLite rows, not types).

ExecStartPre=just agents-sync (justfile) runs apps/server/src/container-reconcile.ts::reconcileAll which, per reconcileOne (line 272):

if (!present) {
  await dockerRun(agent, image);
  return "created";
}
if (snap && matchesDesired(snap, agent, image)) {
  return "unchanged";     // ← BUG: doesn't check snap.State.Running
}

A stopped-but-present container matches the desired image + config, so reconcileOne returns "unchanged" and never issues docker start. The reconcile decision table (comment at line 254-267) doesn't have a "running" column at all — it's only present/absent + DB/config match.

Result: stopped -default containers from containers-down stay stopped through agents-sync and survive the restart in the wrong state.

Acceptance criteria

Fix the reconcile gap

container-reconcile.ts::reconcileOne gains a "present but not running" branch. When the container exists, matches the desired config, but State.Running === false: issue docker start <name> and return a new "started" action variant.
The per-instance reconcile outcome table in the doc comment (lines 254-267) is updated with a running column distinguishing the three present-states: running-matching / stopped-matching / stopped-or-running-drift.
inspectContainer already returns State — verify it's not being discarded before matchesDesired can check it. If it is, expose running state in the snapshot.

Fix the asymmetric systemd recipes

containers-down should iterate the same source of truth as agents-sync (SQLite agents table) instead of config/agents.json::types. Today it leaves pool-member -2 containers running; that's inconsistent with its name and with how a full service stop should behave.
Alternative: remove ExecStopPost=just containers-down entirely. The agent containers are long-lived; they should survive service restarts so in-flight work continues. A service restart that kills half the fleet is a worse outcome than one that keeps everything running (the SIGTERM-drain bug #182 is the right place to solve the in-flight task problem).

Tests

container-reconcile.test.ts: add a case where the container is present + matches + stopped; reconcile returns "started" and calls docker start (mocked).
container-reconcile.test.ts: "present + matches + running" still returns "unchanged" with no docker mutation.
Integration-ish: the existing just agents-sync from the command line against a stopped container actually starts it (manual smoke after merge).

Docs

CLAUDE.md "Container reconciliation" section notes the new "started" action variant.
Justfile comment on containers-down either updated to match the new SQLite-sourced iteration OR removed if we delete the recipe.

Out of scope

Fixing the SIGTERM-drain-on-restart issue (#182) — that's the right fix for "don't interrupt in-flight work during a restart"; this ticket is narrower: "if a restart happens, the fleet must come back up".
Container image upgrades / rebuilds (containers-rebuild) — already a separate code path.
The Exited (137) exit code is a red herring — likely from docker stop sending SIGTERM then SIGKILL after timeout, not an OOM. No memory issue here.

Dependencies

None. Backend-only fix in apps/server/src/container-reconcile.ts + optional justfile cleanup.

References

Live reproducer evidence: service log lines from 2026-04-20T20:40-20:42Z (container-watchdog detected stopped -default containers but didn't recover them).
Source of the bug: apps/server/src/container-reconcile.ts:272 (reconcileOne).
Related ticket: #182 (SIGTERM drain) — complementary, not duplicate.
Manual workaround used today: for c in claude-hooks-{boss,dev,reviewer,designer,design-reviewer}-default; do docker start "$c"; done.

## Summary After every `systemctl restart claude-hooks`, about half of the agent containers stay in `exited` state — the ones targeted by `containers-down` (all `-default` instances). Webhook dispatches to those agents succeed at enqueue but then silently fail when trying to `docker exec` into a stopped container. The fleet half-breaks on every restart. ## Reproducer (witnessed 2026-04-20) 1. Start with all 8 agent containers `Up` (3 `-2` pool members + 5 `-default` instances). 2. `systemctl --user restart claude-hooks`. 3. Observe: the 3 `-2` containers stay up; all 5 `-default` containers show `Exited (137)`: ``` claude-hooks-dev-default Exited (137) 41 minutes ago claude-hooks-reviewer-2 Up 9 hours claude-hooks-boss-2 Up 9 hours claude-hooks-dev-2 Up 9 hours claude-hooks-reviewer-default Exited (137) 40 minutes ago claude-hooks-designer-default Exited (137) 41 minutes ago claude-hooks-design-reviewer-default Exited (137) 41 minutes ago claude-hooks-boss-default Exited (137) 41 minutes ago ``` 4. Dispatch an issue to any `-default` agent. Service logs show: ``` [designer-default] enqueued … charles/claude-hooks#187 (depth: 1) [designer-default] starting … [container-watchdog] claude-hooks-designer-default: container present but not running; Docker restart-policy should recover ``` but the agent never actually runs. Task sits in limbo; webhook won't re-fire. Manual `docker start claude-hooks-<name>` on each unblocks them. That workaround was applied at 2026-04-20T20:42Z to keep M18-5.1 / M19-0 dispatches flowing. ## Root cause Two asymmetric recipes are glued into the systemd unit: **`ExecStopPost=just containers-down`** (`justfile`) iterates `config/agents.json::types` and stops `claude-hooks-<type>-default` for each. It stops the 5 `-default` instances but leaves the `-2` pool members alone (they're SQLite rows, not types). **`ExecStartPre=just agents-sync`** (`justfile`) runs `apps/server/src/container-reconcile.ts::reconcileAll` which, per `reconcileOne` (line 272): ```ts if (!present) { await dockerRun(agent, image); return "created"; } if (snap && matchesDesired(snap, agent, image)) { return "unchanged"; // ← BUG: doesn't check snap.State.Running } ``` A stopped-but-present container matches the desired image + config, so `reconcileOne` returns `"unchanged"` and never issues `docker start`. The reconcile decision table (comment at line 254-267) doesn't have a "running" column at all — it's only present/absent + DB/config match. Result: stopped `-default` containers from `containers-down` stay stopped through `agents-sync` and survive the restart in the wrong state. ## Acceptance criteria ### Fix the reconcile gap - [ ] `container-reconcile.ts::reconcileOne` gains a "present but not running" branch. When the container exists, matches the desired config, but `State.Running === false`: issue `docker start <name>` and return a new `"started"` action variant. - [ ] The per-instance reconcile outcome table in the doc comment (lines 254-267) is updated with a `running` column distinguishing the three present-states: running-matching / stopped-matching / stopped-or-running-drift. - [ ] `inspectContainer` already returns State — verify it's not being discarded before `matchesDesired` can check it. If it is, expose running state in the snapshot. ### Fix the asymmetric systemd recipes - [ ] `containers-down` should iterate the same source of truth as `agents-sync` (SQLite `agents` table) instead of `config/agents.json::types`. Today it leaves pool-member `-2` containers running; that's inconsistent with its name and with how a full service stop should behave. - [ ] Alternative: **remove `ExecStopPost=just containers-down` entirely**. The agent containers are long-lived; they should survive service restarts so in-flight work continues. A service restart that kills half the fleet is a worse outcome than one that keeps everything running (the SIGTERM-drain bug #182 is the right place to solve the in-flight task problem). ### Tests - [ ] `container-reconcile.test.ts`: add a case where the container is present + matches + **stopped**; reconcile returns `"started"` and calls `docker start` (mocked). - [ ] `container-reconcile.test.ts`: "present + matches + running" still returns `"unchanged"` with no docker mutation. - [ ] Integration-ish: the existing `just agents-sync` from the command line against a stopped container actually starts it (manual smoke after merge). ### Docs - [ ] CLAUDE.md "Container reconciliation" section notes the new `"started"` action variant. - [ ] Justfile comment on `containers-down` either updated to match the new SQLite-sourced iteration OR removed if we delete the recipe. ## Out of scope - Fixing the SIGTERM-drain-on-restart issue (#182) — that's the right fix for "don't interrupt in-flight work during a restart"; this ticket is narrower: "if a restart happens, the fleet must come back up". - Container image upgrades / rebuilds (`containers-rebuild`) — already a separate code path. - The `Exited (137)` exit code is a red herring — likely from `docker stop` sending SIGTERM then SIGKILL after timeout, not an OOM. No memory issue here. ## Dependencies - None. Backend-only fix in `apps/server/src/container-reconcile.ts` + optional justfile cleanup. ## References - Live reproducer evidence: service log lines from 2026-04-20T20:40-20:42Z (`container-watchdog` detected stopped `-default` containers but didn't recover them). - Source of the bug: `apps/server/src/container-reconcile.ts:272` (`reconcileOne`). - Related ticket: #182 (SIGTERM drain) — complementary, not duplicate. - Manual workaround used today: `for c in claude-hooks-{boss,dev,reviewer,designer,design-reviewer}-default; do docker start "$c"; done`.

claude-desktop added the

area:infra

type:bug

labels

2026-04-20 20:45:18 +00:00