bug(reconcile): agents-sync leaves stopped containers stopped — every restart drops half the fleet #188
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#188
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
After every
systemctl restart claude-hooks, about half of the agent containers stay inexitedstate — the ones targeted bycontainers-down(all-defaultinstances). Webhook dispatches to those agents succeed at enqueue but then silently fail when trying todocker execinto a stopped container. The fleet half-breaks on every restart.Reproducer (witnessed 2026-04-20)
Up(3-2pool members + 5-defaultinstances).systemctl --user restart claude-hooks.-2containers stay up; all 5-defaultcontainers showExited (137):-defaultagent. Service logs show: but the agent never actually runs. Task sits in limbo; webhook won't re-fire.Manual
docker start claude-hooks-<name>on each unblocks them. That workaround was applied at 2026-04-20T20:42Z to keep M18-5.1 / M19-0 dispatches flowing.Root cause
Two asymmetric recipes are glued into the systemd unit:
ExecStopPost=just containers-down(justfile) iteratesconfig/agents.json::typesand stopsclaude-hooks-<type>-defaultfor each. It stops the 5-defaultinstances but leaves the-2pool members alone (they're SQLite rows, not types).ExecStartPre=just agents-sync(justfile) runsapps/server/src/container-reconcile.ts::reconcileAllwhich, perreconcileOne(line 272):A stopped-but-present container matches the desired image + config, so
reconcileOnereturns"unchanged"and never issuesdocker start. The reconcile decision table (comment at line 254-267) doesn't have a "running" column at all — it's only present/absent + DB/config match.Result: stopped
-defaultcontainers fromcontainers-downstay stopped throughagents-syncand survive the restart in the wrong state.Acceptance criteria
Fix the reconcile gap
container-reconcile.ts::reconcileOnegains a "present but not running" branch. When the container exists, matches the desired config, butState.Running === false: issuedocker start <name>and return a new"started"action variant.runningcolumn distinguishing the three present-states: running-matching / stopped-matching / stopped-or-running-drift.inspectContaineralready returns State — verify it's not being discarded beforematchesDesiredcan check it. If it is, expose running state in the snapshot.Fix the asymmetric systemd recipes
containers-downshould iterate the same source of truth asagents-sync(SQLiteagentstable) instead ofconfig/agents.json::types. Today it leaves pool-member-2containers running; that's inconsistent with its name and with how a full service stop should behave.ExecStopPost=just containers-downentirely. The agent containers are long-lived; they should survive service restarts so in-flight work continues. A service restart that kills half the fleet is a worse outcome than one that keeps everything running (the SIGTERM-drain bug #182 is the right place to solve the in-flight task problem).Tests
container-reconcile.test.ts: add a case where the container is present + matches + stopped; reconcile returns"started"and callsdocker start(mocked).container-reconcile.test.ts: "present + matches + running" still returns"unchanged"with no docker mutation.just agents-syncfrom the command line against a stopped container actually starts it (manual smoke after merge).Docs
"started"action variant.containers-downeither updated to match the new SQLite-sourced iteration OR removed if we delete the recipe.Out of scope
containers-rebuild) — already a separate code path.Exited (137)exit code is a red herring — likely fromdocker stopsending SIGTERM then SIGKILL after timeout, not an OOM. No memory issue here.Dependencies
apps/server/src/container-reconcile.ts+ optional justfile cleanup.References
container-watchdogdetected stopped-defaultcontainers but didn't recover them).apps/server/src/container-reconcile.ts:272(reconcileOne).for c in claude-hooks-{boss,dev,reviewer,designer,design-reviewer}-default; do docker start "$c"; done.