fix(reconcile): docker start stopped-but-matching containers (#188) #189
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!189
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "boss/188"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Two changes that together keep the fleet up across service restarts — closes the restart-drops-half-the-fleet regression from #188.
1.
reconcileOnegains a "present + matches + stopped → started" branchContainerSnapshotnow carriesrunning;inspectContainerreadsState.Runningfromdocker inspect.running === false, firedocker startand return the new"started"action.container-reconcile.tsupdated with arunningcolumn.reconcileAllresult gains astarted: string[]bucket; CLI (just agents-sync) + startup log print the new line.Before this fix, a stopped-but-matching container returned
"unchanged"anddocker execdispatches hung on anExited (137)target.2.
ExecStopPost=just containers-downremoved from the systemd unitAgent containers are long-lived and should survive service restarts so in-flight work continues. #182's SIGTERM drain handles in-flight tasks separately — that's the right place for "don't interrupt work during a restart", not
containers-down.containers-down/containers-uprecipes now iteratedocker ps(the runtime source of truth) so pool-member instances (-2,-3, etc.) are covered too, not just<type>-defaultfromconfig/agents.json.Tests
container-reconcile.test.ts: new"starts a stopped-but-matching container and returns 'started'"case — assertsdocker startfires with no recreate / remove."returns 'unchanged' when the running container matches desired state"tightened to assert nostartcall fires either.reconcileAlltest added for thestartedbucket surfacing through to the outcome.dockerInspectJsonhelper defaultsrunning=trueso existing tests keep matching the "running" branch; the#188tests passfalse.All 20 reconcile tests + 721 total server tests pass; biome + typecheck clean.
Docs
CLAUDE.md"Container reconciliation" section documents the four action variants includingstarted, plus a new "Restart behaviour (#188)" subsection.claude-hooks.servicecomments explain whyExecStopPostis gone.containers-down/containers-upupdated for thedocker psiteration.Test plan
bun x turbo run typecheck— cleanbun x biome check .— cleanbun x turbo run test— 721 pass-defaultcontainer viadocker stop, then runjust agents-syncand verify the container wakes up with the new[agents-sync] started: …line.Closes #188
Review — Round 1
CI: ✅ green (run #1790, 3m20s, all 721 tests pass)
All acceptance criteria from #188 are met. The implementation is correct and well-scoped.
Core fix (
container-reconcile.ts)The
present + matches + stopped → docker startbranch is implemented cleanly:ContainerSnapshot.runningcorrectly readsobj.State?.Running === true(strict equality, so any non-boolean from a malformed inspect safely defaultsfalse).dockerStartthrows on non-zero exit — this is the right choice. A silently-no-op'd start would leave dispatches hanging on adocker execinto an Exited container, which is the exact failure mode being fixed.runningcolumn. No ambiguity.reconcileAllinitialisesstarted: []and routes the new action correctly. The orphan loop correctly stays bounded toremoved | unchanged(orphans always hit the!agentbranch, can neverstart).Systemd / justfile
ExecStopPostis gone and the comment explaining why is thorough — references both the asymmetry (#188) and where the in-flight task concern actually belongs (#182). Future operators won't be tempted to re-add it.containers-up/containers-downnow iteratedocker ps(runtime truth) instead ofagents.json::types, so pool-member-2containers are covered. The{{{{.Names}}just-escape for the Go template format string is correct ({{{{→ literal{{, standalone}}→ literal}}, shell sees{{.Names}}).Tests
All three new cases cover what matters:
reconcileOne"starts a stopped-but-matching container and returns 'started'" — asserts exactly onedocker startcall, zerorun/stop/rm.reconcileOne"returns 'unchanged' when running container matches" — now additionally asserts zerostartcalls (guards the no-regression path).reconcileAll"reports 'started' bucket when a present-but-stopped container wakes" — confirms the bucket surfaces end-to-end.The
dockerInspectJsonhelper defaultingrunning=trueis the right call: all pre-existing tests keep matching the running-branch without touching them.Docs
CLAUDE.md "Container reconciliation" section correctly documents all four action variants and the new "Restart behaviour (#188)" subsection explains the asymmetry and the deliberate decision to not run
containers-downon stop.