fix(reconcile): docker start stopped-but-matching containers (#188) #189

Merged
code-lead merged 2 commits from boss/188 into main 2026-04-20 21:02:40 +00:00
Collaborator

Summary

Two changes that together keep the fleet up across service restarts — closes the restart-drops-half-the-fleet regression from #188.

1. reconcileOne gains a "present + matches + stopped → started" branch

  • ContainerSnapshot now carries running; inspectContainer reads State.Running from docker inspect.
  • When the container is present and config matches but running === false, fire docker start and return the new "started" action.
  • Decision matrix comment in container-reconcile.ts updated with a running column.
  • reconcileAll result gains a started: string[] bucket; CLI (just agents-sync) + startup log print the new line.

Before this fix, a stopped-but-matching container returned "unchanged" and docker exec dispatches hung on an Exited (137) target.

2. ExecStopPost=just containers-down removed from the systemd unit

Agent containers are long-lived and should survive service restarts so in-flight work continues. #182's SIGTERM drain handles in-flight tasks separately — that's the right place for "don't interrupt work during a restart", not containers-down.

containers-down / containers-up recipes now iterate docker ps (the runtime source of truth) so pool-member instances (-2, -3, etc.) are covered too, not just <type>-default from config/agents.json.

Tests

  • container-reconcile.test.ts: new "starts a stopped-but-matching container and returns 'started'" case — asserts docker start fires with no recreate / remove.
  • Existing "returns 'unchanged' when the running container matches desired state" tightened to assert no start call fires either.
  • reconcileAll test added for the started bucket surfacing through to the outcome.
  • dockerInspectJson helper defaults running=true so existing tests keep matching the "running" branch; the #188 tests pass false.

All 20 reconcile tests + 721 total server tests pass; biome + typecheck clean.

Docs

  • CLAUDE.md "Container reconciliation" section documents the four action variants including started, plus a new "Restart behaviour (#188)" subsection.
  • claude-hooks.service comments explain why ExecStopPost is gone.
  • Justfile comments on containers-down / containers-up updated for the docker ps iteration.

Test plan

  • bun x turbo run typecheck — clean
  • bun x biome check . — clean
  • bun x turbo run test — 721 pass
  • Manual smoke after merge: stop a -default container via docker stop, then run just agents-sync and verify the container wakes up with the new [agents-sync] started: … line.

Closes #188

## Summary Two changes that together keep the fleet up across service restarts — closes the restart-drops-half-the-fleet regression from #188. ### 1. `reconcileOne` gains a "present + matches + stopped → started" branch - `ContainerSnapshot` now carries `running`; `inspectContainer` reads `State.Running` from `docker inspect`. - When the container is present and config matches but `running === false`, fire `docker start` and return the new `"started"` action. - Decision matrix comment in `container-reconcile.ts` updated with a `running` column. - `reconcileAll` result gains a `started: string[]` bucket; CLI (`just agents-sync`) + startup log print the new line. Before this fix, a stopped-but-matching container returned `"unchanged"` and `docker exec` dispatches hung on an `Exited (137)` target. ### 2. `ExecStopPost=just containers-down` removed from the systemd unit Agent containers are long-lived and should survive service restarts so in-flight work continues. #182's SIGTERM drain handles in-flight tasks separately — that's the right place for "don't interrupt work during a restart", not `containers-down`. `containers-down` / `containers-up` recipes now iterate `docker ps` (the runtime source of truth) so pool-member instances (`-2`, `-3`, etc.) are covered too, not just `<type>-default` from `config/agents.json`. ### Tests - `container-reconcile.test.ts`: new `"starts a stopped-but-matching container and returns 'started'"` case — asserts `docker start` fires with no recreate / remove. - Existing `"returns 'unchanged' when the running container matches desired state"` tightened to assert no `start` call fires either. - `reconcileAll` test added for the `started` bucket surfacing through to the outcome. - `dockerInspectJson` helper defaults `running=true` so existing tests keep matching the "running" branch; the `#188` tests pass `false`. All 20 reconcile tests + 721 total server tests pass; biome + typecheck clean. ### Docs - `CLAUDE.md` "Container reconciliation" section documents the four action variants including `started`, plus a new "Restart behaviour (#188)" subsection. - `claude-hooks.service` comments explain why `ExecStopPost` is gone. - Justfile comments on `containers-down` / `containers-up` updated for the `docker ps` iteration. ## Test plan - [x] `bun x turbo run typecheck` — clean - [x] `bun x biome check .` — clean - [x] `bun x turbo run test` — 721 pass - [ ] Manual smoke after merge: stop a `-default` container via `docker stop`, then run `just agents-sync` and verify the container wakes up with the new `[agents-sync] started: …` line. Closes #188
fix(reconcile): docker start stopped-but-matching containers (#188)
Some checks failed
qa / qa (pull_request) Failing after 6s
qa / dockerfile (pull_request) Successful in 7s
b1228fbeab
Two changes that together keep the fleet up across service restarts:

1. `reconcileOne` gains a "present + matches + stopped → started" branch.
   `inspectContainer` now returns `State.Running`; when config matches
   but the container is stopped, issue `docker start` and report the
   new "started" action. Before this fix, a stopped-but-matching
   container returned "unchanged" and `docker exec` dispatches hung on
   an `Exited (137)` target.

2. `ExecStopPost=just containers-down` removed from the systemd unit.
   Agent containers are long-lived and should survive service restarts
   so in-flight work continues; #182's SIGTERM drain handles the
   in-flight-task case separately. The `containers-down`/`containers-up`
   recipes now iterate `docker ps` (the runtime source of truth) so
   pool-member instances are covered too, not just `<type>-default`.

Closes #188

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(ci): escape {{ in containers-{up,down} docker ps format string
All checks were successful
qa / qa (pull_request) Successful in 3m10s
qa / dockerfile (pull_request) Successful in 10s
37c7d8baa6
The #188 recipes used `docker ps --format '{{.Names}}'` verbatim, but
`just` treats `{{...}}` as template interpolation. The lexer hit `.` after
`{{` and blew up with "Unknown start of token" before any recipe could be
dispatched — so the qa.yml "Validate justfile" step's `just --summary`
call reported `Missing required recipe: ci-setup` on every PR.

Escape `{{` as `{{{{` per the just book (`}}` on its own doesn't need
escaping — only the opening brace pair triggers interpolation parsing).
Verified locally: `just --summary` now lists every recipe and
`just --dry-run containers-up` expands to the literal `'{{.Names}}'`
docker expects.
reviewer approved these changes 2026-04-20 21:02:00 +00:00
reviewer left a comment

Review — Round 1

CI: green (run #1790, 3m20s, all 721 tests pass)

All acceptance criteria from #188 are met. The implementation is correct and well-scoped.

Core fix (container-reconcile.ts)

The present + matches + stopped → docker start branch is implemented cleanly:

  • ContainerSnapshot.running correctly reads obj.State?.Running === true (strict equality, so any non-boolean from a malformed inspect safely defaults false).
  • dockerStart throws on non-zero exit — this is the right choice. A silently-no-op'd start would leave dispatches hanging on a docker exec into an Exited container, which is the exact failure mode being fixed.
  • The updated doc-comment decision matrix covers all six cases including the new running column. No ambiguity.
  • reconcileAll initialises started: [] and routes the new action correctly. The orphan loop correctly stays bounded to removed | unchanged (orphans always hit the !agent branch, can never start).

Systemd / justfile

  • ExecStopPost is gone and the comment explaining why is thorough — references both the asymmetry (#188) and where the in-flight task concern actually belongs (#182). Future operators won't be tempted to re-add it.
  • containers-up / containers-down now iterate docker ps (runtime truth) instead of agents.json::types, so pool-member -2 containers are covered. The {{{{.Names}} just-escape for the Go template format string is correct ({{{{ → literal {{, standalone }} → literal }}, shell sees {{.Names}}).

Tests

All three new cases cover what matters:

  1. reconcileOne "starts a stopped-but-matching container and returns 'started'" — asserts exactly one docker start call, zero run/stop/rm.
  2. reconcileOne "returns 'unchanged' when running container matches" — now additionally asserts zero start calls (guards the no-regression path).
  3. reconcileAll "reports 'started' bucket when a present-but-stopped container wakes" — confirms the bucket surfaces end-to-end.

The dockerInspectJson helper defaulting running=true is the right call: all pre-existing tests keep matching the running-branch without touching them.

Docs

CLAUDE.md "Container reconciliation" section correctly documents all four action variants and the new "Restart behaviour (#188)" subsection explains the asymmetry and the deliberate decision to not run containers-down on stop.

## Review — Round 1 **CI**: ✅ green (run #1790, 3m20s, all 721 tests pass) All acceptance criteria from #188 are met. The implementation is correct and well-scoped. ### Core fix (`container-reconcile.ts`) The `present + matches + stopped → docker start` branch is implemented cleanly: - `ContainerSnapshot.running` correctly reads `obj.State?.Running === true` (strict equality, so any non-boolean from a malformed inspect safely defaults `false`). - `dockerStart` **throws** on non-zero exit — this is the right choice. A silently-no-op'd start would leave dispatches hanging on a `docker exec` into an Exited container, which is the exact failure mode being fixed. - The updated doc-comment decision matrix covers all six cases including the new `running` column. No ambiguity. - `reconcileAll` initialises `started: []` and routes the new action correctly. The orphan loop correctly stays bounded to `removed | unchanged` (orphans always hit the `!agent` branch, can never `start`). ### Systemd / justfile - `ExecStopPost` is gone and the comment explaining why is thorough — references both the asymmetry (#188) and where the in-flight task concern actually belongs (#182). Future operators won't be tempted to re-add it. - `containers-up` / `containers-down` now iterate `docker ps` (runtime truth) instead of `agents.json::types`, so pool-member `-2` containers are covered. The `{{{{.Names}}` just-escape for the Go template format string is correct (`{{{{` → literal `{{`, standalone `}}` → literal `}}`, shell sees `{{.Names}}`). ### Tests All three new cases cover what matters: 1. `reconcileOne` "starts a stopped-but-matching container and returns 'started'" — asserts exactly one `docker start` call, zero `run`/`stop`/`rm`. 2. `reconcileOne` "returns 'unchanged' when running container matches" — now additionally asserts zero `start` calls (guards the no-regression path). 3. `reconcileAll` "reports 'started' bucket when a present-but-stopped container wakes" — confirms the bucket surfaces end-to-end. The `dockerInspectJson` helper defaulting `running=true` is the right call: all pre-existing tests keep matching the running-branch without touching them. ### Docs CLAUDE.md "Container reconciliation" section correctly documents all four action variants and the new "Restart behaviour (#188)" subsection explains the asymmetry and the deliberate decision to not run `containers-down` on stop.
code-lead deleted branch boss/188 2026-04-20 21:02:41 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!189
No description provided.