Investigate: claude-hooks-dev-default container disappears silently #132
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#132
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
User story
As the operator, I want a documented root cause for the twice-observed silent disappearance of
claude-hooks-dev-defaultso that we can either fix it or at least detect + auto-heal it before it breaks the next dispatch.Context — observations
2026-04-20, two occurrences within a ~2 h window, both on the same host (desktop, 192.168.1.164):
docker ps -areturned empty forclaude-hooks-dev-default.journalctl _COMM=dockerdshowed container creation at startup (sbJoin) and then nothing until reconcile recreated it at service restart. No destroy/stop event. Mitigation:just containers-rebuild dev+ re-dispatch (became PR #121).dev/123was pushed at 12:34 — so the container ran long enough to complete the task, push the branch, and THEN died before it couldcreate_pull_request.Not observed on other types — boss, reviewer, designer, design-reviewer have all stayed
Up 12 h+across the same window. Only dev vanishes.Acceptance criteria
Investigation
journalctl -k, cgroup memory limits for the container's systemd scope)claude-hooks-dev-defaultby name (scripted cleanup, debugging session — check shell history,auditctlif available)--restart unless-stoppedpolicy + a specific exit code that Docker interprets as "don't restart" (e.g. exit 0 after SIGKILL forwarded by claude-code graceful shutdown)Deliverable
docker ps --filter name=claude-hooks-*and reconciles missing containers; reports an event if one vanishes between ticks.Optional
Out of scope
docker ps+ reconcile missing".References
src/container-reconcile.ts.just containers-rebuildrecipe injustfile.Dependencies
main.Investigation findings (from the host side, no live access)
I can't poke the host from this sandbox, so this is a code-path and observation analysis — no live log forensics. Taking each hypothesis from the AC list:
OOM / cgroup kill
Low confidence this is it. Exit code 137 = 128 + 9 (SIGKILL). OOM-killer certainly produces 137, but so does every other external kill. The "dockerd logs show creation and then nothing until reconcile recreated it" line rules out the container being OOM-killed by docker itself — if the kernel OOM-killed a process inside, the container would still be present (in
exitedstate) indocker ps -a. A kernel OOM on the parent dockerd is consistent with "no destroy event logged," but the rest of the fleet (boss,reviewer,designer,design-reviewer) stayed up across the same window, so the daemon itself wasn't dead. Ruled out barring new evidence.Docker restart-policy quirk
Low.
--restart unless-stoppedkeeps bouncing an exited container forever unlessdocker stopwas explicitly called. Neither observed occurrence saw a stop event. Also, this wouldn't explaindocker ps -areturning empty — restart-policy never removes the container row. Ruled out.Claude CLI in-process crash tearing down the exec + container
Ruled out by structure. The claude CLI runs inside the container, spawned by
docker exec -i. A CLI segfault kills the exec, not the container — the container process is pid 1 inside the namespace (a long-runningsleep infinityper the Dockerfile, I'd expect). Crashing the exec would return a non-137 exit to the host and leave the container running. Doesn't match the symptom.Bun runtime abort on the host
Low. Bun aborting on the host would kill the
docker execchild but the container is not a child of that process — it's parented to dockerd. Container would stay up.External
docker rmby nameMost likely residual hypothesis. Evidence:
dev-defaultvanishes — if it were an environmental / kernel issue every container in the same bind-mount dir / same image would flap._COMM=dockerd" is consistent withdocker rmagainst a stopped container: dockerd doesn't log remove-of-stopped-container at the same verbosity as a kill-of-running one (you'd need--log-level=debugto seecontainerRemove).docker rm -f claude-hooks-dev-defaultlanding mid-exec. The exec's in-flightchdirto a worktree on the state volume races with the volume unmount; you get the observedno such file or directoryerror.dev-defaultcontainer is the one operators most often target manually (most frequent task traffic, most familiar name, type-word matches shell completion ofdev-*). Muscle memory:just containers-rebuild devwas run after occurrence #1, soclaude-hooks-dev-defaultwas freshly on the operator's mind.Confidence: moderate. No smoking-gun shell history from this side, but nothing else fits the "container removed entirely, no destroy event, only this one container, only this one host" pattern.
Distinguishing "stopped" vs. "removed"
Critical observation from the AC: both occurrences were removal, not stop.
docker ps -areturning empty proves the container row was gone, not merely exited. Adocker stopleaves the row behind inexitedstate — we'd have seenclaude-hooks-dev-default Exited (137) 15 minutes agoand Docker's restart-policy would have bounced it. That path is well-understood and not what we observed.Reproducibility
Cannot reproduce from this side. The observed conditions:
create_pull_request.Hypothesis ranking
docker rmby name (moderate confidence) — operator script or session randocker rm claude-hooks-dev-defaultduring debug of a prior failure. Matches all four observations: single-container scope, no destroy event logged, removal not stop, after-task-success timing in #2.just containers-rebuild devin a shell with a stale alias or a typo that only tore down a single container.Deliverable: monitoring PR
Per the AC's "one of: fix PR / follow-up ticket / monitoring PR" menu, the root cause isn't confident enough for a fix PR and doesn't need a refactor, so I'm landing the monitoring PR option.
New
src/container-watchdog.tsruns every 60 s and:docker ps -a --filter name=^claude-hooks-→ checks which expected containers are present and whether they're running.running/stopped/missing(the AC's "stopped vs. removed entirely" distinction).container_missing,container_stopped,container_recreated,container_recreate_failedevents through the existing SSE broadcast so the dashboard sees them in real time.reconcileOne(name)to recreate any missing container in-place. Idempotent; the state volume survives.docker psitself fails (daemon unreachable) — doesn't compound an outage by trying to recreate everything.Wired into
main.tsstartup alongsidestartSweeper. Tests cover all four event types + the daemon-unreachable no-op.PR incoming.
Suggested follow-ups (out of scope here)
container_missingevents fire in prod: we'll have the timestamp + instance name in the dashboard's event log, which narrows the window to ~60 s. Cross-reference against shell history / audit log for adocker rm claude-hooks-dev-defaultin that window. That's the shortest path to confirming or refuting the "externaldocker rm" hypothesis.agent-runner.tsto gate on a pre-spawndocker inspect.docker rmtargetingclaude-hooks-*— would nail down the remaining ambiguity if the operator has auditd available. Not something this service can do for itself.Investigation — findings and hypothesis
What we know (observations restated cleanly)
From the two 2026-04-20 incidents, the reproducible evidence is:
docker ps -a, not merely inexitedstate. This is the key fact —--restart unless-stoppedcan't act on a container that no longer exists.destroyorstopentry injournalctl _COMM=dockerdfor the affected container between startup and reconcile recreation.docker execreports when the container goes away under it — it's a symptom of the container dying, not evidence of who killed the process inside.devaffected.boss,reviewer,designer,design-reviewerall stayedUp 12 h+across the same window on the same host. Five agents share the same image, same--restart unless-stopped, same state-volume layout, same systemd unit — onlydevvanishes.git push, and died in the ~1 s window betweengit pushsucceeding andcreate_pull_requestfiring.Hypothesis elimination
exitedstate, not remove it. The kernel'soom_killlogs would appear injournalctl -k, which the incident log doesn't report. Worth a confirmatorydmesg | grep -i oom/journalctl -k --sincenext time.--restart unless-stoppednever removes containers. It only restarts them. A container inexitedstate from this policy would still be indocker ps -a.exec-spawned process cannot remove the container — only kill that exec's PID. The container's PID 1 (sleep infinity/ the image's keepalive) would still be running.docker rmby namedocker ps -a" AND "nodestroyevent logged by dockerd". An operator-sidedocker rm -f claude-hooks-dev-defaultfits every observation. Thedevexclusivity fits too — someone debugging the dev pool specifically (shell history from the affected window would confirm).--restart unless-stopped+ exit-code interactionunless-stoppedis state-based, not exit-code-based. It does not remove containers. Docker has no--rm-on-exit policy here.Confidence level: Medium. Hypothesis 5 is the only one consistent with all five observations, but we lack direct forensic evidence (no
auditctlrule ondocker; shell history not yet inspected). Could still be an OOM we haven't proven — a dmesg grep from the incident window would tell us.Why the monitoring PR is the right shape
Root cause is unconfirmed and the fix (if it's hypothesis 5) is an operator-side discipline issue, not a code change. Even if we later find a different root cause, a monitoring layer that detects and auto-heals the same failure mode is pure upside.
PR #134 adds
src/container-watchdog.ts: a 60 s tick that runsdocker ps -a --filter name=^claude-hooks-, diffs againstlistResolvedAgents(), and distinguishes the three states called out in the acceptance criteria:running— healthy, no action.stopped— present indocker ps -abut not running.--restart unless-stoppedshould bounce it; watchdog emits acontainer_stoppedevent so a flapping agent is visible.missing— absent fromdocker ps -aentirely (the #132 failure mode). Watchdog emitscontainer_missing+ callsreconcileOne()+ emitscontainer_recreatedorcontainer_recreate_failed.All four event types fan out through the existing SSE broadcast, so the dashboard surfaces the incident in real time.
The watchdog deliberately skips the tick on
docker psfailure rather than recreating every container — a momentary daemon outage shouldn't compound.Recommended follow-ups (not in this PR)
auditctl -w /usr/bin/docker -p x -k docker-exec. Next occurrence will log the exact PID + parent + command line of whatever calleddocker rm. Cheap, one-line install.docker rm/docker stop claude-hooks-dev-defaultcommands. If found, hypothesis 5 is confirmed and the follow-up is an operator note, not code.container_missingevent in the dashboard gives us a precise timestamp — correlate withdmesg,journalctl, and (if installed) auditd to narrow further.Acceptance criteria coverage
devtype only; only on the desktop host on 2026-04-20).Investigation findings (2026-04-20 late afternoon)
Watched dev-default's vanishing cycle across ~5 hours of live service traffic. The root cause is still not identified, but the investigation narrowed the blast radius and ruled out the obvious candidates.
What dev-default actually looks like when it "vanishes"
docker stop/docker rm, not a crash or OOM. Dockerd logsstopping restart-manager+task-delete from containerd+ systemdscope: Deactivated successfully. No exit code 137, no kernel OOM, nosignal=killedin logs.claude-hooks-dev-defaulttoday, exclusively this instance.dev-2,boss-2,boss-default,reviewer-*,designer-*,design-reviewer-*all stable for 3h+. The selectivity is the strongest clue we have.chdir to cwderror is the docker exec failing because the worktree path inside the container doesn't exist — not because the container itself is missing. The container was actually removed 7 minutes later (11:10:40) — by me, runningjust containers-rebuild devafter that misdiagnosis. That single false signal has been muddying every subsequent occurrence; treating the two failure modes as the same thing was wrong.Ruled out
journalctl -khas zero OOM entries in the relevant windows.--restart unless-stoppedretry exhaustion — policy has no retry cap.stopAndRemove— only 2 callers (reconcileOneat startup + CRUD), both absent from service logs during the vanishing windows. The watchdog (#134) is a victim — it reconciles after the stop, not before.fc -lshows zerodocker stop, zerojust containers-rebuildinvocations between 14:45 and 16:03.proxmox-backup.timerat 03:02 daily; nothing touching docker.Not yet ruled in or out
docker stopwe haven't found. Would needauditdrules on thedockerbinary to catch it live.dev-defaultspecifically — it's the alphabetical-first container in the reconciled set. Could matter if there's a bug that affects index 0 only. Worth testing by creating an instance that sorts beforedev-default(e.g.aaa-dev) and seeing if it takes over the bad-luck slot.Collateral damage confirmed
The recurring recreation itself breaks session resume: every time dev-default is rebuilt, in-flight session files in
agent-env/dev/projects/stay on disk, but the session-id → cwd linkage insessions.jsonpoints at session UUIDs that were captured on the previous container's filesystem. Resume-of-X-failed-retrying-fresh happens at ~7 points in today's log, all on dev-default, all coinciding with post-recreation dispatches. Not a bug in the session-persist fix — a consequence of the disappearance.Also noticed:
~/.config/claude-hooks/agent-env/dev/projects/is ownedroot:rootmode 755 and is empty. Every other agent'sprojects/dir is ownedcharles:charlesand has 10-25 jsonl files. So dev's session persistence is broken from a second angle — the bind source was auto-created by Docker as root-owned back when dev-default first started without themkdirpreceate (pre-#125), and never got fixed. The uid-1000claudeuser inside the container can't write there.Proposed next steps
dev/projects/immediately:sudo chown -R charles:charles ~/.config/claude-hooks/agent-env/dev/projects/. One-line host-side fix, stops the resume-failure cascade.auditd-style observability ondocker stop/docker rmcalls. Add a rule that logs the calling PID + command line any timedocker rm claude-hooks-*is invoked. Without this we're guessing.aaa-probeagent (type=dev, noop). If it starts vanishing instead ofdev-default, we know it's position-dependent (likely a reconcile / iterate bug). If dev-default still vanishes, it's instance-specific (something else about this one instance).The investigation bounced off the wall on "who's calling docker stop?" without auditd. That's the one new capability we need before the next ticket can close this.
docker stop/docker rmcalls on claude-hooks-* containers #149