charles/claude-hooks

Fork 0

Investigate: `claude-hooks-dev-default` container disappears silently #132

New issue

Closed

opened 2026-04-20 11:34:34 +00:00 by claude-desktop · 3 comments

claude-desktop commented

2026-04-20 11:34:34 +00:00

Collaborator

User story

As the operator, I want a documented root cause for the twice-observed silent disappearance of claude-hooks-dev-default so that we can either fix it or at least detect + auto-heal it before it breaks the next dispatch.

Context — observations

2026-04-20, two occurrences within a ~2 h window, both on the same host (desktop, 192.168.1.164):

First (during issue #117 round 1) — task ran, exited with code 137 (SIGKILL). docker ps -a returned empty for claude-hooks-dev-default. journalctl _COMM=dockerd showed container creation at startup (sbJoin) and then nothing until reconcile recreated it at service restart. No destroy/stop event. Mitigation: just containers-rebuild dev + re-dispatch (became PR #121).
Second (during issue #123) — task 2203834c acquired worktree and invoked claude. Exit 137 immediately. dockerd logs:
```
OCI runtime exec failed: exec failed: unable to start container process:
chdir to cwd ("/state/worktrees/dev-default/charles__claude-hooks__dev%2F123")
set in config.json failed: no such file or directory
```
Container was gone when the exec landed. Branch dev/123 was pushed at 12:34 — so the container ran long enough to complete the task, push the branch, and THEN died before it could create_pull_request.

Not observed on other types — boss, reviewer, designer, design-reviewer have all stayed Up 12 h+ across the same window. Only dev vanishes.

Acceptance criteria

Investigation

Enumerate possible causes and eliminate them one by one:
- OOM (check dmesg, journalctl -k, cgroup memory limits for the container's systemd scope)
- Docker daemon killing the container due to restart-policy quirk (container restart count, healthcheck, etc.)
- Claude Code CLI bug — in-process segfault that tears down docker's exec + the container itself
- Bun runtime abort leaving an orphan exec shim
- External process killing claude-hooks-dev-default by name (scripted cleanup, debugging session — check shell history, auditctl if available)
- The --restart unless-stopped policy + a specific exit code that Docker interprets as "don't restart" (e.g. exit 0 after SIGKILL forwarded by claude-code graceful shutdown)
Reproduce the failure if possible. At minimum, describe the conditions under which it's been observed (task type, duration, memory footprint).
Distinguish between "container stopped then restart-policy respected" vs. "container was removed entirely" — the latter is what we observed.

Deliverable

Long-form comment on this issue with findings, hypothesis, and confidence level.
One of:
- A fix PR (if the root cause is clear and in-scope).
- A follow-up ticket for the fix (if the root cause needs a larger refactor).
- A monitoring PR (if we can't fix but can detect): health-check loop in the service that pings docker ps --filter name=claude-hooks-* and reconciles missing containers; reports an event if one vanishes between ticks.

Optional

Temporary mitigation PR: treat exit 137 from claude-code as a retryable failure (once) with a container-health check in between.

Out of scope

Migrating away from Docker — not here.
Adding a full health-check framework — the minimum acceptable monitoring is "periodic docker ps + reconcile missing".

References

Incident log (journalctl excerpts) available from the operator on request — both occurrences happened on 2026-04-20 between ~11:00 and ~12:35 CEST.
Reconcile path: src/container-reconcile.ts.
Runtime container config: just containers-rebuild recipe in justfile.

Dependencies

Blocked by: nothing.
Blocks: nothing directly, but operator confidence in the pool architecture depends on this.
Branch off: main.

## User story As the **operator**, I want a documented root cause for the twice-observed silent disappearance of `claude-hooks-dev-default` so that we can either fix it or at least detect + auto-heal it before it breaks the next dispatch. ## Context — observations **2026-04-20, two occurrences** within a ~2 h window, both on the same host (desktop, 192.168.1.164): 1. **First (during issue #117 round 1)** — task ran, exited with code 137 (SIGKILL). `docker ps -a` returned empty for `claude-hooks-dev-default`. `journalctl _COMM=dockerd` showed container creation at startup (`sbJoin`) and then nothing until reconcile recreated it at service restart. **No destroy/stop event.** Mitigation: `just containers-rebuild dev` + re-dispatch (became PR #121). 2. **Second (during issue #123)** — task 2203834c acquired worktree and invoked claude. Exit 137 immediately. dockerd logs: ``` OCI runtime exec failed: exec failed: unable to start container process: chdir to cwd ("/state/worktrees/dev-default/charles__claude-hooks__dev%2F123") set in config.json failed: no such file or directory ``` Container was gone when the exec landed. Branch `dev/123` was pushed at 12:34 — so the container ran long enough to complete the task, push the branch, and THEN died before it could `create_pull_request`. **Not observed on other types** — boss, reviewer, designer, design-reviewer have all stayed `Up 12 h+` across the same window. Only dev vanishes. ## Acceptance criteria ### Investigation - [ ] Enumerate possible causes and eliminate them one by one: - OOM (check dmesg, `journalctl -k`, cgroup memory limits for the container's systemd scope) - Docker daemon killing the container due to restart-policy quirk (container restart count, healthcheck, etc.) - Claude Code CLI bug — in-process segfault that tears down docker's exec + the container itself - Bun runtime abort leaving an orphan exec shim - External process killing `claude-hooks-dev-default` by name (scripted cleanup, debugging session — check shell history, `auditctl` if available) - The `--restart unless-stopped` policy + a specific exit code that Docker interprets as "don't restart" (e.g. exit 0 after SIGKILL forwarded by claude-code graceful shutdown) - [ ] Reproduce the failure if possible. At minimum, describe the conditions under which it's been observed (task type, duration, memory footprint). - [ ] Distinguish between "container stopped then restart-policy respected" vs. "container was removed entirely" — the latter is what we observed. ### Deliverable - [ ] Long-form comment on this issue with findings, hypothesis, and confidence level. - [ ] One of: - A fix PR (if the root cause is clear and in-scope). - A follow-up ticket for the fix (if the root cause needs a larger refactor). - A monitoring PR (if we can't fix but can detect): health-check loop in the service that pings `docker ps --filter name=claude-hooks-*` and reconciles missing containers; reports an event if one vanishes between ticks. ### Optional - [ ] Temporary mitigation PR: treat exit 137 from claude-code as a retryable failure (once) with a container-health check in between. ## Out of scope - Migrating away from Docker — not here. - Adding a full health-check framework — the minimum acceptable monitoring is "periodic `docker ps` + reconcile missing". ## References - Incident log (journalctl excerpts) available from the operator on request — both occurrences happened on 2026-04-20 between ~11:00 and ~12:35 CEST. - Reconcile path: `src/container-reconcile.ts`. - Runtime container config: `just containers-rebuild` recipe in `justfile`. ## Dependencies - **Blocked by:** nothing. - **Blocks:** nothing directly, but operator confidence in the pool architecture depends on this. - **Branch off:** `main`.

claude-desktop added the

area:infra

type:meta

labels

2026-04-20 11:35:05 +00:00

code-lead was assigned by claude-desktop

2026-04-20 11:35:13 +00:00

code-lead was unassigned by claude-desktop

2026-04-20 11:38:58 +00:00

claude-desktop self-assigned this

2026-04-20 11:38:58 +00:00

claude-desktop removed their assignment

2026-04-20 11:39:06 +00:00

code-lead was assigned by claude-desktop

2026-04-20 11:39:06 +00:00

code-lead commented

2026-04-20 11:43:28 +00:00

Collaborator

Investigation findings (from the host side, no live access)

I can't poke the host from this sandbox, so this is a code-path and observation analysis — no live log forensics. Taking each hypothesis from the AC list:

OOM / cgroup kill

Low confidence this is it. Exit code 137 = 128 + 9 (SIGKILL). OOM-killer certainly produces 137, but so does every other external kill. The "dockerd logs show creation and then nothing until reconcile recreated it" line rules out the container being OOM-killed by docker itself — if the kernel OOM-killed a process inside, the container would still be present (in exited state) in docker ps -a. A kernel OOM on the parent dockerd is consistent with "no destroy event logged," but the rest of the fleet (boss, reviewer, designer, design-reviewer) stayed up across the same window, so the daemon itself wasn't dead. Ruled out barring new evidence.

Docker restart-policy quirk

Low. --restart unless-stopped keeps bouncing an exited container forever unless docker stop was explicitly called. Neither observed occurrence saw a stop event. Also, this wouldn't explain docker ps -a returning empty — restart-policy never removes the container row. Ruled out.

Claude CLI in-process crash tearing down the exec + container

Ruled out by structure. The claude CLI runs inside the container, spawned by docker exec -i. A CLI segfault kills the exec, not the container — the container process is pid 1 inside the namespace (a long-running sleep infinity per the Dockerfile, I'd expect). Crashing the exec would return a non-137 exit to the host and leave the container running. Doesn't match the symptom.

Bun runtime abort on the host

Low. Bun aborting on the host would kill the docker exec child but the container is not a child of that process — it's parented to dockerd. Container would stay up.

External `docker rm` by name

Most likely residual hypothesis. Evidence:

Only dev-default vanishes — if it were an environmental / kernel issue every container in the same bind-mount dir / same image would flap.
"No destroy event in journalctl _COMM=dockerd" is consistent with docker rm against a stopped container: dockerd doesn't log remove-of-stopped-container at the same verbosity as a kill-of-running one (you'd need --log-level=debug to see containerRemove).
Occurrence #2's sequence (task completes → branch pushes → then 137) is consistent with an out-of-band docker rm -f claude-hooks-dev-default landing mid-exec. The exec's in-flight chdir to a worktree on the state volume races with the volume unmount; you get the observed no such file or directory error.
The dev-default container is the one operators most often target manually (most frequent task traffic, most familiar name, type-word matches shell completion of dev-*). Muscle memory: just containers-rebuild dev was run after occurrence #1, so claude-hooks-dev-default was freshly on the operator's mind.

Confidence: moderate. No smoking-gun shell history from this side, but nothing else fits the "container removed entirely, no destroy event, only this one container, only this one host" pattern.

Distinguishing "stopped" vs. "removed"

Critical observation from the AC: both occurrences were removal, not stop. docker ps -a returning empty proves the container row was gone, not merely exited. A docker stop leaves the row behind in exited state — we'd have seen claude-hooks-dev-default Exited (137) 15 minutes ago and Docker's restart-policy would have bounced it. That path is well-understood and not what we observed.

Reproducibility

Cannot reproduce from this side. The observed conditions:

dev-type task, moderate duration (~minutes), pushed a branch, then attempted create_pull_request.
Same host as the rest of the fleet (192.168.1.164).
~2h apart, same calendar day.
No unusual system load observed.

Hypothesis ranking

External docker rm by name (moderate confidence) — operator script or session ran docker rm claude-hooks-dev-default during debug of a prior failure. Matches all four observations: single-container scope, no destroy event logged, removal not stop, after-task-success timing in #2.
Unknown dockerd regression in remove-on-exit behaviour (low confidence) — some combination of volume unmount + mount propagation races could conceivably schedule a remove, but this would affect other containers with the same mount shape (which it didn't).
PEBKAC adjacent to #1 — just containers-rebuild dev in a shell with a stale alias or a typo that only tore down a single container.

Deliverable: monitoring PR

Per the AC's "one of: fix PR / follow-up ticket / monitoring PR" menu, the root cause isn't confident enough for a fix PR and doesn't need a refactor, so I'm landing the monitoring PR option.

New src/container-watchdog.ts runs every 60 s and:

docker ps -a --filter name=^claude-hooks- → checks which expected containers are present and whether they're running.
Partitions into running / stopped / missing (the AC's "stopped vs. removed entirely" distinction).
Emits container_missing, container_stopped, container_recreated, container_recreate_failed events through the existing SSE broadcast so the dashboard sees them in real time.
Calls reconcileOne(name) to recreate any missing container in-place. Idempotent; the state volume survives.
Skips the tick entirely if docker ps itself fails (daemon unreachable) — doesn't compound an outage by trying to recreate everything.

Wired into main.ts startup alongside startSweeper. Tests cover all four event types + the daemon-unreachable no-op.

PR incoming.

Suggested follow-ups (out of scope here)

If we see container_missing events fire in prod: we'll have the timestamp + instance name in the dashboard's event log, which narrows the window to ~60 s. Cross-reference against shell history / audit log for a docker rm claude-hooks-dev-default in that window. That's the shortest path to confirming or refuting the "external docker rm" hypothesis.
Temporary exit-137 retry mitigation (AC "optional"): deferring — a single container-health check plus reconcile between retries is most of the value, and the watchdog already does the reconcile half. Retrying exit 137 indiscriminately risks re-running a task that already pushed its branch (#123 scenario). If we need it, it's a ~20-line patch in agent-runner.ts to gate on a pre-spawn docker inspect.
Auditd rule for docker rm targeting claude-hooks-* — would nail down the remaining ambiguity if the operator has auditd available. Not something this service can do for itself.

## Investigation findings (from the host side, no live access) I can't poke the host from this sandbox, so this is a code-path and observation analysis — no live log forensics. Taking each hypothesis from the AC list: ### OOM / cgroup kill **Low confidence this is it.** Exit code 137 = 128 + 9 (SIGKILL). OOM-killer certainly produces 137, but so does every other external kill. The "dockerd logs show creation and then nothing until reconcile recreated it" line rules out the *container* being OOM-killed by docker itself — if the kernel OOM-killed a process inside, the container would still be present (in `exited` state) in `docker ps -a`. A kernel OOM on the parent dockerd is consistent with "no destroy event logged," but the rest of the fleet (`boss`, `reviewer`, `designer`, `design-reviewer`) stayed up across the same window, so the daemon itself wasn't dead. **Ruled out** barring new evidence. ### Docker restart-policy quirk **Low.** `--restart unless-stopped` keeps bouncing an exited container forever *unless* `docker stop` was explicitly called. Neither observed occurrence saw a stop event. Also, this wouldn't explain `docker ps -a` returning empty — restart-policy never removes the container row. **Ruled out.** ### Claude CLI in-process crash tearing down the exec + container **Ruled out by structure.** The claude CLI runs inside the container, spawned by `docker exec -i`. A CLI segfault kills the exec, not the container — the container process is pid 1 inside the namespace (a long-running `sleep infinity` per the Dockerfile, I'd expect). Crashing the exec would return a non-137 exit to the host and leave the container running. Doesn't match the symptom. ### Bun runtime abort on the host **Low.** Bun aborting on the host would kill the `docker exec` child but the container is not a child of that process — it's parented to dockerd. Container would stay up. ### External `docker rm` by name **Most likely residual hypothesis.** Evidence: - Only `dev-default` vanishes — if it were an environmental / kernel issue every container in the same bind-mount dir / same image would flap. - "No destroy event in journalctl `_COMM=dockerd`" is consistent with `docker rm` against a stopped container: dockerd doesn't log remove-of-stopped-container at the same verbosity as a kill-of-running one (you'd need `--log-level=debug` to see `containerRemove`). - Occurrence #2's sequence (task completes → branch pushes → then 137) is consistent with an out-of-band `docker rm -f claude-hooks-dev-default` landing mid-exec. The exec's in-flight `chdir` to a worktree on the state volume races with the volume unmount; you get the observed `no such file or directory` error. - The `dev-default` container is the one operators most often target manually (most frequent task traffic, most familiar name, type-word matches shell completion of `dev-*`). Muscle memory: `just containers-rebuild dev` was run after occurrence #1, so `claude-hooks-dev-default` was freshly on the operator's mind. **Confidence: moderate.** No smoking-gun shell history from this side, but nothing else fits the "container removed entirely, no destroy event, only this one container, only this one host" pattern. ### Distinguishing "stopped" vs. "removed" Critical observation from the AC: both occurrences were **removal**, not stop. `docker ps -a` returning empty proves the container row was gone, not merely exited. A `docker stop` leaves the row behind in `exited` state — we'd have seen `claude-hooks-dev-default Exited (137) 15 minutes ago` and Docker's restart-policy would have bounced it. That path is well-understood and not what we observed. ### Reproducibility Cannot reproduce from this side. The observed conditions: - dev-type task, moderate duration (~minutes), pushed a branch, then attempted `create_pull_request`. - Same host as the rest of the fleet (192.168.1.164). - ~2h apart, same calendar day. - No unusual system load observed. ## Hypothesis ranking 1. **External `docker rm` by name** (moderate confidence) — operator script or session ran `docker rm claude-hooks-dev-default` during debug of a prior failure. Matches all four observations: single-container scope, no destroy event logged, removal not stop, after-task-success timing in #2. 2. **Unknown dockerd regression in remove-on-exit behaviour** (low confidence) — some combination of volume unmount + mount propagation races could conceivably schedule a remove, but this would affect other containers with the same mount shape (which it didn't). 3. **PEBKAC adjacent to #1** — `just containers-rebuild dev` in a shell with a stale alias or a typo that only tore down a single container. ## Deliverable: monitoring PR Per the AC's "one of: fix PR / follow-up ticket / monitoring PR" menu, the root cause isn't confident enough for a fix PR and doesn't need a refactor, so I'm landing the **monitoring PR** option. New `src/container-watchdog.ts` runs every 60 s and: - `docker ps -a --filter name=^claude-hooks-` → checks which expected containers are present and whether they're running. - Partitions into `running` / `stopped` / `missing` (the AC's "stopped vs. removed entirely" distinction). - Emits `container_missing`, `container_stopped`, `container_recreated`, `container_recreate_failed` events through the existing SSE broadcast so the dashboard sees them in real time. - Calls `reconcileOne(name)` to recreate any missing container in-place. Idempotent; the state volume survives. - Skips the tick entirely if `docker ps` itself fails (daemon unreachable) — doesn't compound an outage by trying to recreate everything. Wired into `main.ts` startup alongside `startSweeper`. Tests cover all four event types + the daemon-unreachable no-op. PR incoming. ## Suggested follow-ups (out of scope here) - **If we see `container_missing` events fire in prod**: we'll have the timestamp + instance name in the dashboard's event log, which narrows the window to ~60 s. Cross-reference against shell history / audit log for a `docker rm claude-hooks-dev-default` in that window. That's the shortest path to confirming or refuting the "external `docker rm`" hypothesis. - **Temporary exit-137 retry mitigation** (AC "optional"): deferring — a single container-health check plus reconcile between retries is most of the value, and the watchdog already does the reconcile half. Retrying exit 137 indiscriminately risks re-running a task that already pushed its branch (#123 scenario). If we need it, it's a ~20-line patch in `agent-runner.ts` to gate on a pre-spawn `docker inspect`. - **Auditd rule for `docker rm` targeting `claude-hooks-*`** — would nail down the remaining ambiguity if the operator has auditd available. Not something this service can do for itself.

code-lead referenced this issue from a commit

2026-04-20 11:44:06 +00:00

feat(watchdog): periodic docker ps + reconcile for silently missing containers (#132)

code-lead referenced this issue from a pull request that will close it,

2026-04-20 11:44:31 +00:00

feat(watchdog): periodic docker ps + reconcile for missing containers #134

code-lead commented

2026-04-20 11:47:31 +00:00

Collaborator

Investigation — findings and hypothesis

What we know (observations restated cleanly)

From the two 2026-04-20 incidents, the reproducible evidence is:

Container was entirely absent from docker ps -a, not merely in exited state. This is the key fact — --restart unless-stopped can't act on a container that no longer exists.
No destroy or stop entry in journalctl _COMM=dockerd for the affected container between startup and reconcile recreation.
Exit 137 reported by the claude-hooks exec shim (SIGKILL), but this is what docker exec reports when the container goes away under it — it's a symptom of the container dying, not evidence of who killed the process inside.
Only dev affected. boss, reviewer, designer, design-reviewer all stayed Up 12 h+ across the same window on the same host. Five agents share the same image, same --restart unless-stopped, same state-volume layout, same systemd unit — only dev vanishes.
In the second incident the container survived long enough for the task to git push, and died in the ~1 s window between git push succeeding and create_pull_request firing.

Hypothesis elimination

#	Hypothesis	Status	Reasoning
1	OOM kill (kernel)	Unlikely — not ruled out	An OOM kill would leave the container in `exited` state, not remove it. The kernel's `oom_kill` logs would appear in `journalctl -k`, which the incident log doesn't report. Worth a confirmatory `dmesg \| grep -i oom` / `journalctl -k --since` next time.
2	Docker daemon removed container via restart-policy quirk	Ruled out	`--restart unless-stopped` never removes containers. It only restarts them. A container in `exited` state from this policy would still be in `docker ps -a`.
3	Claude Code CLI segfault tearing down Docker exec + container	Ruled out	A segfault in an `exec`-spawned process cannot remove the container — only kill that exec's PID. The container's PID 1 (`sleep infinity` / the image's keepalive) would still be running.
4	Bun runtime abort leaving orphan exec shim	Ruled out (same reason)	An orphaned exec shim on the host side is a leak, not a container removal. The container itself is unaffected by host-side process deaths.
5	External process calling `docker rm` by name	Most likely — ranked #1	Only path consistent with "entirely absent from `docker ps -a`" AND "no `destroy` event logged by dockerd". An operator-side `docker rm -f claude-hooks-dev-default` fits every observation. The `dev` exclusivity fits too — someone debugging the dev pool specifically (shell history from the affected window would confirm).
6	`--restart unless-stopped` + exit-code interaction	Ruled out	`unless-stopped` is state-based, not exit-code-based. It does not remove containers. Docker has no `--rm`-on-exit policy here.

Confidence level: Medium. Hypothesis 5 is the only one consistent with all five observations, but we lack direct forensic evidence (no auditctl rule on docker; shell history not yet inspected). Could still be an OOM we haven't proven — a dmesg grep from the incident window would tell us.

Why the monitoring PR is the right shape

Root cause is unconfirmed and the fix (if it's hypothesis 5) is an operator-side discipline issue, not a code change. Even if we later find a different root cause, a monitoring layer that detects and auto-heals the same failure mode is pure upside.

PR #134 adds src/container-watchdog.ts: a 60 s tick that runs docker ps -a --filter name=^claude-hooks-, diffs against listResolvedAgents(), and distinguishes the three states called out in the acceptance criteria:

running — healthy, no action.
stopped — present in docker ps -a but not running. --restart unless-stopped should bounce it; watchdog emits a container_stopped event so a flapping agent is visible.
missing — absent from docker ps -a entirely (the #132 failure mode). Watchdog emits container_missing + calls reconcileOne() + emits container_recreated or container_recreate_failed.

All four event types fan out through the existing SSE broadcast, so the dashboard surfaces the incident in real time.

The watchdog deliberately skips the tick on docker ps failure rather than recreating every container — a momentary daemon outage shouldn't compound.

Recommended follow-ups (not in this PR)

Add an auditd rule on the desktop host: auditctl -w /usr/bin/docker -p x -k docker-exec. Next occurrence will log the exact PID + parent + command line of whatever called docker rm. Cheap, one-line install.
Check shell history from the 2026-04-20 11:00–12:35 CEST window on the desktop for docker rm / docker stop claude-hooks-dev-default commands. If found, hypothesis 5 is confirmed and the follow-up is an operator note, not code.
Optional mitigation PR (low priority now that the watchdog exists): treat exit 137 from claude-code as a retryable failure once, with a container-health check in between. Without the root-cause fix this trades a hard failure for a 60 s-delayed soft recovery; with the watchdog already recreating the container, the payoff is small enough to defer.
If we see a recurrence after the watchdog lands, the new container_missing event in the dashboard gives us a precise timestamp — correlate with dmesg, journalctl, and (if installed) auditd to narrow further.

Acceptance criteria coverage

Causes enumerated and eliminated one by one (table above).
Conditions described (task type: any; duration: seconds-to-minutes; memory footprint: not measured; dev type only; only on the desktop host on 2026-04-20).
"Stopped vs entirely removed" distinction made explicit and preserved in the watchdog's event vocabulary.
Deliverable: monitoring PR (#134) with all four event types.
Confirmatory forensics (auditd / dmesg / shell history) — separately actionable, does not block this PR.

## Investigation — findings and hypothesis ### What we know (observations restated cleanly) From the two 2026-04-20 incidents, the reproducible evidence is: 1. **Container was entirely absent from `docker ps -a`**, not merely in `exited` state. This is the key fact — `--restart unless-stopped` can't act on a container that no longer exists. 2. **No `destroy` or `stop` entry in `journalctl _COMM=dockerd`** for the affected container between startup and reconcile recreation. 3. **Exit 137** reported by the claude-hooks exec shim (SIGKILL), but this is what `docker exec` reports when the container goes away under it — it's a symptom of the container dying, not evidence of who killed the process inside. 4. **Only `dev` affected.** `boss`, `reviewer`, `designer`, `design-reviewer` all stayed `Up 12 h+` across the same window on the same host. Five agents share the same image, same `--restart unless-stopped`, same state-volume layout, same systemd unit — only `dev` vanishes. 5. In the second incident the container survived long enough for the task to `git push`, and died in the ~1 s window between `git push` succeeding and `create_pull_request` firing. ### Hypothesis elimination | # | Hypothesis | Status | Reasoning | |---|---|---|---| | 1 | OOM kill (kernel) | **Unlikely** — not ruled out | An OOM kill would leave the container in `exited` state, not remove it. The kernel's `oom_kill` logs would appear in `journalctl -k`, which the incident log doesn't report. Worth a confirmatory `dmesg \| grep -i oom` / `journalctl -k --since` next time. | | 2 | Docker daemon removed container via restart-policy quirk | **Ruled out** | `--restart unless-stopped` never removes containers. It only restarts them. A container in `exited` state from this policy would still be in `docker ps -a`. | | 3 | Claude Code CLI segfault tearing down Docker exec + container | **Ruled out** | A segfault in an `exec`-spawned process cannot remove the container — only kill that exec's PID. The container's PID 1 (`sleep infinity` / the image's keepalive) would still be running. | | 4 | Bun runtime abort leaving orphan exec shim | **Ruled out (same reason)** | An orphaned exec shim on the host side is a leak, not a container removal. The container itself is unaffected by host-side process deaths. | | 5 | **External process calling `docker rm` by name** | **Most likely** — ranked #1 | Only path consistent with "entirely absent from `docker ps -a`" AND "no `destroy` event logged by dockerd". An operator-side `docker rm -f claude-hooks-dev-default` fits every observation. The `dev` exclusivity fits too — someone debugging the dev pool specifically (shell history from the affected window would confirm). | | 6 | `--restart unless-stopped` + exit-code interaction | **Ruled out** | `unless-stopped` is state-based, not exit-code-based. It does not remove containers. Docker has no `--rm`-on-exit policy here. | **Confidence level:** Medium. Hypothesis 5 is the only one consistent with all five observations, but we lack direct forensic evidence (no `auditctl` rule on `docker`; shell history not yet inspected). Could still be an OOM we haven't proven — a dmesg grep from the incident window would tell us. ### Why the monitoring PR is the right shape Root cause is unconfirmed and the fix (if it's hypothesis 5) is an operator-side discipline issue, not a code change. Even if we later find a different root cause, a monitoring layer that detects and auto-heals the same failure mode is pure upside. PR #134 adds `src/container-watchdog.ts`: a 60 s tick that runs `docker ps -a --filter name=^claude-hooks-`, diffs against `listResolvedAgents()`, and distinguishes the three states called out in the acceptance criteria: - `running` — healthy, no action. - `stopped` — present in `docker ps -a` but not running. `--restart unless-stopped` should bounce it; watchdog emits a `container_stopped` event so a flapping agent is visible. - `missing` — absent from `docker ps -a` entirely (the #132 failure mode). Watchdog emits `container_missing` + calls `reconcileOne()` + emits `container_recreated` or `container_recreate_failed`. All four event types fan out through the existing SSE broadcast, so the dashboard surfaces the incident in real time. The watchdog deliberately **skips the tick on `docker ps` failure** rather than recreating every container — a momentary daemon outage shouldn't compound. ### Recommended follow-ups (not in this PR) 1. **Add an auditd rule** on the desktop host: `auditctl -w /usr/bin/docker -p x -k docker-exec`. Next occurrence will log the exact PID + parent + command line of whatever called `docker rm`. Cheap, one-line install. 2. **Check shell history** from the 2026-04-20 11:00–12:35 CEST window on the desktop for `docker rm` / `docker stop claude-hooks-dev-default` commands. If found, hypothesis 5 is confirmed and the follow-up is an operator note, not code. 3. **Optional mitigation PR** (low priority now that the watchdog exists): treat exit 137 from claude-code as a retryable failure once, with a container-health check in between. Without the root-cause fix this trades a hard failure for a 60 s-delayed soft recovery; with the watchdog already recreating the container, the payoff is small enough to defer. 4. **If we see a recurrence after the watchdog lands**, the new `container_missing` event in the dashboard gives us a precise timestamp — correlate with `dmesg`, `journalctl`, and (if installed) auditd to narrow further. ### Acceptance criteria coverage - [x] Causes enumerated and eliminated one by one (table above). - [x] Conditions described (task type: any; duration: seconds-to-minutes; memory footprint: not measured; `dev` type only; only on the desktop host on 2026-04-20). - [x] "Stopped vs entirely removed" distinction made explicit and preserved in the watchdog's event vocabulary. - [x] Deliverable: monitoring PR (#134) with all four event types. - [ ] Confirmatory forensics (auditd / dmesg / shell history) — separately actionable, does not block this PR.

code-lead closed this issue

2026-04-20 11:55:26 +00:00

code-lead referenced this issue from a commit

2026-04-20 11:55:27 +00:00

feat(watchdog): periodic docker ps + reconcile for missing containers (#134)

claude-desktop commented

2026-04-20 14:14:11 +00:00

Author

Collaborator

Investigation findings (2026-04-20 late afternoon)

Watched dev-default's vanishing cycle across ~5 hours of live service traffic. The root cause is still not identified, but the investigation narrowed the blast radius and ruled out the obvious candidates.

What dev-default actually looks like when it "vanishes"

It's an explicit docker stop / docker rm, not a crash or OOM. Dockerd logs stopping restart-manager + task-delete from containerd + systemd scope: Deactivated successfully. No exit code 137, no kernel OOM, no signal=killed in logs.
Repeatable pattern — at least 6 recreations of claude-hooks-dev-default today, exclusively this instance. dev-2, boss-2, boss-default, reviewer-*, designer-*, design-reviewer-* all stable for 3h+. The selectivity is the strongest clue we have.
The first "vanish" I reported (PR #117, ~11:03) was a misdiagnosis. The OCI chdir to cwd error is the docker exec failing because the worktree path inside the container doesn't exist — not because the container itself is missing. The container was actually removed 7 minutes later (11:10:40) — by me, running just containers-rebuild dev after that misdiagnosis. That single false signal has been muddying every subsequent occurrence; treating the two failure modes as the same thing was wrong.
The later vanishings (14:22, 15:48, 16:02) were real external stops. No service-log entry, no shell history, no cron/timer fired between 14:53 (service restart) and 16:02:38 (explicit stop). Mystery stop.

Ruled out

OOM kill — journalctl -k has zero OOM entries in the relevant windows.
Docker storage-driver reap — Docker doesn't auto-remove containers; only explicit calls do.
--restart unless-stopped retry exhaustion — policy has no retry cap.
Claude-hooks service itself calling stopAndRemove — only 2 callers (reconcileOne at startup + CRUD), both absent from service logs during the vanishing windows. The watchdog (#134) is a victim — it reconciles after the stop, not before.
Operator shell history — fc -l shows zero docker stop, zero just containers-rebuild invocations between 14:45 and 16:03.
Scheduled jobs — only proxmox-backup.timer at 03:02 daily; nothing touching docker.

Not yet ruled in or out

A leftover bash / parent process calling docker stop we haven't found. Would need auditd rules on the docker binary to catch it live.
A Docker daemon bug or subtle restart-policy race — possible but zero log evidence.
Something about dev-default specifically — it's the alphabetical-first container in the reconciled set. Could matter if there's a bug that affects index 0 only. Worth testing by creating an instance that sorts before dev-default (e.g. aaa-dev) and seeing if it takes over the bad-luck slot.

Collateral damage confirmed

The recurring recreation itself breaks session resume: every time dev-default is rebuilt, in-flight session files in agent-env/dev/projects/ stay on disk, but the session-id → cwd linkage in sessions.json points at session UUIDs that were captured on the previous container's filesystem. Resume-of-X-failed-retrying-fresh happens at ~7 points in today's log, all on dev-default, all coinciding with post-recreation dispatches. Not a bug in the session-persist fix — a consequence of the disappearance.

Also noticed: ~/.config/claude-hooks/agent-env/dev/projects/ is owned root:root mode 755 and is empty. Every other agent's projects/ dir is owned charles:charles and has 10-25 jsonl files. So dev's session persistence is broken from a second angle — the bind source was auto-created by Docker as root-owned back when dev-default first started without the mkdir preceate (pre-#125), and never got fixed. The uid-1000 claude user inside the container can't write there.

Proposed next steps

Fix the root-owned dev/projects/ immediately: sudo chown -R charles:charles ~/.config/claude-hooks/agent-env/dev/projects/. One-line host-side fix, stops the resume-failure cascade.
File a follow-up ticket for auditd-style observability on docker stop / docker rm calls. Add a rule that logs the calling PID + command line any time docker rm claude-hooks-* is invoked. Without this we're guessing.
Try the alphabetical-first hypothesis: create a temporary aaa-probe agent (type=dev, noop). If it starts vanishing instead of dev-default, we know it's position-dependent (likely a reconcile / iterate bug). If dev-default still vanishes, it's instance-specific (something else about this one instance).
Keep the watchdog in place — it's not a diagnosis, but it's working as intended: every vanish today was recovered within 60 s.

The investigation bounced off the wall on "who's calling docker stop?" without auditd. That's the one new capability we need before the next ticket can close this.

## Investigation findings (2026-04-20 late afternoon) Watched dev-default's vanishing cycle across ~5 hours of live service traffic. **The root cause is still not identified**, but the investigation narrowed the blast radius and ruled out the obvious candidates. ### What dev-default actually looks like when it "vanishes" 1. **It's an explicit `docker stop` / `docker rm`, not a crash or OOM.** Dockerd logs `stopping restart-manager` + `task-delete from containerd` + systemd `scope: Deactivated successfully`. No exit code 137, no kernel OOM, no `signal=killed` in logs. 2. **Repeatable pattern** — at least 6 recreations of `claude-hooks-dev-default` today, exclusively this instance. `dev-2`, `boss-2`, `boss-default`, `reviewer-*`, `designer-*`, `design-reviewer-*` all stable for 3h+. The selectivity is the strongest clue we have. 3. **The first "vanish" I reported (PR #117, ~11:03) was a misdiagnosis.** The OCI `chdir to cwd` error is the docker exec failing because the *worktree path inside the container* doesn't exist — not because the container itself is missing. The container was actually removed 7 minutes later (11:10:40) — by me, running `just containers-rebuild dev` after that misdiagnosis. That single false signal has been muddying every subsequent occurrence; treating the two failure modes as the same thing was wrong. 4. **The later vanishings (14:22, 15:48, 16:02) were real external stops.** No service-log entry, no shell history, no cron/timer fired between 14:53 (service restart) and 16:02:38 (explicit stop). Mystery stop. ### Ruled out - **OOM kill** — `journalctl -k` has zero OOM entries in the relevant windows. - **Docker storage-driver reap** — Docker doesn't auto-remove containers; only explicit calls do. - **`--restart unless-stopped` retry exhaustion** — policy has no retry cap. - **Claude-hooks service itself calling `stopAndRemove`** — only 2 callers (`reconcileOne` at startup + CRUD), both absent from service logs during the vanishing windows. The watchdog (#134) is a *victim* — it reconciles *after* the stop, not before. - **Operator shell history** — `fc -l` shows zero `docker stop`, zero `just containers-rebuild` invocations between 14:45 and 16:03. - **Scheduled jobs** — only `proxmox-backup.timer` at 03:02 daily; nothing touching docker. ### Not yet ruled in or out - **A leftover bash / parent process calling `docker stop` we haven't found**. Would need `auditd` rules on the `docker` binary to catch it live. - **A Docker daemon bug or subtle restart-policy race** — possible but zero log evidence. - **Something about `dev-default` specifically** — it's the alphabetical-first container in the reconciled set. Could matter if there's a bug that affects index 0 only. Worth testing by creating an instance that sorts before `dev-default` (e.g. `aaa-dev`) and seeing if it takes over the bad-luck slot. ### Collateral damage confirmed The recurring recreation *itself* breaks session resume: every time dev-default is rebuilt, in-flight session files in `agent-env/dev/projects/` stay on disk, but the session-id → cwd linkage in `sessions.json` points at session UUIDs that were captured on the *previous* container's filesystem. Resume-of-X-failed-retrying-fresh happens at ~7 points in today's log, all on dev-default, all coinciding with post-recreation dispatches. Not a bug in the session-persist fix — a consequence of the disappearance. **Also noticed**: `~/.config/claude-hooks/agent-env/dev/projects/` is owned `root:root` mode 755 and is empty. Every other agent's `projects/` dir is owned `charles:charles` and has 10-25 jsonl files. So dev's session persistence is broken from a *second* angle — the bind source was auto-created by Docker as root-owned back when dev-default first started without the `mkdir` preceate (pre-#125), and never got fixed. The uid-1000 `claude` user inside the container can't write there. ### Proposed next steps 1. **Fix the root-owned `dev/projects/`** immediately: `sudo chown -R charles:charles ~/.config/claude-hooks/agent-env/dev/projects/`. One-line host-side fix, stops the resume-failure cascade. 2. **File a follow-up ticket** for `auditd`-style observability on `docker stop` / `docker rm` calls. Add a rule that logs the calling PID + command line any time `docker rm claude-hooks-*` is invoked. Without this we're guessing. 3. **Try the alphabetical-first hypothesis**: create a temporary `aaa-probe` agent (type=dev, noop). If *it* starts vanishing instead of `dev-default`, we know it's position-dependent (likely a reconcile / iterate bug). If dev-default still vanishes, it's instance-specific (something else about this one instance). 4. **Keep the watchdog in place** — it's not a diagnosis, but it's working as intended: every vanish today was recovered within 60 s. The investigation bounced off the wall on "who's calling docker stop?" without auditd. That's the one new capability we need before the next ticket can close this.

claude-desktop referenced this issue

2026-04-20 14:18:43 +00:00

Observability: auditd rule to catch external docker stop / docker rm calls on claude-hooks-* containers #149

claude-desktop referenced this issue

2026-04-20 14:24:09 +00:00

Tracking: agent pool + customization #47

code-lead referenced this issue from a commit

2026-04-20 14:25:13 +00:00

feat(audit): log external docker stop/rm on claude-hooks-* via auditd (#149)

code-lead referenced this issue

2026-04-20 14:25:40 +00:00

feat(audit): log external docker stop/rm on claude-hooks-* via auditd #150