fix(image+plugins): un-pin claude-code; install plugins from inside image #98

Merged
charles merged 3 commits from fix/unpin-claude-code into main 2026-04-19 19:06:31 +00:00
Collaborator

Summary

All five agents were dispatching for hours with zero plugins loaded, even though agents.json declared them and settings.json listed them as enabled. claude plugin list inside every container reported Status: ✘ failed to load — Plugin X not found in marketplace claude-plugins-official.

Root cause. Dockerfile pinned claude-code (2.1.110, per #83) while just agent-plugins-install ran on the host (auto-updated to 2.1.113). Installer wrote marketplace state in a format the older container loader couldn't resolve. The smoke probe only grepped the ❯ <name> header line, so it reported green while every plugin silently failed.

Fix. Un-pin, install plugins from inside the image so installer CLI == loader CLI always.

Changes

  • Dockerfile — drop ARG CLAUDE_CODE_VERSION. Resolve the npm latest dist-tag at image build. Switch from @anthropic-ai/claude-code (bundled cli.js — layout gone in 2.1.114) to the platform-native @anthropic-ai/claude-code-linux-{x64,arm64} tarball. Move DISABLE_AUTOUPDATER=1 to an ENV.
  • justfile agent-plugins-install — run the install inside a throwaway container off container_image with the agent-env dir bind-mounted rw. Wipe stale plugins/ dir and enabledPlugins / extraKnownMarketplaces keys from settings.json before install so migration from the old host-install is clean.
  • scripts/smoke-creds.sh — plugin probe now parses per-plugin Status: and fails on ✘ failed to load, not just the name header. Remove the pinned-version probe (nothing to pin against).
  • CLAUDE.md — remove the "claude-code version bumps" section.

Rationale for un-pinning

Pinning was motivated by the v2.1.111 context-bloat regression (#83), but the underlying context-ceiling problem has since been addressed via other means (1M-context designer, agent-aware prompt-too-long hint). Keeping the pin guaranteed drift with the host-installed claude CLI, which is what bit us here.

Test plan

  • just qa — 259 pass, 0 fail
  • Rebuild image — claude --version reports 2.1.114 inside every container
  • just containers-rebuild — all 5 containers recreated cleanly
  • just agent-plugins-install — 12 plugins install successfully (boss/dev/reviewer/designer/design-reviewer)
  • scripts/smoke-creds.sh33 passed, 0 failed; hardened plugin probe now reports (loaded) per plugin instead of accepting the name header alone
  • Re-dispatch a trivial task on each of dev/reviewer/boss post-merge to exercise code-flow paths with real plugins active (#76 validation)

🤖 Generated with Claude Code

## Summary All five agents were dispatching for hours with **zero plugins loaded**, even though `agents.json` declared them and `settings.json` listed them as enabled. `claude plugin list` inside every container reported `Status: ✘ failed to load — Plugin X not found in marketplace claude-plugins-official`. **Root cause.** `Dockerfile` pinned `claude-code` (2.1.110, per #83) while `just agent-plugins-install` ran on the **host** (auto-updated to 2.1.113). Installer wrote marketplace state in a format the older container loader couldn't resolve. The smoke probe only grepped the `❯ <name>` header line, so it reported green while every plugin silently failed. **Fix.** Un-pin, install plugins from inside the image so installer CLI == loader CLI always. ## Changes - **Dockerfile** — drop `ARG CLAUDE_CODE_VERSION`. Resolve the npm `latest` dist-tag at image build. Switch from `@anthropic-ai/claude-code` (bundled `cli.js` — layout gone in 2.1.114) to the platform-native `@anthropic-ai/claude-code-linux-{x64,arm64}` tarball. Move `DISABLE_AUTOUPDATER=1` to an `ENV`. - **justfile `agent-plugins-install`** — run the install inside a throwaway container off `container_image` with the agent-env dir bind-mounted rw. Wipe stale `plugins/` dir and `enabledPlugins` / `extraKnownMarketplaces` keys from `settings.json` before install so migration from the old host-install is clean. - **scripts/smoke-creds.sh** — plugin probe now parses per-plugin `Status:` and fails on `✘ failed to load`, not just the name header. Remove the pinned-version probe (nothing to pin against). - **CLAUDE.md** — remove the "claude-code version bumps" section. ## Rationale for un-pinning Pinning was motivated by the v2.1.111 context-bloat regression (#83), but the underlying context-ceiling problem has since been addressed via other means (1M-context designer, agent-aware prompt-too-long hint). Keeping the pin guaranteed drift with the host-installed `claude` CLI, which is what bit us here. ## Test plan - [x] `just qa` — 259 pass, 0 fail - [x] Rebuild image — `claude --version` reports `2.1.114` inside every container - [x] `just containers-rebuild` — all 5 containers recreated cleanly - [x] `just agent-plugins-install` — 12 plugins install successfully (boss/dev/reviewer/designer/design-reviewer) - [x] `scripts/smoke-creds.sh` — **33 passed, 0 failed**; hardened plugin probe now reports `(loaded)` per plugin instead of accepting the name header alone - [ ] Re-dispatch a trivial task on each of dev/reviewer/boss post-merge to exercise code-flow paths with real plugins active (#76 validation) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fix(image+plugins): un-pin claude-code; install plugins from inside image
All checks were successful
qa / qa (pull_request) Successful in 2m32s
qa / dockerfile (pull_request) Successful in 8s
36dca620db
The Dockerfile pinned `claude-code` via ARG CLAUDE_CODE_VERSION (#83) while
`just agent-plugins-install` ran on the host. Host auto-updates to a newer
CLI, container stays on the pinned one, installer writes marketplace state
in a format the loader can't resolve → all plugins silently `failed to load`
on every dispatch. Smoke probe didn't catch it because it only checked the
plugin name header, not load status.

- Dockerfile: drop ARG CLAUDE_CODE_VERSION, pull the npm `latest` dist-tag of
  the platform-native `claude-code-linux-{x64,arm64}` package (upstream moved
  off the bundled cli.js in 2.1.114). Set DISABLE_AUTOUPDATER=1 as ENV.
- justfile: agent-plugins-install now runs inside a throwaway container off
  `container_image`, so installer CLI == loader CLI. Wipes stale plugins/ +
  settings.json keys first so migration from the old host-install is clean.
- smoke-creds.sh: plugin probe parses per-plugin Status and fails on
  `✘ failed to load` instead of accepting the name header alone. Drop the
  version-pin probe (nothing to pin against).
- CLAUDE.md: drop the "claude-code version bumps" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reviewer requested changes 2026-04-19 18:52:36 +00:00
Dismissed
reviewer left a comment

Review: fix(image+plugins): un-pin claude-code; install plugins from inside image

CI: Green (run #1626, 2m41s, success)

Root cause diagnosis and fix are correct: the host-vs-container CLI version skew was the genuine source of plugin load failures, and installing plugins from inside a throwaway container built off the same image is the right architectural fix. The hardened smoke probe (per-plugin Status: parsing) is a clear improvement. Two bugs and one stale comment need fixing before merge.


🔴 Bug 1 — containers-smoke still references the removed cli.js

File: justfile, containers-smoke NAME recipe

docker exec "claude-hooks-{{NAME}}" bun /opt/claude-code/cli.js --version

cli.js no longer exists in the 2.1.114+ layout — the image now installs a native binary at /usr/local/bin/claude. This command will fail with "file not found" the first time anyone runs just containers-smoke <agent>.

Fix: Replace that line with:

docker exec "claude-hooks-{{NAME}}" claude --version

🟡 Bug 2 — Plugin install failures silently swallowed in the container script

File: justfile, agent-plugins-install recipe, inner bash -c script

set -e
for plugin in $PLUGINS; do
    echo "  installing $plugin"
    claude plugin install "$plugin" 2>&1 | tail -1
done

set -e without set -o pipefail means the pipeline claude plugin install … 2>&1 | tail -1 always exits 0 (from tail). A failed install is not surfaced — just agent-plugins-install exits cleanly and the operator has no signal. Failures are only caught later by smoke-creds.sh.

Fix: Change set -e to set -eo pipefail in the inner script.


Cosmetic — Dockerfile header comment references removed ARG

File: Dockerfile, header comment block

#   - claude v${CLAUDE_CODE_VERSION}       — Claude Code CLI

CLAUDE_CODE_VERSION is no longer an ARG. Should read:

#   - claude (latest at build time)        — Claude Code CLI
## Review: fix(image+plugins): un-pin claude-code; install plugins from inside image **CI**: ✅ Green (run #1626, 2m41s, success) Root cause diagnosis and fix are correct: the host-vs-container CLI version skew was the genuine source of plugin load failures, and installing plugins from inside a throwaway container built off the same image is the right architectural fix. The hardened smoke probe (per-plugin `Status:` parsing) is a clear improvement. Two bugs and one stale comment need fixing before merge. --- ### 🔴 Bug 1 — `containers-smoke` still references the removed `cli.js` **File:** `justfile`, `containers-smoke NAME` recipe ```just docker exec "claude-hooks-{{NAME}}" bun /opt/claude-code/cli.js --version ``` `cli.js` no longer exists in the 2.1.114+ layout — the image now installs a native binary at `/usr/local/bin/claude`. This command will fail with "file not found" the first time anyone runs `just containers-smoke <agent>`. **Fix:** Replace that line with: ```just docker exec "claude-hooks-{{NAME}}" claude --version ``` --- ### 🟡 Bug 2 — Plugin install failures silently swallowed in the container script **File:** `justfile`, `agent-plugins-install` recipe, inner bash `-c` script ```bash set -e for plugin in $PLUGINS; do echo " installing $plugin" claude plugin install "$plugin" 2>&1 | tail -1 done ``` `set -e` without `set -o pipefail` means the pipeline `claude plugin install … 2>&1 | tail -1` always exits 0 (from `tail`). A failed install is not surfaced — `just agent-plugins-install` exits cleanly and the operator has no signal. Failures are only caught later by `smoke-creds.sh`. **Fix:** Change `set -e` to `set -eo pipefail` in the inner script. --- ### ⚪ Cosmetic — Dockerfile header comment references removed ARG **File:** `Dockerfile`, header comment block ``` # - claude v${CLAUDE_CODE_VERSION} — Claude Code CLI ``` `CLAUDE_CODE_VERSION` is no longer an `ARG`. Should read: ``` # - claude (latest at build time) — Claude Code CLI ```
fix(review): address PR #98 review — containers-smoke path, pipefail, stale header
All checks were successful
qa / qa (pull_request) Successful in 2m32s
qa / dockerfile (pull_request) Successful in 8s
f68ab3cf5f
- justfile containers-smoke: drop `bun /opt/claude-code/cli.js` (that path is
  gone in the 2.1.114 native-binary layout); use the `claude` wrapper instead.
- justfile agent-plugins-install: inner bash now `set -eo pipefail` so a
  failed `claude plugin install | tail -1` surfaces a non-zero exit instead
  of being swallowed by `tail`.
- Dockerfile header comment: drop the ${CLAUDE_CODE_VERSION} reference now
  that the ARG is gone; point at the per-platform npm package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Collaborator

All three points addressed in f68ab3c:

  • 🔴 containers-smoke → uses claude --version now; verified against claude-hooks-dev (outputs 2.1.114 (Claude Code)).
  • 🟡 agent-plugins-install inner script → set -eo pipefail so a failed claude plugin install | tail -1 exits non-zero; confirmed the pipeline now propagates failures.
  • Dockerfile header → points at @anthropic-ai/claude-code-linux-x64 (the per-platform package).

scripts/smoke-creds.sh still 33 passed, 0 failed. Ready for re-review.

All three points addressed in `f68ab3c`: - 🔴 `containers-smoke` → uses `claude --version` now; verified against `claude-hooks-dev` (outputs `2.1.114 (Claude Code)`). - 🟡 `agent-plugins-install` inner script → `set -eo pipefail` so a failed `claude plugin install | tail -1` exits non-zero; confirmed the pipeline now propagates failures. - ⚪ Dockerfile header → points at `@anthropic-ai/claude-code-linux-x64` (the per-platform package). `scripts/smoke-creds.sh` still 33 passed, 0 failed. Ready for re-review.
reviewer requested changes 2026-04-19 19:01:13 +00:00
Dismissed
reviewer left a comment

Review — REQUEST_CHANGES

CI was still running at review time (run #183, commit f68ab3c, job "fix(review): address PR #98 review") — push any trivial change (or wait) and I will re-review when it completes.


Code review — one issue found

Dockerfile/opt/claude-code not cleaned up after install

After install -m 0755 /opt/claude-code/claude /usr/local/bin/claude copies the binary, /opt/claude-code is never removed. The tarball extracts extra files (LICENSE, package.json, …) that persist in the image layer and waste space. Fix:

install -m 0755 /opt/claude-code/claude /usr/local/bin/claude; \
rm -rf /opt/claude-code

Everything else looks good

  • Root cause fix is correct: installing plugins from inside the throwaway container (same claude binary as the long-running agent) eliminates the installer-vs-loader version skew that caused all plugins to silently fail.
  • Dockerfile: platform-native tarball fetch with [ -n "$tarball_url" ] guard and set -eux is solid. ENV DISABLE_AUTOUPDATER=1 correctly moved to a persistent env.
  • agent-plugins-install: pre-wipe of plugins/ and del(.enabledPlugins, .extraKnownMarketplaces) before in-container install correctly clears stale host-side paths. set -eo pipefail inside the bash heredoc ensures a failing claude plugin install | tail -1 propagates non-zero.
  • smoke-creds.sh plugin probe: the new awk parser distinguishing loaded from ✘ failed to load is a clear improvement over the header-only grep that gave false positives.
  • CLAUDE.md cleanup: removing the version-bump protocol section is correct — no pin to manage anymore.
## Review — REQUEST_CHANGES CI was still running at review time (run #183, commit `f68ab3c`, job "fix(review): address PR #98 review") — push any trivial change (or wait) and I will re-review when it completes. --- ### Code review — one issue found **`Dockerfile` — `/opt/claude-code` not cleaned up after install** After `install -m 0755 /opt/claude-code/claude /usr/local/bin/claude` copies the binary, `/opt/claude-code` is never removed. The tarball extracts extra files (LICENSE, package.json, …) that persist in the image layer and waste space. Fix: ```dockerfile install -m 0755 /opt/claude-code/claude /usr/local/bin/claude; \ rm -rf /opt/claude-code ``` --- ### Everything else looks good - **Root cause fix is correct**: installing plugins from inside the throwaway container (same `claude` binary as the long-running agent) eliminates the installer-vs-loader version skew that caused all plugins to silently fail. - **Dockerfile**: platform-native tarball fetch with `[ -n "$tarball_url" ]` guard and `set -eux` is solid. `ENV DISABLE_AUTOUPDATER=1` correctly moved to a persistent env. - **`agent-plugins-install`**: pre-wipe of `plugins/` and `del(.enabledPlugins, .extraKnownMarketplaces)` before in-container install correctly clears stale host-side paths. `set -eo pipefail` inside the bash heredoc ensures a failing `claude plugin install | tail -1` propagates non-zero. - **`smoke-creds.sh` plugin probe**: the new awk parser distinguishing `loaded` from `✘ failed to load` is a clear improvement over the header-only grep that gave false positives. - **CLAUDE.md cleanup**: removing the version-bump protocol section is correct — no pin to manage anymore.
reviewer left a comment

Review — REQUEST_CHANGES

CI was still running at review time (run #183, commit f68ab3c, job "fix(review): address PR #98 review") — push any trivial change (or wait) and I will re-review when it completes.


Code review — one issue found

Dockerfile/opt/claude-code not cleaned up after install

After install -m 0755 /opt/claude-code/claude /usr/local/bin/claude copies the binary, /opt/claude-code is never removed. The tarball extracts extra files (LICENSE, package.json, …) that persist in the image layer and waste space. Fix:

install -m 0755 /opt/claude-code/claude /usr/local/bin/claude; \
rm -rf /opt/claude-code

Everything else looks good

  • Root cause fix is correct: installing plugins from inside the throwaway container (same claude binary as the long-running agent) eliminates the installer-vs-loader version skew that caused all plugins to silently fail.
  • Dockerfile: platform-native tarball fetch with [ -n "$tarball_url" ] guard and set -eux is solid. ENV DISABLE_AUTOUPDATER=1 correctly moved to a persistent env.
  • agent-plugins-install: pre-wipe of plugins/ and del(.enabledPlugins, .extraKnownMarketplaces) before in-container install correctly clears stale host-side paths. set -eo pipefail inside the bash heredoc ensures a failing claude plugin install | tail -1 propagates non-zero.
  • smoke-creds.sh plugin probe: the new awk parser distinguishing loaded from "failed to load" is a clear improvement over the header-only grep that gave false positives.
  • CLAUDE.md cleanup: removing the version-bump protocol section is correct — no pin to manage anymore.
## Review — REQUEST_CHANGES CI was still running at review time (run #183, commit `f68ab3c`, job "fix(review): address PR #98 review") — push any trivial change (or wait) and I will re-review when it completes. --- ### Code review — one issue found **`Dockerfile` — `/opt/claude-code` not cleaned up after install** After `install -m 0755 /opt/claude-code/claude /usr/local/bin/claude` copies the binary, `/opt/claude-code` is never removed. The tarball extracts extra files (LICENSE, package.json, …) that persist in the image layer and waste space. Fix: ```dockerfile install -m 0755 /opt/claude-code/claude /usr/local/bin/claude; \ rm -rf /opt/claude-code ``` --- ### Everything else looks good - **Root cause fix is correct**: installing plugins from inside the throwaway container (same `claude` binary as the long-running agent) eliminates the installer-vs-loader version skew that caused all plugins to silently fail. - **Dockerfile**: platform-native tarball fetch with `[ -n "$tarball_url" ]` guard and `set -eux` is solid. `ENV DISABLE_AUTOUPDATER=1` correctly moved to a persistent env. - **`agent-plugins-install`**: pre-wipe of `plugins/` and `del(.enabledPlugins, .extraKnownMarketplaces)` before in-container install correctly clears stale host-side paths. `set -eo pipefail` inside the bash heredoc ensures a failing `claude plugin install | tail -1` propagates non-zero. - **`smoke-creds.sh` plugin probe**: the new awk parser distinguishing `loaded` from "failed to load" is a clear improvement over the header-only grep that gave false positives. - **CLAUDE.md cleanup**: removing the version-bump protocol section is correct — no pin to manage anymore.
fix(review): clean up /opt/claude-code after install
All checks were successful
qa / qa (pull_request) Successful in 2m37s
qa / dockerfile (pull_request) Successful in 10s
6552eaf5bb
The tarball extracts LICENSE, package.json, README.md alongside the native
`claude` binary. We only need the binary (copied to /usr/local/bin/claude),
so `rm -rf /opt/claude-code` at the end of the same RUN step keeps the
extras out of the final image layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Collaborator

Addressed in 6552eaf: rm -rf /opt/claude-code at the end of the same RUN step. Verified — image built, /opt/claude-code is gone inside the runtime, claude --version still reports 2.1.114.

Addressed in `6552eaf`: `rm -rf /opt/claude-code` at the end of the same `RUN` step. Verified — image built, `/opt/claude-code` is gone inside the runtime, `claude --version` still reports `2.1.114`.
charles deleted branch fix/unpin-claude-code 2026-04-19 19:06:31 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!98
No description provided.