fix(container): writable credentials mount so Claude Code can self-refresh access tokens #202

Closed
opened 2026-04-21 07:05:00 +00:00 by claude-desktop · 0 comments
Collaborator

Summary

Agent containers bind-mount their credentials directory read-only. When Claude Code's access token expires, it correctly calls Anthropic's OAuth refresh endpoint and gets a new access token — but can't write it back to .credentials.json because the mount is :ro. Every subsequent SDK call uses the expired token → 401 → task failure. Only an operator-side claude login rotates the file on disk; the container, which can only read, stays stuck on the expired token.

Standalone Claude Code CLI never hits this — it writes the refreshed token back to the same file it read from, all transparent. Containers are the only environment where the write-back step is blocked.

Reproducer

  1. Dispatch any long-running agent task.
  2. Wait for the access token to expire mid-turn (naturally, at the ~8-12h TTL boundary, or artificially by corrupting the token in the credentials file).
  3. Observe: task fails with "Failed to authenticate. API Error: 401 Invalid authentication credentials".
  4. No retry. Container stuck until operator claude login + just agent-env-sync rewrites the file.

Witnessed 2026-04-21T02:12:55Z on task 186bfb5e (dev-default addressing review on PR #199). Review sat unanswered 7 hours until the operator noticed.

Root cause

apps/server/src/container.ts::dockerRun mounts credentials_host_dir with :ro. Read-only blocks Claude Code from writing the refreshed access token back. The next SDK call reads the stale token still on disk and fails.

The fix

Mount writable (:rw). Claude Code inside the container refreshes its own access token, writes it back, carries on. The operator's claude login becomes an occasional recovery step (for refresh-token rotation), not a routine maintenance task.

Acceptance criteria

Mount change

  • apps/server/src/container.ts::dockerRun mounts credentials_host_dir with :rw instead of :ro (or drops the flag; Docker default is rw).
  • CLAUDE.md "Container credentials" section updated: mount is RW, agents self-refresh, bind directory ownership stays UID-matched for the in-container claude user.
  • config/agents.json schema unchanged — the mount flag is a service-level constant, not per-agent.

Interaction with just agent-env-sync

  • agent-env-sync must not clobber in-flight refreshes. Add an mtime check: only overwrite an agent's .credentials.json if the host's is newer. Otherwise skip (agent has a fresher token from self-refresh; no point downgrading).
  • Operator-forced overwrite (just agent-env-sync --force) still exists for the refresh-token-rotation recovery case.

Refresh-token rotation fallback

  • If Anthropic rotates refresh tokens on each refresh call (unverified behaviour), agents will naturally drift from the host. Operator behaviour: on next claude login, the host's refresh token is authoritative; operator runs just agent-env-sync --force to propagate.
  • Document this flow in CLAUDE.md so the operator isn't surprised.

Smoke test

  • scripts/smoke-creds.sh adds a write-probe: docker exec <container> touch /home/claude/.config/claude-code/.probe && docker exec <container> rm /home/claude/.config/claude-code/.probe. Fails loud if the mount is accidentally reverted to read-only.

Tests

  • Unit: container.test.ts asserts the docker run command emitted by dockerRun includes :rw (or no :ro) on the credentials bind.
  • Integration (optional, behind RUN_DOCKER_TESTS=1): start a container, write-probe the mount, confirm the probe succeeds and an actual credentials refresh round-trip works end-to-end.

Security posture change (documented)

  • CLAUDE.md gains a "Security trade-off" note on the credentials mount: the directory is writable from inside the container. A compromised agent could overwrite its own .credentials.json, but since the same agent can already read + exfiltrate the tokens over the network, write-access does not meaningfully expand the attack surface. Per-agent isolation still holds (each agent writes only its own bind dir).

Out of scope

  • Retry-on-401 as a defensive layer. Moot under the new design for credential-rotation causes. Keep in back-pocket if other 401 causes surface (e.g. an actual revoked token); file separately if so.
  • Broader transient-error retry (429 rate limit, 503 service, network timeouts). Different shape; no evidence we're hitting them.
  • Periodic credential-health probe on the host. Not needed — the container handles its own refresh now.
  • Cross-agent credentials sharing via a common dir. Each agent's bind dir stays isolated — simpler reasoning, no new race conditions.

Dependencies

  • None. Backend + docs + one smoke-test tweak.

References

  • Reproducer: task 186bfb5e, 2026-04-21T02:12:55Z, dev-default on PR #199.
  • Error shape: Claude Code returned an error result: Failed to authenticate. API Error: 401 {"type":"error","error":{"type":"authentication_error","message":"Invalid authentication credentials"}}
  • CLAUDE.md → "Container credentials": governs the mount semantics. Currently read-only; this ticket flips it.
  • Claude Code OAuth behaviour: refresh endpoint returns a new access token; the CLI writes it back to .credentials.json in place. Needs write access.
  • Diagnosis conversation: the original #202 proposed a retry-on-401 at the agent-runner layer. That fix was a symptom-level patch — it would have no-op'd because the container still couldn't refresh. This ticket addresses the root cause instead.
## Summary Agent containers bind-mount their credentials directory **read-only**. When Claude Code's access token expires, it correctly calls Anthropic's OAuth refresh endpoint and gets a new access token — but can't write it back to `.credentials.json` because the mount is `:ro`. Every subsequent SDK call uses the expired token → 401 → task failure. Only an operator-side `claude login` rotates the file on disk; the container, which can only read, stays stuck on the expired token. Standalone Claude Code CLI never hits this — it writes the refreshed token back to the same file it read from, all transparent. Containers are the only environment where the write-back step is blocked. ## Reproducer 1. Dispatch any long-running agent task. 2. Wait for the access token to expire mid-turn (naturally, at the ~8-12h TTL boundary, or artificially by corrupting the token in the credentials file). 3. Observe: task fails with `"Failed to authenticate. API Error: 401 Invalid authentication credentials"`. 4. No retry. Container stuck until operator `claude login` + `just agent-env-sync` rewrites the file. Witnessed 2026-04-21T02:12:55Z on task `186bfb5e` (dev-default addressing review on PR #199). Review sat unanswered 7 hours until the operator noticed. ## Root cause `apps/server/src/container.ts::dockerRun` mounts `credentials_host_dir` with `:ro`. Read-only blocks Claude Code from writing the refreshed access token back. The next SDK call reads the stale token still on disk and fails. ## The fix Mount writable (`:rw`). Claude Code inside the container refreshes its own access token, writes it back, carries on. The operator's `claude login` becomes an occasional recovery step (for refresh-token rotation), not a routine maintenance task. ## Acceptance criteria ### Mount change - [ ] `apps/server/src/container.ts::dockerRun` mounts `credentials_host_dir` with `:rw` instead of `:ro` (or drops the flag; Docker default is rw). - [ ] CLAUDE.md "Container credentials" section updated: mount is RW, agents self-refresh, bind directory ownership stays UID-matched for the in-container `claude` user. - [ ] `config/agents.json` schema unchanged — the mount flag is a service-level constant, not per-agent. ### Interaction with `just agent-env-sync` - [ ] `agent-env-sync` must not clobber in-flight refreshes. Add an mtime check: only overwrite an agent's `.credentials.json` if the host's is *newer*. Otherwise skip (agent has a fresher token from self-refresh; no point downgrading). - [ ] Operator-forced overwrite (`just agent-env-sync --force`) still exists for the refresh-token-rotation recovery case. ### Refresh-token rotation fallback - [ ] If Anthropic rotates refresh tokens on each refresh call (unverified behaviour), agents will naturally drift from the host. Operator behaviour: on next `claude login`, the host's refresh token is authoritative; operator runs `just agent-env-sync --force` to propagate. - [ ] Document this flow in CLAUDE.md so the operator isn't surprised. ### Smoke test - [ ] `scripts/smoke-creds.sh` adds a write-probe: `docker exec <container> touch /home/claude/.config/claude-code/.probe && docker exec <container> rm /home/claude/.config/claude-code/.probe`. Fails loud if the mount is accidentally reverted to read-only. ### Tests - [ ] Unit: `container.test.ts` asserts the `docker run` command emitted by `dockerRun` includes `:rw` (or no `:ro`) on the credentials bind. - [ ] Integration (optional, behind `RUN_DOCKER_TESTS=1`): start a container, write-probe the mount, confirm the probe succeeds and an actual credentials refresh round-trip works end-to-end. ### Security posture change (documented) - [ ] CLAUDE.md gains a "Security trade-off" note on the credentials mount: the directory is writable from inside the container. A compromised agent could overwrite its own `.credentials.json`, but since the same agent can already read + exfiltrate the tokens over the network, write-access does not meaningfully expand the attack surface. Per-agent isolation still holds (each agent writes only its own bind dir). ## Out of scope - **Retry-on-401 as a defensive layer.** Moot under the new design for credential-rotation causes. Keep in back-pocket if other 401 causes surface (e.g. an actual revoked token); file separately if so. - **Broader transient-error retry** (429 rate limit, 503 service, network timeouts). Different shape; no evidence we're hitting them. - **Periodic credential-health probe on the host.** Not needed — the container handles its own refresh now. - **Cross-agent credentials sharing** via a common dir. Each agent's bind dir stays isolated — simpler reasoning, no new race conditions. ## Dependencies - None. Backend + docs + one smoke-test tweak. ## References - Reproducer: task `186bfb5e`, 2026-04-21T02:12:55Z, dev-default on PR #199. - Error shape: `Claude Code returned an error result: Failed to authenticate. API Error: 401 {"type":"error","error":{"type":"authentication_error","message":"Invalid authentication credentials"}}` - CLAUDE.md → "Container credentials": governs the mount semantics. Currently `read-only`; this ticket flips it. - Claude Code OAuth behaviour: refresh endpoint returns a new access token; the CLI writes it back to `.credentials.json` in place. Needs write access. - Diagnosis conversation: the original #202 proposed a retry-on-401 at the agent-runner layer. That fix was a symptom-level patch — it would have no-op'd because the container still couldn't refresh. This ticket addresses the root cause instead.
claude-desktop changed title from bug(agent-runner): auto-retry once on Anthropic 401 (credential-rotation silent failure) to fix(container): writable credentials mount so Claude Code can self-refresh access tokens 2026-04-21 07:14:30 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#202
No description provided.