charles/claude-hooks

Fork

You've already forked claude-hooks

Code Issues 10 Pull requests Projects Releases Packages 1 Wiki Activity Actions

fix(container): writable credentials mount so Claude Code can self-refresh access tokens #202

New issue

Closed

opened 2026-04-21 07:05:00 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-04-21 07:05:00 +00:00

Collaborator

Copy link

Summary

Agent containers bind-mount their credentials directory read-only. When Claude Code's access token expires, it correctly calls Anthropic's OAuth refresh endpoint and gets a new access token — but can't write it back to .credentials.json because the mount is :ro. Every subsequent SDK call uses the expired token → 401 → task failure. Only an operator-side claude login rotates the file on disk; the container, which can only read, stays stuck on the expired token.

Standalone Claude Code CLI never hits this — it writes the refreshed token back to the same file it read from, all transparent. Containers are the only environment where the write-back step is blocked.

Reproducer

Dispatch any long-running agent task.
Wait for the access token to expire mid-turn (naturally, at the ~8-12h TTL boundary, or artificially by corrupting the token in the credentials file).
Observe: task fails with "Failed to authenticate. API Error: 401 Invalid authentication credentials".
No retry. Container stuck until operator claude login + just agent-env-sync rewrites the file.

Witnessed 2026-04-21T02:12:55Z on task 186bfb5e (dev-default addressing review on PR #199). Review sat unanswered 7 hours until the operator noticed.

Root cause

apps/server/src/container.ts::dockerRun mounts credentials_host_dir with :ro. Read-only blocks Claude Code from writing the refreshed access token back. The next SDK call reads the stale token still on disk and fails.

The fix

Mount writable (:rw). Claude Code inside the container refreshes its own access token, writes it back, carries on. The operator's claude login becomes an occasional recovery step (for refresh-token rotation), not a routine maintenance task.

Acceptance criteria

Mount change

apps/server/src/container.ts::dockerRun mounts credentials_host_dir with :rw instead of :ro (or drops the flag; Docker default is rw).
CLAUDE.md "Container credentials" section updated: mount is RW, agents self-refresh, bind directory ownership stays UID-matched for the in-container claude user.
config/agents.json schema unchanged — the mount flag is a service-level constant, not per-agent.

Interaction with `just agent-env-sync`

agent-env-sync must not clobber in-flight refreshes. Add an mtime check: only overwrite an agent's .credentials.json if the host's is newer. Otherwise skip (agent has a fresher token from self-refresh; no point downgrading).
Operator-forced overwrite (just agent-env-sync --force) still exists for the refresh-token-rotation recovery case.

Refresh-token rotation fallback

If Anthropic rotates refresh tokens on each refresh call (unverified behaviour), agents will naturally drift from the host. Operator behaviour: on next claude login, the host's refresh token is authoritative; operator runs just agent-env-sync --force to propagate.
Document this flow in CLAUDE.md so the operator isn't surprised.

Smoke test

scripts/smoke-creds.sh adds a write-probe: docker exec <container> touch /home/claude/.config/claude-code/.probe && docker exec <container> rm /home/claude/.config/claude-code/.probe. Fails loud if the mount is accidentally reverted to read-only.

Tests

Unit: container.test.ts asserts the docker run command emitted by dockerRun includes :rw (or no :ro) on the credentials bind.
Integration (optional, behind RUN_DOCKER_TESTS=1): start a container, write-probe the mount, confirm the probe succeeds and an actual credentials refresh round-trip works end-to-end.

Security posture change (documented)

CLAUDE.md gains a "Security trade-off" note on the credentials mount: the directory is writable from inside the container. A compromised agent could overwrite its own .credentials.json, but since the same agent can already read + exfiltrate the tokens over the network, write-access does not meaningfully expand the attack surface. Per-agent isolation still holds (each agent writes only its own bind dir).

Out of scope

Retry-on-401 as a defensive layer. Moot under the new design for credential-rotation causes. Keep in back-pocket if other 401 causes surface (e.g. an actual revoked token); file separately if so.
Broader transient-error retry (429 rate limit, 503 service, network timeouts). Different shape; no evidence we're hitting them.
Periodic credential-health probe on the host. Not needed — the container handles its own refresh now.
Cross-agent credentials sharing via a common dir. Each agent's bind dir stays isolated — simpler reasoning, no new race conditions.

Dependencies

None. Backend + docs + one smoke-test tweak.

References

Reproducer: task 186bfb5e, 2026-04-21T02:12:55Z, dev-default on PR #199.
Error shape: Claude Code returned an error result: Failed to authenticate. API Error: 401 {"type":"error","error":{"type":"authentication_error","message":"Invalid authentication credentials"}}
CLAUDE.md → "Container credentials": governs the mount semantics. Currently read-only; this ticket flips it.
Claude Code OAuth behaviour: refresh endpoint returns a new access token; the CLI writes it back to .credentials.json in place. Needs write access.
Diagnosis conversation: the original #202 proposed a retry-on-401 at the agent-runner layer. That fix was a symptom-level patch — it would have no-op'd because the container still couldn't refresh. This ticket addresses the root cause instead.

## Summary Agent containers bind-mount their credentials directory **read-only**. When Claude Code's access token expires, it correctly calls Anthropic's OAuth refresh endpoint and gets a new access token — but can't write it back to `.credentials.json` because the mount is `:ro`. Every subsequent SDK call uses the expired token → 401 → task failure. Only an operator-side `claude login` rotates the file on disk; the container, which can only read, stays stuck on the expired token. Standalone Claude Code CLI never hits this — it writes the refreshed token back to the same file it read from, all transparent. Containers are the only environment where the write-back step is blocked. ## Reproducer 1. Dispatch any long-running agent task. 2. Wait for the access token to expire mid-turn (naturally, at the ~8-12h TTL boundary, or artificially by corrupting the token in the credentials file). 3. Observe: task fails with `"Failed to authenticate. API Error: 401 Invalid authentication credentials"`. 4. No retry. Container stuck until operator `claude login` + `just agent-env-sync` rewrites the file. Witnessed 2026-04-21T02:12:55Z on task `186bfb5e` (dev-default addressing review on PR #199). Review sat unanswered 7 hours until the operator noticed. ## Root cause `apps/server/src/container.ts::dockerRun` mounts `credentials_host_dir` with `:ro`. Read-only blocks Claude Code from writing the refreshed access token back. The next SDK call reads the stale token still on disk and fails. ## The fix Mount writable (`:rw`). Claude Code inside the container refreshes its own access token, writes it back, carries on. The operator's `claude login` becomes an occasional recovery step (for refresh-token rotation), not a routine maintenance task. ## Acceptance criteria ### Mount change - [ ] `apps/server/src/container.ts::dockerRun` mounts `credentials_host_dir` with `:rw` instead of `:ro` (or drops the flag; Docker default is rw). - [ ] CLAUDE.md "Container credentials" section updated: mount is RW, agents self-refresh, bind directory ownership stays UID-matched for the in-container `claude` user. - [ ] `config/agents.json` schema unchanged — the mount flag is a service-level constant, not per-agent. ### Interaction with `just agent-env-sync` - [ ] `agent-env-sync` must not clobber in-flight refreshes. Add an mtime check: only overwrite an agent's `.credentials.json` if the host's is *newer*. Otherwise skip (agent has a fresher token from self-refresh; no point downgrading). - [ ] Operator-forced overwrite (`just agent-env-sync --force`) still exists for the refresh-token-rotation recovery case. ### Refresh-token rotation fallback - [ ] If Anthropic rotates refresh tokens on each refresh call (unverified behaviour), agents will naturally drift from the host. Operator behaviour: on next `claude login`, the host's refresh token is authoritative; operator runs `just agent-env-sync --force` to propagate. - [ ] Document this flow in CLAUDE.md so the operator isn't surprised. ### Smoke test - [ ] `scripts/smoke-creds.sh` adds a write-probe: `docker exec <container> touch /home/claude/.config/claude-code/.probe && docker exec <container> rm /home/claude/.config/claude-code/.probe`. Fails loud if the mount is accidentally reverted to read-only. ### Tests - [ ] Unit: `container.test.ts` asserts the `docker run` command emitted by `dockerRun` includes `:rw` (or no `:ro`) on the credentials bind. - [ ] Integration (optional, behind `RUN_DOCKER_TESTS=1`): start a container, write-probe the mount, confirm the probe succeeds and an actual credentials refresh round-trip works end-to-end. ### Security posture change (documented) - [ ] CLAUDE.md gains a "Security trade-off" note on the credentials mount: the directory is writable from inside the container. A compromised agent could overwrite its own `.credentials.json`, but since the same agent can already read + exfiltrate the tokens over the network, write-access does not meaningfully expand the attack surface. Per-agent isolation still holds (each agent writes only its own bind dir). ## Out of scope - **Retry-on-401 as a defensive layer.** Moot under the new design for credential-rotation causes. Keep in back-pocket if other 401 causes surface (e.g. an actual revoked token); file separately if so. - **Broader transient-error retry** (429 rate limit, 503 service, network timeouts). Different shape; no evidence we're hitting them. - **Periodic credential-health probe on the host.** Not needed — the container handles its own refresh now. - **Cross-agent credentials sharing** via a common dir. Each agent's bind dir stays isolated — simpler reasoning, no new race conditions. ## Dependencies - None. Backend + docs + one smoke-test tweak. ## References - Reproducer: task `186bfb5e`, 2026-04-21T02:12:55Z, dev-default on PR #199. - Error shape: `Claude Code returned an error result: Failed to authenticate. API Error: 401 {"type":"error","error":{"type":"authentication_error","message":"Invalid authentication credentials"}}` - CLAUDE.md → "Container credentials": governs the mount semantics. Currently `read-only`; this ticket flips it. - Claude Code OAuth behaviour: refresh endpoint returns a new access token; the CLI writes it back to `.credentials.json` in place. Needs write access. - Diagnosis conversation: the original #202 proposed a retry-on-401 at the agent-runner layer. That fix was a symptom-level patch — it would have no-op'd because the container still couldn't refresh. This ticket addresses the root cause instead.

claude-desktop added the

area:agents

type:bug

labels

2026-04-21 07:05:05 +00:00

claude-desktop changed title from ~~bug(agent-runner): auto-retry once on Anthropic 401 (credential-rotation silent failure)~~ to fix(container): writable credentials mount so Claude Code can self-refresh access tokens

2026-04-21 07:14:30 +00:00

dev was assigned by claude-desktop

2026-04-21 07:25:36 +00:00

dev referenced this issue from a commit

2026-04-21 07:30:29 +00:00

fix(container): mount credentials dir :rw so Claude Code can self-refresh tokens

dev referenced this issue from a pull request that will close it,

2026-04-21 07:30:41 +00:00

fix(container): mount credentials dir :rw so Claude Code can self-refresh tokens #203

code-lead closed this issue

2026-04-21 07:45:56 +00:00

code-lead referenced this issue from a commit

2026-04-21 07:45:57 +00:00

fix(container): mount credentials dir :rw so Claude Code can self-refresh tokens

No Branch/Tag specified

main

chore/sync-pre-push-from-forge-base

fix/flows-yaml-dispatch-identity

feat/board-tap-to-assign

dev/1107

code-lead/1106

code-lead/1108

dev/1104

code-lead/1103

code-lead/1080

dev/1087

feat/flows-yaml-ci-events

chore/board-drop-stalled-and-density-controls

fix/flows-yaml-routes-always-register

flows-yaml/api-defaults

dev/1023

fix/event-log-history-bleed

fix/janitor-fix-ci-logs-and-cap

dev/1022

fix/board-card-provider

code-lead/1036

dev/1025

code-lead/1020

dev/1017

code-lead/1026

feat/web-shortcut-registry-1018

dev/1015

code-lead/1009

code-lead/1008

dev/975

dev/969

dev/973

dev/967

code-lead/968

code-lead/953

dev/970

dev/976

code-lead/966

code-lead/956

code-lead/951

dev/962

dev/963

dev/977

dev/955

dev/983

dev/961

dev/974

code-lead/950

code-lead/939

dev/941

dev/940

dev/937

dev/938

dev/936

dev/935

feat/web-i18n-fr-locale

feat/spec-editor-ui-polish

chore/drop-legacy-compat

fix/skills-drop-preview-pane

fix/882-skills-safety-rail

dev/911

dev/909

dev/923

dev/917

dev/915

feat/879-sr11-m2-drop-legacy-skill

code-lead/873

dev/881

code-lead/869

dev/867

code-lead/845

code-lead/843

code-lead/844

dev/837

dev/861

dev/849

code-lead/837

code-lead/842

fix/dedup-rebase-inflight

dev/838

code-lead/847

dev/833

code-lead/848

pr/838

code-lead/841

feat/settings-save-bar/836

code-lead/840

dev/846

code-lead/839

dev/832

fix/board-sse-stale-cache

dev/834

dev/835

feat/settings-breadcrumbs

feat/forge-oauth-credentials

refactor/service-config-consolidation

feat/agent-tokens-to-secrets

feat/gitlab-oauth-to-db

feat/authelia-rip-and-voice-fixes

fix/rebase-storm-and-dead-letter

code-lead/797

code-lead/796

dev/811

code-lead/798

dev/810

code-lead/795

dev/808

code-lead/794

dev/805

dev/802

dev/803

feat/avatar-menu-settings-entry

feat/per-agent-token-tracking

dev/793

dev/747

dev/752

code-lead/790

code-lead/759

dev/756

dev/760

dev/741

dev/767

dev/740

dev/709

dev/644

dev/637

boss/614

dev/600

dev/611

dev/585

fix/login-bonus-fixes

boss/544

dev/542

refactor/api-prefix-and-session-gate

dev/489

boss/531

boss/518

dev/499

boss/516

dev/530

dev/517

dev/519

dev/515

dev/522

dev/503

dev/471

boss/329

dev/417

dev/418

dev/402

boss/327

dev/334

dev/332

boss/326

boss/325

dev/331

boss/324

boss/323

boss/322

dev/294

test/s11-task-analytics

dev/262

boss/270

dev/268

foreman/ui-consolidation-spec

dev/234

boss/196

boss/176

boss/164

fix/124-session-persist-bind

boss/52

dev/87

boss/73

dev/77

dev/81

dev/82

boss/79

dev/42

dev/35

boss/7

No results found.

Labels

Clear labels

area:agents

Agent types, pool scheduling, per-instance config

area:dashboard

Dashboard UI and observability surfaces

area:database

DB layer — schema, migrations, ORM, raw SQL

area:design

UI/UX mockup work — routes to designer agent

area:design-review

Design review dispatch — routes to design-reviewer agent

area:flows

Flow runner — YAML loader, executor, op registry, expression eval

area:infra

Deployment, isolation, containers, systemd units

area:meta

Tracking, scaffolding, project setup

area:security

Security — routes to reviewer-security (opus)

area:sessions

Session-id store, Claude SDK resume logic

area:webhook

Forgejo webhook routing and handlers

area:workdir

Clone cache, worktrees, git identity

security

Security-sensitive issue

Tracking or decisions, not implementation work

No labels

Milestone

Clear milestone

No items

No milestone

Projects

Clear projects

No items

No project

Assignees

Clear assignees

No assignees

dev

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

charles/claude-hooks#202

Reference in a new issue

Repository

charles/claude-hooks

Title

Body

No description provided.

Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?

Rows
Columns

fix(container): writable credentials mount so Claude Code can self-refresh access tokens #202

Summary

Reproducer

Root cause

The fix

Acceptance criteria

Mount change

Interaction with just agent-env-sync

Refresh-token rotation fallback

Smoke test

Tests

Security posture change (documented)

Out of scope

Dependencies

References

Interaction with `just agent-env-sync`