sessions: diagnose persistent "No conversation found with session ID" resume failures #124

New issue

Closed

opened 2026-04-20 10:22:43 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-04-20 10:22:43 +00:00

Collaborator

User story

As the operator, I want Claude Agent SDK session resume to actually succeed when we have a stored session id, so that multi-turn workflows (boss on a long PR, designer across dispatches) don't start fresh every time and waste turns + tokens + cost re-building context.

What happens today

Every agent dispatch with a stored session id hits this:

[boss-default] <task-id>: resuming session <uuid>
[boss-default] <task-id>: resume of <uuid> failed — retrying fresh (Claude Code returned an error result: No conversation found with session ID: <uuid>)
[boss-default] <task-id>: resume failed — starting fresh session (Claude Code returned an error result: No conversation found with session ID: <uuid>)

Concrete examples from today:

Boss on PR #115 address-review cycles
Dev on #117 re-dispatch
Designer on #70 multi-round designs

The session id exists in SQLite (sessions.ts persisted it on the prior dispatch), but the Claude Agent SDK's query({ resume: <id> }) rejects it as "not found". Something between "the SDK issued id X at turn N" and "we hand id X back at turn N+1 in a new dispatch" loses the handle.

Effect: every dispatch pays for fresh context (system prompt, tool schemas, initial file reads) instead of resuming from the prior turn. For a 20-turn PR series that's 20× wasted cold-start overhead.

Investigation scope

Before proposing a fix, the ticket owner must reproduce the failure and locate where the handle gets lost. Likely suspects:

SDK-side retention — Claude Code's session store is file-based in ~/.claude (or CLAUDE_CONFIG_DIR). If our per-agent config dirs don't persist that state across container restarts or session-env bind-mount invalidation, the SDK won't find the conversation even though we know the id.
- Probe: after a successful dispatch, ls -la $CLAUDE_CONFIG_DIR/projects/*/conversations/ (or equivalent) — is the conversation file there? Is it the one id sessions.ts persisted?
Session id recording bug — agent-runner.ts captures the session id from the SDK's system/init message. If it captures a transient id (e.g. the pre-init placeholder) instead of the final conversation id, the stored id is never valid for resume.
- Probe: log the full SDK message stream around session id capture, correlate with what's in SQLite.
Cache dir / volume mismatch — agents run in containers with their state volume mounted at /state. The SDK probably stores conversations somewhere under the agent's CLAUDE_CONFIG_DIR (our per-agent env dir on the host, bind-mounted read-only). If read-only, the SDK can't persist the conversation file — which means the handle was never written in the first place, even though we "stored" an id.

Acceptance criteria

Root cause identified and documented in the PR body — which of the three suspects (or something else) is at fault, with concrete evidence.
Fix lands that makes session resume actually succeed for at least one round-trip: dispatch → persist → re-dispatch → resume succeeds (no "No conversation found" log line).
Regression test: src/sessions.test.ts or src/agent-runner.test.ts covers the resume-after-persist flow. If the bug was in container-volume layout, the test can mock the filesystem; if it was in id capture, the test mocks the SDK message stream.
scripts/smoke-creds.sh grows a probe for session-resume: after a fake dispatch stores a session id, the next check asserts the on-disk conversation file exists in the expected location. Fails loud if the volume layout regresses.
CLAUDE.md gets a paragraph documenting the session lifecycle (capture → persist → re-dispatch → resume) so future agents don't have to rediscover the flow.

Out of scope

SDK version bump to a newer Claude Agent SDK — unless the investigation finds the bug is an upstream regression, stay on the current pinned version.
Session-id migration for historical SQLite entries — they can stay; they'll just keep failing to resume until they age out of the sessions store (sweeper TTL).
Multi-agent session sharing — not in scope.

References

src/sessions.ts — session-id persistence.
src/agent-runner.ts — runWithSessionResume and the session-id capture logic.
Claude Agent SDK docs on session resume: https://docs.anthropic.com/en/docs/agents-and-tools/claude-agent-sdk (look for resume / resumeSession).
Container config dir bind: src/container.ts — how CLAUDE_CONFIG_DIR is mounted.
Observed log pattern (from today): journalctl [agent-name] <task-id>: resume of <uuid> failed — retrying fresh.

Dependencies

Blocked by: nothing.
Blocks: nothing structural, but unblocks real-world value — every multi-turn dispatch currently wastes cold-start overhead.
Branch off: main.

## User story As the **operator**, I want Claude Agent SDK session resume to actually succeed when we have a stored session id, so that multi-turn workflows (boss on a long PR, designer across dispatches) don't start fresh every time and waste turns + tokens + cost re-building context. ## What happens today **Every agent dispatch with a stored session id hits this**: ``` [boss-default] <task-id>: resuming session <uuid> [boss-default] <task-id>: resume of <uuid> failed — retrying fresh (Claude Code returned an error result: No conversation found with session ID: <uuid>) [boss-default] <task-id>: resume failed — starting fresh session (Claude Code returned an error result: No conversation found with session ID: <uuid>) ``` Concrete examples from today: - Boss on PR #115 address-review cycles - Dev on #117 re-dispatch - Designer on #70 multi-round designs The session id **exists** in SQLite (`sessions.ts` persisted it on the prior dispatch), but the Claude Agent SDK's `query({ resume: <id> })` rejects it as "not found". Something between "the SDK issued id X at turn N" and "we hand id X back at turn N+1 in a new dispatch" loses the handle. Effect: every dispatch pays for fresh context (system prompt, tool schemas, initial file reads) instead of resuming from the prior turn. For a 20-turn PR series that's 20× wasted cold-start overhead. ## Investigation scope Before proposing a fix, the ticket owner must **reproduce the failure** and locate where the handle gets lost. Likely suspects: 1. **SDK-side retention** — Claude Code's session store is file-based in `~/.claude` (or `CLAUDE_CONFIG_DIR`). If our per-agent config dirs don't persist that state across container restarts or session-env bind-mount invalidation, the SDK won't find the conversation even though we know the id. - Probe: after a successful dispatch, `ls -la $CLAUDE_CONFIG_DIR/projects/*/conversations/` (or equivalent) — is the conversation file there? Is it the one id `sessions.ts` persisted? 2. **Session id recording bug** — `agent-runner.ts` captures the session id from the SDK's `system/init` message. If it captures a **transient** id (e.g. the pre-init placeholder) instead of the final conversation id, the stored id is never valid for resume. - Probe: log the full SDK message stream around session id capture, correlate with what's in SQLite. 3. **Cache dir / volume mismatch** — agents run in containers with their state volume mounted at `/state`. The SDK probably stores conversations somewhere under the agent's `CLAUDE_CONFIG_DIR` (our per-agent env dir on the host, bind-mounted read-only). If read-only, the SDK can't persist the conversation file — which means the handle was never written in the first place, even though we "stored" an id. ### Acceptance criteria - [ ] Root cause identified and documented in the PR body — which of the three suspects (or something else) is at fault, with concrete evidence. - [ ] Fix lands that makes session resume actually succeed for at least one round-trip: dispatch → persist → re-dispatch → resume succeeds (no "No conversation found" log line). - [ ] Regression test: `src/sessions.test.ts` or `src/agent-runner.test.ts` covers the resume-after-persist flow. If the bug was in container-volume layout, the test can mock the filesystem; if it was in id capture, the test mocks the SDK message stream. - [ ] `scripts/smoke-creds.sh` grows a probe for session-resume: after a fake dispatch stores a session id, the next check asserts the on-disk conversation file exists in the expected location. Fails loud if the volume layout regresses. - [ ] CLAUDE.md gets a paragraph documenting the session lifecycle (capture → persist → re-dispatch → resume) so future agents don't have to rediscover the flow. ## Out of scope - **SDK version bump to a newer Claude Agent SDK** — unless the investigation finds the bug is an upstream regression, stay on the current pinned version. - **Session-id migration for historical SQLite entries** — they can stay; they'll just keep failing to resume until they age out of the sessions store (sweeper TTL). - **Multi-agent session sharing** — not in scope. ## References - `src/sessions.ts` — session-id persistence. - `src/agent-runner.ts` — `runWithSessionResume` and the session-id capture logic. - Claude Agent SDK docs on session resume: https://docs.anthropic.com/en/docs/agents-and-tools/claude-agent-sdk (look for `resume` / `resumeSession`). - Container config dir bind: `src/container.ts` — how `CLAUDE_CONFIG_DIR` is mounted. - Observed log pattern (from today): journalctl `[agent-name] <task-id>: resume of <uuid> failed — retrying fresh`. ## Dependencies - **Blocked by:** nothing. - **Blocks:** nothing structural, but unblocks real-world value — every multi-turn dispatch currently wastes cold-start overhead. - **Branch off:** `main`.

claude-desktop added the

area:sessions

type:bug

labels

2026-04-20 10:22:57 +00:00