sessions: diagnose persistent "No conversation found with session ID" resume failures #124
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#124
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
User story
As the operator, I want Claude Agent SDK session resume to actually succeed when we have a stored session id, so that multi-turn workflows (boss on a long PR, designer across dispatches) don't start fresh every time and waste turns + tokens + cost re-building context.
What happens today
Every agent dispatch with a stored session id hits this:
Concrete examples from today:
The session id exists in SQLite (
sessions.tspersisted it on the prior dispatch), but the Claude Agent SDK'squery({ resume: <id> })rejects it as "not found". Something between "the SDK issued id X at turn N" and "we hand id X back at turn N+1 in a new dispatch" loses the handle.Effect: every dispatch pays for fresh context (system prompt, tool schemas, initial file reads) instead of resuming from the prior turn. For a 20-turn PR series that's 20× wasted cold-start overhead.
Investigation scope
Before proposing a fix, the ticket owner must reproduce the failure and locate where the handle gets lost. Likely suspects:
SDK-side retention — Claude Code's session store is file-based in
~/.claude(orCLAUDE_CONFIG_DIR). If our per-agent config dirs don't persist that state across container restarts or session-env bind-mount invalidation, the SDK won't find the conversation even though we know the id.ls -la $CLAUDE_CONFIG_DIR/projects/*/conversations/(or equivalent) — is the conversation file there? Is it the one idsessions.tspersisted?Session id recording bug —
agent-runner.tscaptures the session id from the SDK'ssystem/initmessage. If it captures a transient id (e.g. the pre-init placeholder) instead of the final conversation id, the stored id is never valid for resume.Cache dir / volume mismatch — agents run in containers with their state volume mounted at
/state. The SDK probably stores conversations somewhere under the agent'sCLAUDE_CONFIG_DIR(our per-agent env dir on the host, bind-mounted read-only). If read-only, the SDK can't persist the conversation file — which means the handle was never written in the first place, even though we "stored" an id.Acceptance criteria
src/sessions.test.tsorsrc/agent-runner.test.tscovers the resume-after-persist flow. If the bug was in container-volume layout, the test can mock the filesystem; if it was in id capture, the test mocks the SDK message stream.scripts/smoke-creds.shgrows a probe for session-resume: after a fake dispatch stores a session id, the next check asserts the on-disk conversation file exists in the expected location. Fails loud if the volume layout regresses.Out of scope
References
src/sessions.ts— session-id persistence.src/agent-runner.ts—runWithSessionResumeand the session-id capture logic.resume/resumeSession).src/container.ts— howCLAUDE_CONFIG_DIRis mounted.[agent-name] <task-id>: resume of <uuid> failed — retrying fresh.Dependencies
main.