B14 — TTL stale session IDs (close F4 — wasted resume attempts) #430

Closed
opened 2026-04-27 07:26:03 +00:00 by claude-desktop · 0 comments
Collaborator

As an orchestrator,
I want to drop a stored session ID after 1 resume failure (or after MAX_SESSION_AGE_MS since last use),
so that the next dispatch goes straight to start fresh without a wasted resume attempt.

Last night 40 resume failures occurred. Each adds ~3–10 s and one failed Anthropic API call. Pure waste — sessions are TTL'd Anthropic-side and won't come back.

Acceptance criteria

TTL on read

  • On every read of a session ID: check last_used_at; if older than MAX_SESSION_AGE_MS (default 24 h, configurable in config/agents.json), drop the row and return null (caller starts fresh, no resume attempt).

TTL on failure

  • On resume failed — No conversation found: delete the session ID from the SQLite agent_sessions table immediately. Do not retry.

Startup sweep

  • On orchestrator boot: drop all rows with last_used_at older than MAX_SESSION_AGE_MS.

Metric

  • Expose claude_session_resume_failures_total counter for ops visibility (also surfaced in B15 watchdog tile).

Tests

  • Unit test: row with last_used_at 25 h ago → read returns null + row deleted.
  • Unit test: row with last_used_at 1 h ago → read returns ID.
  • Unit test: resume failure → row deleted.
  • Unit test: startup sweep drops N rows older than TTL.

Out of scope

  • Resuming sessions across orchestrator restarts via a different mechanism — out of scope (Anthropic API doesn't support it).

References

  • Spec: docs/specs/automation-hardening.md §4 B14.
  • Session store: apps/server/src/infrastructure/database/sessions.ts (or wherever the agent_sessions table lives).
  • Night-1 incident: 40 resume failures across the run.
**As an** orchestrator, **I want** to drop a stored session ID after 1 resume failure (or after `MAX_SESSION_AGE_MS` since last use), **so that** the next dispatch goes straight to `start fresh` without a wasted resume attempt. Last night 40 resume failures occurred. Each adds ~3–10 s and one failed Anthropic API call. Pure waste — sessions are TTL'd Anthropic-side and won't come back. ## Acceptance criteria ### TTL on read - [ ] On every read of a session ID: check `last_used_at`; if older than `MAX_SESSION_AGE_MS` (default 24 h, configurable in `config/agents.json`), drop the row and return null (caller starts fresh, no resume attempt). ### TTL on failure - [ ] On `resume failed — No conversation found`: delete the session ID from the SQLite `agent_sessions` table immediately. Do not retry. ### Startup sweep - [ ] On orchestrator boot: drop all rows with `last_used_at` older than `MAX_SESSION_AGE_MS`. ### Metric - [ ] Expose `claude_session_resume_failures_total` counter for ops visibility (also surfaced in B15 watchdog tile). ### Tests - [ ] Unit test: row with `last_used_at` 25 h ago → read returns null + row deleted. - [ ] Unit test: row with `last_used_at` 1 h ago → read returns ID. - [ ] Unit test: resume failure → row deleted. - [ ] Unit test: startup sweep drops N rows older than TTL. ## Out of scope - Resuming sessions across orchestrator restarts via a different mechanism — out of scope (Anthropic API doesn't support it). ## References - Spec: `docs/specs/automation-hardening.md` §4 B14. - Session store: `apps/server/src/infrastructure/database/sessions.ts` (or wherever the agent_sessions table lives). - Night-1 incident: 40 resume failures across the run.
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#430
No description provided.