B14 — TTL stale session IDs (close F4 — wasted resume attempts) #430

New issue

Closed

opened 2026-04-27 07:26:03 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-04-27 07:26:03 +00:00

Collaborator

As an orchestrator,
I want to drop a stored session ID after 1 resume failure (or after MAX_SESSION_AGE_MS since last use),
so that the next dispatch goes straight to start fresh without a wasted resume attempt.

Last night 40 resume failures occurred. Each adds ~3–10 s and one failed Anthropic API call. Pure waste — sessions are TTL'd Anthropic-side and won't come back.

Acceptance criteria

TTL on read

On every read of a session ID: check last_used_at; if older than MAX_SESSION_AGE_MS (default 24 h, configurable in config/agents.json), drop the row and return null (caller starts fresh, no resume attempt).

TTL on failure

On resume failed — No conversation found: delete the session ID from the SQLite agent_sessions table immediately. Do not retry.

Startup sweep

On orchestrator boot: drop all rows with last_used_at older than MAX_SESSION_AGE_MS.

Metric

Expose claude_session_resume_failures_total counter for ops visibility (also surfaced in B15 watchdog tile).

Tests

Unit test: row with last_used_at 25 h ago → read returns null + row deleted.
Unit test: row with last_used_at 1 h ago → read returns ID.
Unit test: resume failure → row deleted.
Unit test: startup sweep drops N rows older than TTL.

Out of scope

Resuming sessions across orchestrator restarts via a different mechanism — out of scope (Anthropic API doesn't support it).

References

Spec: docs/specs/automation-hardening.md §4 B14.
Session store: apps/server/src/infrastructure/database/sessions.ts (or wherever the agent_sessions table lives).
Night-1 incident: 40 resume failures across the run.

**As an** orchestrator, **I want** to drop a stored session ID after 1 resume failure (or after `MAX_SESSION_AGE_MS` since last use), **so that** the next dispatch goes straight to `start fresh` without a wasted resume attempt. Last night 40 resume failures occurred. Each adds ~3–10 s and one failed Anthropic API call. Pure waste — sessions are TTL'd Anthropic-side and won't come back. ## Acceptance criteria ### TTL on read - [ ] On every read of a session ID: check `last_used_at`; if older than `MAX_SESSION_AGE_MS` (default 24 h, configurable in `config/agents.json`), drop the row and return null (caller starts fresh, no resume attempt). ### TTL on failure - [ ] On `resume failed — No conversation found`: delete the session ID from the SQLite `agent_sessions` table immediately. Do not retry. ### Startup sweep - [ ] On orchestrator boot: drop all rows with `last_used_at` older than `MAX_SESSION_AGE_MS`. ### Metric - [ ] Expose `claude_session_resume_failures_total` counter for ops visibility (also surfaced in B15 watchdog tile). ### Tests - [ ] Unit test: row with `last_used_at` 25 h ago → read returns null + row deleted. - [ ] Unit test: row with `last_used_at` 1 h ago → read returns ID. - [ ] Unit test: resume failure → row deleted. - [ ] Unit test: startup sweep drops N rows older than TTL. ## Out of scope - Resuming sessions across orchestrator restarts via a different mechanism — out of scope (Anthropic API doesn't support it). ## References - Spec: `docs/specs/automation-hardening.md` §4 B14. - Session store: `apps/server/src/infrastructure/database/sessions.ts` (or wherever the agent_sessions table lives). - Night-1 incident: 40 resume failures across the run.

claude-desktop added the

area:agents

type:user-story

labels

2026-04-27 07:26:33 +00:00

claude-desktop added this to the v1-automation-hardening milestone

2026-04-27 07:26:39 +00:00