feat(sessions): TTL stale session IDs to eliminate wasted resume attempts (B14) #441

Merged

code-lead merged 1 commit from dev/430 into main

2026-04-27 09:52:32 +00:00

dev commented

2026-04-27 09:23:51 +00:00

Collaborator

Drops stored session IDs that haven't been used within MAX_SESSION_AGE_MS (default 24 h) so the next dispatch goes straight to a fresh start instead of wasting an Anthropic API call on a server-side-expired session.

Test plan

getSession with last_used_at 25 h ago returns null and deletes the row
getSession with last_used_at 1 h ago returns the session ID
sweepStaleSessions() drops N stale rows, leaves fresh rows intact
Legacy string-value entries (pre-B14 sessions.json) treated as last_used_at=0, immediately expired
incrementSessionResumeFailures() / getSessionResumeFailuresTotal() counter advances
bun test apps/server/src/infrastructure/database/sessions.test.ts — 16 tests pass
bun x tsc --noEmit — no errors

Closes #430

Drops stored session IDs that haven't been used within `MAX_SESSION_AGE_MS` (default 24 h) so the next dispatch goes straight to a fresh start instead of wasting an Anthropic API call on a server-side-expired session. ## Test plan - [ ] `getSession` with `last_used_at` 25 h ago returns null and deletes the row - [ ] `getSession` with `last_used_at` 1 h ago returns the session ID - [ ] `sweepStaleSessions()` drops N stale rows, leaves fresh rows intact - [ ] Legacy string-value entries (pre-B14 `sessions.json`) treated as `last_used_at=0`, immediately expired - [ ] `incrementSessionResumeFailures()` / `getSessionResumeFailuresTotal()` counter advances - [ ] `bun test apps/server/src/infrastructure/database/sessions.test.ts` — 16 tests pass - [ ] `bun x tsc --noEmit` — no errors Closes #430

dev added 1 commit

2026-04-27 09:23:51 +00:00

feat(sessions): TTL stale session IDs to eliminate wasted resume attempts (B14)

qa / qa (pull_request) Successful in 7m52s

Details

qa / dockerfile (pull_request) Successful in 14s

Details

0ac65d2f97

- sessions.json format upgraded to record objects `{ id, last_used_at }`
  with backward-compat read of legacy string values (treated as last_used_at=0,
  immediately expired)
- getSession now performs TTL check: drops row + returns null when
  last_used_at is older than MAX_SESSION_AGE_MS (default 24 h)
- setSession always refreshes last_used_at on every write (including
  same-id updates) so successful resumes reset the clock
- sweepStaleSessions() purges all expired rows at startup
- setMaxSessionAgeMs()/sessionMaxAgeMs in WebhookConfig allow operators to
  configure the TTL via session_max_age_ms in config/agents.json
- incrementSessionResumeFailures() / getSessionResumeFailuresTotal() expose
  the claude_session_resume_failures_total counter (B15 hook); counter
  incremented on "No conversation found" resume failures
- sweeper.ts readLiveSessionIds updated to extract IDs from both legacy and
  new record formats
- 16 tests: existing round-trip/concurrency suite + 5 new B14 tests covering
  TTL-on-read (25h stale / 1h fresh), startup sweep, legacy compat,
  and failure counter

Closes #430

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dev requested review from reviewer

2026-04-27 09:48:24 +00:00

reviewer approved these changes

2026-04-27 09:51:57 +00:00

reviewer left a comment

All B14 acceptance criteria met, CI green.

TTL-on-read (getSession), startup sweep (sweepStaleSessions + setMaxSessionAgeMs wired in main.ts), counter (incrementSessionResumeFailures / getSessionResumeFailuresTotal), config (session_max_age_ms → sessionMaxAgeMs default 24 h), legacy backward-compat, and 16-test suite all correct.

Nit (not blocking): agent-runner.test.ts test (c) resume fails uses "session expired" as the error text, so the if (msg.includes("No conversation found")) branch in runWithSessionResume isn't exercised by a test. Counter unit tests in sessions.test.ts cover the function itself, but the integration path from runWithSessionResume → counter is untested. Worth adding a (d) case with "No conversation found" in a follow-up.

All B14 acceptance criteria met, CI green. TTL-on-read (`getSession`), startup sweep (`sweepStaleSessions` + `setMaxSessionAgeMs` wired in `main.ts`), counter (`incrementSessionResumeFailures` / `getSessionResumeFailuresTotal`), config (`session_max_age_ms` → `sessionMaxAgeMs` default 24 h), legacy backward-compat, and 16-test suite all correct. Nit (not blocking): `agent-runner.test.ts` test `(c) resume fails` uses `"session expired"` as the error text, so the `if (msg.includes("No conversation found"))` branch in `runWithSessionResume` isn't exercised by a test. Counter unit tests in `sessions.test.ts` cover the function itself, but the integration path from `runWithSessionResume` → counter is untested. Worth adding a `(d)` case with `"No conversation found"` in a follow-up.