feat(sessions): disk-backed session store and resume flow #25

Merged
code-lead merged 1 commit from boss/6 into main 2026-04-17 14:20:41 +00:00
Collaborator

Summary

Replaces the sessionKey / dropSession stubs in main.ts with a real disk-backed store (src/sessions.ts) and wires session resumption into runAgent.

  • Sessions store (src/sessions.ts): sessionKey, getSession, setSession, dropSession, dropAllForIssue, backed by ~/.local/state/claude-hooks/sessions.json (override with CLAUDE_HOOKS_STATE_DIR). Atomic writes via tmp + rename. Concurrent writes serialised by an in-process promise-chain mutex — documented as last-writer-wins for the (unexpected) cross-process case.
  • runAgent: looks up the key at task start; passes resume: id to query() when a session exists; captures the first SDK message's session_id and persists it; on a thrown error during a resumed run, drops the bad session and retries once fresh with a warning.
  • Skill prompts: new delta files implement-delta.md, review-delta.md, address-review-delta.md. The webhook picks <base>-delta.md when a session already exists for the (agent, repo, issue/pr) key, else falls back to the full template.
  • Cleanup webhooks: cleanupIssue calls dropAllForIssue(repo, issue); cleanupBranch accepts an optional prNumber and does the same. The PR-close handler passes it.
  • /reset: unchanged shape — now awaits the real async dropSession.

Test plan

  • bun test — 69 tests pass (new src/sessions.test.ts covers round-trip, atomic write + corrupt-tmp survival, concurrent sets, dropAllForIssue scoping)
  • bunx tsc --noEmit — clean
  • bunx biome check src/ — clean
  • Main test suite updated: new /reset case asserts session=true after a real setSession
  • Live smoke: re-dispatch the same issue twice, confirm second run logs resuming session <id> and uses the delta skill

Closes #6

## Summary Replaces the `sessionKey` / `dropSession` stubs in `main.ts` with a real disk-backed store (`src/sessions.ts`) and wires session resumption into `runAgent`. - **Sessions store** (`src/sessions.ts`): `sessionKey`, `getSession`, `setSession`, `dropSession`, `dropAllForIssue`, backed by `~/.local/state/claude-hooks/sessions.json` (override with `CLAUDE_HOOKS_STATE_DIR`). Atomic writes via `tmp + rename`. Concurrent writes serialised by an in-process promise-chain mutex — documented as last-writer-wins for the (unexpected) cross-process case. - **runAgent**: looks up the key at task start; passes `resume: id` to `query()` when a session exists; captures the first SDK message's `session_id` and persists it; on a thrown error during a resumed run, drops the bad session and retries once fresh with a warning. - **Skill prompts**: new delta files `implement-delta.md`, `review-delta.md`, `address-review-delta.md`. The webhook picks `<base>-delta.md` when a session already exists for the (agent, repo, issue/pr) key, else falls back to the full template. - **Cleanup webhooks**: `cleanupIssue` calls `dropAllForIssue(repo, issue)`; `cleanupBranch` accepts an optional `prNumber` and does the same. The PR-close handler passes it. - **`/reset`**: unchanged shape — now awaits the real async `dropSession`. ## Test plan - [x] `bun test` — 69 tests pass (new `src/sessions.test.ts` covers round-trip, atomic write + corrupt-tmp survival, concurrent sets, `dropAllForIssue` scoping) - [x] `bunx tsc --noEmit` — clean - [x] `bunx biome check src/` — clean - [x] Main test suite updated: new `/reset` case asserts `session=true` after a real `setSession` - [ ] Live smoke: re-dispatch the same issue twice, confirm second run logs `resuming session <id>` and uses the delta skill Closes #6
feat(sessions): disk-backed session store and resume flow
All checks were successful
qa / qa (pull_request) Successful in 47s
34dff638a6
Move `sessionKey` and the `dropSession` stub out of `main.ts` into a new
`src/sessions.ts` backed by `~/.local/state/claude-hooks/sessions.json`
(override via `CLAUDE_HOOKS_STATE_DIR`). Writes go through a tmp+rename
and a promise-chain mutex so concurrent workers can't lose an update.

Wire `runAgent` to resume a stored session via the SDK's `resume` option,
capture the session id from the first SDK message, and automatically
retry once with a fresh session if the resumed id is rejected.

On resume, the webhook dispatches a short delta skill
(`<base>-delta.md`) instead of the full template — added for
`implement`, `review`, and `address-review`. `merge` falls back to the
full template since it's a one-shot event.

Cleanup webhooks (`cleanupIssue`, `cleanupBranch` with `prNumber`) now
call `dropAllForIssue` so a closed issue/PR drops every agent's session
for that ticket. `/reset` keeps its shape; the stub return value flips
from always-false to the real dropSession result.

Closes #6
reviewer approved these changes 2026-04-17 13:45:31 +00:00
reviewer left a comment

Review — PR #25: disk-backed session store and resume flow

All acceptance criteria from issue #6 are met. Code is correct, tests are thorough, and the edge cases are handled well. Two observations below — neither blocks merging.


AC checklist

Sessions store (src/sessions.ts)

  • sessionKey, getSession, setSession, dropSession, dropAllForIssue — all implemented and exported
  • Backing store at ~/.local/state/claude-hooks/sessions.json (env override works)
  • Atomic writes via tmp + rename with a sufficiently unique tmp name (pid.ms.rand)
  • In-process mutex via promise-chain; cross-process semantics documented as last-writer-wins — appropriate for a singleton service

runAgent integration

  • sessionKey computed at task start; resume: id passed to query() when a session is stored
  • captured flag correctly gates the first session_id message; setSession called with .catch() so a disk error does not kill the task
  • On throw during resumed run: session dropped, single fresh retry, warning logged — matches AC
  • Cleanup webhooks call dropAllForIssue; cleanupBranch now accepts prNumber
  • /reset endpoint now awaits the real async dropSession

Skill prompt delta vs. full

  • skillForEvent helper is clean; try/catch fallback to full template when delta file is missing
  • implement-delta.md, review-delta.md, address-review-delta.md added

Tests

  • Round-trip (set/get/drop), double-drop returns false
  • Atomic-write: stray .tmp.broken sibling does not corrupt reads
  • Invalid JSON on disk → treated as empty, recovers on next write
  • 50 concurrent setSession calls all land without data loss
  • dropAllForIssue scoping: only matching (repo, issue) keys removed
  • /reset test in main.test.ts now exercises the real dropSession and asserts session=true

Observation 1 — merge missing its delta file (non-blocking)

handleApproved now calls skillForEvent(MERGE_AGENT, repo, pr.number, "merge") but skills/merge-delta.md is absent. The try/catch fallback in skillForEvent silently returns the full merge.md, so the merge agent never benefits from session resumption. Either add merge-delta.md, or revert this call back to loadSkill("merge") to make the intent explicit. If merge is intentionally a one-shot action the latter is cleaner and avoids the unnecessary getSession lookup.

Observation 2 — retry scope is broader than "session not found" (non-blocking)

The catch block retries fresh whenever runOnce(storedSession) throws and storedSession is non-null — regardless of error type. This covers the documented case (expired/unknown session ID) but also catches mid-run errors (Forgejo API blip, Bash failure, etc.), dropping a valid session. The behaviour is still acceptable (fresh retry beats permanent failure), but if the agent had already pushed partial work, the fresh run could open a duplicate PR. Not a new risk (it predates sessions), but the new retry path makes it slightly more likely to surface.

## Review — PR #25: disk-backed session store and resume flow All acceptance criteria from issue #6 are met. Code is correct, tests are thorough, and the edge cases are handled well. Two observations below — neither blocks merging. --- ### AC checklist **Sessions store (`src/sessions.ts`)** - ✅ `sessionKey`, `getSession`, `setSession`, `dropSession`, `dropAllForIssue` — all implemented and exported - ✅ Backing store at `~/.local/state/claude-hooks/sessions.json` (env override works) - ✅ Atomic writes via `tmp + rename` with a sufficiently unique tmp name (`pid.ms.rand`) - ✅ In-process mutex via promise-chain; cross-process semantics documented as last-writer-wins — appropriate for a singleton service **`runAgent` integration** - ✅ `sessionKey` computed at task start; `resume: id` passed to `query()` when a session is stored - ✅ `captured` flag correctly gates the first `session_id` message; `setSession` called with `.catch()` so a disk error does not kill the task - ✅ On throw during resumed run: session dropped, single fresh retry, warning logged — matches AC - ✅ Cleanup webhooks call `dropAllForIssue`; `cleanupBranch` now accepts `prNumber` - ✅ `/reset` endpoint now awaits the real async `dropSession` **Skill prompt delta vs. full** - ✅ `skillForEvent` helper is clean; try/catch fallback to full template when delta file is missing - ✅ `implement-delta.md`, `review-delta.md`, `address-review-delta.md` added **Tests** - ✅ Round-trip (set/get/drop), double-drop returns false - ✅ Atomic-write: stray `.tmp.broken` sibling does not corrupt reads - ✅ Invalid JSON on disk → treated as empty, recovers on next write - ✅ 50 concurrent `setSession` calls all land without data loss - ✅ `dropAllForIssue` scoping: only matching (repo, issue) keys removed - ✅ `/reset` test in `main.test.ts` now exercises the real `dropSession` and asserts `session=true` --- ### Observation 1 — `merge` missing its delta file (non-blocking) `handleApproved` now calls `skillForEvent(MERGE_AGENT, repo, pr.number, "merge")` but `skills/merge-delta.md` is absent. The try/catch fallback in `skillForEvent` silently returns the full `merge.md`, so the merge agent never benefits from session resumption. Either add `merge-delta.md`, or revert this call back to `loadSkill("merge")` to make the intent explicit. If merge is intentionally a one-shot action the latter is cleaner and avoids the unnecessary `getSession` lookup. ### Observation 2 — retry scope is broader than "session not found" (non-blocking) The catch block retries fresh whenever `runOnce(storedSession)` throws and `storedSession` is non-null — regardless of error type. This covers the documented case (expired/unknown session ID) but also catches mid-run errors (Forgejo API blip, Bash failure, etc.), dropping a valid session. The behaviour is still acceptable (fresh retry beats permanent failure), but if the agent had already pushed partial work, the fresh run could open a duplicate PR. Not a new risk (it predates sessions), but the new retry path makes it slightly more likely to surface.
code-lead deleted branch boss/6 2026-04-17 14:20:42 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!25
No description provided.