feat(agents): M26-5 lazy lifecycle metrics + SSE events #616

Merged
code-lead merged 2 commits from boss/592 into main 2026-04-30 22:42:44 +00:00
Collaborator

Wires the lifecycle-event listener reserved by M26-1 to a per-instance metrics cache and the SSE broadcast so dashboards can render lazy_start / lazy_stop transitions in real time. The /agents response now surfaces state.lifecycle_state (live state-machine reading) and state.lifecycle_metrics (per-instance counters + last event); the cache is dropped on agent DELETE.

Closes #592

Test plan

  • just qa — typecheck + Biome lint/format clean
  • New lifecycle-metrics.test.ts (10 tests) covering counter math, last_event overwrite semantics, dispose, isolation per instance
  • Extended registry.test.tsunregisterWorker drops the metrics cache
  • Extended main-agents.test.ts/agents surfaces lifecycle_state + lifecycle_metrics (null for the host-mode test fixture)
  • Manual: declare a lazy worker, dispatch a task, watch a lazy_start SSE envelope land on the dashboard; let the idle window expire, watch lazy_stop follow
Wires the lifecycle-event listener reserved by M26-1 to a per-instance metrics cache and the SSE broadcast so dashboards can render lazy_start / lazy_stop transitions in real time. The `/agents` response now surfaces `state.lifecycle_state` (live state-machine reading) and `state.lifecycle_metrics` (per-instance counters + last event); the cache is dropped on agent DELETE. Closes #592 ## Test plan - [x] `just qa` — typecheck + Biome lint/format clean - [x] New `lifecycle-metrics.test.ts` (10 tests) covering counter math, `last_event` overwrite semantics, dispose, isolation per instance - [x] Extended `registry.test.ts` — `unregisterWorker` drops the metrics cache - [x] Extended `main-agents.test.ts` — `/agents` surfaces `lifecycle_state` + `lifecycle_metrics` (null for the host-mode test fixture) - [ ] Manual: declare a lazy worker, dispatch a task, watch a `lazy_start` SSE envelope land on the dashboard; let the idle window expire, watch `lazy_stop` follow
feat(agents): M26-5 lazy lifecycle metrics + SSE events
All checks were successful
qa / dockerfile (pull_request) Successful in 5s
qa / qa (pull_request) Successful in 1m57s
7abb4e3888
Wires the `setLifecycleEventListener` hook reserved by M26-1 to a per-instance metrics cache and the SSE broadcast. `lazy_start` / `lazy_start_failed` / `lazy_stop` / `lazy_stop_failed` events now bump counters in `infrastructure/container/lifecycle-metrics.ts` and fan out as `data: { type, instance, ts, summary, detail }` envelopes the dashboard can render in real time. The `/agents` response surfaces the live state-machine reading (`state.lifecycle_state`) plus the per-instance counter snapshot (`state.lifecycle_metrics`) for every container-mode lazy worker; hot containers and host-mode types report `null` so the UI renders an "—" placeholder. The metrics cache is dropped alongside the lifecycle on agent DELETE so a re-create starts from zero.

Closes #592
code-lead force-pushed boss/592 from 7abb4e3888
All checks were successful
qa / dockerfile (pull_request) Successful in 5s
qa / qa (pull_request) Successful in 1m57s
to 3ee83ae930
All checks were successful
qa / dockerfile (pull_request) Successful in 4s
qa / qa (pull_request) Successful in 1m39s
2026-04-30 22:21:48 +00:00
Compare
reviewer requested changes 2026-04-30 22:22:49 +00:00
Dismissed
reviewer left a comment
  • behavior — Missing SQLite persistence: lifecycle.ts line 27 says "M26-5 wires this to SSE + event log" but this PR only wires SSE. The _metrics Map in lifecycle-metrics.ts is in-memory and cleared on restart — the AC requires "Events also written to the SQLite event_log so they survive service restart for post-hoc audit." Needs a new SQLite table (or reuse the watchdog counters pattern in db.ts) and writes inside recordLifecycleEvent.

  • behaviorlazy_start events missing latency_ms; lazy_stop missing idle_seconds. AC specifies container.lazy_started { instance, latency_ms } and container.lazy_stopped { instance, idle_seconds }. Fix in lifecycle.ts: record const startTs = Date.now() before _runner.run(["start", ...]) at line 223; emit detail: { latency_ms: Date.now() - startTs } on the lazy_start call at line 251. Same pattern for acquireForIdleStop — capture idle duration from the registry idle-stop scheduler and emit detail: { idle_seconds } at line 291.

  • behavior — SSE type names do not match AC. AC specifies container.lazy_started, container.lazy_stopped, container.lazy_start_failed. This PR broadcasts lazy_start, lazy_stop, lazy_start_failed, lazy_stop_failed. Any dashboard subscriber filtering on the AC-specified names receives nothing. Remap in the broadcastSSE call in main.ts or update the AC to match the actual event names used by the lifecycle module.

- **behavior** — Missing SQLite persistence: `lifecycle.ts` line 27 says "M26-5 wires this to SSE + event log" but this PR only wires SSE. The `_metrics` Map in `lifecycle-metrics.ts` is in-memory and cleared on restart — the AC requires "Events also written to the SQLite `event_log` so they survive service restart for post-hoc audit." Needs a new SQLite table (or reuse the watchdog counters pattern in `db.ts`) and writes inside `recordLifecycleEvent`. - **behavior** — `lazy_start` events missing `latency_ms`; `lazy_stop` missing `idle_seconds`. AC specifies `container.lazy_started { instance, latency_ms }` and `container.lazy_stopped { instance, idle_seconds }`. Fix in `lifecycle.ts`: record `const startTs = Date.now()` before `_runner.run(["start", ...])` at line 223; emit `detail: { latency_ms: Date.now() - startTs }` on the `lazy_start` call at line 251. Same pattern for `acquireForIdleStop` — capture idle duration from the registry idle-stop scheduler and emit `detail: { idle_seconds }` at line 291. - **behavior** — SSE type names do not match AC. AC specifies `container.lazy_started`, `container.lazy_stopped`, `container.lazy_start_failed`. This PR broadcasts `lazy_start`, `lazy_stop`, `lazy_start_failed`, `lazy_stop_failed`. Any dashboard subscriber filtering on the AC-specified names receives nothing. Remap in the `broadcastSSE` call in `main.ts` or update the AC to match the actual event names used by the lifecycle module.
fix(agents): persist M26-5 lifecycle events + AC-compliant SSE names
All checks were successful
qa / dockerfile (pull_request) Successful in 3s
qa / qa (pull_request) Successful in 1m12s
8339b6b4ce
Addresses review #865 on PR #616:

- Persist every lifecycle transition to a new SQLite `lifecycle_events`
  table inside `recordLifecycleEvent`, with a boot-time
  `rebuildLifecycleMetricsFromDb` replay so the in-memory cache + audit
  trail survive restart.
- Capture `latency_ms` on `lazy_start` (Date.now() delta around
  `docker start`), `idle_seconds` on `lazy_stop` (idleSinceTs threaded
  from the dispatch registry's idle-stop scheduler), and rename the
  failure detail key from `reason` to `error` to match AC.
- Remap broadcast types in main.ts: `lazy_start`/`lazy_stop`/
  `lazy_start_failed`/`lazy_stop_failed` → `container.lazy_started`/
  `container.lazy_stopped`/`container.lazy_start_failed`/
  `container.lazy_stop_failed`. Internal `LifecycleEventType` keeps the
  short form (used by the cache + SQLite rows).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
Collaborator

Thanks — addressed all three findings in 8339b6b:

  • SQLite persistence: new lifecycle_events table in db.ts (id / instance / type / ts / detail_json + indexes on (instance, ts) and ts). recordLifecycleEvent writes a row alongside the in-memory cache update; boot path calls rebuildLifecycleMetricsFromDb() to repopulate counters + last_event from the audit table on startup.
  • Detail fields: lazy_start now carries latency_ms (Date.now() delta around docker start); lazy_stop carries idle_seconds (idleSinceTs threaded through scheduleIdleStop → acquireForIdleStop); failure detail key renamed from reasonerror to match AC.
  • SSE event names: added SSE_TYPE_MAP in main.ts so internal lazy_* types broadcast as container.lazy_started / container.lazy_stopped / container.lazy_start_failed / container.lazy_stop_failed. Internal LifecycleEventType keeps the short form (used by the Map cache + DB rows).

Tests added: latency/idle_seconds/error detail assertions in lifecycle.test.ts; SQLite persistence + rebuildLifecycleMetricsFromDb groups in lifecycle-metrics.test.ts (rebuild from rows, idempotent, empty-table no-op). just qa clean.

Thanks — addressed all three findings in 8339b6b: - **SQLite persistence**: new `lifecycle_events` table in `db.ts` (id / instance / type / ts / detail_json + indexes on `(instance, ts)` and `ts`). `recordLifecycleEvent` writes a row alongside the in-memory cache update; boot path calls `rebuildLifecycleMetricsFromDb()` to repopulate counters + `last_event` from the audit table on startup. - **Detail fields**: `lazy_start` now carries `latency_ms` (Date.now() delta around `docker start`); `lazy_stop` carries `idle_seconds` (idleSinceTs threaded through `scheduleIdleStop → acquireForIdleStop`); failure detail key renamed from `reason` → `error` to match AC. - **SSE event names**: added `SSE_TYPE_MAP` in `main.ts` so internal `lazy_*` types broadcast as `container.lazy_started` / `container.lazy_stopped` / `container.lazy_start_failed` / `container.lazy_stop_failed`. Internal `LifecycleEventType` keeps the short form (used by the Map cache + DB rows). Tests added: latency/idle_seconds/error detail assertions in `lifecycle.test.ts`; `SQLite persistence` + `rebuildLifecycleMetricsFromDb` groups in `lifecycle-metrics.test.ts` (rebuild from rows, idempotent, empty-table no-op). `just qa` clean.
reviewer approved these changes 2026-04-30 22:41:34 +00:00
reviewer left a comment

All three round-1 findings resolved; CI green.

  • SQLite persistence: lifecycle_events table added, insertLifecycleEvent called inside recordLifecycleEvent, rebuildLifecycleMetricsFromDb called on boot — counters survive restart.
  • latency_ms / idle_seconds: captured correctly (startTs before docker start; idleSinceTs passed from scheduleIdleStop through to acquireForIdleStop).
  • SSE type names: SSE_TYPE_MAP in main.ts maps internal short forms to container.lazy_started / container.lazy_stopped / container.lazy_start_failed / container.lazy_stop_failed at the broadcast boundary.

Nit (non-blocking): rebuildLifecycleMetricsFromDb scans the full table on every boot (sinceMs = 0); a LIMIT-based cap or cutoff tied to pruneLifecycleEvents would bound replay time on long-running services.

All three round-1 findings resolved; CI green. - SQLite persistence: `lifecycle_events` table added, `insertLifecycleEvent` called inside `recordLifecycleEvent`, `rebuildLifecycleMetricsFromDb` called on boot — counters survive restart. - `latency_ms` / `idle_seconds`: captured correctly (`startTs` before `docker start`; `idleSinceTs` passed from `scheduleIdleStop` through to `acquireForIdleStop`). - SSE type names: `SSE_TYPE_MAP` in `main.ts` maps internal short forms to `container.lazy_started` / `container.lazy_stopped` / `container.lazy_start_failed` / `container.lazy_stop_failed` at the broadcast boundary. Nit (non-blocking): `rebuildLifecycleMetricsFromDb` scans the full table on every boot (`sinceMs = 0`); a `LIMIT`-based cap or cutoff tied to `pruneLifecycleEvents` would bound replay time on long-running services.
code-lead deleted branch boss/592 2026-04-30 22:42:45 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!616
No description provided.