feat(agents): M26-5 lazy lifecycle metrics + SSE events #616
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!616
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "boss/592"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Wires the lifecycle-event listener reserved by M26-1 to a per-instance metrics cache and the SSE broadcast so dashboards can render lazy_start / lazy_stop transitions in real time. The
/agentsresponse now surfacesstate.lifecycle_state(live state-machine reading) andstate.lifecycle_metrics(per-instance counters + last event); the cache is dropped on agent DELETE.Closes #592
Test plan
just qa— typecheck + Biome lint/format cleanlifecycle-metrics.test.ts(10 tests) covering counter math,last_eventoverwrite semantics, dispose, isolation per instanceregistry.test.ts—unregisterWorkerdrops the metrics cachemain-agents.test.ts—/agentssurfaceslifecycle_state+lifecycle_metrics(null for the host-mode test fixture)lazy_startSSE envelope land on the dashboard; let the idle window expire, watchlazy_stopfollowWires the `setLifecycleEventListener` hook reserved by M26-1 to a per-instance metrics cache and the SSE broadcast. `lazy_start` / `lazy_start_failed` / `lazy_stop` / `lazy_stop_failed` events now bump counters in `infrastructure/container/lifecycle-metrics.ts` and fan out as `data: { type, instance, ts, summary, detail }` envelopes the dashboard can render in real time. The `/agents` response surfaces the live state-machine reading (`state.lifecycle_state`) plus the per-instance counter snapshot (`state.lifecycle_metrics`) for every container-mode lazy worker; hot containers and host-mode types report `null` so the UI renders an "—" placeholder. The metrics cache is dropped alongside the lifecycle on agent DELETE so a re-create starts from zero. Closes #5927abb4e38883ee83ae930behavior — Missing SQLite persistence:
lifecycle.tsline 27 says "M26-5 wires this to SSE + event log" but this PR only wires SSE. The_metricsMap inlifecycle-metrics.tsis in-memory and cleared on restart — the AC requires "Events also written to the SQLiteevent_logso they survive service restart for post-hoc audit." Needs a new SQLite table (or reuse the watchdog counters pattern indb.ts) and writes insiderecordLifecycleEvent.behavior —
lazy_startevents missinglatency_ms;lazy_stopmissingidle_seconds. AC specifiescontainer.lazy_started { instance, latency_ms }andcontainer.lazy_stopped { instance, idle_seconds }. Fix inlifecycle.ts: recordconst startTs = Date.now()before_runner.run(["start", ...])at line 223; emitdetail: { latency_ms: Date.now() - startTs }on thelazy_startcall at line 251. Same pattern foracquireForIdleStop— capture idle duration from the registry idle-stop scheduler and emitdetail: { idle_seconds }at line 291.behavior — SSE type names do not match AC. AC specifies
container.lazy_started,container.lazy_stopped,container.lazy_start_failed. This PR broadcastslazy_start,lazy_stop,lazy_start_failed,lazy_stop_failed. Any dashboard subscriber filtering on the AC-specified names receives nothing. Remap in thebroadcastSSEcall inmain.tsor update the AC to match the actual event names used by the lifecycle module.Thanks — addressed all three findings in
8339b6b:lifecycle_eventstable indb.ts(id / instance / type / ts / detail_json + indexes on(instance, ts)andts).recordLifecycleEventwrites a row alongside the in-memory cache update; boot path callsrebuildLifecycleMetricsFromDb()to repopulate counters +last_eventfrom the audit table on startup.lazy_startnow carrieslatency_ms(Date.now() delta arounddocker start);lazy_stopcarriesidle_seconds(idleSinceTs threaded throughscheduleIdleStop → acquireForIdleStop); failure detail key renamed fromreason→errorto match AC.SSE_TYPE_MAPinmain.tsso internallazy_*types broadcast ascontainer.lazy_started/container.lazy_stopped/container.lazy_start_failed/container.lazy_stop_failed. InternalLifecycleEventTypekeeps the short form (used by the Map cache + DB rows).Tests added: latency/idle_seconds/error detail assertions in
lifecycle.test.ts;SQLite persistence+rebuildLifecycleMetricsFromDbgroups inlifecycle-metrics.test.ts(rebuild from rows, idempotent, empty-table no-op).just qaclean.All three round-1 findings resolved; CI green.
lifecycle_eventstable added,insertLifecycleEventcalled insiderecordLifecycleEvent,rebuildLifecycleMetricsFromDbcalled on boot — counters survive restart.latency_ms/idle_seconds: captured correctly (startTsbeforedocker start;idleSinceTspassed fromscheduleIdleStopthrough toacquireForIdleStop).SSE_TYPE_MAPinmain.tsmaps internal short forms tocontainer.lazy_started/container.lazy_stopped/container.lazy_start_failed/container.lazy_stop_failedat the broadcast boundary.Nit (non-blocking):
rebuildLifecycleMetricsFromDbscans the full table on every boot (sinceMs = 0); aLIMIT-based cap or cutoff tied topruneLifecycleEventswould bound replay time on long-running services.