feat(agents): fleet-health panel — saturation, queue, cost burn, last activity #250
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!250
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "boss/239"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
GET /agents/health(5s cache) rolling up fleet-wide saturation (busy/capacity), queue depth, 1-hour USD burn, newest/oldest activity timestamps, and watchdog-flagged degraded instances. Payload also ships a per-agent slice (last_active_ts+ 6-bucket cost sparkline) so the Agents page hydrates strip + per-card chips in one round-trip.container-watchdogsignals: a newrecordWatchdogEvent/getWatchdogEventCachepair snapshots the latest watchdog event per instance, andbuildFleetHealthpicks the instances whose last event wascontainer_missing/container_stopped/container_recreate_failed. Recovered ones drop off as soon as acontainer_recreatedevent lands./app/agents: four-tileFleetHealthStrip(saturation bar + tones, queue, cost burn with warn/danger thresholds, last activity) plus a conditional Degraded row, a new "Last active" column, and an inlineCostSparklineper row./eventsSSE subscription — task-lifecycle +container_*+ cost envelopes invalidate theagents-healthTanStack query. No poll storm: the backstop refetch is 5 s to match the server TTL.Server-side touches
apps/server/src/agents-health.ts— pure builder (buildFleetHealth) +workerViewadapter.apps/server/src/task-store.ts—computeCostBurn,computeCostSparklineByAgent,lastFinishedAtByAgentSQL helpers.apps/server/src/container-watchdog.ts— latest-event cache exposed tomain.ts.apps/server/src/main.ts—handleAgentsHealthroute + cache-reset escape hatch; watchdog wiring records events alongside the SSE fan-out.Web touches
apps/web/src/components/fleet-health-strip.tsx— strip + tile tones, degraded row.apps/web/src/components/cost-sparkline.tsx— 6-bar mini sparkline, all-zero baseline for idle instances.apps/web/src/lib/format.ts—fmtAgohelper./agents/healthwithrefetchInterval: 5_000+ SSE invalidation.Test plan
bun run qa→ 900 server tests + 201 web tests pass, biome clean, workspace builds.buildFleetHealthunit coverage: saturation, queue, cost window, last-activity, degraded filter, sparkline bucketing, idle-no-history fallback,generated_atISO.GET /agents/healthresponse shape + 5s cache invariant.<FleetHealthStrip />component tests: tiles render, saturation danger at 100 %, cost-burn warn/danger thresholds, degraded row, error + loading states.Closes #239
🤖 Generated with Claude Code
0954ef8c47273c207791Review — APPROVED
CI green (run #1917, 4m11s). All acceptance criteria from issue #239 met. Code is clean, well-tested, and follows existing patterns.
Acceptance criteria check
GET /agents/healthreturns{saturation, queue, cost_burn_usd_per_hour, last_activity, degraded_agents}container-watchdogsignals/app/agents/agents/healthwith synthetic fleet stateWhat's good
buildFleetHealth+ injectablenow) makes the test suite fast and deterministic — the sparkline bucket tests are exact.resetAgentsHealthCache()exported for test hygiene.degradedStateFromEventswitch is exhaustive overWatchdogEventType's 4 members — TypeScript will catch any future type addition that goes unhandled.computeCostSparklineByAgentbucketing math is correct:Math.floor((finished_at - fromMs) / bucketMs)clamped to[0, bucketCount-1], oldest-first, consistent with the test assertions.main.tscorrectly mirrors events intorecordWatchdogEventbefore the SSE fan-out; recovered containers overwrite degraded state naturally.health?.cost_sparkline ?? []) renders cleanly via the all-zeroCostSparklinepath.Two minor notes (non-blocking)
1.
AgentHealthEntry.statushas a dead"stopped"variant —packages/shared/src/agents-health.tsbuildFleetHealthpopulatesper_agentexclusively from the live worker registry, so thestatusfield is always"busy" | "idle"—"stopped"is never emitted. TheAgentRowinagents.tsxcorrectly reads status fromagent.state.status(nothealth.status), so there's no UI bug, but the type misleads callers into thinkingper_agentmight contain stopped agents. Options: strip"stopped"from the union, or explicitly populate it for agents that are in the DB but absent from the live registry (which would make the type accurate and give the UI a richer signal). Either way is fine; just flagging the dead variant.2. CLAUDE.md API table not updated —
CLAUDE.mdGET /agents/healthis a new public endpoint and the CLAUDE.md API table is the canonical reference for them. The Vite proxy is fine (/agentsprefix covers it), but the table entry is missing. Low priority since this is an internal tool, but the file itself says to keep the table in sync.