feat(agents): fleet-health panel — saturation, queue, cost burn, last activity #250

Merged
code-lead merged 1 commit from boss/239 into main 2026-04-21 15:46:40 +00:00
Collaborator

Summary

  • Adds GET /agents/health (5s cache) rolling up fleet-wide saturation (busy/capacity), queue depth, 1-hour USD burn, newest/oldest activity timestamps, and watchdog-flagged degraded instances. Payload also ships a per-agent slice (last_active_ts + 6-bucket cost sparkline) so the Agents page hydrates strip + per-card chips in one round-trip.
  • Degraded detection reuses the container-watchdog signals: a new recordWatchdogEvent / getWatchdogEventCache pair snapshots the latest watchdog event per instance, and buildFleetHealth picks the instances whose last event was container_missing / container_stopped / container_recreate_failed. Recovered ones drop off as soon as a container_recreated event lands.
  • On /app/agents: four-tile FleetHealthStrip (saturation bar + tones, queue, cost burn with warn/danger thresholds, last activity) plus a conditional Degraded row, a new "Last active" column, and an inline CostSparkline per row.
  • Live updates flow through the existing /events SSE subscription — task-lifecycle + container_* + cost envelopes invalidate the agents-health TanStack query. No poll storm: the backstop refetch is 5 s to match the server TTL.

Server-side touches

  • apps/server/src/agents-health.ts — pure builder (buildFleetHealth) + workerView adapter.
  • apps/server/src/task-store.tscomputeCostBurn, computeCostSparklineByAgent, lastFinishedAtByAgent SQL helpers.
  • apps/server/src/container-watchdog.ts — latest-event cache exposed to main.ts.
  • apps/server/src/main.tshandleAgentsHealth route + cache-reset escape hatch; watchdog wiring records events alongside the SSE fan-out.

Web touches

  • apps/web/src/components/fleet-health-strip.tsx — strip + tile tones, degraded row.
  • apps/web/src/components/cost-sparkline.tsx — 6-bar mini sparkline, all-zero baseline for idle instances.
  • apps/web/src/lib/format.tsfmtAgo helper.
  • Agents route subscribes to /agents/health with refetchInterval: 5_000 + SSE invalidation.

Test plan

  • bun run qa → 900 server tests + 201 web tests pass, biome clean, workspace builds.
  • buildFleetHealth unit coverage: saturation, queue, cost window, last-activity, degraded filter, sparkline bucketing, idle-no-history fallback, generated_at ISO.
  • HTTP-surface test: GET /agents/health response shape + 5s cache invariant.
  • <FleetHealthStrip /> component tests: tiles render, saturation danger at 100 %, cost-burn warn/danger thresholds, degraded row, error + loading states.
  • Visual check once deployed: saturate the fleet (12 concurrent dispatches) and confirm saturation tile flips red.

Closes #239

🤖 Generated with Claude Code

## Summary - Adds `GET /agents/health` (5s cache) rolling up fleet-wide saturation (busy/capacity), queue depth, 1-hour USD burn, newest/oldest activity timestamps, and watchdog-flagged degraded instances. Payload also ships a per-agent slice (`last_active_ts` + 6-bucket cost sparkline) so the Agents page hydrates strip + per-card chips in one round-trip. - Degraded detection reuses the `container-watchdog` signals: a new `recordWatchdogEvent` / `getWatchdogEventCache` pair snapshots the latest watchdog event per instance, and `buildFleetHealth` picks the instances whose last event was `container_missing` / `container_stopped` / `container_recreate_failed`. Recovered ones drop off as soon as a `container_recreated` event lands. - On `/app/agents`: four-tile `FleetHealthStrip` (saturation bar + tones, queue, cost burn with warn/danger thresholds, last activity) plus a conditional Degraded row, a new "Last active" column, and an inline `CostSparkline` per row. - Live updates flow through the existing `/events` SSE subscription — task-lifecycle + `container_*` + cost envelopes invalidate the `agents-health` TanStack query. No poll storm: the backstop refetch is 5 s to match the server TTL. ### Server-side touches - `apps/server/src/agents-health.ts` — pure builder (`buildFleetHealth`) + `workerView` adapter. - `apps/server/src/task-store.ts` — `computeCostBurn`, `computeCostSparklineByAgent`, `lastFinishedAtByAgent` SQL helpers. - `apps/server/src/container-watchdog.ts` — latest-event cache exposed to `main.ts`. - `apps/server/src/main.ts` — `handleAgentsHealth` route + cache-reset escape hatch; watchdog wiring records events alongside the SSE fan-out. ### Web touches - `apps/web/src/components/fleet-health-strip.tsx` — strip + tile tones, degraded row. - `apps/web/src/components/cost-sparkline.tsx` — 6-bar mini sparkline, all-zero baseline for idle instances. - `apps/web/src/lib/format.ts` — `fmtAgo` helper. - Agents route subscribes to `/agents/health` with `refetchInterval: 5_000` + SSE invalidation. ## Test plan - [x] `bun run qa` → 900 server tests + 201 web tests pass, biome clean, workspace builds. - [x] `buildFleetHealth` unit coverage: saturation, queue, cost window, last-activity, degraded filter, sparkline bucketing, idle-no-history fallback, `generated_at` ISO. - [x] HTTP-surface test: `GET /agents/health` response shape + 5s cache invariant. - [x] `<FleetHealthStrip />` component tests: tiles render, saturation danger at 100 %, cost-burn warn/danger thresholds, degraded row, error + loading states. - [ ] Visual check once deployed: saturate the fleet (12 concurrent dispatches) and confirm saturation tile flips red. Closes #239 🤖 Generated with [Claude Code](https://claude.com/claude-code)
feat(agents): fleet-health panel — saturation, queue, cost burn, last activity
Some checks failed
qa / qa (pull_request) Has been cancelled
qa / dockerfile (pull_request) Has been cancelled
0954ef8c47
Adds GET /agents/health (5s cache) that rolls up fleet-wide saturation,
queue depth, 1-hour USD/hr cost burn, last-activity ages, and the
watchdog-reported degraded instance list. Payload also carries a
per-agent slice (last_active_ts + 6-bucket cost sparkline) so the
Agents page can render the strip + per-card chips in one round-trip.

Renders on /app/agents as a four-tile strip plus a conditional Degraded
banner; each row gains a Last-active column and a tiny cost-sparkline.
Live-updates via SSE task-lifecycle + container_* events — no poll storm
beyond the existing 5s refetchInterval floor.

Closes #239

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
code-lead force-pushed boss/239 from 0954ef8c47
Some checks failed
qa / qa (pull_request) Has been cancelled
qa / dockerfile (pull_request) Has been cancelled
to 273c207791
All checks were successful
qa / qa (pull_request) Successful in 4m3s
qa / dockerfile (pull_request) Successful in 7s
2026-04-21 15:27:19 +00:00
Compare
reviewer approved these changes 2026-04-21 15:45:50 +00:00
reviewer left a comment

Review — APPROVED

CI green (run #1917, 4m11s). All acceptance criteria from issue #239 met. Code is clean, well-tested, and follows existing patterns.


Acceptance criteria check

Criterion Status
GET /agents/health returns {saturation, queue, cost_burn_usd_per_hour, last_activity, degraded_agents}
5 s server cache
Degraded detection reuses container-watchdog signals
Fleet-health strip on /app/agents
Per-agent card gets last-activity + cost sparkline
Live updates via SSE — no poll storm 5 s backstop + SSE invalidation
Unit test for /agents/health with synthetic fleet state 8 pure builder cases + 2 HTTP surface cases
Visual: saturate fleet manually manual, acceptable

What's good

  • Pure builder pattern (buildFleetHealth + injectable now) makes the test suite fast and deterministic — the sparkline bucket tests are exact.
  • Cache strategy matches the board/pipeline 5 s TTL pattern exactly; resetAgentsHealthCache() exported for test hygiene.
  • degradedStateFromEvent switch is exhaustive over WatchdogEventType's 4 members — TypeScript will catch any future type addition that goes unhandled.
  • computeCostSparklineByAgent bucketing math is correct: Math.floor((finished_at - fromMs) / bucketMs) clamped to [0, bucketCount-1], oldest-first, consistent with the test assertions.
  • Watchdog wiring in main.ts correctly mirrors events into recordWatchdogEvent before the SSE fan-out; recovered containers overwrite degraded state naturally.
  • No raw hex in components — all tones via Tailwind utility classes.
  • Empty sparkline fallback (health?.cost_sparkline ?? []) renders cleanly via the all-zero CostSparkline path.

Two minor notes (non-blocking)

1. AgentHealthEntry.status has a dead "stopped" variantpackages/shared/src/agents-health.ts

buildFleetHealth populates per_agent exclusively from the live worker registry, so the status field is always "busy" | "idle""stopped" is never emitted. The AgentRow in agents.tsx correctly reads status from agent.state.status (not health.status), so there's no UI bug, but the type misleads callers into thinking per_agent might contain stopped agents. Options: strip "stopped" from the union, or explicitly populate it for agents that are in the DB but absent from the live registry (which would make the type accurate and give the UI a richer signal). Either way is fine; just flagging the dead variant.

2. CLAUDE.md API table not updatedCLAUDE.md

GET /agents/health is a new public endpoint and the CLAUDE.md API table is the canonical reference for them. The Vite proxy is fine (/agents prefix covers it), but the table entry is missing. Low priority since this is an internal tool, but the file itself says to keep the table in sync.

## Review — APPROVED CI green (run #1917, 4m11s). All acceptance criteria from issue #239 met. Code is clean, well-tested, and follows existing patterns. --- ### Acceptance criteria check | Criterion | Status | |---|---| | `GET /agents/health` returns `{saturation, queue, cost_burn_usd_per_hour, last_activity, degraded_agents}` | ✅ | | 5 s server cache | ✅ | | Degraded detection reuses `container-watchdog` signals | ✅ | | Fleet-health strip on `/app/agents` | ✅ | | Per-agent card gets last-activity + cost sparkline | ✅ | | Live updates via SSE — no poll storm | ✅ 5 s backstop + SSE invalidation | | Unit test for `/agents/health` with synthetic fleet state | ✅ 8 pure builder cases + 2 HTTP surface cases | | Visual: saturate fleet manually | ⬜ manual, acceptable | --- ### What's good - **Pure builder pattern** (`buildFleetHealth` + injectable `now`) makes the test suite fast and deterministic — the sparkline bucket tests are exact. - **Cache strategy** matches the board/pipeline 5 s TTL pattern exactly; `resetAgentsHealthCache()` exported for test hygiene. - **`degradedStateFromEvent` switch** is exhaustive over `WatchdogEventType`'s 4 members — TypeScript will catch any future type addition that goes unhandled. - **`computeCostSparklineByAgent` bucketing math** is correct: `Math.floor((finished_at - fromMs) / bucketMs)` clamped to `[0, bucketCount-1]`, oldest-first, consistent with the test assertions. - **Watchdog wiring** in `main.ts` correctly mirrors events into `recordWatchdogEvent` before the SSE fan-out; recovered containers overwrite degraded state naturally. - **No raw hex** in components — all tones via Tailwind utility classes. - **Empty sparkline fallback** (`health?.cost_sparkline ?? []`) renders cleanly via the all-zero `CostSparkline` path. --- ### Two minor notes (non-blocking) **1. `AgentHealthEntry.status` has a dead `"stopped"` variant** — `packages/shared/src/agents-health.ts` `buildFleetHealth` populates `per_agent` exclusively from the live worker registry, so the `status` field is always `"busy" | "idle"` — `"stopped"` is never emitted. The `AgentRow` in `agents.tsx` correctly reads status from `agent.state.status` (not `health.status`), so there's no UI bug, but the type misleads callers into thinking `per_agent` might contain stopped agents. Options: strip `"stopped"` from the union, or explicitly populate it for agents that are in the DB but absent from the live registry (which would make the type accurate and give the UI a richer signal). Either way is fine; just flagging the dead variant. **2. CLAUDE.md API table not updated** — `CLAUDE.md` `GET /agents/health` is a new public endpoint and the CLAUDE.md API table is the canonical reference for them. The Vite proxy is fine (`/agents` prefix covers it), but the table entry is missing. Low priority since this is an internal tool, but the file itself says to keep the table in sync.
code-lead deleted branch boss/239 2026-04-21 15:46:40 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!250
No description provided.