M26-3: tier badge + failover event log + provider history sparkline #550

Closed
opened 2026-04-29 12:24:37 +00:00 by claude-desktop · 1 comment
Collaborator

As an operator, I want the agents page to show which tier each instance is currently on, when it last failed over, and what's happening across the fleet, so that I can spot a degraded primary provider at a glance and trace why a specific agent jumped to its fallback.

Background

M26-1 lands the data (per-instance current_tier, last_failover_at, last_failure_kind, paused). M26-2 lands the wizard. This story makes the runtime state visible.

Acceptance criteria

Agents list — tier badge column

  • New "Active tier" column on the /agents page list rows.
  • Cell renders ① claude-opus-4-7 for tier 1, ② deepseek-v4-pro for tier 2, ③ qwen3-coder for tier 3, ✕ all tiers exhausted when paused.
  • Tier 1 styled neutral (white text); tier ≥ 2 styled with a subtle warning tone (amber); paused styled red.
  • Hover tooltip shows last_failure_kind + relative time (e.g. "rate_limit · 47m ago").
  • Sortable by tier number; defaults to current sort.

Cooldown countdown

  • When current_tier > 1, badge shows (retry in Nm) countdown until last_failover_at + cooldown_min. Auto-decrements every 30s. Hidden when expired.

Per-instance drawer — failover event log

  • Click an instance row → existing drawer gains a new "Provider chain" tab.
  • Top: stacked-bar showing 24h tier-time distribution (one row per tier, fill = % time on that tier in the last 24h). Pulled from a new agent_provider_events ledger table populated by M26-1's classifier (one row per tier flip + every cooldown-elapsed reset).
  • Bottom: chronological event list, last 7 days: 2026-04-29 09:14 ① → ② "401 auth_error" [view raw]. The [view raw] link opens the linked task transcript drawer.

Manual controls

  • [ ↺ Reset to tier 1 ] button on paused or tier-degraded instances. POSTs /agents/<name>/reset-tier. Server clears state row, schedules immediate reconcile (env-file rewrite).
  • [ ⏸ Pause ] / [ ▶ Resume ] toggle independent of tier. POSTs /agents/<name>/pause and /unpause.

Server endpoints (new)

  • GET /agents/<name>/provider-events?since=<ms>&limit=N → ledger rows.
  • POST /agents/<name>/reset-tier → clears current_tier=1, paused=0, returns updated state.
  • POST /agents/<name>/pause → sets paused=1.
  • POST /agents/<name>/unpause → sets paused=0.
  • All four endpoints behind operator auth gate.

Tests

  • Tier badge renders correctly for tier 1/2/3/paused fixture rows.
  • Cooldown countdown advances on fake timer.
  • Reset-tier button posts to the right endpoint and refetches state.
  • Stacked-bar component renders correct percentages from a 24h fixture.

Out of scope

  • Cross-instance fleet health roll-up on a global page — this story is scoped to /agents only.
  • Cost / spend column (separate concern; lives in /usage).

References

  • M26-1 (#TBD) — produces agent_provider_state + agent_provider_events rows this consumes.
  • Existing instance drawer: apps/web/src/features/agents/InstanceDrawer.tsx (or current location post-config-split).
  • Existing pause/unpause pattern: apps/web/src/components/watchdog-panel.tsx (similar pattern to mirror).
As an **operator**, I want the agents page to show which tier each instance is currently on, when it last failed over, and what's happening across the fleet, so that I can spot a degraded primary provider at a glance and trace why a specific agent jumped to its fallback. ### Background M26-1 lands the data (per-instance `current_tier`, `last_failover_at`, `last_failure_kind`, `paused`). M26-2 lands the wizard. This story makes the runtime state visible. ### Acceptance criteria #### Agents list — tier badge column - [ ] New "Active tier" column on the `/agents` page list rows. - [ ] Cell renders `① claude-opus-4-7` for tier 1, `② deepseek-v4-pro` for tier 2, `③ qwen3-coder` for tier 3, `✕ all tiers exhausted` when paused. - [ ] Tier 1 styled neutral (white text); tier ≥ 2 styled with a subtle warning tone (amber); paused styled red. - [ ] Hover tooltip shows `last_failure_kind` + relative time (e.g. "rate_limit · 47m ago"). - [ ] Sortable by tier number; defaults to current sort. #### Cooldown countdown - [ ] When `current_tier > 1`, badge shows `(retry in Nm)` countdown until `last_failover_at + cooldown_min`. Auto-decrements every 30s. Hidden when expired. #### Per-instance drawer — failover event log - [ ] Click an instance row → existing drawer gains a new "Provider chain" tab. - [ ] Top: stacked-bar showing 24h tier-time distribution (one row per tier, fill = % time on that tier in the last 24h). Pulled from a new `agent_provider_events` ledger table populated by M26-1's classifier (one row per tier flip + every cooldown-elapsed reset). - [ ] Bottom: chronological event list, last 7 days: `2026-04-29 09:14 ① → ② "401 auth_error" [view raw]`. The `[view raw]` link opens the linked task transcript drawer. #### Manual controls - [ ] `[ ↺ Reset to tier 1 ]` button on paused or tier-degraded instances. POSTs `/agents/<name>/reset-tier`. Server clears state row, schedules immediate reconcile (env-file rewrite). - [ ] `[ ⏸ Pause ]` / `[ ▶ Resume ]` toggle independent of tier. POSTs `/agents/<name>/pause` and `/unpause`. #### Server endpoints (new) - [ ] `GET /agents/<name>/provider-events?since=<ms>&limit=N` → ledger rows. - [ ] `POST /agents/<name>/reset-tier` → clears `current_tier=1, paused=0`, returns updated state. - [ ] `POST /agents/<name>/pause` → sets `paused=1`. - [ ] `POST /agents/<name>/unpause` → sets `paused=0`. - [ ] All four endpoints behind operator auth gate. #### Tests - [ ] Tier badge renders correctly for tier 1/2/3/paused fixture rows. - [ ] Cooldown countdown advances on fake timer. - [ ] Reset-tier button posts to the right endpoint and refetches state. - [ ] Stacked-bar component renders correct percentages from a 24h fixture. ### Out of scope - Cross-instance fleet health roll-up on a global page — this story is scoped to `/agents` only. - Cost / spend column (separate concern; lives in `/usage`). ### References - M26-1 (#TBD) — produces `agent_provider_state` + `agent_provider_events` rows this consumes. - Existing instance drawer: `apps/web/src/features/agents/InstanceDrawer.tsx` (or current location post-config-split). - Existing pause/unpause pattern: `apps/web/src/components/watchdog-panel.tsx` (similar pattern to mirror).
Author
Collaborator

Shipped via PR #552 + #554 (post-redesign):

Tier badge:

  • Tier column on the InstancesTable. Renders ① / ② / ③ glyph + active model id, ✕ when paused.
  • Tier 1 neutral, tier ≥ 2 amber, paused red.
  • Hover tooltip: last_failure_kind · Nm until retry.
  • M26-7 icon next to glyph when last_failure_kind === "token_budget".

Cooldown countdown:

  • 30 s setInterval ticker drives a fresh Math.ceil on each tick. Hidden when expired or tier == 1.

Manual controls:

  • Reset-to-tier-1 (visible only when degraded or paused) → POST /agents/<name>/reset-tier.
  • ⏸ / ▶ toggle → POST /agents/<name>/{pause,unpause}.
  • All buttons live on the badge, not in a drawer.

Server endpoints:

  • GET /agents/<name>/provider-events?since=&limit= → ledger rows.
  • POST /agents/<name>/{reset-tier,pause,unpause} — all three behind guardMutating.
  • paused coerced to boolean uniformly across list + control endpoints (serialiseProviderState).

Deviation:

  • Per-instance drawer "Provider chain" tab with stacked-bar tier-time distribution + 7-day event log NOT shipped. The drawer architecture changed in PR #554 (drawer is now per-type, not per-instance). The ledger endpoint exists but no consumer; reopen as a dedicated story if/when an instance drawer surfaces.

Tests: 6 endpoint cases (list shape, ledger empty, 404 unknown agent, pause/unpause toggle, reset-tier, auth rejection).

Shipped via PR #552 + #554 (post-redesign): **Tier badge:** - `Tier` column on the InstancesTable. Renders ① / ② / ③ glyph + active model id, ✕ when paused. - Tier 1 neutral, tier ≥ 2 amber, paused red. - Hover tooltip: `last_failure_kind · Nm until retry`. - M26-7 ⛽ icon next to glyph when `last_failure_kind === "token_budget"`. **Cooldown countdown:** - 30 s `setInterval` ticker drives a fresh `Math.ceil` on each tick. Hidden when expired or tier == 1. **Manual controls:** - `↺` Reset-to-tier-1 (visible only when degraded or paused) → `POST /agents/<name>/reset-tier`. - `⏸ / ▶` toggle → `POST /agents/<name>/{pause,unpause}`. - All buttons live on the badge, not in a drawer. **Server endpoints:** - `GET /agents/<name>/provider-events?since=&limit=` → ledger rows. - `POST /agents/<name>/{reset-tier,pause,unpause}` — all three behind `guardMutating`. - `paused` coerced to boolean uniformly across list + control endpoints (`serialiseProviderState`). **Deviation:** - Per-instance drawer "Provider chain" tab with stacked-bar tier-time distribution + 7-day event log NOT shipped. The drawer architecture changed in PR #554 (drawer is now per-type, not per-instance). The ledger endpoint exists but no consumer; reopen as a dedicated story if/when an instance drawer surfaces. Tests: 6 endpoint cases (list shape, ledger empty, 404 unknown agent, pause/unpause toggle, reset-tier, auth rejection).
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#550
No description provided.