M26 — multi-provider failover (chain + UI + audit) #552

Merged
charles merged 10 commits from feat/m26-multi-provider-failover into main 2026-04-29 13:53:41 +00:00
Collaborator

Summary

Replaces the single provider field per agent type (PR #547) with a 1–3 entry ordered chain plus per-instance failover state. On auth/quota/rate-limit/persistent-5xx failure the dispatcher steps down a tier; after cooldown a successful task resets to tier 1.

4 commits, one per ticket:

# Commit Closes
1 a7128e2 docs(specs): audit usage_threshold_tokens + spec out enforcement #551
2 85d5a51 feat(agents): provider chain + per-instance failover state machine #548
3 2feadfe feat(dashboard): provider chain editor in /config wizard #549
4 8dbd864 feat(dashboard): tier badge + reset/pause controls + ledger endpoint #550

What's in each

M26-4 (docs/specs/token-budget.md) — audit confirms usage_threshold_tokens is a UI-only stub today (3 read sites, 0 enforcement). Spec proposes per-type budget driving the M26-1 chain via token_budget trigger. Implementation deferred to M26-5/6/7.

M26-1 — schema (provider_chain + failover blocks), legacy single-provider migration, two SQLite tables (agent_provider_state + agent_provider_events ledger), classifier (provider-failover.ts), state machine (applyOutcome), tier-aware container env-file rewrite, agent-runner hook. 23 classifier/state-machine tests + 12 loader tests, all green.

M26-2 — new "Provider" section tab in /config wizard between Thresholds and Container. Provider dropdown + free-text model input per row, ↑/↓ reorder, +/× add/remove (max 3 / min 1), failover policy block (cooldown + pause-or-wrap + trigger checkboxes — token_budget greyed out per M26-4). Validation banner for host-mode + non-anthropic tier 1, duplicate providers, empty models. 7 component tests.

M26-3provider_state block on /agents response, 4 new endpoints: GET /agents/:name/provider-events, POST /agents/:name/{reset-tier,pause,unpause}. New "Tier" column on the /agents page renders ①/②/③ glyph (or ✕ when paused) + active model id + cooldown countdown when degraded. Per-row reset (↺) + pause/resume (⏸/▶) buttons. 6 endpoint tests.

Test plan

  • just typecheck — all 4 packages clean.
  • Pre-commit Biome — clean across all 4 commits.
  • Server tests — 2580 pass, 4 fail (all pre-existing on main: session JSONL pruning + foreman session CRUD).
  • Web tests — 588 pass, 24 fail (all pre-existing localStorage shim issues in theme.test.ts / selected-repos.test.ts).
  • Manual: declare a 3-tier chain (anthropicdeepseekollama) on dev, force a 401 by setting a bad DEEPSEEK_API_KEY, dispatch a task, confirm container recreates with tier-2 env.
  • Manual: flip a paused agent back via button on the /agents page; confirm agent_provider_state.paused = 0 and the worker resumes dispatching.
  • Manual: cooldown countdown auto-decrements (note: M26-3 review flagged that the countdown only refreshes on re-render — may need a setInterval driver).

Known gaps (intentional deferrals)

  • Live /agents/models?provider= combobox in M26-2 wizard — endpoint exists from PR #547 but kept as free-text input for v1.
  • Per-instance drawer with stacked-bar tier-time distribution + chronological event log (originally part of M26-3 ticket) — defer to a follow-up; would need a new drawer that doesn't exist today on the agents page.
  • token_budget real enforcement — tracked by M26-4 audit; implementation in M26-5/6/7 (tickets not yet opened).

🤖 Generated with Claude Code

## Summary Replaces the single `provider` field per agent type (PR #547) with a 1–3 entry ordered chain plus per-instance failover state. On auth/quota/rate-limit/persistent-5xx failure the dispatcher steps down a tier; after cooldown a successful task resets to tier 1. 4 commits, one per ticket: | # | Commit | Closes | |---|---|---| | 1 | `a7128e2` docs(specs): audit `usage_threshold_tokens` + spec out enforcement | #551 | | 2 | `85d5a51` feat(agents): provider chain + per-instance failover state machine | #548 | | 3 | `2feadfe` feat(dashboard): provider chain editor in /config wizard | #549 | | 4 | `8dbd864` feat(dashboard): tier badge + reset/pause controls + ledger endpoint | #550 | ## What's in each **M26-4** (`docs/specs/token-budget.md`) — audit confirms `usage_threshold_tokens` is a UI-only stub today (3 read sites, 0 enforcement). Spec proposes per-type budget driving the M26-1 chain via `token_budget` trigger. Implementation deferred to M26-5/6/7. **M26-1** — schema (`provider_chain` + `failover` blocks), legacy single-`provider` migration, two SQLite tables (`agent_provider_state` + `agent_provider_events` ledger), classifier (`provider-failover.ts`), state machine (`applyOutcome`), tier-aware container env-file rewrite, agent-runner hook. 23 classifier/state-machine tests + 12 loader tests, all green. **M26-2** — new "Provider" section tab in `/config` wizard between Thresholds and Container. Provider dropdown + free-text model input per row, ↑/↓ reorder, +/× add/remove (max 3 / min 1), failover policy block (cooldown + pause-or-wrap + trigger checkboxes — `token_budget` greyed out per M26-4). Validation banner for host-mode + non-anthropic tier 1, duplicate providers, empty models. 7 component tests. **M26-3** — `provider_state` block on `/agents` response, 4 new endpoints: `GET /agents/:name/provider-events`, `POST /agents/:name/{reset-tier,pause,unpause}`. New "Tier" column on the `/agents` page renders ①/②/③ glyph (or ✕ when paused) + active model id + cooldown countdown when degraded. Per-row reset (↺) + pause/resume (⏸/▶) buttons. 6 endpoint tests. ## Test plan - [x] `just typecheck` — all 4 packages clean. - [x] Pre-commit Biome — clean across all 4 commits. - [x] Server tests — 2580 pass, 4 fail (all pre-existing on `main`: session JSONL pruning + foreman session CRUD). - [x] Web tests — 588 pass, 24 fail (all pre-existing localStorage shim issues in `theme.test.ts` / `selected-repos.test.ts`). - [ ] Manual: declare a 3-tier chain (`anthropic` → `deepseek` → `ollama`) on `dev`, force a 401 by setting a bad `DEEPSEEK_API_KEY`, dispatch a task, confirm container recreates with tier-2 env. - [ ] Manual: flip a paused agent back via `↺` button on the `/agents` page; confirm `agent_provider_state.paused = 0` and the worker resumes dispatching. - [ ] Manual: cooldown countdown auto-decrements (note: M26-3 review flagged that the countdown only refreshes on re-render — may need a setInterval driver). ## Known gaps (intentional deferrals) - Live `/agents/models?provider=` combobox in M26-2 wizard — endpoint exists from PR #547 but kept as free-text input for v1. - Per-instance drawer with stacked-bar tier-time distribution + chronological event log (originally part of M26-3 ticket) — defer to a follow-up; would need a new drawer that doesn't exist today on the agents page. - `token_budget` real enforcement — tracked by M26-4 audit; implementation in M26-5/6/7 (tickets not yet opened). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Closes #551.

Audit findings:
- 3 read sites (loader, /usage echo, schema validation), 0 enforcement.
- `usageThresholdTokens` is purely cosmetic — drives the dashboard ring's
  50/80/95% colour tiers; crossing the threshold has no runtime effect.

Decisions captured in `docs/specs/token-budget.md`:
- Lift threshold from global service.json to per-type agents.json (keep
  global as operator-wide kill switch).
- Reuse existing `usage_reset` window — no new reset clock.
- Threshold cross fires M26-1's `token_budget` trigger → tier flip via
  the chain. Single-tier types fall back to hard pause iff
  `pause_if_all_fail`.
- Gate lives in the post-task hook (`agent-runner.ts:489`), not pre-flight
  or sweeper. Trades one over-budget task per window for a clean code path.

Re-adds `token_budget` to M26-1's `failover.triggers` enum (loader rejects
it on single-tier types — noop).

Implementation gated on follow-up tickets:
- M26-5: loader + post-task hook + `recordFailover` integration.
- M26-6: wizard field + per-type ring in Stats tab.
- M26-7: `token_budget` icon in M26-3's tier badge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #548.

Replaces the single `provider` field per agent type with a 1–3 entry
ordered chain plus a failover policy. Per-instance state lives in two
new SQLite tables; container reconcile reads the active tier and rewrites
the env-file accordingly.

Schema:
- `agentTypeSchema` adds `provider_chain: ProviderChainEntry[]` (1–3
  entries) and `failover: { cooldown_min, pause_if_all_fail, triggers }`.
- Loader: 4 input shapes accepted — explicit chain, legacy single-`provider`
  (auto-wrapped to 1-tier), absent (synthesises anthropic 1-tier), hybrid
  rejected. Tiers renumbered to contiguous 1..N. Host-mode types must have
  anthropic at tier 1. `token_budget` trigger stripped from chains < 2 with
  console.warn (M26-5 will land enforcement).

State:
- `agent_provider_state` — one row per instance: `current_tier`,
  `last_failover_at`, `last_failure_kind`, `paused`. Paired insert in
  `createAgent`, cascade delete in `deleteAgent`.
- `agent_provider_events` — append-only ledger, one row per tier flip
  + cooldown reset. Indexed on `(agent_name, at)` for the M26-3 drawer.

Classifier (`provider-failover.ts`):
- Builds a `TaskOutcome` from the streaming `TaskEvent` sequence —
  collects `api_retry` system events' `error_status` codes and the
  terminal `ResultEvent.{ok, subtype, errors}`.
- `classifyOutcome` maps to a `FailoverTrigger`: 401/403 → `auth_error`,
  402 → `quota_error`, 429 + retry budget exhausted → `rate_limit`, ≥3
  5xx in a 5-min sliding window per instance → `persistent_5xx`. Falls
  back to error-string regex for SDK errors that bypass `api_retry`.
- `applyOutcome` runs the state machine: bump on match, pause-or-wrap
  when chain exhausted, reset to tier 1 on success after cooldown.

Container reconcile:
- `buildProviderEnvLines` now picks the active chain entry via
  `pickActiveChainEntry(agent)` (reads tier from state, falls back to
  legacy `agent.provider` for hand-crafted test fixtures).

Runner hookup:
- `agent-runner.ts` builds an outcome incrementally inside the
  for-await loop, then calls `applyOutcome` after stream close.
  Best-effort — DB error never fails the task. Logs the action when
  not a noop.

Tests:
- `provider-failover.test.ts` — 23 cases covering classifier (status
  codes, 5xx window, error-string fallback) and state machine (bump,
  pause, wraparound, cooldown reset, trigger filtering, legacy fixture
  fallback). 47 expect calls, 0 fail.
- `webhook-config.test.ts` — 12 new cases covering chain shapes,
  validation errors, host-mode gate, custom failover policy, legacy
  migration. 22 expect calls, 0 fail.
- Existing single-provider host-mode test updated for the new error
  message wording.

Out of scope (follow-ups):
- M26-2: wizard step 3 to edit chains.
- M26-3: tier badge in agents list + ledger drawer.
- M26-4: token_budget enforcement (already specced in
  `docs/specs/token-budget.md`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #549.

New "Provider" section tab between Thresholds and Container in the
type-edit form. Operators can now configure M26-1's `provider_chain` +
`failover` block through the UI instead of hand-editing
`config/agents.json`.

Wizard step:
- Renders 1–3 tier rows with per-row provider dropdown
  (anthropic / deepseek / ollama) + free-text model input.
- `+ Add tier` button (disabled at 3). `×` remove (disabled at 1).
- `↑` / `↓` reorder buttons; tiers renumber to contiguous 1..N on save.
- Failover policy block (only renders for chain length ≥ 2):
  cooldown_min numeric input, pause_if_all_fail checkbox, triggers
  checkbox group (4 active + token_budget reserved/disabled).
- Inline validation banner: host-mode + non-anthropic tier 1, duplicate
  providers, missing model ids.

Wire-up:
- `apps/web/src/lib/api.ts` declares `ProviderChainEntry`,
  `FailoverPolicy`, `AgentProvider`, `FailoverTrigger` plus the
  `FAILOVER_TRIGGERS` UI metadata array.
- `AgentTypeConfig` gains `provider_chain` and `failover` fields. The
  legacy `provider` + `default_model` fields stay populated in sync with
  tier 1 so existing UI surfaces (Thresholds model picker) keep working.
- Save path round-trips through the existing `PUT /config/agents`
  endpoint — no new wire format.

Tests:
- `config.test.tsx` — 7 new cases: legacy 1-tier render, add-tier
  appends, remove disabled at 1, reorder renumbers, failover block
  visibility gated on chain length, token_budget disabled,
  host-mode warning surfaces.
- Existing "renders all six section tabs" test renamed to drop the
  hardcoded count.

Out of scope:
- Live model combobox per row (free-text input today; M26-3 will wire
  `GET /agents/models?provider=<id>` once the endpoint is available).
- Per-row "ready/⚠ env missing" status indicator (same blocker).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(dashboard): tier badge + reset/pause controls + ledger endpoint (M26-3)
All checks were successful
qa / qa (pull_request) Successful in 13m39s
qa / dockerfile (pull_request) Successful in 15s
8dbd864d0c
Closes #550.

Surfaces M26-1's per-instance failover state on the /agents page and
adds operator controls. Three new server endpoints + a tier-badge
component on the agents-list rows.

Server:
- `handleAgentsList` joins `agent_provider_state` per row; response gains
  a `provider_state` block (current_tier, active_provider, active_model,
  chain_length, last_failover_at, last_failure_kind, paused, cooldown_min).
- `GET /agents/:name/provider-events?since=&limit=` — failover ledger
  rows from `agent_provider_events` table. Empty on a fresh agent.
- `POST /agents/:name/reset-tier` — clears `current_tier=1, paused=0,
  last_failover_at=NULL`. Also clears the in-memory 5xx sliding window.
- `POST /agents/:name/pause` / `/unpause` — toggles `paused` flag.
- All POST endpoints behind `guardMutating` (operator-session auth).
- 404 when agent name isn't in DB.

Web:
- `AgentsListEntry` gains optional `provider_state` block; absence handled
  defensively (legacy fixture compatibility).
- `TierBadge` component renders ①/②/③ glyph (or ✕ when paused) plus the
  active model id, an inline cooldown countdown when degraded, and a
  hover tooltip with the last failure kind. Reset (↺) button shown only
  on degraded/paused rows; pause/resume (⏸/▶) toggle on every row.
- `agents.tsx` mounts react-query mutations for the three control
  endpoints; success refetches `/agents`, error fires a toast.
- New "Tier" column added between Model and Match labels in the agents
  table.

Tests:
- `main-agents.test.ts` — 6 cases covering response shape (every row has
  provider_state with current_tier=1), empty ledger on fresh agent,
  404 on unknown agent, pause/unpause toggle, reset-tier, auth rejection.

Out of scope (deferred to follow-ups):
- Drawer tab with stacked-bar tier-time distribution + chronological
  event list — requires a new instance-detail drawer that doesn't
  exist today on the agents page. Standalone follow-up.
- Live model combobox in M26-2 wizard (depends on `/agents/models?provider`
  endpoint — already shipped in PR #547 but UI uses free-text for now).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #552 review feedback on commit a7128e2:

- Fix every file:line reference — all stale post-config-split.
  webhook-config.ts loader is at 2027-2029 (not 1857-1859); type decl
  482-484 + 2466 (not 469-471, 2296); main.ts handleUsage is 1420-1441
  (not 1357-1375).
- Correct the `UsageResult` shape snippet — actual fields are
  `window: { reset_at, since, kind }` (not bare string), `totals:
  UsageTotals` (not bare `total_tokens`). Cache columns are
  `cache_read_tokens` / `cache_creation_tokens` (not `*_input_tokens`).
- Add explicit caveat for `usage_reset: "all"` — under "all",
  `usageBounds` returns `resetMs: null`, so a per-type budget would
  trip exactly once forever. Recommendation: M26-5 rejects "all" for
  budgeted types at load time.
- Move the post-task-hook pointer from `agent-runner.ts:489` (helper
  fn) to `:981` (M26-1's classifier hook), since that's the actual
  integration site after M26-1 landed.
- Spell out the dual-cap UX caveat — global ring + per-type rings can
  disagree. Tooltip needs to say "everything fine globally ≠ no type
  has failed over."
- Mark M26-5/6/7 as "to be filed" rather than implying they exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #552 review feedback on commit 85d5a51:

Bug fixes:
- Cycle-free `registerOnAgentDelete` hook in db.ts. provider-failover
  registers `clear5xxWindow` so deleting an instance scrubs its sliding-
  window samples. Recreating an agent with the same name no longer
  inherits stale 5xx hits.
- `applyOutcome` returns `noop` for legacy `ResolvedAgent` shapes
  (no `provider_chain`). Previously a single auth_error / quota_error
  on a legacy fixture immediately paused the instance via the
  synthesised 1-tier chain + `pause_if_all_fail: true` default. Agents
  that never opted into the chain semantic now stay where they are.
- Hybrid `provider` + `provider_chain` rejection message + JSDoc no
  longer says "chain wins" while the code rejects. Two stories made
  one: the loader requires the legacy `provider` field to match
  `provider_chain[tier=1].provider`, otherwise rejects.

New tests:
- Legacy fixture path now asserts noop instead of paused.
- Delete-cascade test: 2 5xx hits, delete + recreate, confirm a third
  5xx doesn't trip persistent_5xx (window cleared).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #552 review feedback on commit 8dbd864:

- Server: `serialiseProviderState()` coerces SQLite's `paused: 0|1` to
  `paused: boolean` on every endpoint (`reset-tier`, `pause`, `unpause`).
  Web wrapper type updated to match. List endpoint already returned
  boolean — the three POSTs now agree.
- Server: 404 messages no longer echo the raw `name` path segment
  (defence-in-depth; XSS unlikely with Hono's plain-text response but
  no reason to leak the input).
- Web: TierBadge ticks a 30 s `setInterval` while the row is degraded
  so the cooldown countdown advances without an external invalidation.
  Timer torn down once the agent recovers or unmounts.
- Web: pause/unpause/reset-tier mutations now also invalidate
  `["agents-health"]` so FleetHealthStrip re-colours immediately
  rather than waiting for its 5 s poll.
- Tests updated to assert `paused: boolean` on all three POST endpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(dashboard): close M26-2 review gaps — dedup types, mirror tier-1 model, error styling
Some checks failed
qa / qa (pull_request) Has been cancelled
qa / dockerfile (pull_request) Has been cancelled
33cd84b422
PR #552 review feedback on commit 2feadfe:

- `AgentProvider`, `ProviderChainEntry`, `FailoverTrigger`,
  `DEFAULT_FAILOVER` re-exported from `@claude-hooks/shared` instead of
  re-declared locally in `apps/web/src/lib/api.ts`. One source of truth
  per global memory rule (`feedback_no_compat_bridge`). `FailoverPolicy`
  remains a web-local mid-edit wrapper that allows undefined fields.
- ThresholdsSection's Model dropdown now mirrors edits into
  `provider_chain[tier=1].model` so the two surfaces don't disagree
  after an operator edits from Thresholds. Both addresses now write to
  the same source.
- Provider chain validation banner gains `role="alert"` + `aria-live`
  for screen-reader announcement, and styles host-mode-mismatch +
  missing-model warnings as hard errors (red,  glyph) since the
  server WILL reject those on save. Duplicate-providers stays as a
  warning (server accepts).

Tests stayed green (7/7 ProviderSection tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(agents): token budget enforcement + wizard input + tier badge icon (M26-5/6/7)
Some checks failed
qa / qa (pull_request) Failing after 9m14s
qa / dockerfile (pull_request) Successful in 15s
0600d01980
Closes the audit follow-ups from `docs/specs/token-budget.md` (M26-4).
Single PR rather than three tickets — operator opted for fast track.

M26-5 — server enforcement:
- `agentTypeSchema` adds optional `usage_threshold_tokens: positive int`.
  Loader rejects non-integer / negative values with a path-qualified
  error.
- `AgentTypeConfig` + `ResolvedAgent` carry the field through. Loader
  preserves the budget on the type even when the chain has length 1
  (where `token_budget` trigger is stripped) — adding a tier later
  re-activates without losing the budget.
- `recordExternalTrigger(agent, kind)` in `provider-failover.ts` —
  fires a tier flip via the same code path the SDK-error classifier
  uses. Respects the trigger allowlist, the legacy-fixture noop, and
  the paused state.
- Post-task hook in `agent-runner.ts` — after `applyOutcome` runs
  (and only if it didn't already move the tier), if the type has a
  budget set + `token_budget` enabled + active window isn't `"all"`,
  sums input + output tokens for the type's agents over the active
  window via `computeUsage()` and fires `recordExternalTrigger` on
  exceedance. Logs `usage=X/Y` for operator visibility.

M26-6 — wizard input:
- `ProviderSection` failover block gains a `token_budget` row when the
  trigger is checked: number input bound to `usage_threshold_tokens`,
  helper text noting the window comes from service config.
- `FAILOVER_TRIGGERS` metadata for `token_budget` updated — no longer
  marked reserved/disabled. Helper text matches the live behaviour.
- (Per-type ring in the Stats tab deferred — would need a richer Stats
  page redesign than this PR covers. The wizard input is the v1
  surface; ring follows.)

M26-7 — tier badge icon:
- TierBadge renders  next to the glyph when `last_failure_kind ===
  "token_budget"`. Tooltip shows the exact kind. New
  `agent-tier-icon-${name}` data-testid for E2E coverage.

Tests:
- `provider-failover.test.ts` — 3 new cases for `recordExternalTrigger`
  (allowlist respect, bump-when-enabled, legacy-fixture noop). 27/27.
- `webhook-config.test.ts` — 3 new cases for budget parsing
  (multi-tier accepts, single-tier strips trigger but preserves
  budget, rejects negative + non-integer). 15/15 M26 tests.

Loader behaviour change vs M26-1:
- `token_budget` strip warning rephrased to drop the "M26-5 will land
  enforcement" note (it has). Behaviour identical: single-tier chains
  still drop the trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test(dashboard): update token_budget tests for M26-5 enforcement
All checks were successful
qa / qa (pull_request) Successful in 14m16s
qa / dockerfile (pull_request) Successful in 14s
61e883740a
The "reserved for M26-5" assertion broke once M26-5/6/7 landed. Flip
the existing test to assert the checkbox is now enabled, and add two
new cases covering the budget input row visibility (rendered when the
trigger is on, hidden when off).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
charles deleted branch feat/m26-multi-provider-failover 2026-04-29 13:53:42 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!552
No description provided.