M26 — multi-provider failover (chain + UI + audit) #552
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!552
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/m26-multi-provider-failover"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Replaces the single
providerfield per agent type (PR #547) with a 1–3 entry ordered chain plus per-instance failover state. On auth/quota/rate-limit/persistent-5xx failure the dispatcher steps down a tier; after cooldown a successful task resets to tier 1.4 commits, one per ticket:
a7128e2docs(specs): auditusage_threshold_tokens+ spec out enforcement85d5a51feat(agents): provider chain + per-instance failover state machine2feadfefeat(dashboard): provider chain editor in /config wizard8dbd864feat(dashboard): tier badge + reset/pause controls + ledger endpointWhat's in each
M26-4 (
docs/specs/token-budget.md) — audit confirmsusage_threshold_tokensis a UI-only stub today (3 read sites, 0 enforcement). Spec proposes per-type budget driving the M26-1 chain viatoken_budgettrigger. Implementation deferred to M26-5/6/7.M26-1 — schema (
provider_chain+failoverblocks), legacy single-providermigration, two SQLite tables (agent_provider_state+agent_provider_eventsledger), classifier (provider-failover.ts), state machine (applyOutcome), tier-aware container env-file rewrite, agent-runner hook. 23 classifier/state-machine tests + 12 loader tests, all green.M26-2 — new "Provider" section tab in
/configwizard between Thresholds and Container. Provider dropdown + free-text model input per row, ↑/↓ reorder, +/× add/remove (max 3 / min 1), failover policy block (cooldown + pause-or-wrap + trigger checkboxes —token_budgetgreyed out per M26-4). Validation banner for host-mode + non-anthropic tier 1, duplicate providers, empty models. 7 component tests.M26-3 —
provider_stateblock on/agentsresponse, 4 new endpoints:GET /agents/:name/provider-events,POST /agents/:name/{reset-tier,pause,unpause}. New "Tier" column on the/agentspage renders ①/②/③ glyph (or ✕ when paused) + active model id + cooldown countdown when degraded. Per-row reset (↺) + pause/resume (⏸/▶) buttons. 6 endpoint tests.Test plan
just typecheck— all 4 packages clean.main: session JSONL pruning + foreman session CRUD).theme.test.ts/selected-repos.test.ts).anthropic→deepseek→ollama) ondev, force a 401 by setting a badDEEPSEEK_API_KEY, dispatch a task, confirm container recreates with tier-2 env.↺button on the/agentspage; confirmagent_provider_state.paused = 0and the worker resumes dispatching.Known gaps (intentional deferrals)
/agents/models?provider=combobox in M26-2 wizard — endpoint exists from PR #547 but kept as free-text input for v1.token_budgetreal enforcement — tracked by M26-4 audit; implementation in M26-5/6/7 (tickets not yet opened).🤖 Generated with Claude Code
usage_threshold_tokens+ spec out enforcement (M26-4) a7128e2b48Closes the audit follow-ups from `docs/specs/token-budget.md` (M26-4). Single PR rather than three tickets — operator opted for fast track. M26-5 — server enforcement: - `agentTypeSchema` adds optional `usage_threshold_tokens: positive int`. Loader rejects non-integer / negative values with a path-qualified error. - `AgentTypeConfig` + `ResolvedAgent` carry the field through. Loader preserves the budget on the type even when the chain has length 1 (where `token_budget` trigger is stripped) — adding a tier later re-activates without losing the budget. - `recordExternalTrigger(agent, kind)` in `provider-failover.ts` — fires a tier flip via the same code path the SDK-error classifier uses. Respects the trigger allowlist, the legacy-fixture noop, and the paused state. - Post-task hook in `agent-runner.ts` — after `applyOutcome` runs (and only if it didn't already move the tier), if the type has a budget set + `token_budget` enabled + active window isn't `"all"`, sums input + output tokens for the type's agents over the active window via `computeUsage()` and fires `recordExternalTrigger` on exceedance. Logs `usage=X/Y` for operator visibility. M26-6 — wizard input: - `ProviderSection` failover block gains a `token_budget` row when the trigger is checked: number input bound to `usage_threshold_tokens`, helper text noting the window comes from service config. - `FAILOVER_TRIGGERS` metadata for `token_budget` updated — no longer marked reserved/disabled. Helper text matches the live behaviour. - (Per-type ring in the Stats tab deferred — would need a richer Stats page redesign than this PR covers. The wizard input is the v1 surface; ring follows.) M26-7 — tier badge icon: - TierBadge renders ⛽ next to the glyph when `last_failure_kind === "token_budget"`. Tooltip shows the exact kind. New `agent-tier-icon-${name}` data-testid for E2E coverage. Tests: - `provider-failover.test.ts` — 3 new cases for `recordExternalTrigger` (allowlist respect, bump-when-enabled, legacy-fixture noop). 27/27. - `webhook-config.test.ts` — 3 new cases for budget parsing (multi-tier accepts, single-tier strips trigger but preserves budget, rejects negative + non-integer). 15/15 M26 tests. Loader behaviour change vs M26-1: - `token_budget` strip warning rephrased to drop the "M26-5 will land enforcement" note (it has). Behaviour identical: single-tier chains still drop the trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>usage_threshold_tokensenforcement #551