M26-1: provider_chain schema + per-instance failover state + tier-aware container reconcile #548
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#548
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As an operator, I want each agent type to declare an ordered chain of up to 3 LLM providers and the dispatcher to track per-instance failover state, so that an agent automatically falls through to a cheaper / local provider when its primary provider hits an auth, quota, rate-limit, or persistent 5xx error, and retries the primary after a cooldown without me intervening.
Background
PR #547 shipped server-side wiring for a single
providerfield per type (anthropic / deepseek / ollama). Operator feedback: a single provider isn't enough — Anthropic Pro Max occasionally rate-limits, DeepSeek occasionally 502s, Ollama is the local lifeboat. Want first-available with auto-failover, per-instance, with cooldown.Decisions locked from PR #547 review thread:
cooldown_minknob).<type>-<n>has its own current tier).token_budgettrigger deferred —usage_threshold_tokensaudit (M26-4) shows it's a UI-only stub today; bringing it into the chain requires real budget enforcement first.Schema
Single-
providerlegacy shape continues to load — loader wraps{ provider, default_model }into a 1-tierprovider_chainso existing configs keep booting unchanged.Acceptance criteria
Schema + loader
agentTypeSchemaaddsprovider_chain(1–3 entries) +failoverblock; both optional.provider_chainlength > 3, duplicate tier numbers, gaps in tier sequence,tier 1.provider != "anthropic"on host-mode types (mirrors existing single-provider host-mode gate).provider+default_modelauto-wrapped intoprovider_chain[0].console.warnonce at load time pointing to the new shape; behaviour identical.failover.cooldown_min≥ 1,failover.triggers⊆{auth_error, quota_error, rate_limit, persistent_5xx}(token_budget excluded by validation).State table
agent_provider_state(agent_name PK, current_tier, last_failover_at, last_failure_kind, paused).agentsrow atcurrent_tier = 1, paused = 0.agentsrow insert: trigger inserts a pairedagent_provider_staterow (current_tier defaults to 1).agentsrow delete: cascade deletes the paired state row.Dispatcher classification
failover.triggers:auth_error← upstream 401/403quota_error← 402 / response body match/insufficient credits|quota.*exceeded/irate_limit← 429 after retry budget is exhausted (not a single 429 with retry-after)persistent_5xx← > 3 5xx in 5 min sliding window per instancecurrent_tier, setlast_failover_at = now(), setlast_failure_kind. If newcurrent_tier > chain.length, markpaused = 1ifffailover.pause_if_all_fail.current_tier > 1andnow() - last_failover_at >= cooldown_min, resetcurrent_tier = 1and clearlast_failure_kind. If still inside cooldown, leave the tier unchanged.Container reconcile (env-file rewrite)
buildProviderEnvLinesreadsagent_provider_state.current_tierand selectsprovider_chain[tier-1]. Falls back to tier 1 if the row is missing (defensive).dockerRunrecreates the container if the env-file would differ from the running container's env (image inspect comparesANTHROPIC_BASE_URLenv var).reconcileOneis only called via the post-task hook after the task event stream closes.paused = 1instances skip dispatch entirely;/agentshealth endpoint surfaces them asstate: "paused", reason: "all_tiers_exhausted".Tests
pause_if_all_fail: true→ paused; withfalse→ loops back to tier 1 (operator's choice).buildProviderEnvLines: tier-2 active → returns deepseek lines; paused instance → throws (caller must skip dispatch).Out of scope
usage_threshold_tokensreal enforcement — covered by M26-4.References
providerwiring): #547docs/providers.md(single-provider operator setup) — needs siblingdocs/provider-chain.mddoc as part of this story.apps/server/src/shared/config/webhook-config.tsline ~1222.providerenv injection:apps/server/src/infrastructure/container/container-reconcile.ts::buildProviderEnvLines.Shipped in PR #552 (
feat(agents): provider chain + per-instance failover state machine (M26-1)). All ACs met:provider_chain(1–3 tiers) +failoverblock onagentTypeSchema. Legacy single-providerauto-wraps. Host-mode + non-anthropic tier 1 rejected.token_budgetstripped on single-tier chains.agent_provider_state+agent_provider_eventsledger added toensureSchema.createAgentupserts paired state row,deleteAgentcascades + scrubs in-memory 5xx window.provider-failover.ts):api_retrystatus codes (401/403 → auth_error, 402 → quota_error, 429 + retryExhausted → rate_limit, ≥3 5xx in 5min → persistent_5xx) + error-string fallback.applyOutcome): bump on match, pause/wraparound when chain exhausted, reset on success after cooldown.buildProviderEnvLinesreads tier from state, picks chain entry. Falls back to legacy fixture path defensively.agent-runner.ts:981, best-effort (DB error doesn't fail task).Tests: 23 classifier/state-machine + 12 loader + delete-cascade. M26-5/6/7 follow-up shipped in same PR (token_budget enforcement + wizard input + ⛽ icon).