charles/claude-hooks

Fork 0

M26-1: provider_chain schema + per-instance failover state + tier-aware container reconcile #548

New issue

Closed

opened 2026-04-29 12:23:46 +00:00 by claude-desktop · 1 comment

claude-desktop commented

2026-04-29 12:23:46 +00:00

Collaborator

As an operator, I want each agent type to declare an ordered chain of up to 3 LLM providers and the dispatcher to track per-instance failover state, so that an agent automatically falls through to a cheaper / local provider when its primary provider hits an auth, quota, rate-limit, or persistent 5xx error, and retries the primary after a cooldown without me intervening.

Background

PR #547 shipped server-side wiring for a single provider field per type (anthropic / deepseek / ollama). Operator feedback: a single provider isn't enough — Anthropic Pro Max occasionally rate-limits, DeepSeek occasionally 502s, Ollama is the local lifeboat. Want first-available with auto-failover, per-instance, with cooldown.

Decisions locked from PR #547 review thread:

Cooldown is chain-wide, not per-tier (single cooldown_min knob).
Failover state is per-instance (each <type>-<n> has its own current tier).
In-flight task on a failing tier is allowed to finish (best-effort); the tier flip applies to the next dispatch.
token_budget trigger deferred — usage_threshold_tokens audit (M26-4) shows it's a UI-only stub today; bringing it into the chain requires real budget enforcement first.

Schema

"types.<name>": {
  // ... existing fields ...
  "provider_chain": [
    { "tier": 1, "provider": "anthropic", "model": "claude-opus-4-7" },
    { "tier": 2, "provider": "deepseek",  "model": "deepseek-v4-pro" },
    { "tier": 3, "provider": "ollama",    "model": "qwen3-coder:latest" }
  ],
  "failover": {
    "cooldown_min": 60,
    "pause_if_all_fail": true,
    "triggers": ["auth_error", "quota_error", "rate_limit", "persistent_5xx"]
  }
}

Single-provider legacy shape continues to load — loader wraps { provider, default_model } into a 1-tier provider_chain so existing configs keep booting unchanged.

Acceptance criteria

Schema + loader

agentTypeSchema adds provider_chain (1–3 entries) + failover block; both optional.
Loader rejects: provider_chain length > 3, duplicate tier numbers, gaps in tier sequence, tier 1.provider != "anthropic" on host-mode types (mirrors existing single-provider host-mode gate).
Legacy provider + default_model auto-wrapped into provider_chain[0]. console.warn once at load time pointing to the new shape; behaviour identical.
failover.cooldown_min ≥ 1, failover.triggers ⊆ {auth_error, quota_error, rate_limit, persistent_5xx} (token_budget excluded by validation).

State table

New SQLite table agent_provider_state(agent_name PK, current_tier, last_failover_at, last_failure_kind, paused).
Migration seeds one row per existing agents row at current_tier = 1, paused = 0.
On agents row insert: trigger inserts a paired agent_provider_state row (current_tier defaults to 1).
On agents row delete: cascade deletes the paired state row.

Dispatcher classification

After every task result, dispatcher classifies the outcome against the type's failover.triggers:
- auth_error ← upstream 401/403
- quota_error ← 402 / response body match /insufficient credits|quota.*exceeded/i
- rate_limit ← 429 after retry budget is exhausted (not a single 429 with retry-after)
- persistent_5xx ← > 3 5xx in 5 min sliding window per instance
On a matching trigger: bump current_tier, set last_failover_at = now(), set last_failure_kind. If new current_tier > chain.length, mark paused = 1 iff failover.pause_if_all_fail.
On any successful task: if current_tier > 1 and now() - last_failover_at >= cooldown_min, reset current_tier = 1 and clear last_failure_kind. If still inside cooldown, leave the tier unchanged.

Container reconcile (env-file rewrite)

buildProviderEnvLines reads agent_provider_state.current_tier and selects provider_chain[tier-1]. Falls back to tier 1 if the row is missing (defensive).
dockerRun recreates the container if the env-file would differ from the running container's env (image inspect compares ANTHROPIC_BASE_URL env var).
In-flight task is NOT killed by reconcile — reconcileOne is only called via the post-task hook after the task event stream closes.
paused = 1 instances skip dispatch entirely; /agents health endpoint surfaces them as state: "paused", reason: "all_tiers_exhausted".

Tests

Loader: 3-tier chain loads; >3 rejected; duplicate tier rejected; legacy single-provider migration wraps cleanly; host-mode + tier 1 != anthropic rejected.
State machine: simulate 401 → tier bump; success after cooldown → reset to 1; success inside cooldown → tier unchanged; all-tiers-exhausted with pause_if_all_fail: true → paused; with false → loops back to tier 1 (operator's choice).
buildProviderEnvLines: tier-2 active → returns deepseek lines; paused instance → throws (caller must skip dispatch).
Reconcile: tier flip + post-task hook → recreate fires once; subsequent task with same tier → unchanged.

Out of scope

Wizard UI — covered by M26-2.
Tier badge in agents list + provider history — covered by M26-3.
usage_threshold_tokens real enforcement — covered by M26-4.
Per-tier cooldown granularity (decided against in design review).
Killing in-flight task on tier flip (decided against — let-finish).

References

PR #547 (server-side single-provider wiring): #547
docs/providers.md (single-provider operator setup) — needs sibling docs/provider-chain.md doc as part of this story.
Existing host-mode gate: apps/server/src/shared/config/webhook-config.ts line ~1222.
Existing provider env injection: apps/server/src/infrastructure/container/container-reconcile.ts::buildProviderEnvLines.

As an **operator**, I want each agent type to declare an ordered chain of up to 3 LLM providers and the dispatcher to track per-instance failover state, so that an agent automatically falls through to a cheaper / local provider when its primary provider hits an auth, quota, rate-limit, or persistent 5xx error, and retries the primary after a cooldown without me intervening. ### Background PR #547 shipped server-side wiring for a single `provider` field per type (anthropic / deepseek / ollama). Operator feedback: a single provider isn't enough — Anthropic Pro Max occasionally rate-limits, DeepSeek occasionally 502s, Ollama is the local lifeboat. Want first-available with auto-failover, per-instance, with cooldown. Decisions locked from PR #547 review thread: 1. Cooldown is **chain-wide**, not per-tier (single `cooldown_min` knob). 2. Failover state is **per-instance** (each `<type>-<n>` has its own current tier). 3. **In-flight task on a failing tier is allowed to finish** (best-effort); the tier flip applies to the *next* dispatch. 4. `token_budget` trigger deferred — `usage_threshold_tokens` audit (M26-4) shows it's a UI-only stub today; bringing it into the chain requires real budget enforcement first. ### Schema ```jsonc "types.<name>": { // ... existing fields ... "provider_chain": [ { "tier": 1, "provider": "anthropic", "model": "claude-opus-4-7" }, { "tier": 2, "provider": "deepseek", "model": "deepseek-v4-pro" }, { "tier": 3, "provider": "ollama", "model": "qwen3-coder:latest" } ], "failover": { "cooldown_min": 60, "pause_if_all_fail": true, "triggers": ["auth_error", "quota_error", "rate_limit", "persistent_5xx"] } } ``` Single-`provider` legacy shape continues to load — loader wraps `{ provider, default_model }` into a 1-tier `provider_chain` so existing configs keep booting unchanged. ### Acceptance criteria #### Schema + loader - [ ] `agentTypeSchema` adds `provider_chain` (1–3 entries) + `failover` block; both optional. - [ ] Loader rejects: `provider_chain` length > 3, duplicate tier numbers, gaps in tier sequence, `tier 1.provider != "anthropic"` on host-mode types (mirrors existing single-provider host-mode gate). - [ ] Legacy `provider` + `default_model` auto-wrapped into `provider_chain[0]`. `console.warn` once at load time pointing to the new shape; behaviour identical. - [ ] `failover.cooldown_min` ≥ 1, `failover.triggers` ⊆ `{auth_error, quota_error, rate_limit, persistent_5xx}` (token_budget excluded by validation). #### State table - [ ] New SQLite table `agent_provider_state(agent_name PK, current_tier, last_failover_at, last_failure_kind, paused)`. - [ ] Migration seeds one row per existing `agents` row at `current_tier = 1, paused = 0`. - [ ] On `agents` row insert: trigger inserts a paired `agent_provider_state` row (current_tier defaults to 1). - [ ] On `agents` row delete: cascade deletes the paired state row. #### Dispatcher classification - [ ] After every task result, dispatcher classifies the outcome against the type's `failover.triggers`: - `auth_error` ← upstream 401/403 - `quota_error` ← 402 / response body match `/insufficient credits|quota.*exceeded/i` - `rate_limit` ← 429 after retry budget is exhausted (not a single 429 with retry-after) - `persistent_5xx` ← > 3 5xx in 5 min sliding window per instance - [ ] On a matching trigger: bump `current_tier`, set `last_failover_at = now()`, set `last_failure_kind`. If new `current_tier > chain.length`, mark `paused = 1` iff `failover.pause_if_all_fail`. - [ ] On any successful task: if `current_tier > 1` and `now() - last_failover_at >= cooldown_min`, reset `current_tier = 1` and clear `last_failure_kind`. If still inside cooldown, leave the tier unchanged. #### Container reconcile (env-file rewrite) - [ ] `buildProviderEnvLines` reads `agent_provider_state.current_tier` and selects `provider_chain[tier-1]`. Falls back to tier 1 if the row is missing (defensive). - [ ] `dockerRun` recreates the container if the env-file would differ from the running container's env (image inspect compares `ANTHROPIC_BASE_URL` env var). - [ ] In-flight task is **NOT** killed by reconcile — `reconcileOne` is only called via the post-task hook *after* the task event stream closes. - [ ] `paused = 1` instances skip dispatch entirely; `/agents` health endpoint surfaces them as `state: "paused", reason: "all_tiers_exhausted"`. #### Tests - [ ] Loader: 3-tier chain loads; >3 rejected; duplicate tier rejected; legacy single-provider migration wraps cleanly; host-mode + tier 1 != anthropic rejected. - [ ] State machine: simulate 401 → tier bump; success after cooldown → reset to 1; success inside cooldown → tier unchanged; all-tiers-exhausted with `pause_if_all_fail: true` → paused; with `false` → loops back to tier 1 (operator's choice). - [ ] `buildProviderEnvLines`: tier-2 active → returns deepseek lines; paused instance → throws (caller must skip dispatch). - [ ] Reconcile: tier flip + post-task hook → recreate fires once; subsequent task with same tier → unchanged. ### Out of scope - Wizard UI — covered by **M26-2**. - Tier badge in agents list + provider history — covered by **M26-3**. - `usage_threshold_tokens` real enforcement — covered by **M26-4**. - Per-tier cooldown granularity (decided against in design review). - Killing in-flight task on tier flip (decided against — let-finish). ### References - PR #547 (server-side single-`provider` wiring): https://forge.jacquin.app/charles/claude-hooks/pulls/547 - `docs/providers.md` (single-provider operator setup) — needs sibling `docs/provider-chain.md` doc as part of this story. - Existing host-mode gate: `apps/server/src/shared/config/webhook-config.ts` line ~1222. - Existing `provider` env injection: `apps/server/src/infrastructure/container/container-reconcile.ts::buildProviderEnvLines`.

claude-desktop added the

area:agents

type:user-story

labels

2026-04-29 12:25:09 +00:00

claude-desktop added this to the M26 — Multi-provider failover milestone

2026-04-29 12:25:26 +00:00

charles referenced this issue from a commit

2026-04-29 12:55:17 +00:00

feat(agents): provider chain + per-instance failover state machine (M26-1)

claude-desktop referenced this issue

2026-04-29 12:57:40 +00:00

M26 — multi-provider failover (chain + UI + audit) #552

charles referenced this issue from a commit

2026-04-29 13:53:43 +00:00

M26 — multi-provider failover (chain + UI + audit) (#552)

claude-desktop commented

2026-04-29 19:08:54 +00:00

Author

Collaborator

Shipped in PR #552 (feat(agents): provider chain + per-instance failover state machine (M26-1)). All ACs met:

Schema + loader: provider_chain (1–3 tiers) + failover block on agentTypeSchema. Legacy single-provider auto-wraps. Host-mode + non-anthropic tier 1 rejected. token_budget stripped on single-tier chains.
State tables: agent_provider_state + agent_provider_events ledger added to ensureSchema. createAgent upserts paired state row, deleteAgent cascades + scrubs in-memory 5xx window.
Classifier (provider-failover.ts): api_retry status codes (401/403 → auth_error, 402 → quota_error, 429 + retryExhausted → rate_limit, ≥3 5xx in 5min → persistent_5xx) + error-string fallback.
State machine (applyOutcome): bump on match, pause/wraparound when chain exhausted, reset on success after cooldown.
Reconcile: buildProviderEnvLines reads tier from state, picks chain entry. Falls back to legacy fixture path defensively.
Runner hookup: post-task hook in agent-runner.ts:981, best-effort (DB error doesn't fail task).

Tests: 23 classifier/state-machine + 12 loader + delete-cascade. M26-5/6/7 follow-up shipped in same PR (token_budget enforcement + wizard input + ⛽ icon).

Shipped in PR #552 (`feat(agents): provider chain + per-instance failover state machine (M26-1)`). All ACs met: - Schema + loader: `provider_chain` (1–3 tiers) + `failover` block on `agentTypeSchema`. Legacy single-`provider` auto-wraps. Host-mode + non-anthropic tier 1 rejected. `token_budget` stripped on single-tier chains. - State tables: `agent_provider_state` + `agent_provider_events` ledger added to `ensureSchema`. `createAgent` upserts paired state row, `deleteAgent` cascades + scrubs in-memory 5xx window. - Classifier (`provider-failover.ts`): `api_retry` status codes (401/403 → auth_error, 402 → quota_error, 429 + retryExhausted → rate_limit, ≥3 5xx in 5min → persistent_5xx) + error-string fallback. - State machine (`applyOutcome`): bump on match, pause/wraparound when chain exhausted, reset on success after cooldown. - Reconcile: `buildProviderEnvLines` reads tier from state, picks chain entry. Falls back to legacy fixture path defensively. - Runner hookup: post-task hook in `agent-runner.ts:981`, best-effort (DB error doesn't fail task). Tests: 23 classifier/state-machine + 12 loader + delete-cascade. M26-5/6/7 follow-up shipped in same PR (token_budget enforcement + wizard input + ⛽ icon).

claude-desktop closed this issue

2026-04-29 19:09:52 +00:00