feat(agents): token economy — caveman mode, prompt caching, model tiering, cost cap as last resort #231

Closed
opened 2026-04-21 12:48:55 +00:00 by claude-desktop · 0 comments
Collaborator

User story

As an operator paying for the agent fleet, I want the system to spend less tokens by default via proven community techniques, with a hard cost cap only as a final safety net, so that routine work (chore tickets, simple reviews) doesn't burn opus-class budget and a runaway loop gets killed before it hurts.

Context: #224 (mid-flight steering impl) cost $15.86 on boss-2 alone. That's normal for an architecture ticket on opus-4.7[1m], but no infrastructure exists today to keep less-ambitious work out of the opus lane, cache expensive system prompts, or abort a genuinely runaway task.

Acceptance criteria

Phase 1 — Investigation + spec

  • Survey community token-economy techniques. At minimum address:
    • Caveman mode — terse system-prompt appendix (Respond in minimal shorthand. No commentary, no emojis, no recap.) for type:chore tickets and routine reviews. Opt-in per instance via prompt_appendix or a type-level caveman: true flag.
    • Prompt caching (Anthropic cache_control: ephemeral) on the long static chunks of the system prompt + on large read files. The SDK supports it; we probably don't use it yet.
    • Model tiering — route type:chore tickets to haiku-4.5 by default (one of our pool already runs it for reviewer-fast). Escalate to sonnet/opus only when the skill template or operator override requests it.
    • Tool result trimming — cap Bash stdout at N lines (configurable), send a <truncated> marker to the SDK instead of the full 100k-line output. Agents re-run with | head / | tail when they need more.
    • Context compaction — lean on the SDK's auto-compact more aggressively; tune the threshold down for type:chore sessions.
    • Read-file dedup — the SDK already caches re-reads in the same turn, but chained-resume sessions re-fetch. Cache by (repo, path, commit_sha) across turns.
  • Write the findings into docs/token-economy.md with a table: technique → expected savings → implementation cost → rollout risk.
  • Decide what to ship in phase 2 — don't try to land everything at once.

Phase 2 — Implementation

  • Ship the top 2–3 techniques from phase 1's rollout plan.
  • Each technique is opt-in per agent type via config/agents.json (new fields under types.<type>.token_economy).
  • Caveman mode ships as a prompt-appendix string in skills/ that any type can reference. Bundled default for type:chore dispatches.
  • Prompt caching is silent — wire it, measure it, expose the hit rate in the /usage endpoint.

Phase 3 — Cost cap (last resort)

  • New field types.<type>.max_cost_usd_per_task in agents.json. Defaults: boss: 20, dev: 5, reviewer: 3, designer: 10, design-reviewer: 3, foreman: 15.
  • agent-runner.ts checks accumulated cost at each turn boundary. At 50% of cap → emit SSE cost_warning envelope (UI toast). At 100% → currentAbort.abort() + task_history row marked cost_capped with reason.
  • The interrupted status from #221 is a sibling state; cost_capped is its own status.
  • Operator override: a /raise-cap <issue_num> <new_cap> slash command on the issue bumps the cap for the current task.

Verification

  • Measurable: after phase 2 ships, the /usage endpoint shows a ≥20% drop in average cost per type:chore task over the following week.
  • Unit tests for the cost-cap path (mock SDK, assert abort fires at threshold).
  • Manual smoke: dispatch a type:chore with caveman mode on, confirm the assistant text is terse, the turn count drops.

Out of scope

  • Switching to a different provider / model family — we stay on Anthropic.
  • Rewriting skill prompts for verbosity — that's implicit in caveman mode; major skill rewrites are separate.
  • Per-operator daily budget alerts — file a follow-up if the basic cap isn't enough.

References

  • apps/server/src/agent-runner.ts — where the SDK query is built + where cost accumulates.
  • apps/server/src/task-store.ts — task-level cost persistence.
  • packages/shared/src/task.tsTaskStatus (add cost_capped).
  • skills/*.md — existing skill templates, where caveman appendix would plug in.
  • Anthropic prompt caching docs — cache_control: ephemeral on system/tools/messages.
## User story As an operator paying for the agent fleet, I want the system to **spend less tokens by default** via proven community techniques, with a hard cost cap only as a final safety net, so that routine work (chore tickets, simple reviews) doesn't burn opus-class budget and a runaway loop gets killed before it hurts. Context: #224 (mid-flight steering impl) cost **$15.86 on boss-2 alone**. That's normal for an architecture ticket on opus-4.7[1m], but no infrastructure exists today to keep less-ambitious work out of the opus lane, cache expensive system prompts, or abort a genuinely runaway task. ## Acceptance criteria ### Phase 1 — Investigation + spec - [ ] Survey community token-economy techniques. At minimum address: - **Caveman mode** — terse system-prompt appendix (`Respond in minimal shorthand. No commentary, no emojis, no recap.`) for `type:chore` tickets and routine reviews. Opt-in per instance via `prompt_appendix` or a type-level `caveman: true` flag. - **Prompt caching** (Anthropic `cache_control: ephemeral`) on the long static chunks of the system prompt + on large read files. The SDK supports it; we probably don't use it yet. - **Model tiering** — route `type:chore` tickets to haiku-4.5 by default (one of our pool already runs it for `reviewer-fast`). Escalate to sonnet/opus only when the skill template or operator override requests it. - **Tool result trimming** — cap Bash stdout at N lines (configurable), send a `<truncated>` marker to the SDK instead of the full 100k-line output. Agents re-run with `| head` / `| tail` when they need more. - **Context compaction** — lean on the SDK's auto-compact more aggressively; tune the threshold down for `type:chore` sessions. - **Read-file dedup** — the SDK already caches re-reads in the same turn, but chained-resume sessions re-fetch. Cache by `(repo, path, commit_sha)` across turns. - [ ] Write the findings into `docs/token-economy.md` with a table: technique → expected savings → implementation cost → rollout risk. - [ ] Decide what to ship in phase 2 — don't try to land everything at once. ### Phase 2 — Implementation - [ ] Ship the top 2–3 techniques from phase 1's rollout plan. - [ ] Each technique is opt-in per agent type via `config/agents.json` (new fields under `types.<type>.token_economy`). - [ ] Caveman mode ships as a prompt-appendix string in `skills/` that any type can reference. Bundled default for `type:chore` dispatches. - [ ] Prompt caching is silent — wire it, measure it, expose the hit rate in the `/usage` endpoint. ### Phase 3 — Cost cap (last resort) - [ ] New field `types.<type>.max_cost_usd_per_task` in `agents.json`. Defaults: `boss: 20`, `dev: 5`, `reviewer: 3`, `designer: 10`, `design-reviewer: 3`, `foreman: 15`. - [ ] `agent-runner.ts` checks accumulated cost at each turn boundary. At 50% of cap → emit SSE `cost_warning` envelope (UI toast). At 100% → `currentAbort.abort()` + task_history row marked `cost_capped` with reason. - [ ] The `interrupted` status from #221 is a sibling state; `cost_capped` is its own status. - [ ] Operator override: a `/raise-cap <issue_num> <new_cap>` slash command on the issue bumps the cap for the current task. ### Verification - [ ] Measurable: after phase 2 ships, the `/usage` endpoint shows a ≥20% drop in average cost per `type:chore` task over the following week. - [ ] Unit tests for the cost-cap path (mock SDK, assert abort fires at threshold). - [ ] Manual smoke: dispatch a `type:chore` with caveman mode on, confirm the assistant text is terse, the turn count drops. ## Out of scope - Switching to a different provider / model family — we stay on Anthropic. - Rewriting skill prompts for verbosity — that's implicit in caveman mode; major skill rewrites are separate. - Per-operator daily budget alerts — file a follow-up if the basic cap isn't enough. ## References - `apps/server/src/agent-runner.ts` — where the SDK query is built + where cost accumulates. - `apps/server/src/task-store.ts` — task-level cost persistence. - `packages/shared/src/task.ts` — `TaskStatus` (add `cost_capped`). - `skills/*.md` — existing skill templates, where caveman appendix would plug in. - Anthropic prompt caching docs — `cache_control: ephemeral` on system/tools/messages.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#231
No description provided.