fix(janitor): pass failing CI logs to fix-ci dispatch + cap retries at 3 #1045

Merged
reviewer merged 1 commit from fix/janitor-fix-ci-logs-and-cap into main 2026-05-10 13:56:51 +00:00
Collaborator

Incident

PRs #1042 and #1044 sat in an endless loop today: CI red → janitor re-dispatched fix-ci → agent guessed wrong → CI red again → repeat. Two structural causes; both fixed here.

Root cause 1 — fix-ci dispatched without failing logs

stale_fix_ci_redispatch rendered the fix-ci skill template with only vars: { repo, pr_number, title }. The template uses {{failing_context}} and {{failing_logs}} placeholders that the agent reads as ground-truth. With both undefined, interpolate() left them as literal {{failing_logs}} strings — agent saw a "fix CI on PR #N" prompt with no actual error, guessed at the cause, and pushed unrelated changes.

The webhook-driven path (post-ci.ts:dispatchFixCi) already does this correctly: it lists workflow runs for the head SHA, finds the failing one, fetches a 4 KB log tail via fetchFailingRunLogs, and substitutes both vars. The janitor just didn't.

Fix: mirror the same flow in ruleStaleFixCiRedispatch. Two new seams (listWorkflowRuns, fetchFailingRunLogs) reuse existing code — no new abstractions, no new templates.

Root cause 2 — no hard cap on retries

canAct/markActed enforce a 6 h pacing window per (rule, repo, target). Once the window elapses, the rule fires again. In a degenerate case (genuinely hard-to-fix CI failure, agent prompt-trapped on the wrong file), the rule re-dispatches every 6 h forever, burning quota.

Fix: extend the in-memory map to track count alongside lastActedAt. Add attemptCount(rule, repo, target) that returns 0 once the pacing window elapses (matches canAct semantics — count and pacing reset together). After STALE_FIX_CI_MAX_ATTEMPTS = 3 dispatches in one window without CI recovering, the rule:

  1. Posts a single bounce comment on the PR (rate-limited via a sibling _capped idempotency key).
  2. Skips dispatch.
  3. Operator unsticks the PR manually instead of the agent burning quota.

Cap check runs before the standard pacing guard, so the bounce comment fires the same window the cap was hit; otherwise the pacing guard would skip the rule and the operator would never see the notice.

Why these don't bloat the janitor

  • No new rule.
  • No new abstraction. Both helpers (fetchFailingRunLogs, the markActed/canAct shape) already exist; they just got threaded into the existing ruleStaleFixCiRedispatch.
  • One rule grew by ~30 lines. Net janitor footprint is +1 helper (attemptCount) + 2 seams.

Tests

  • 4 new tests on stale_fix_ci_redispatch:
    • failing_logs + failing_context substituted when failing run exists
    • (could not fetch logs …) placeholder substituted when no failing run found
    • 3-attempt cap → posts single bounce comment, skips dispatch
    • cap is idempotent — subsequent passes don't re-comment
  • New test seam _seedAttemptCountForTest lets the cap boundary be tested without simulating multiple 6 h pacing windows.
  • Full suite: 3449 / 3449 pass.

Out of scope

These are the next two follow-ups from the same incident audit:

  • SHA-scope approvedAnywhere in decidePostCiAction + dismiss-on-push (correctness for stale APPROVED reviews merging new code).
  • Server-side gate on pull_request_review.submitted to dismiss APPROVED reviews submitted on red CI (defense against the reviewer skill rule being violated by the agent).

Both touch post-ci.ts / event handlers, not the janitor — separate PRs to keep scope tight.

Refs: PR #1042 (cursor docs PR — APPROVED on red CI), PR #1044 (sidebar collapse — fix-ci spin).

## Incident PRs #1042 and #1044 sat in an endless loop today: CI red → janitor re-dispatched `fix-ci` → agent guessed wrong → CI red again → repeat. Two structural causes; both fixed here. ## Root cause 1 — fix-ci dispatched without failing logs `stale_fix_ci_redispatch` rendered the `fix-ci` skill template with only `vars: { repo, pr_number, title }`. The template uses `{{failing_context}}` and `{{failing_logs}}` placeholders that the agent reads as ground-truth. With both undefined, `interpolate()` left them as literal `{{failing_logs}}` strings — agent saw a "fix CI on PR #N" prompt with no actual error, guessed at the cause, and pushed unrelated changes. The webhook-driven path (`post-ci.ts:dispatchFixCi`) already does this correctly: it lists workflow runs for the head SHA, finds the failing one, fetches a 4 KB log tail via `fetchFailingRunLogs`, and substitutes both vars. The janitor just didn't. **Fix:** mirror the same flow in `ruleStaleFixCiRedispatch`. Two new seams (`listWorkflowRuns`, `fetchFailingRunLogs`) reuse existing code — no new abstractions, no new templates. ## Root cause 2 — no hard cap on retries `canAct`/`markActed` enforce a 6 h pacing window per `(rule, repo, target)`. Once the window elapses, the rule fires again. In a degenerate case (genuinely hard-to-fix CI failure, agent prompt-trapped on the wrong file), the rule re-dispatches every 6 h forever, burning quota. **Fix:** extend the in-memory map to track `count` alongside `lastActedAt`. Add `attemptCount(rule, repo, target)` that returns 0 once the pacing window elapses (matches `canAct` semantics — count and pacing reset together). After `STALE_FIX_CI_MAX_ATTEMPTS = 3` dispatches in one window without CI recovering, the rule: 1. Posts a single bounce comment on the PR (rate-limited via a sibling `_capped` idempotency key). 2. Skips dispatch. 3. Operator unsticks the PR manually instead of the agent burning quota. Cap check runs **before** the standard pacing guard, so the bounce comment fires the same window the cap was hit; otherwise the pacing guard would skip the rule and the operator would never see the notice. ## Why these don't bloat the janitor - No new rule. - No new abstraction. Both helpers (`fetchFailingRunLogs`, the `markActed`/`canAct` shape) already exist; they just got threaded into the existing `ruleStaleFixCiRedispatch`. - One rule grew by ~30 lines. Net janitor footprint is +1 helper (`attemptCount`) + 2 seams. ## Tests - 4 new tests on `stale_fix_ci_redispatch`: - `failing_logs` + `failing_context` substituted when failing run exists - `(could not fetch logs …)` placeholder substituted when no failing run found - 3-attempt cap → posts single bounce comment, skips dispatch - cap is idempotent — subsequent passes don't re-comment - New test seam `_seedAttemptCountForTest` lets the cap boundary be tested without simulating multiple 6 h pacing windows. - Full suite: 3449 / 3449 pass. ## Out of scope These are the next two follow-ups from the same incident audit: - SHA-scope `approvedAnywhere` in `decidePostCiAction` + dismiss-on-push (correctness for stale APPROVED reviews merging new code). - Server-side gate on `pull_request_review.submitted` to dismiss APPROVED reviews submitted on red CI (defense against the reviewer skill rule being violated by the agent). Both touch `post-ci.ts` / event handlers, not the janitor — separate PRs to keep scope tight. Refs: PR #1042 (cursor docs PR — APPROVED on red CI), PR #1044 (sidebar collapse — fix-ci spin).
fix(janitor): pass failing CI logs to fix-ci dispatch + cap retries at 3
All checks were successful
qa / dockerfile (pull_request) Successful in 11s
qa / sql-layer-check (pull_request) Successful in 11s
qa / i18n-string-check (pull_request) Successful in 11s
qa / db-schema (pull_request) Successful in 14s
qa / qa-1 (pull_request) Successful in 3m17s
qa / qa (pull_request) Successful in 0s
a844a5214f
PR #1042 + #1044 incident — janitor's `stale_fix_ci_redispatch` rule was
spinning the author endlessly:

  1. CI red → janitor renders the `fix-ci` skill template with
     `vars: { repo, pr_number, title }` only. `failing_logs` and
     `failing_context` stay as literal `{{...}}` placeholders in the
     rendered prompt — the agent gets a "fix CI on PR #N" message with
     no actual error to read. It guesses, pushes a "fix" that doesn't
     address the real failing job, CI fails again.
  2. 6 h pacing window expires, janitor re-dispatches with the same
     blank prompt. Loop indefinitely.

Two changes:

**Logs in the prompt.** Mirror the webhook-driven path
(`post-ci.ts:fetchFailingRunLogs`): list workflow runs for the head SHA,
find the failing one, fetch its log tail, and substitute both
`failing_context` (workflow name) and `failing_logs` (≤ 4 KB tail) into
the skill template. Same helper, same template, no new code path.

**Hard cap at 3 attempts.** Extend the in-memory idempotency map from
`Map<key, timestamp>` to `Map<key, { lastActedAt, count }>`; expose
`attemptCount(rule, repo, target)` that respects the existing 6 h
pacing reset (count returns 0 once the window elapses, matching
`canAct` semantics). After 3 dispatches in one window without CI
recovering, the rule posts a single "bouncing to operator" comment
(rate-limited via a sibling `_capped` idempotency key so subsequent
janitor passes don't re-comment) and stops dispatching. Operator
unsticks the PR manually instead of the agent burning quota on
guesses.

Cap check runs *before* the standard pacing guard so the bounce
comment fires the same window the cap was hit; otherwise the pacing
guard would skip the rule entirely and the operator would never see
the notice.

`_seedAttemptCountForTest` exposes a counter-seeding seam so the cap
boundary is testable without simulating multiple 6 h windows.

Refs: PR #1042, PR #1044.
reviewer approved these changes 2026-05-10 13:56:41 +00:00
reviewer left a comment

Logic correct, CI green. Nit: moving canAct after getPullRequest adds extra HTTP calls for paced PRs — acceptable given the cap-before-pacing ordering requirement.

Logic correct, CI green. Nit: moving `canAct` after `getPullRequest` adds extra HTTP calls for paced PRs — acceptable given the cap-before-pacing ordering requirement.
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!1045
No description provided.