fix(janitor): pass failing CI logs to fix-ci dispatch + cap retries at 3 #1045
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!1045
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/janitor-fix-ci-logs-and-cap"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Incident
PRs #1042 and #1044 sat in an endless loop today: CI red → janitor re-dispatched
fix-ci→ agent guessed wrong → CI red again → repeat. Two structural causes; both fixed here.Root cause 1 — fix-ci dispatched without failing logs
stale_fix_ci_redispatchrendered thefix-ciskill template with onlyvars: { repo, pr_number, title }. The template uses{{failing_context}}and{{failing_logs}}placeholders that the agent reads as ground-truth. With both undefined,interpolate()left them as literal{{failing_logs}}strings — agent saw a "fix CI on PR #N" prompt with no actual error, guessed at the cause, and pushed unrelated changes.The webhook-driven path (
post-ci.ts:dispatchFixCi) already does this correctly: it lists workflow runs for the head SHA, finds the failing one, fetches a 4 KB log tail viafetchFailingRunLogs, and substitutes both vars. The janitor just didn't.Fix: mirror the same flow in
ruleStaleFixCiRedispatch. Two new seams (listWorkflowRuns,fetchFailingRunLogs) reuse existing code — no new abstractions, no new templates.Root cause 2 — no hard cap on retries
canAct/markActedenforce a 6 h pacing window per(rule, repo, target). Once the window elapses, the rule fires again. In a degenerate case (genuinely hard-to-fix CI failure, agent prompt-trapped on the wrong file), the rule re-dispatches every 6 h forever, burning quota.Fix: extend the in-memory map to track
countalongsidelastActedAt. AddattemptCount(rule, repo, target)that returns 0 once the pacing window elapses (matchescanActsemantics — count and pacing reset together). AfterSTALE_FIX_CI_MAX_ATTEMPTS = 3dispatches in one window without CI recovering, the rule:_cappedidempotency key).Cap check runs before the standard pacing guard, so the bounce comment fires the same window the cap was hit; otherwise the pacing guard would skip the rule and the operator would never see the notice.
Why these don't bloat the janitor
fetchFailingRunLogs, themarkActed/canActshape) already exist; they just got threaded into the existingruleStaleFixCiRedispatch.attemptCount) + 2 seams.Tests
stale_fix_ci_redispatch:failing_logs+failing_contextsubstituted when failing run exists(could not fetch logs …)placeholder substituted when no failing run found_seedAttemptCountForTestlets the cap boundary be tested without simulating multiple 6 h pacing windows.Out of scope
These are the next two follow-ups from the same incident audit:
approvedAnywhereindecidePostCiAction+ dismiss-on-push (correctness for stale APPROVED reviews merging new code).pull_request_review.submittedto dismiss APPROVED reviews submitted on red CI (defense against the reviewer skill rule being violated by the agent).Both touch
post-ci.ts/ event handlers, not the janitor — separate PRs to keep scope tight.Refs: PR #1042 (cursor docs PR — APPROVED on red CI), PR #1044 (sidebar collapse — fix-ci spin).
Logic correct, CI green. Nit: moving
canActaftergetPullRequestadds extra HTTP calls for paced PRs — acceptable given the cap-before-pacing ordering requirement.