Janitor rule: redispatch stale fix-ci on PRs whose CI is still red #784
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#784
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
User story
As an operator, I want a PR with red CI to keep getting
fix-cidispatches until either the agent succeeds or I intervene, so a single failedfix-citask (OOM, crash, transient SDK error) doesn't strand the PR for an hour or longer.Background
When Forgejo emits
action_run_failure,handleCheckSuiteCompleted(apps/server/src/domain/workflow/event-handlers.ts:963) routes todispatchFixCi, which:_fixCiDispatchedBoundedMap (keyedrepo#sha, 1 h TTL).fix-cito the PR author and marks the SHA dispatched.The mechanism is correct on the happy path. Two failure modes are uncovered:
5d25a13c-…:devcontainer OOM-killed (exit 137) while runningfix-cifor SHAa7cbf84. PR sat red until the operator intervened. The dedup map blocked re-dispatch for the full 1 h window even though the previous task ended infailure.action_run_failureevent for that SHA is gone, and CI stays red with no agent in flight.Proposed fix
Add janitor rule
stale_fix_ci_redispatchmodelled onunmergeable_pr_rebase(#781):failure.listTasksForIssue(repo, pr.number)for the most recent task. If a task is currently running for that issue (worker registry check) → skip. If the most recent finished task for that PR ended insuccess→ skip (post-CI didn't fire yet, leave it to the event path). If the most recent finished task ended infailure/interrupted/cancelledAND CI is still red → redispatch._fixCiDispatchedsemantics: the janitor's owncanActwindow (6 h) is the rate limiter, not the post-ci dedup map — the dedup map exists to suppress event-driven double-dispatch within a single CI run, but it shouldn't block recovery after a task crash.dispatchFixCi. Need a way to bypassalreadyDispatchedFixCi— extend the function with aforceRedispatchopt (or call the inner persistence path directly).Acceptance criteria
Janitor rule
stale_fix_ci_redispatchregistered in_ALL_RULESandreconcileOncesuccess,pending, ornull(no workflows)success(event path will pick up next CI run)failureAND last finished task isfailure/interrupted/cancelledDispatch path
dispatchFixCiaccepts aforce?: booleanopt (or expose a sibling helper) so the janitor can bypass the 1 h dedup mapcanAct6 h window is the rate limiter for the ruleTests
successcanActwindow across passesOut of scope
force_mergeround-cap covers reviewer rounds, not CI rounds).unmergeable_pr_rebase) is the parallel fix for the conflict no-event gap.References
event-handlers.ts:963(handleCheckSuiteCompleted).post-ci.ts:261(dispatchFixCi),post-ci.ts:45(FIX_CI_DEDUP_MS).janitor.tsruleunmergeable_pr_rebase(PR #783, issue #781).5d25a13c-4f78-4531-a23e-028f055202ebexited 137 on SHAa7cbf84.