Janitor rule: redispatch stale fix-ci on PRs whose CI is still red #784

Closed
opened 2026-05-02 22:49:39 +00:00 by claude-desktop · 0 comments
Collaborator

User story

As an operator, I want a PR with red CI to keep getting fix-ci dispatches until either the agent succeeds or I intervene, so a single failed fix-ci task (OOM, crash, transient SDK error) doesn't strand the PR for an hour or longer.

Background

When Forgejo emits action_run_failure, handleCheckSuiteCompleted (apps/server/src/domain/workflow/event-handlers.ts:963) routes to dispatchFixCi, which:

  1. Checks the _fixCiDispatched BoundedMap (keyed repo#sha, 1 h TTL).
  2. If clean, dispatches fix-ci to the PR author and marks the SHA dispatched.

The mechanism is correct on the happy path. Two failure modes are uncovered:

  • Agent crashes mid-task. PR 780 / task 5d25a13c-…: dev container OOM-killed (exit 137) while running fix-ci for SHA a7cbf84. PR sat red until the operator intervened. The dedup map blocked re-dispatch for the full 1 h window even though the previous task ended in failure.
  • Webhook missed / out-of-order. Forgejo doesn't redeliver after a service restart or transient HTTP error — the only action_run_failure event for that SHA is gone, and CI stays red with no agent in flight.

Proposed fix

Add janitor rule stale_fix_ci_redispatch modelled on unmergeable_pr_rebase (#781):

  1. Per repo, list open PRs.
  2. Get aggregate CI status for each PR's head SHA. Skip unless failure.
  3. Walk listTasksForIssue(repo, pr.number) for the most recent task. If a task is currently running for that issue (worker registry check) → skip. If the most recent finished task for that PR ended in success → skip (post-CI didn't fire yet, leave it to the event path). If the most recent finished task ended in failure / interrupted / cancelled AND CI is still red → redispatch.
  4. Override _fixCiDispatched semantics: the janitor's own canAct window (6 h) is the rate limiter, not the post-ci dedup map — the dedup map exists to suppress event-driven double-dispatch within a single CI run, but it shouldn't block recovery after a task crash.
  5. Reuse dispatchFixCi. Need a way to bypass alreadyDispatchedFixCi — extend the function with a forceRedispatch opt (or call the inner persistence path directly).

Acceptance criteria

Janitor rule

  • New rule stale_fix_ci_redispatch registered in _ALL_RULES and reconcileOnce
  • Lists open PRs, fetches aggregate CI status per head SHA
  • Skips when CI is success, pending, or null (no workflows)
  • Skips when worker registry has a running task for that issue
  • Skips when last finished task for the issue is success (event path will pick up next CI run)
  • Redispatches when CI is failure AND last finished task is failure/interrupted/cancelled

Dispatch path

  • dispatchFixCi accepts a force?: boolean opt (or expose a sibling helper) so the janitor can bypass the 1 h dedup map
  • Janitor's own canAct 6 h window is the rate limiter for the rule

Tests

  • Unit: rule dispatches when CI=failure + last task=failure + no in-flight task
  • Unit: rule skips when CI=success
  • Unit: rule skips when there's an in-flight task for the issue
  • Unit: rule skips when last finished task is success
  • Unit: rule respects its own canAct window across passes

Out of scope

  • Retry-with-extended-memory for OOM specifically (operator-level container config, separate ticket if it becomes a pattern).
  • Cap on total fix-ci attempts per PR (could be a follow-up — current force_merge round-cap covers reviewer rounds, not CI rounds).
  • Ticket #781 (unmergeable_pr_rebase) is the parallel fix for the conflict no-event gap.

References

  • Hook: event-handlers.ts:963 (handleCheckSuiteCompleted).
  • Dispatch: post-ci.ts:261 (dispatchFixCi), post-ci.ts:45 (FIX_CI_DEDUP_MS).
  • Janitor pattern: janitor.ts rule unmergeable_pr_rebase (PR #783, issue #781).
  • Reproduction: PR #780, task 5d25a13c-4f78-4531-a23e-028f055202eb exited 137 on SHA a7cbf84.
## User story As an operator, I want a PR with red CI to keep getting `fix-ci` dispatches until either the agent succeeds or I intervene, so a single failed `fix-ci` task (OOM, crash, transient SDK error) doesn't strand the PR for an hour or longer. ## Background When Forgejo emits `action_run_failure`, `handleCheckSuiteCompleted` (`apps/server/src/domain/workflow/event-handlers.ts:963`) routes to `dispatchFixCi`, which: 1. Checks the `_fixCiDispatched` BoundedMap (keyed `repo#sha`, 1 h TTL). 2. If clean, dispatches `fix-ci` to the PR author and marks the SHA dispatched. The mechanism is correct on the happy path. Two failure modes are uncovered: - **Agent crashes mid-task.** PR 780 / task `5d25a13c-…`: `dev` container OOM-killed (exit 137) while running `fix-ci` for SHA `a7cbf84`. PR sat red until the operator intervened. The dedup map blocked re-dispatch for the full 1 h window even though the previous task ended in `failure`. - **Webhook missed / out-of-order.** Forgejo doesn't redeliver after a service restart or transient HTTP error — the only `action_run_failure` event for that SHA is gone, and CI stays red with no agent in flight. ## Proposed fix Add janitor rule `stale_fix_ci_redispatch` modelled on `unmergeable_pr_rebase` (#781): 1. Per repo, list open PRs. 2. Get aggregate CI status for each PR's head SHA. Skip unless `failure`. 3. Walk `listTasksForIssue(repo, pr.number)` for the most recent task. If a task is currently running for that issue (worker registry check) → skip. If the most recent finished task for that PR ended in `success` → skip (post-CI didn't fire yet, leave it to the event path). If the most recent finished task ended in `failure` / `interrupted` / `cancelled` AND CI is still red → redispatch. 4. Override `_fixCiDispatched` semantics: the janitor's own `canAct` window (6 h) is the rate limiter, not the post-ci dedup map — the dedup map exists to suppress event-driven double-dispatch within a single CI run, but it shouldn't block recovery after a task crash. 5. Reuse `dispatchFixCi`. Need a way to bypass `alreadyDispatchedFixCi` — extend the function with a `forceRedispatch` opt (or call the inner persistence path directly). ## Acceptance criteria ### Janitor rule - [ ] New rule `stale_fix_ci_redispatch` registered in `_ALL_RULES` and `reconcileOnce` - [ ] Lists open PRs, fetches aggregate CI status per head SHA - [ ] Skips when CI is `success`, `pending`, or `null` (no workflows) - [ ] Skips when worker registry has a running task for that issue - [ ] Skips when last finished task for the issue is `success` (event path will pick up next CI run) - [ ] Redispatches when CI is `failure` AND last finished task is `failure`/`interrupted`/`cancelled` ### Dispatch path - [ ] `dispatchFixCi` accepts a `force?: boolean` opt (or expose a sibling helper) so the janitor can bypass the 1 h dedup map - [ ] Janitor's own `canAct` 6 h window is the rate limiter for the rule ### Tests - [ ] Unit: rule dispatches when CI=failure + last task=failure + no in-flight task - [ ] Unit: rule skips when CI=success - [ ] Unit: rule skips when there's an in-flight task for the issue - [ ] Unit: rule skips when last finished task is `success` - [ ] Unit: rule respects its own `canAct` window across passes ## Out of scope - Retry-with-extended-memory for OOM specifically (operator-level container config, separate ticket if it becomes a pattern). - Cap on total fix-ci attempts per PR (could be a follow-up — current `force_merge` round-cap covers reviewer rounds, not CI rounds). - Ticket #781 (`unmergeable_pr_rebase`) is the parallel fix for the conflict no-event gap. ## References - Hook: `event-handlers.ts:963` (`handleCheckSuiteCompleted`). - Dispatch: `post-ci.ts:261` (`dispatchFixCi`), `post-ci.ts:45` (`FIX_CI_DEDUP_MS`). - Janitor pattern: `janitor.ts` rule `unmergeable_pr_rebase` (PR #783, issue #781). - Reproduction: PR #780, task `5d25a13c-4f78-4531-a23e-028f055202eb` exited 137 on SHA `a7cbf84`.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#784
No description provided.