feat(janitor): stale_fix_ci_redispatch rule — recover from crashed fix-ci tasks #785

Merged
charles merged 3 commits from code-lead/784 into main 2026-05-03 09:20:36 +00:00
Collaborator

Closes #784. Stacks on #783.

Stacked on #783

Summary

When dispatchFixCi runs and the dispatched fix-ci task crashes (OOM exit 137, SDK timeout, container kill), the 1 h post-ci dedup map blocks re-dispatch for the rest of the window. The PR sits red until the operator intervenes. Reproduced on PR 780 / task 5d25a13c-…: dev container OOM-killed mid fix-ci, dedup blocked recovery.

Design

  • New janitor rule stale_fix_ci_redispatch registered in _ALL_RULES and reconcileOnce.
  • Walks open PRs per repo. For each: fetches aggregate CI status; skips unless failure. Skips if a task is currently running for the issue (worker registry check via cfg.isTaskRunning). Walks listTasksForIssue for the latest finished task. Skips if the latest task is success (event path will fire on the next CI run). Skips if there's no task history yet (first dispatch is the event-driven path's job). Otherwise re-dispatches fix-ci.
  • Force escape hatch. dispatchFixCi now accepts opts.force?: boolean. Default false preserves event-driven dedup. Janitor passes true to bypass the 1 h map — that map exists to suppress duplicate dispatches within a single CI run's status fan-out, not to block recovery after a task crash.
  • Rate limiter. Janitor's own canAct 6 h window. So even with force=true, the rule can't dispatch more than once per 6 h per PR.
  • Test seam. New impl.listTasksForIssue / impl.dispatchFixCi / impl.skillForEvent / impl.interpolate so tests can drive the rule without spinning up the full skill loader / DB.

Test plan

  • Unit: redispatches with force=true when CI=failure + last task=failure + no in-flight
  • Unit: skips when CI=success
  • Unit: skips when CI=pending
  • Unit: skips when worker registry reports a running task
  • Unit: skips when latest finished task is success
  • Unit: skips when there's no task history (first dispatch is the event path's)
  • Unit: 6 h canAct window is the rate limiter — second pass within window is a no-op
  • Unit: dry-run flags without calling dispatchFixCi
  • 56/56 tests pass in janitor.test.ts
  • just qa clean (3 pre-existing session JSONL pruning failures on main are unrelated)
Closes #784. Stacks on #783. Stacked on #783 ## Summary When `dispatchFixCi` runs and the dispatched fix-ci task crashes (OOM exit 137, SDK timeout, container kill), the 1 h post-ci dedup map blocks re-dispatch for the rest of the window. The PR sits red until the operator intervenes. Reproduced on PR 780 / task `5d25a13c-…`: `dev` container OOM-killed mid fix-ci, dedup blocked recovery. ## Design - **New janitor rule** `stale_fix_ci_redispatch` registered in `_ALL_RULES` and `reconcileOnce`. - **Walks open PRs** per repo. For each: fetches aggregate CI status; skips unless `failure`. Skips if a task is currently running for the issue (worker registry check via `cfg.isTaskRunning`). Walks `listTasksForIssue` for the latest finished task. Skips if the latest task is `success` (event path will fire on the next CI run). Skips if there's no task history yet (first dispatch is the event-driven path's job). Otherwise re-dispatches `fix-ci`. - **Force escape hatch.** `dispatchFixCi` now accepts `opts.force?: boolean`. Default `false` preserves event-driven dedup. Janitor passes `true` to bypass the 1 h map — that map exists to suppress duplicate dispatches within a single CI run's status fan-out, not to block recovery after a task crash. - **Rate limiter.** Janitor's own `canAct` 6 h window. So even with `force=true`, the rule can't dispatch more than once per 6 h per PR. - **Test seam.** New `impl.listTasksForIssue` / `impl.dispatchFixCi` / `impl.skillForEvent` / `impl.interpolate` so tests can drive the rule without spinning up the full skill loader / DB. ## Test plan - [x] Unit: redispatches with `force=true` when CI=failure + last task=failure + no in-flight - [x] Unit: skips when CI=success - [x] Unit: skips when CI=pending - [x] Unit: skips when worker registry reports a running task - [x] Unit: skips when latest finished task is `success` - [x] Unit: skips when there's no task history (first dispatch is the event path's) - [x] Unit: 6 h `canAct` window is the rate limiter — second pass within window is a no-op - [x] Unit: dry-run flags without calling `dispatchFixCi` - [x] 56/56 tests pass in `janitor.test.ts` - [x] `just qa` clean (3 pre-existing `session JSONL pruning` failures on `main` are unrelated)
feat(janitor): unmergeable_pr_rebase rule (closes #781)
Some checks failed
qa / dockerfile (pull_request) Successful in 7s
qa / qa (pull_request) Failing after 22m39s
9cda43f575
Forgejo fires no event when a PR's mergeable flag flips after main moves
on under it (push outside a PR-merge, force-push, dependency landing
elsewhere). handlePostMergeRebase only covers the explicit merge case;
pr-changes-requested-graph only covers the review case. Stale unmergeable
PRs sat indefinitely.

Adds the eighth janitor rule. Each pass walks open PRs, picks
mergeable=false ones whose author is a configured agent and whose
declared parents are all merged, and dispatches a rebase via the shared
post-merge path — same dedup map keyed on repo#pr@sha so a janitor sweep
and a webhook-driven cascade can't double-dispatch within the 10-minute
window.

The dispatch helper is extracted from handlePostMergeRebase into
dispatchPrRebase so both call sites use the same skill loader, watchdog
tagging, label resolution, and dedup. Exposed via the impl seam so tests
can swap it without process-global mock.module.

Closes #781
feat(janitor): stale_fix_ci_redispatch rule (closes #784)
All checks were successful
qa / dockerfile (pull_request) Successful in 8s
qa / qa (pull_request) Successful in 2m47s
d09a6dc186
Stacks on #781 / PR #783.

When dispatchFixCi sends a fix-ci task and the task crashes (OOM exit
137, SDK timeout, container kill), the post-ci dedup map blocks
re-dispatch for 1 h on the same head SHA. PR sits red until the operator
intervenes. Reproduced on PR 780 / task 5d25a13c — dev container OOM
killed mid-task.

Adds the ninth janitor rule. Each pass walks open PRs whose aggregate CI
is `failure`, checks the worker registry to skip in-flight tasks, walks
listTasksForIssue to find the latest finished task, and re-dispatches
fix-ci with `force=true` when that task ended in failure / interrupted /
cancelled. The 6 h `canAct` window is the rate limiter.

Extends dispatchFixCi with an opts.force escape hatch so the janitor can
bypass the 1 h dedup map without rewriting the call site. Event-driven
callers keep the dedup behaviour (default opts.force=false).

Closes #784
Merge branch 'main' into code-lead/784
All checks were successful
qa / dockerfile (pull_request) Successful in 5s
qa / qa (pull_request) Successful in 2m56s
74a442c5da
# Conflicts:
#	apps/server/src/background/janitor.test.ts
#	apps/server/src/background/janitor.ts
#	packages/shared/src/janitor.ts
reviewer approved these changes 2026-05-02 23:10:38 +00:00
reviewer left a comment

CI green. Logic correct: tasks[tasks.length-1] matches ORDER BY started_at ASC (newest-last). force=true bypass is minimal and scoped — dedup still guards event-driven callers. All 7 AC items satisfied, all 8 test cases present.

CI green. Logic correct: `tasks[tasks.length-1]` matches `ORDER BY started_at ASC` (newest-last). `force=true` bypass is minimal and scoped — dedup still guards event-driven callers. All 7 AC items satisfied, all 8 test cases present.
reviewer approved these changes 2026-05-02 23:11:36 +00:00
reviewer left a comment

CI green. tasks[tasks.length-1] matches ORDER BY started_at ASC. force=true bypass is minimal. All AC satisfied, 8 tests present.

CI green. tasks[tasks.length-1] matches ORDER BY started_at ASC. force=true bypass is minimal. All AC satisfied, 8 tests present.
charles deleted branch code-lead/784 2026-05-03 09:20:37 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!785
No description provided.