charles/claude-hooks

Fork

You've already forked claude-hooks

Code Issues 10 Pull requests Projects Releases Packages 1 Wiki Activity Actions

Janitor rule: redispatch stale fix-ci on PRs whose CI is still red #784

New issue

Closed

opened 2026-05-02 22:49:39 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-05-02 22:49:39 +00:00

Collaborator

Copy link

User story

As an operator, I want a PR with red CI to keep getting fix-ci dispatches until either the agent succeeds or I intervene, so a single failed fix-ci task (OOM, crash, transient SDK error) doesn't strand the PR for an hour or longer.

Background

When Forgejo emits action_run_failure, handleCheckSuiteCompleted (apps/server/src/domain/workflow/event-handlers.ts:963) routes to dispatchFixCi, which:

Checks the _fixCiDispatched BoundedMap (keyed repo#sha, 1 h TTL).
If clean, dispatches fix-ci to the PR author and marks the SHA dispatched.

The mechanism is correct on the happy path. Two failure modes are uncovered:

Agent crashes mid-task. PR 780 / task 5d25a13c-…: dev container OOM-killed (exit 137) while running fix-ci for SHA a7cbf84. PR sat red until the operator intervened. The dedup map blocked re-dispatch for the full 1 h window even though the previous task ended in failure.
Webhook missed / out-of-order. Forgejo doesn't redeliver after a service restart or transient HTTP error — the only action_run_failure event for that SHA is gone, and CI stays red with no agent in flight.

Proposed fix

Add janitor rule stale_fix_ci_redispatch modelled on unmergeable_pr_rebase (#781):

Per repo, list open PRs.
Get aggregate CI status for each PR's head SHA. Skip unless failure.
Walk listTasksForIssue(repo, pr.number) for the most recent task. If a task is currently running for that issue (worker registry check) → skip. If the most recent finished task for that PR ended in success → skip (post-CI didn't fire yet, leave it to the event path). If the most recent finished task ended in failure / interrupted / cancelled AND CI is still red → redispatch.
Override _fixCiDispatched semantics: the janitor's own canAct window (6 h) is the rate limiter, not the post-ci dedup map — the dedup map exists to suppress event-driven double-dispatch within a single CI run, but it shouldn't block recovery after a task crash.
Reuse dispatchFixCi. Need a way to bypass alreadyDispatchedFixCi — extend the function with a forceRedispatch opt (or call the inner persistence path directly).

Acceptance criteria

Janitor rule

New rule stale_fix_ci_redispatch registered in _ALL_RULES and reconcileOnce
Lists open PRs, fetches aggregate CI status per head SHA
Skips when CI is success, pending, or null (no workflows)
Skips when worker registry has a running task for that issue
Skips when last finished task for the issue is success (event path will pick up next CI run)
Redispatches when CI is failure AND last finished task is failure/interrupted/cancelled

Dispatch path

dispatchFixCi accepts a force?: boolean opt (or expose a sibling helper) so the janitor can bypass the 1 h dedup map
Janitor's own canAct 6 h window is the rate limiter for the rule

Tests

Unit: rule dispatches when CI=failure + last task=failure + no in-flight task
Unit: rule skips when CI=success
Unit: rule skips when there's an in-flight task for the issue
Unit: rule skips when last finished task is success
Unit: rule respects its own canAct window across passes

Out of scope

Retry-with-extended-memory for OOM specifically (operator-level container config, separate ticket if it becomes a pattern).
Cap on total fix-ci attempts per PR (could be a follow-up — current force_merge round-cap covers reviewer rounds, not CI rounds).
Ticket #781 (unmergeable_pr_rebase) is the parallel fix for the conflict no-event gap.

References

Hook: event-handlers.ts:963 (handleCheckSuiteCompleted).
Dispatch: post-ci.ts:261 (dispatchFixCi), post-ci.ts:45 (FIX_CI_DEDUP_MS).
Janitor pattern: janitor.ts rule unmergeable_pr_rebase (PR #783, issue #781).
Reproduction: PR #780, task 5d25a13c-4f78-4531-a23e-028f055202eb exited 137 on SHA a7cbf84.

## User story As an operator, I want a PR with red CI to keep getting `fix-ci` dispatches until either the agent succeeds or I intervene, so a single failed `fix-ci` task (OOM, crash, transient SDK error) doesn't strand the PR for an hour or longer. ## Background When Forgejo emits `action_run_failure`, `handleCheckSuiteCompleted` (`apps/server/src/domain/workflow/event-handlers.ts:963`) routes to `dispatchFixCi`, which: 1. Checks the `_fixCiDispatched` BoundedMap (keyed `repo#sha`, 1 h TTL). 2. If clean, dispatches `fix-ci` to the PR author and marks the SHA dispatched. The mechanism is correct on the happy path. Two failure modes are uncovered: - **Agent crashes mid-task.** PR 780 / task `5d25a13c-…`: `dev` container OOM-killed (exit 137) while running `fix-ci` for SHA `a7cbf84`. PR sat red until the operator intervened. The dedup map blocked re-dispatch for the full 1 h window even though the previous task ended in `failure`. - **Webhook missed / out-of-order.** Forgejo doesn't redeliver after a service restart or transient HTTP error — the only `action_run_failure` event for that SHA is gone, and CI stays red with no agent in flight. ## Proposed fix Add janitor rule `stale_fix_ci_redispatch` modelled on `unmergeable_pr_rebase` (#781): 1. Per repo, list open PRs. 2. Get aggregate CI status for each PR's head SHA. Skip unless `failure`. 3. Walk `listTasksForIssue(repo, pr.number)` for the most recent task. If a task is currently running for that issue (worker registry check) → skip. If the most recent finished task for that PR ended in `success` → skip (post-CI didn't fire yet, leave it to the event path). If the most recent finished task ended in `failure` / `interrupted` / `cancelled` AND CI is still red → redispatch. 4. Override `_fixCiDispatched` semantics: the janitor's own `canAct` window (6 h) is the rate limiter, not the post-ci dedup map — the dedup map exists to suppress event-driven double-dispatch within a single CI run, but it shouldn't block recovery after a task crash. 5. Reuse `dispatchFixCi`. Need a way to bypass `alreadyDispatchedFixCi` — extend the function with a `forceRedispatch` opt (or call the inner persistence path directly). ## Acceptance criteria ### Janitor rule - [ ] New rule `stale_fix_ci_redispatch` registered in `_ALL_RULES` and `reconcileOnce` - [ ] Lists open PRs, fetches aggregate CI status per head SHA - [ ] Skips when CI is `success`, `pending`, or `null` (no workflows) - [ ] Skips when worker registry has a running task for that issue - [ ] Skips when last finished task for the issue is `success` (event path will pick up next CI run) - [ ] Redispatches when CI is `failure` AND last finished task is `failure`/`interrupted`/`cancelled` ### Dispatch path - [ ] `dispatchFixCi` accepts a `force?: boolean` opt (or expose a sibling helper) so the janitor can bypass the 1 h dedup map - [ ] Janitor's own `canAct` 6 h window is the rate limiter for the rule ### Tests - [ ] Unit: rule dispatches when CI=failure + last task=failure + no in-flight task - [ ] Unit: rule skips when CI=success - [ ] Unit: rule skips when there's an in-flight task for the issue - [ ] Unit: rule skips when last finished task is `success` - [ ] Unit: rule respects its own `canAct` window across passes ## Out of scope - Retry-with-extended-memory for OOM specifically (operator-level container config, separate ticket if it becomes a pattern). - Cap on total fix-ci attempts per PR (could be a follow-up — current `force_merge` round-cap covers reviewer rounds, not CI rounds). - Ticket #781 (`unmergeable_pr_rebase`) is the parallel fix for the conflict no-event gap. ## References - Hook: `event-handlers.ts:963` (`handleCheckSuiteCompleted`). - Dispatch: `post-ci.ts:261` (`dispatchFixCi`), `post-ci.ts:45` (`FIX_CI_DEDUP_MS`). - Janitor pattern: `janitor.ts` rule `unmergeable_pr_rebase` (PR #783, issue #781). - Reproduction: PR #780, task `5d25a13c-4f78-4531-a23e-028f055202eb` exited 137 on SHA `a7cbf84`.

claude-desktop added the

area:webhook

type:user-story

labels

2026-05-02 22:49:45 +00:00

charles referenced this issue from a commit

2026-05-02 22:55:46 +00:00

feat(janitor): stale_fix_ci_redispatch rule (closes #784)

claude-desktop referenced this issue from a pull request that will close it,

2026-05-02 22:56:04 +00:00

feat(janitor): stale_fix_ci_redispatch rule — recover from crashed fix-ci tasks #785

charles closed this issue

2026-05-03 09:20:37 +00:00

charles referenced this issue from a commit

2026-05-03 09:20:37 +00:00

feat(janitor): stale_fix_ci_redispatch rule — recover from crashed fix-ci tasks (#785)

charles referenced this issue from a commit

2026-05-03 11:11:42 +00:00

chore(infra): migrate Docker base + CI workflows to forge-base v0.1.0

No Branch/Tag specified

main

chore/sync-pre-push-from-forge-base

fix/flows-yaml-dispatch-identity

feat/board-tap-to-assign

dev/1107

code-lead/1106

code-lead/1108

dev/1104

code-lead/1103

code-lead/1080

dev/1087

feat/flows-yaml-ci-events

chore/board-drop-stalled-and-density-controls

fix/flows-yaml-routes-always-register

flows-yaml/api-defaults

dev/1023

fix/event-log-history-bleed

fix/janitor-fix-ci-logs-and-cap

dev/1022

fix/board-card-provider

code-lead/1036

dev/1025

code-lead/1020

dev/1017

code-lead/1026

feat/web-shortcut-registry-1018

dev/1015

code-lead/1009

code-lead/1008

dev/975

dev/969

dev/973

dev/967

code-lead/968

code-lead/953

dev/970

dev/976

code-lead/966

code-lead/956

code-lead/951

dev/962

dev/963

dev/977

dev/955

dev/983

dev/961

dev/974

code-lead/950

code-lead/939

dev/941

dev/940

dev/937

dev/938

dev/936

dev/935

feat/web-i18n-fr-locale

feat/spec-editor-ui-polish

chore/drop-legacy-compat

fix/skills-drop-preview-pane

fix/882-skills-safety-rail

dev/911

dev/909

dev/923

dev/917

dev/915

feat/879-sr11-m2-drop-legacy-skill

code-lead/873

dev/881

code-lead/869

dev/867

code-lead/845

code-lead/843

code-lead/844

dev/837

dev/861

dev/849

code-lead/837

code-lead/842

fix/dedup-rebase-inflight

dev/838

code-lead/847

dev/833

code-lead/848

pr/838

code-lead/841

feat/settings-save-bar/836

code-lead/840

dev/846

code-lead/839

dev/832

fix/board-sse-stale-cache

dev/834

dev/835

feat/settings-breadcrumbs

feat/forge-oauth-credentials

refactor/service-config-consolidation

feat/agent-tokens-to-secrets

feat/gitlab-oauth-to-db

feat/authelia-rip-and-voice-fixes

fix/rebase-storm-and-dead-letter

code-lead/797

code-lead/796

dev/811

code-lead/798

dev/810

code-lead/795

dev/808

code-lead/794

dev/805

dev/802

dev/803

feat/avatar-menu-settings-entry

feat/per-agent-token-tracking

dev/793

dev/747

dev/752

code-lead/790

code-lead/759

dev/756

dev/760

dev/741

dev/767

dev/740

dev/709

dev/644

dev/637

boss/614

dev/600

dev/611

dev/585

fix/login-bonus-fixes

boss/544

dev/542

refactor/api-prefix-and-session-gate

dev/489

boss/531

boss/518

dev/499

boss/516

dev/530

dev/517

dev/519

dev/515

dev/522

dev/503

dev/471

boss/329

dev/417

dev/418

dev/402

boss/327

dev/334

dev/332

boss/326

boss/325

dev/331

boss/324

boss/323

boss/322

dev/294

test/s11-task-analytics

dev/262

boss/270

dev/268

foreman/ui-consolidation-spec

dev/234

boss/196

boss/176

boss/164

fix/124-session-persist-bind

boss/52

dev/87

boss/73

dev/77

dev/81

dev/82

boss/79

dev/42

dev/35

boss/7

No results found.

Labels

Clear labels

area:agents

Agent types, pool scheduling, per-instance config

area:dashboard

Dashboard UI and observability surfaces

area:database

DB layer — schema, migrations, ORM, raw SQL

area:design

UI/UX mockup work — routes to designer agent

area:design-review

Design review dispatch — routes to design-reviewer agent

area:flows

Flow runner — YAML loader, executor, op registry, expression eval

area:infra

Deployment, isolation, containers, systemd units

area:meta

Tracking, scaffolding, project setup

area:security

Security — routes to reviewer-security (opus)

area:sessions

Session-id store, Claude SDK resume logic

area:webhook

Forgejo webhook routing and handlers

area:workdir

Clone cache, worktrees, git identity

security

Security-sensitive issue

Tracking or decisions, not implementation work

No labels

Milestone

Clear milestone

No items

No milestone

Projects

Clear projects

No items

No project

Assignees

Clear assignees

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

charles/claude-hooks#784

Reference in a new issue

Repository

charles/claude-hooks

Title

Body

No description provided.

Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?

Rows
Columns