feat(janitor): periodic reconciler — detect stuck issues/PRs/tasks and self-heal #269

Closed
opened 2026-04-21 21:52:38 +00:00 by claude-desktop · 0 comments
Collaborator

User story

As an operator, I want a periodic janitor process that scans for known stuck states and either heals them automatically or flags them loudly on the dashboard, so I stop spending 20 % of my operator time doing curl surgery to unstick the fleet.

Every failure pattern on this ticket has been observed on 2026-04-21 across multiple tickets. Each is cheap to detect; most are cheap to fix.

Observed stuck patterns — all from today

# Pattern Today's examples Auto-heal
1 PR merged with Closes #N. (trailing period / odd phrasing) but Forgejo didn't auto-close #224, #263 close the issue
2 Design-reviewer comment carries **Verdict**: APPROVED but the skill didn't execute the close step #253 close the issue
3 Issue has an area:* routing label but no dispatch ran (service was down when the label was added) #236 bounce the label
4 Dependent's blocker closed but the propagator didn't fire (e.g., service restart mid-webhook) #262 after #263 re-fire handleIssueAssigned
5 Task status=success but no PR opened + no worktree residue (short-turn bail) designer #253 pass 1 ($0.54, 10 turns) flag for operator
6 Idle-assigned card with no PR, no blockers, >30 min of staleness many over the course of today comment + flag
7 PR mergeable=True + CI green + APPROVED but not merged (usually dev never hit the merge button) intermittent surface as a "ready to merge" row

Acceptance criteria

New module

  • apps/server/src/janitor.ts — scheduled reconcile, runs every janitor.interval_ms (default 600 000 ms = 10 min). Cancelable timer; shutdown.ts cancels it on drain.
  • One pure reconcileOnce({ mode: "dry-run" | "auto-heal" }): JanitorReport helper so tests exercise the rule tree without a timer.

Config (config/agents.json::janitor)

  • enabled: boolean (default true once merged, false during staged rollout — flip per-deploy).
  • interval_ms: number (clamp 60 000–3 600 000).
  • mode: "dry-run" | "auto-heal" — dry-run logs what it would do; auto-heal acts. Default "dry-run" until one week of logs look sane.
  • rules: string[] — allowlist of rule names to run (empty = all). Lets the operator disable a specific rule if it misbehaves.

Rules (each as a named function in janitor.ts)

  • closes_keyword_drift — for every PR merged in the last 24 h with closes #N / fixes #N / resolves #N in its body, if issue N is still open and has no open blockers, close it with an audit comment.
  • design_approved_not_closed — scan open issues with type:user-story for any comment by design-reviewer containing **Verdict**: APPROVED. If present and the issue is still open, close it (same handling as #248's skill path, just the operator-side safety net).
  • label_dispatch_miss — for any open issue with area:design / area:design-review / area:dashboard, check task_history + the live worker registry for a dispatch in the last 2 h. If none, bounce the routing label.
  • dependent_unblocked — for any open issue whose native /dependencies are all closed, with no currently-running task and an assignee, re-fire the assignment webhook (PATCH + re-PATCH assignees).
  • zero_output_success — for any task_history.status=success in the last 24 h with turns=0 OR cost_usd=null, flag as "bailed out — operator decides". Never auto-re-dispatch (could be a real success; leave it to the human).
  • stale_idle_assigned — for any issue assigned, no PR, no blockers, unchanged for >30 min, post a one-line comment 🧹 janitor: this ticket has been idle-assigned since <time>. Re-dispatching. and bounce the assignee.
  • ready_to_merge — for any open PR with mergeable=true + CI success + at least one APPROVED review + no pending REQUEST_CHANGES, flag as "ready". Do not auto-merge — operator decides.

Observability

  • Every action emits a structured line: [janitor] rule=<name> action=<close|bounce|redispatch|flag|noop> target=#N details="<one-line reason>".
  • Broadcast janitor_action SSE envelope so the dashboard can render a recent-activity panel (on the Agents page or a new Janitor page — pick one in the mockup ticket if we decide to add UI).
  • /janitor/history GET endpoint returns the last 100 actions (in-memory ring buffer; no persistence needed).

Safety rails

  • Per-rule per-target idempotency: the same (rule, target) pair is only acted on once per 6 h, so a misbehaving rule doesn't spam the same issue with corrective label-bounces.
  • Global kill-switch via config/agents.json::janitor.enabled=false — takes effect on next reload without requiring a code change.
  • Dry-run mode for the first rollout week — log-only; metrics on how many times each rule would have fired lets us tune before flipping to auto-heal.

Verification

  • Unit tests in apps/server/src/janitor.test.ts — one table test per rule, fake Forgejo + fake task_history, assert the right action is proposed.
  • Integration test: spin up a dry-run janitor with a synthetic stuck state; assert the report matches expectations and no actual Forgejo writes fire.
  • Manual: after one week in dry-run, review the log, flip to auto-heal, confirm a week of clean self-heal without operator intervention.

Out of scope

  • Auto-merge on ready_to_merge — operator decides; the janitor only flags.
  • Auto-re-dispatch of zero_output_success — too risky without knowing whether the agent actually bailed or just had nothing to do.
  • A dedicated Janitor UI page — file a follow-up if the SSE log stream turns out to need structured browsing. For now, log lines + SSE envelope.
  • Cross-repo cycles — single-repo only. Today we have one repo in cfg.repos; re-evaluate if that changes.

References

  • apps/server/src/deps.ts::propagateDependencyClosure — the dependent_unblocked rule shares its code path.
  • apps/server/src/webhook-handlers.ts::dispatchIssueForAgentlabel_dispatch_miss re-fires through this.
  • apps/server/src/task-store.ts::computeStatszero_output_success query shape.
  • Prior tickets that each cover a single failure mode (this janitor is the monitoring layer, not a replacement):
    • #221interrupted status for SIGTERM casualties.
    • #248 — skill-level close-on-approval for design reviews.
    • #257 — credential seeding gap.
  • 2026-04-21 operator log: at least 7 distinct stuck states unstuck by hand in a single day. The janitor is the "fix this whole class of incident" pass.
## User story As an operator, I want a periodic **janitor** process that scans for known stuck states and either heals them automatically or flags them loudly on the dashboard, so I stop spending 20 % of my operator time doing `curl` surgery to unstick the fleet. Every failure pattern on this ticket has been observed on 2026-04-21 across multiple tickets. Each is cheap to detect; most are cheap to fix. ## Observed stuck patterns — all from today | # | Pattern | Today's examples | Auto-heal | |---|---|---|---| | 1 | PR merged with `Closes #N.` (trailing period / odd phrasing) but Forgejo didn't auto-close | #224, #263 | close the issue | | 2 | Design-reviewer comment carries `**Verdict**: APPROVED` but the skill didn't execute the close step | #253 | close the issue | | 3 | Issue has an `area:*` routing label but no dispatch ran (service was down when the label was added) | #236 | bounce the label | | 4 | Dependent's blocker closed but the propagator didn't fire (e.g., service restart mid-webhook) | #262 after #263 | re-fire `handleIssueAssigned` | | 5 | Task `status=success` but no PR opened + no worktree residue (short-turn bail) | designer #253 pass 1 ($0.54, 10 turns) | flag for operator | | 6 | Idle-assigned card with no PR, no blockers, >30 min of staleness | many over the course of today | comment + flag | | 7 | PR `mergeable=True` + CI green + APPROVED but not merged (usually dev never hit the merge button) | intermittent | surface as a "ready to merge" row | ## Acceptance criteria ### New module - [ ] `apps/server/src/janitor.ts` — scheduled reconcile, runs every `janitor.interval_ms` (default 600 000 ms = 10 min). Cancelable timer; shutdown.ts cancels it on drain. - [ ] One pure `reconcileOnce({ mode: "dry-run" | "auto-heal" }): JanitorReport` helper so tests exercise the rule tree without a timer. ### Config (`config/agents.json::janitor`) - [ ] `enabled: boolean` (default `true` once merged, `false` during staged rollout — flip per-deploy). - [ ] `interval_ms: number` (clamp 60 000–3 600 000). - [ ] `mode: "dry-run" | "auto-heal"` — dry-run logs what it would do; auto-heal acts. Default `"dry-run"` until one week of logs look sane. - [ ] `rules: string[]` — allowlist of rule names to run (empty = all). Lets the operator disable a specific rule if it misbehaves. ### Rules (each as a named function in janitor.ts) - [ ] `closes_keyword_drift` — for every PR merged in the last 24 h with `closes #N` / `fixes #N` / `resolves #N` in its body, if issue N is still open and has no open blockers, close it with an audit comment. - [ ] `design_approved_not_closed` — scan open issues with `type:user-story` for any comment by `design-reviewer` containing `**Verdict**: APPROVED`. If present and the issue is still open, close it (same handling as #248's skill path, just the operator-side safety net). - [ ] `label_dispatch_miss` — for any open issue with `area:design` / `area:design-review` / `area:dashboard`, check `task_history` + the live worker registry for a dispatch in the last 2 h. If none, bounce the routing label. - [ ] `dependent_unblocked` — for any open issue whose native `/dependencies` are all closed, with no currently-running task and an assignee, re-fire the assignment webhook (PATCH + re-PATCH assignees). - [ ] `zero_output_success` — for any `task_history.status=success` in the last 24 h with `turns=0` OR `cost_usd=null`, flag as "bailed out — operator decides". Never auto-re-dispatch (could be a real success; leave it to the human). - [ ] `stale_idle_assigned` — for any issue assigned, no PR, no blockers, unchanged for >30 min, post a one-line comment `🧹 janitor: this ticket has been idle-assigned since <time>. Re-dispatching.` and bounce the assignee. - [ ] `ready_to_merge` — for any open PR with `mergeable=true` + CI success + at least one `APPROVED` review + no pending `REQUEST_CHANGES`, flag as "ready". Do **not** auto-merge — operator decides. ### Observability - [ ] Every action emits a structured line: `[janitor] rule=<name> action=<close|bounce|redispatch|flag|noop> target=#N details="<one-line reason>"`. - [ ] Broadcast `janitor_action` SSE envelope so the dashboard can render a recent-activity panel (on the Agents page or a new Janitor page — pick one in the mockup ticket if we decide to add UI). - [ ] `/janitor/history` GET endpoint returns the last 100 actions (in-memory ring buffer; no persistence needed). ### Safety rails - [ ] Per-rule per-target idempotency: the same `(rule, target)` pair is only acted on once per 6 h, so a misbehaving rule doesn't spam the same issue with corrective label-bounces. - [ ] Global kill-switch via `config/agents.json::janitor.enabled=false` — takes effect on next reload without requiring a code change. - [ ] Dry-run mode for the first rollout week — log-only; metrics on how many times each rule would have fired lets us tune before flipping to auto-heal. ### Verification - [ ] Unit tests in `apps/server/src/janitor.test.ts` — one table test per rule, fake Forgejo + fake task_history, assert the right action is proposed. - [ ] Integration test: spin up a `dry-run` janitor with a synthetic stuck state; assert the report matches expectations and no actual Forgejo writes fire. - [ ] Manual: after one week in dry-run, review the log, flip to `auto-heal`, confirm a week of clean self-heal without operator intervention. ## Out of scope - Auto-merge on `ready_to_merge` — operator decides; the janitor only flags. - Auto-re-dispatch of `zero_output_success` — too risky without knowing whether the agent actually bailed or just had nothing to do. - A dedicated Janitor UI page — file a follow-up if the SSE log stream turns out to need structured browsing. For now, log lines + SSE envelope. - Cross-repo cycles — single-repo only. Today we have one repo in `cfg.repos`; re-evaluate if that changes. ## References - `apps/server/src/deps.ts::propagateDependencyClosure` — the `dependent_unblocked` rule shares its code path. - `apps/server/src/webhook-handlers.ts::dispatchIssueForAgent` — `label_dispatch_miss` re-fires through this. - `apps/server/src/task-store.ts::computeStats` — `zero_output_success` query shape. - Prior tickets that each cover a single failure mode (this janitor is the *monitoring* layer, not a replacement): - #221 — `interrupted` status for SIGTERM casualties. - #248 — skill-level close-on-approval for design reviews. - #257 — credential seeding gap. - 2026-04-21 operator log: at least 7 distinct stuck states unstuck by hand in a single day. The janitor is the "fix this whole class of incident" pass.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#269
No description provided.