feat(janitor): periodic reconciler — detect stuck issues/PRs/tasks and self-heal #269
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks#269
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
User story
As an operator, I want a periodic janitor process that scans for known stuck states and either heals them automatically or flags them loudly on the dashboard, so I stop spending 20 % of my operator time doing
curlsurgery to unstick the fleet.Every failure pattern on this ticket has been observed on 2026-04-21 across multiple tickets. Each is cheap to detect; most are cheap to fix.
Observed stuck patterns — all from today
Closes #N.(trailing period / odd phrasing) but Forgejo didn't auto-close**Verdict**: APPROVEDbut the skill didn't execute the close steparea:*routing label but no dispatch ran (service was down when the label was added)handleIssueAssignedstatus=successbut no PR opened + no worktree residue (short-turn bail)mergeable=True+ CI green + APPROVED but not merged (usually dev never hit the merge button)Acceptance criteria
New module
apps/server/src/janitor.ts— scheduled reconcile, runs everyjanitor.interval_ms(default 600 000 ms = 10 min). Cancelable timer; shutdown.ts cancels it on drain.reconcileOnce({ mode: "dry-run" | "auto-heal" }): JanitorReporthelper so tests exercise the rule tree without a timer.Config (
config/agents.json::janitor)enabled: boolean(defaulttrueonce merged,falseduring staged rollout — flip per-deploy).interval_ms: number(clamp 60 000–3 600 000).mode: "dry-run" | "auto-heal"— dry-run logs what it would do; auto-heal acts. Default"dry-run"until one week of logs look sane.rules: string[]— allowlist of rule names to run (empty = all). Lets the operator disable a specific rule if it misbehaves.Rules (each as a named function in janitor.ts)
closes_keyword_drift— for every PR merged in the last 24 h withcloses #N/fixes #N/resolves #Nin its body, if issue N is still open and has no open blockers, close it with an audit comment.design_approved_not_closed— scan open issues withtype:user-storyfor any comment bydesign-reviewercontaining**Verdict**: APPROVED. If present and the issue is still open, close it (same handling as #248's skill path, just the operator-side safety net).label_dispatch_miss— for any open issue witharea:design/area:design-review/area:dashboard, checktask_history+ the live worker registry for a dispatch in the last 2 h. If none, bounce the routing label.dependent_unblocked— for any open issue whose native/dependenciesare all closed, with no currently-running task and an assignee, re-fire the assignment webhook (PATCH + re-PATCH assignees).zero_output_success— for anytask_history.status=successin the last 24 h withturns=0ORcost_usd=null, flag as "bailed out — operator decides". Never auto-re-dispatch (could be a real success; leave it to the human).stale_idle_assigned— for any issue assigned, no PR, no blockers, unchanged for >30 min, post a one-line comment🧹 janitor: this ticket has been idle-assigned since <time>. Re-dispatching.and bounce the assignee.ready_to_merge— for any open PR withmergeable=true+ CI success + at least oneAPPROVEDreview + no pendingREQUEST_CHANGES, flag as "ready". Do not auto-merge — operator decides.Observability
[janitor] rule=<name> action=<close|bounce|redispatch|flag|noop> target=#N details="<one-line reason>".janitor_actionSSE envelope so the dashboard can render a recent-activity panel (on the Agents page or a new Janitor page — pick one in the mockup ticket if we decide to add UI)./janitor/historyGET endpoint returns the last 100 actions (in-memory ring buffer; no persistence needed).Safety rails
(rule, target)pair is only acted on once per 6 h, so a misbehaving rule doesn't spam the same issue with corrective label-bounces.config/agents.json::janitor.enabled=false— takes effect on next reload without requiring a code change.Verification
apps/server/src/janitor.test.ts— one table test per rule, fake Forgejo + fake task_history, assert the right action is proposed.dry-runjanitor with a synthetic stuck state; assert the report matches expectations and no actual Forgejo writes fire.auto-heal, confirm a week of clean self-heal without operator intervention.Out of scope
ready_to_merge— operator decides; the janitor only flags.zero_output_success— too risky without knowing whether the agent actually bailed or just had nothing to do.cfg.repos; re-evaluate if that changes.References
apps/server/src/deps.ts::propagateDependencyClosure— thedependent_unblockedrule shares its code path.apps/server/src/webhook-handlers.ts::dispatchIssueForAgent—label_dispatch_missre-fires through this.apps/server/src/task-store.ts::computeStats—zero_output_successquery shape.interruptedstatus for SIGTERM casualties.