feat(janitor): stale_fix_ci_redispatch rule — recover from crashed fix-ci tasks #785
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!785
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "code-lead/784"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #784. Stacks on #783.
Stacked on #783
Summary
When
dispatchFixCiruns and the dispatched fix-ci task crashes (OOM exit 137, SDK timeout, container kill), the 1 h post-ci dedup map blocks re-dispatch for the rest of the window. The PR sits red until the operator intervenes. Reproduced on PR 780 / task5d25a13c-…:devcontainer OOM-killed mid fix-ci, dedup blocked recovery.Design
stale_fix_ci_redispatchregistered in_ALL_RULESandreconcileOnce.failure. Skips if a task is currently running for the issue (worker registry check viacfg.isTaskRunning). WalkslistTasksForIssuefor the latest finished task. Skips if the latest task issuccess(event path will fire on the next CI run). Skips if there's no task history yet (first dispatch is the event-driven path's job). Otherwise re-dispatchesfix-ci.dispatchFixCinow acceptsopts.force?: boolean. Defaultfalsepreserves event-driven dedup. Janitor passestrueto bypass the 1 h map — that map exists to suppress duplicate dispatches within a single CI run's status fan-out, not to block recovery after a task crash.canAct6 h window. So even withforce=true, the rule can't dispatch more than once per 6 h per PR.impl.listTasksForIssue/impl.dispatchFixCi/impl.skillForEvent/impl.interpolateso tests can drive the rule without spinning up the full skill loader / DB.Test plan
force=truewhen CI=failure + last task=failure + no in-flightsuccesscanActwindow is the rate limiter — second pass within window is a no-opdispatchFixCijanitor.test.tsjust qaclean (3 pre-existingsession JSONL pruningfailures onmainare unrelated)CI green. Logic correct:
tasks[tasks.length-1]matchesORDER BY started_at ASC(newest-last).force=truebypass is minimal and scoped — dedup still guards event-driven callers. All 7 AC items satisfied, all 8 test cases present.CI green. tasks[tasks.length-1] matches ORDER BY started_at ASC. force=true bypass is minimal. All AC satisfied, 8 tests present.