B17 — Per-agent-type completion proof (extend B10 silent-failure detection beyond git) #446

Closed
opened 2026-04-27 12:17:14 +00:00 by claude-desktop · 0 comments
Collaborator

As an orchestrator,
I want B10's silent-completion watchdog to require a per-agent-type delivery proof before trusting done — task completed,
so that non-git agents (designer, design-reviewer, reviewer) can't silently fail like dev does.

B10 (#426) caught silent failures on dev / boss rebase tasks by checking PR head sha + duration. It works for git-producing agents. But last night the designer silently failed on issue #236: agent ran 12 min on Opus + Penpot MCP, logged success, never posted a Penpot link comment on the issue, never created a frame. Card stayed IDLE-ASSIGNED for 17 hours until manually retriggered.

Probable cause: cross-task session contamination (designer-2 resumed prior session from issue #291) AND/OR Penpot MCP call failed silently. Either way, the orchestrator trusted the success signal because B10's git heuristic doesn't apply to a Penpot-producing agent.

This story generalises B10 with a per-agent-type completion-proof table. Each agent type declares what counts as "actually delivered":

Agent type Completion proof
dev / boss PR head sha changed (B10's existing check)
designer new comment on the issue containing a penpot.app/#/workspace/ URL, posted by the agent's forgejo_user, after task start
design-reviewer new review comment on the design PR (or on the issue) by the agent's forgejo_user, after task start
reviewer / reviewer-security submitted PR review (APPROVED or REQUEST_CHANGES) by the agent's forgejo_user, after task start
foreman dispatched task / breakdown comment posted on the issue, after task start

If the agent reports done — task completed and the proof is missing, treat it as a silent failure (same B10 path: increment counter, dead-letter at threshold, escalate via B11).

Acceptance criteria

Backend — completion-proof table

  • Define COMPLETION_PROOF_RULES: Record<AgentType, ProofCheck> in apps/server/src/domain/dispatch/completion-proof.ts (new module).
  • ProofCheck is a function (task: TaskRecord, started_at: ms) => Promise<{ passed: boolean; reason?: string }> with the table above as concrete impls.
  • dev / boss proof reuses B10's existing sha-changed check.
  • designer proof: fetch issue comments via Forgejo API, filter created_at >= started_at AND user.login === task.assignee AND body contains penpot.app/#/workspace/. Pass if any.
  • design-reviewer proof: fetch issue + linked-PR comments, filter same, body contains frame: or figure: or penpot.app/. Pass if any.
  • reviewer / reviewer-security proof: fetch PR reviews via /repos/.../pulls/N/reviews, filter submitted_at >= started_at AND user.login === task.assignee AND state in ["APPROVED", "REQUEST_CHANGES", "COMMENTED"]. Pass if any.
  • foreman proof: fetch issue comments, filter same, body contains Dispatched or Broken down or Skill:. Pass if any.

Backend — wire into B10 path

  • At task end (after done — task completed), if the task had a branch_override OR matched a non-dev/boss agent type, call COMPLETION_PROOF_RULES[type](task, started_at).
  • If proof fails: log [suspect-completion] task <id> on <agent> completed without delivery proof — flagging and route through B10's existing increment + re-dispatch path.
  • Behaviour beyond detection (counter persistence, dead-letter at 3, escalation via B11) is unchanged — this story only widens the detection surface.

Tests

  • Unit test (completion-proof.test.ts): designer task, no Penpot URL in any post-start comment → fails proof.
  • Unit test: designer task, comment with penpot.app/#/workspace/abc posted 5 s after start by @designer → passes.
  • Unit test: reviewer task, review submitted by @reviewer post-start with state APPROVED → passes.
  • Unit test: reviewer task, no review submitted → fails proof.
  • Unit test: dev task with sha changed → reuses existing B10 path, no double-flag.
  • Integration test (server): seeded designer task with no Penpot comment → orchestrator flags + increments counter.

Out of scope

  • Per-instance proof rules (rules are per type, not per instance — keep it simple).
  • Proof rules for tester / future agent types — extend the table when those land.
  • Auto-recovery of partially-delivered work (e.g. half-created Penpot frame) — same as B10, just re-dispatch.
  • Alternative proof channels (Slack message, email) — Forgejo-only for v1.

References

  • Spec: docs/specs/automation-hardening.md — extend §4 B10 with this addendum.
  • B10 (parent): #426, apps/server/src/domain/dispatch/silent-completion.ts (or wherever B10's logic landed).
  • B11 (escalation, unchanged): #427.
  • Designer flow + Penpot MCP: docs/penpot.md, docs/design-review.md.
  • Concrete incident: designer-2 task 06299765-1760-4f25-850d-eee8157e36a3 on issue #236 at 2026-04-26 20:56 — ran 12 min, no Penpot comment, silent stall.

Dependencies

  • Land after B10 (#426) — extends its watchdog rather than replacing it.

Suggested first commit

feat(watchdog): per-agent-type completion proof (B17 — extends B10 beyond git agents)

**As an** orchestrator, **I want** B10's silent-completion watchdog to require a per-agent-type *delivery proof* before trusting `done — task completed`, **so that** non-git agents (designer, design-reviewer, reviewer) can't silently fail like dev does. B10 (#426) caught silent failures on `dev` / `boss` rebase tasks by checking PR head sha + duration. It works for git-producing agents. But last night the **designer** silently failed on issue #236: agent ran 12 min on Opus + Penpot MCP, logged success, never posted a Penpot link comment on the issue, never created a frame. Card stayed `IDLE-ASSIGNED` for 17 hours until manually retriggered. Probable cause: cross-task session contamination (designer-2 resumed prior session from issue #291) AND/OR Penpot MCP call failed silently. Either way, the orchestrator trusted the success signal because B10's git heuristic doesn't apply to a Penpot-producing agent. This story generalises B10 with a per-agent-type *completion-proof* table. Each agent type declares what counts as "actually delivered": | Agent type | Completion proof | |---|---| | dev / boss | PR head sha changed (B10's existing check) | | designer | new comment on the issue containing a `penpot.app/#/workspace/` URL, posted by the agent's forgejo_user, after task start | | design-reviewer | new review comment on the design PR (or on the issue) by the agent's forgejo_user, after task start | | reviewer / reviewer-security | submitted PR review (APPROVED or REQUEST_CHANGES) by the agent's forgejo_user, after task start | | foreman | dispatched task / breakdown comment posted on the issue, after task start | If the agent reports `done — task completed` and the proof is missing, treat it as a silent failure (same B10 path: increment counter, dead-letter at threshold, escalate via B11). ## Acceptance criteria ### Backend — completion-proof table - [ ] Define `COMPLETION_PROOF_RULES: Record<AgentType, ProofCheck>` in `apps/server/src/domain/dispatch/completion-proof.ts` (new module). - [ ] `ProofCheck` is a function `(task: TaskRecord, started_at: ms) => Promise<{ passed: boolean; reason?: string }>` with the table above as concrete impls. - [ ] `dev` / `boss` proof reuses B10's existing sha-changed check. - [ ] `designer` proof: fetch issue comments via Forgejo API, filter `created_at >= started_at` AND `user.login === task.assignee` AND body contains `penpot.app/#/workspace/`. Pass if any. - [ ] `design-reviewer` proof: fetch issue + linked-PR comments, filter same, body contains `frame:` or `figure:` or `penpot.app/`. Pass if any. - [ ] `reviewer` / `reviewer-security` proof: fetch PR reviews via `/repos/.../pulls/N/reviews`, filter `submitted_at >= started_at` AND `user.login === task.assignee` AND state in `["APPROVED", "REQUEST_CHANGES", "COMMENTED"]`. Pass if any. - [ ] `foreman` proof: fetch issue comments, filter same, body contains `Dispatched` or `Broken down` or `Skill:`. Pass if any. ### Backend — wire into B10 path - [ ] At task end (after `done — task completed`), if the task had a `branch_override` OR matched a non-`dev`/`boss` agent type, call `COMPLETION_PROOF_RULES[type](task, started_at)`. - [ ] If proof fails: log `[suspect-completion] task <id> on <agent> completed without delivery proof — flagging` and route through B10's existing increment + re-dispatch path. - [ ] Behaviour beyond detection (counter persistence, dead-letter at 3, escalation via B11) is unchanged — this story only widens the *detection* surface. ### Tests - [ ] Unit test (`completion-proof.test.ts`): designer task, no Penpot URL in any post-start comment → fails proof. - [ ] Unit test: designer task, comment with `penpot.app/#/workspace/abc` posted 5 s after start by `@designer` → passes. - [ ] Unit test: reviewer task, review submitted by `@reviewer` post-start with state `APPROVED` → passes. - [ ] Unit test: reviewer task, no review submitted → fails proof. - [ ] Unit test: dev task with sha changed → reuses existing B10 path, no double-flag. - [ ] Integration test (server): seeded designer task with no Penpot comment → orchestrator flags + increments counter. ## Out of scope - Per-instance proof rules (rules are per *type*, not per instance — keep it simple). - Proof rules for `tester` / future agent types — extend the table when those land. - Auto-recovery of partially-delivered work (e.g. half-created Penpot frame) — same as B10, just re-dispatch. - Alternative proof channels (Slack message, email) — Forgejo-only for v1. ## References - Spec: `docs/specs/automation-hardening.md` — extend §4 B10 with this addendum. - B10 (parent): #426, `apps/server/src/domain/dispatch/silent-completion.ts` (or wherever B10's logic landed). - B11 (escalation, unchanged): #427. - Designer flow + Penpot MCP: `docs/penpot.md`, `docs/design-review.md`. - Concrete incident: designer-2 task `06299765-1760-4f25-850d-eee8157e36a3` on issue #236 at 2026-04-26 20:56 — ran 12 min, no Penpot comment, silent stall. ## Dependencies - **Land after B10** (#426) — extends its watchdog rather than replacing it. ## Suggested first commit `feat(watchdog): per-agent-type completion proof (B17 — extends B10 beyond git agents)`
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#446
No description provided.