B17 — Per-agent-type completion proof (extend B10 silent-failure detection beyond git) #446

New issue

Closed

opened 2026-04-27 12:17:14 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-04-27 12:17:14 +00:00

Collaborator

As an orchestrator,
I want B10's silent-completion watchdog to require a per-agent-type delivery proof before trusting done — task completed,
so that non-git agents (designer, design-reviewer, reviewer) can't silently fail like dev does.

B10 (#426) caught silent failures on dev / boss rebase tasks by checking PR head sha + duration. It works for git-producing agents. But last night the designer silently failed on issue #236: agent ran 12 min on Opus + Penpot MCP, logged success, never posted a Penpot link comment on the issue, never created a frame. Card stayed IDLE-ASSIGNED for 17 hours until manually retriggered.

Probable cause: cross-task session contamination (designer-2 resumed prior session from issue #291) AND/OR Penpot MCP call failed silently. Either way, the orchestrator trusted the success signal because B10's git heuristic doesn't apply to a Penpot-producing agent.

This story generalises B10 with a per-agent-type completion-proof table. Each agent type declares what counts as "actually delivered":

Agent type	Completion proof
dev / boss	PR head sha changed (B10's existing check)
designer	new comment on the issue containing a `penpot.app/#/workspace/` URL, posted by the agent's forgejo_user, after task start
design-reviewer	new review comment on the design PR (or on the issue) by the agent's forgejo_user, after task start
reviewer / reviewer-security	submitted PR review (APPROVED or REQUEST_CHANGES) by the agent's forgejo_user, after task start
foreman	dispatched task / breakdown comment posted on the issue, after task start

If the agent reports done — task completed and the proof is missing, treat it as a silent failure (same B10 path: increment counter, dead-letter at threshold, escalate via B11).

Acceptance criteria

Backend — completion-proof table

Define COMPLETION_PROOF_RULES: Record<AgentType, ProofCheck> in apps/server/src/domain/dispatch/completion-proof.ts (new module).
ProofCheck is a function (task: TaskRecord, started_at: ms) => Promise<{ passed: boolean; reason?: string }> with the table above as concrete impls.
dev / boss proof reuses B10's existing sha-changed check.
designer proof: fetch issue comments via Forgejo API, filter created_at >= started_at AND user.login === task.assignee AND body contains penpot.app/#/workspace/. Pass if any.
design-reviewer proof: fetch issue + linked-PR comments, filter same, body contains frame: or figure: or penpot.app/. Pass if any.
reviewer / reviewer-security proof: fetch PR reviews via /repos/.../pulls/N/reviews, filter submitted_at >= started_at AND user.login === task.assignee AND state in ["APPROVED", "REQUEST_CHANGES", "COMMENTED"]. Pass if any.
foreman proof: fetch issue comments, filter same, body contains Dispatched or Broken down or Skill:. Pass if any.

Backend — wire into B10 path

At task end (after done — task completed), if the task had a branch_override OR matched a non-dev/boss agent type, call COMPLETION_PROOF_RULES[type](task, started_at).
If proof fails: log [suspect-completion] task <id> on <agent> completed without delivery proof — flagging and route through B10's existing increment + re-dispatch path.
Behaviour beyond detection (counter persistence, dead-letter at 3, escalation via B11) is unchanged — this story only widens the detection surface.

Tests

Unit test (completion-proof.test.ts): designer task, no Penpot URL in any post-start comment → fails proof.
Unit test: designer task, comment with penpot.app/#/workspace/abc posted 5 s after start by @designer → passes.
Unit test: reviewer task, review submitted by @reviewer post-start with state APPROVED → passes.
Unit test: reviewer task, no review submitted → fails proof.
Unit test: dev task with sha changed → reuses existing B10 path, no double-flag.
Integration test (server): seeded designer task with no Penpot comment → orchestrator flags + increments counter.

Out of scope

Per-instance proof rules (rules are per type, not per instance — keep it simple).
Proof rules for tester / future agent types — extend the table when those land.
Auto-recovery of partially-delivered work (e.g. half-created Penpot frame) — same as B10, just re-dispatch.
Alternative proof channels (Slack message, email) — Forgejo-only for v1.

References

Spec: docs/specs/automation-hardening.md — extend §4 B10 with this addendum.
B10 (parent): #426, apps/server/src/domain/dispatch/silent-completion.ts (or wherever B10's logic landed).
B11 (escalation, unchanged): #427.
Designer flow + Penpot MCP: docs/penpot.md, docs/design-review.md.
Concrete incident: designer-2 task 06299765-1760-4f25-850d-eee8157e36a3 on issue #236 at 2026-04-26 20:56 — ran 12 min, no Penpot comment, silent stall.

Dependencies

Land after B10 (#426) — extends its watchdog rather than replacing it.

Suggested first commit

feat(watchdog): per-agent-type completion proof (B17 — extends B10 beyond git agents)

**As an** orchestrator, **I want** B10's silent-completion watchdog to require a per-agent-type *delivery proof* before trusting `done — task completed`, **so that** non-git agents (designer, design-reviewer, reviewer) can't silently fail like dev does. B10 (#426) caught silent failures on `dev` / `boss` rebase tasks by checking PR head sha + duration. It works for git-producing agents. But last night the **designer** silently failed on issue #236: agent ran 12 min on Opus + Penpot MCP, logged success, never posted a Penpot link comment on the issue, never created a frame. Card stayed `IDLE-ASSIGNED` for 17 hours until manually retriggered. Probable cause: cross-task session contamination (designer-2 resumed prior session from issue #291) AND/OR Penpot MCP call failed silently. Either way, the orchestrator trusted the success signal because B10's git heuristic doesn't apply to a Penpot-producing agent. This story generalises B10 with a per-agent-type *completion-proof* table. Each agent type declares what counts as "actually delivered": | Agent type | Completion proof | |---|---| | dev / boss | PR head sha changed (B10's existing check) | | designer | new comment on the issue containing a `penpot.app/#/workspace/` URL, posted by the agent's forgejo_user, after task start | | design-reviewer | new review comment on the design PR (or on the issue) by the agent's forgejo_user, after task start | | reviewer / reviewer-security | submitted PR review (APPROVED or REQUEST_CHANGES) by the agent's forgejo_user, after task start | | foreman | dispatched task / breakdown comment posted on the issue, after task start | If the agent reports `done — task completed` and the proof is missing, treat it as a silent failure (same B10 path: increment counter, dead-letter at threshold, escalate via B11). ## Acceptance criteria ### Backend — completion-proof table - [ ] Define `COMPLETION_PROOF_RULES: Record<AgentType, ProofCheck>` in `apps/server/src/domain/dispatch/completion-proof.ts` (new module). - [ ] `ProofCheck` is a function `(task: TaskRecord, started_at: ms) => Promise<{ passed: boolean; reason?: string }>` with the table above as concrete impls. - [ ] `dev` / `boss` proof reuses B10's existing sha-changed check. - [ ] `designer` proof: fetch issue comments via Forgejo API, filter `created_at >= started_at` AND `user.login === task.assignee` AND body contains `penpot.app/#/workspace/`. Pass if any. - [ ] `design-reviewer` proof: fetch issue + linked-PR comments, filter same, body contains `frame:` or `figure:` or `penpot.app/`. Pass if any. - [ ] `reviewer` / `reviewer-security` proof: fetch PR reviews via `/repos/.../pulls/N/reviews`, filter `submitted_at >= started_at` AND `user.login === task.assignee` AND state in `["APPROVED", "REQUEST_CHANGES", "COMMENTED"]`. Pass if any. - [ ] `foreman` proof: fetch issue comments, filter same, body contains `Dispatched` or `Broken down` or `Skill:`. Pass if any. ### Backend — wire into B10 path - [ ] At task end (after `done — task completed`), if the task had a `branch_override` OR matched a non-`dev`/`boss` agent type, call `COMPLETION_PROOF_RULES[type](task, started_at)`. - [ ] If proof fails: log `[suspect-completion] task <id> on <agent> completed without delivery proof — flagging` and route through B10's existing increment + re-dispatch path. - [ ] Behaviour beyond detection (counter persistence, dead-letter at 3, escalation via B11) is unchanged — this story only widens the *detection* surface. ### Tests - [ ] Unit test (`completion-proof.test.ts`): designer task, no Penpot URL in any post-start comment → fails proof. - [ ] Unit test: designer task, comment with `penpot.app/#/workspace/abc` posted 5 s after start by `@designer` → passes. - [ ] Unit test: reviewer task, review submitted by `@reviewer` post-start with state `APPROVED` → passes. - [ ] Unit test: reviewer task, no review submitted → fails proof. - [ ] Unit test: dev task with sha changed → reuses existing B10 path, no double-flag. - [ ] Integration test (server): seeded designer task with no Penpot comment → orchestrator flags + increments counter. ## Out of scope - Per-instance proof rules (rules are per *type*, not per instance — keep it simple). - Proof rules for `tester` / future agent types — extend the table when those land. - Auto-recovery of partially-delivered work (e.g. half-created Penpot frame) — same as B10, just re-dispatch. - Alternative proof channels (Slack message, email) — Forgejo-only for v1. ## References - Spec: `docs/specs/automation-hardening.md` — extend §4 B10 with this addendum. - B10 (parent): #426, `apps/server/src/domain/dispatch/silent-completion.ts` (or wherever B10's logic landed). - B11 (escalation, unchanged): #427. - Designer flow + Penpot MCP: `docs/penpot.md`, `docs/design-review.md`. - Concrete incident: designer-2 task `06299765-1760-4f25-850d-eee8157e36a3` on issue #236 at 2026-04-26 20:56 — ran 12 min, no Penpot comment, silent stall. ## Dependencies - **Land after B10** (#426) — extends its watchdog rather than replacing it. ## Suggested first commit `feat(watchdog): per-agent-type completion proof (B17 — extends B10 beyond git agents)`

claude-desktop added the

area:agents

type:user-story

labels

2026-04-27 12:17:21 +00:00