Concurrent same-type dispatch on one issue: 30 s dedup TTL too short #597

New issue

Closed

opened 2026-04-30 20:11:31 +00:00 by claude-desktop · 0 comments

claude-desktop commented

2026-04-30 20:11:31 +00:00

Collaborator

Summary

Same-type pool members can run concurrently on a single issue when the second dispatch arrives more than 30 s after the first. Observed on issue #596 (M26-4 lifecycle config): four boss task rows for the same issue, including two windows of ~1m30s where boss-default and boss-2 were both running.

Filesystem dedup masked the visible damage (worktree path is keyed by issue → last-writer wins → only one branch boss/591 + PR #596 survived), so all four task rows are marked success and no error trail exists. Token cost paid ~4× for one logical task: 38 + 40 + 23 + 23 = 124 boss turns.

Reproduction (observed, 2026-04-30)

tasks.db::task_history for issue #596:

boss-default  21:59:27 → 22:03:16  38 turns  ← initial
reviewer-2    21:59:29 → 22:05:16  24 turns
boss-2        22:01:47 → 22:06:27  40 turns  ← overlap with both boss-default runs
boss-default  22:05:12 → 22:06:59  23 turns  ← second boss-default

Board screenshot at ~22:06 showed both boss-default (53 s) and boss-2 (4 m) cards on issue #596 in the running column.

Root cause

apps/server/src/domain/dispatch/assign-dedup.ts:

_assignDedup keyed by ${repo}#${issue}@${type} with a fixed 30 s TTL (line 23).
isDupAssign(...) checks the timestamp; once 30 s elapse the key is gone and a fresh dispatch slips through, regardless of whether the prior task is still running.
cancelStaleIssueTask(...) only cancels queued tasks, and only when prev.type !== newType (line 68 if (prev.type === newType) return;). Same-type concurrent runs are unhandled.
Pool selector (domain/dispatch/pool.ts) then picks the idle peer (boss-default when boss-2 is busy) → second concurrent task lands cleanly.

Acceptance criteria

Source of truth = registry, not a clock

isDupAssign(repo, issueNumber, type) returns true whenever any worker of type has a currentTask or queued task tagged with ${repo}#${issueNumber} — independent of how long ago the prior dispatch landed.
Dispatch tasks carry the issue identifier (repo, issueNumber) on the worker's task record so the check is O(workers) without a separate index.
Drop the 30 s BoundedMap TTL bookkeeping (or keep purely as a fast-path; the registry walk is authoritative).

Cross-type stale cancellation

cancelStaleIssueTask keeps its current behaviour (different-type re-dispatches cancel a queued prior task) but log line at "is already running" path also surfaces a dispatch.duplicate_running event so the operator sees the warning on the dashboard, not just stdout.

Defense-in-depth: board warning

When the board view detects two cards for the same (repo, issue_number) with status === "running", render a single merged card with a running × N indicator + warning chip. (Implementation lands behind the dedup fix so it should never trigger in normal operation.)

Tests

Unit: isDupAssign returns true while a prior task is running, even after 60 s simulated wall-clock.
Unit: isDupAssign returns false once the prior task completes (registry no longer holds it).
Integration: two assigned webhooks for same issue 5 s apart → first dispatches, second is suppressed (covered today). Two webhooks 60 s apart → second is also suppressed (regression; fails today).
Backfill: query task_history for any historical concurrent runs (overlapping started_at / finished_at for same (repo, issue_number, agent_type)) and emit a one-shot count to verify scope.

Out of scope

Per-task ephemeral worktrees (would change the failure mode but not the dedup root cause).
Reviewer pool same-issue parallel runs are usually intentional (different reviewers asked for different perspectives) — the dedup applies to assign-dispatch, not review-requested.
Lazy lifecycle (M26) interaction — no change here, but worth noting that the second-task race window may grow on lazy roles where docker start adds latency.

References

apps/server/src/domain/dispatch/assign-dedup.ts — current TTL implementation.
apps/server/src/domain/dispatch/pool.ts — selector that grants the duplicate.
apps/server/src/domain/dispatch/registry.ts — worker currentTask + queue fields.
Issue #92 (original assign-dedup), #49 (sharpening) — context for why the TTL exists.
M26 — would expose this race more often with lazy cold-start; bug ideally fixed first.

## Summary Same-type pool members can run **concurrently** on a single issue when the second dispatch arrives more than 30 s after the first. Observed on issue #596 (M26-4 lifecycle config): four `boss` task rows for the same issue, including two windows of ~1m30s where `boss-default` and `boss-2` were *both* running. Filesystem dedup masked the visible damage (worktree path is keyed by issue → last-writer wins → only one branch `boss/591` + PR #596 survived), so all four task rows are marked `success` and no error trail exists. Token cost paid ~4× for one logical task: 38 + 40 + 23 + 23 = 124 boss turns. ## Reproduction (observed, 2026-04-30) `tasks.db::task_history` for issue #596: ``` boss-default 21:59:27 → 22:03:16 38 turns ← initial reviewer-2 21:59:29 → 22:05:16 24 turns boss-2 22:01:47 → 22:06:27 40 turns ← overlap with both boss-default runs boss-default 22:05:12 → 22:06:59 23 turns ← second boss-default ``` Board screenshot at ~22:06 showed both `boss-default` (53 s) and `boss-2` (4 m) cards on issue #596 in the `running` column. ## Root cause `apps/server/src/domain/dispatch/assign-dedup.ts`: - `_assignDedup` keyed by `${repo}#${issue}@${type}` with a fixed **30 s TTL** (line 23). - `isDupAssign(...)` checks the timestamp; once 30 s elapse the key is gone and a fresh dispatch slips through, regardless of whether the prior task is still running. - `cancelStaleIssueTask(...)` only cancels **queued** tasks, and only when `prev.type !== newType` (line 68 `if (prev.type === newType) return;`). Same-type concurrent runs are unhandled. - Pool selector (`domain/dispatch/pool.ts`) then picks the idle peer (`boss-default` when `boss-2` is busy) → second concurrent task lands cleanly. ## Acceptance criteria ### Source of truth = registry, not a clock - [ ] `isDupAssign(repo, issueNumber, type)` returns true whenever **any** worker of `type` has a `currentTask` or queued task tagged with `${repo}#${issueNumber}` — independent of how long ago the prior dispatch landed. - [ ] Dispatch tasks carry the issue identifier (`repo`, `issueNumber`) on the worker's task record so the check is O(workers) without a separate index. - [ ] Drop the 30 s `BoundedMap` TTL bookkeeping (or keep purely as a fast-path; the registry walk is authoritative). ### Cross-type stale cancellation - [ ] `cancelStaleIssueTask` keeps its current behaviour (different-type re-dispatches cancel a queued prior task) but log line at "is already running" path also surfaces a `dispatch.duplicate_running` event so the operator sees the warning on the dashboard, not just stdout. ### Defense-in-depth: board warning - [ ] When the board view detects two cards for the same `(repo, issue_number)` with `status === "running"`, render a single merged card with a `running × N` indicator + warning chip. (Implementation lands behind the dedup fix so it should never trigger in normal operation.) ### Tests - [ ] Unit: `isDupAssign` returns true while a prior task is running, even after 60 s simulated wall-clock. - [ ] Unit: `isDupAssign` returns false once the prior task completes (registry no longer holds it). - [ ] Integration: two `assigned` webhooks for same issue 5 s apart → first dispatches, second is suppressed (covered today). Two webhooks 60 s apart → second is **also** suppressed (regression; fails today). - [ ] Backfill: query `task_history` for any historical concurrent runs (overlapping `started_at` / `finished_at` for same `(repo, issue_number, agent_type)`) and emit a one-shot count to verify scope. ## Out of scope - Per-task ephemeral worktrees (would change the failure mode but not the dedup root cause). - Reviewer pool same-issue parallel runs are usually intentional (different reviewers asked for different perspectives) — the dedup applies to assign-dispatch, not review-requested. - Lazy lifecycle (M26) interaction — no change here, but worth noting that the second-task race window may grow on lazy roles where `docker start` adds latency. ## References - `apps/server/src/domain/dispatch/assign-dedup.ts` — current TTL implementation. - `apps/server/src/domain/dispatch/pool.ts` — selector that grants the duplicate. - `apps/server/src/domain/dispatch/registry.ts` — worker `currentTask` + `queue` fields. - Issue #92 (original assign-dedup), #49 (sharpening) — context for why the TTL exists. - M26 — would expose this race more often with lazy cold-start; bug ideally fixed first.

claude-desktop added the

area:agents

type:bug

labels

2026-04-30 20:11:35 +00:00