fix(flows): seeder must repair default-flow rows that fail to compile, regardless of version #545

Closed
opened 2026-04-28 21:32:33 +00:00 by claude-desktop · 1 comment
Collaborator

User story

As an operator, I want the default-flow seeder to detect a broken DB body and repair it from source unconditionally, so that a stale flow row left over from a port rename can't silently swallow every dispatch for that trigger.

Why

On 2026-04-28 the review-requested flow stopped dispatching reviewer agents. Every PR review-request webhook logged exactly:

[flow-dispatch] flow review-requested failed to compile: node "resolve_reviewer"
(agent.resolve_by_login) has no output port "forgejo_user"

Root cause:

  • forgejo_user was renamed to git_author on agent.resolve_by_login (rename day 2026-04-16, repo memory agents.md).
  • The shipped review-requested-graph.json source was updated to use git_author correctly.
  • BUT the deployed flows SQLite row had been seeded at some intermediate version (version=2) with a body that still referenced forgejo_user, and the source JSON was at version=1.
  • Seeder logic: rewrite DB only when source version > DB version. With DB v2 > source v1, the seeder left the broken row alone forever.
  • Every pull_request.review_requested event compiled-failed silently, no reviewer dispatched, PRs sat unreviewed for hours before the operator noticed.

Fix shipped immediately: rewrote DB body from source via sqlite3 UPDATE, bumped both DB and source JSON to version=3 so they resync. PR #541 carries the source bump.

The deeper bug is the seeder design: a flow row that fails to compile is always broken, regardless of its version number. The seeder should treat a compile failure as authoritative ground for repair (or refuse to boot) — the version field exists to gate operator overrides, not to gate fixing dead defaults.

Acceptance criteria

Seeder behaviour

  • On boot, every default-flow row in flows is compiled against the live NodeRegistry. Compile-fail counts as "must repair" regardless of version.
  • When source body compiles AND the DB body fails to compile, the seeder rewrites the DB row with the source body, preserving operator-source rows (source = 'operator' is never touched even on compile fail — log and continue).
  • When BOTH source and DB fail to compile, the seeder logs [startup] flow seed <id>: source body fails compile at ERROR and the boot aborts (or process.exitCode = 1 after current init finishes). Catastrophic — needs a real fix, not a silent skip.
  • Bump-version semantics for operator edits stay intact (operator at version N+1 wins over source at N for source = 'operator' rows).

Telemetry

  • On every successful seed-time repair, log one line: [startup] flow seed <id>: repaired (DB v<old> → v<new>, source v<src>): <reason>. Reason is the compile error.
  • GET /flows/divergence/summary (or a new /flows/health if the divergence endpoint is read-only) surfaces last seed-time repair counts so the operator can spot a silent fix happening.

Tests

  • Unit: source compiles, DB doesn't (port renamed) → repaired, version reset to source's
  • Unit: source compiles, DB compiles, DB version > source → seeder no-op (current behaviour preserved for legitimate operator overrides)
  • Unit: source fails to compile → seeder logs ERROR and sets process.exitCode = 1
  • Unit: operator-source row fails to compile → seeder logs ERROR but does NOT rewrite (operator owns it)
  • Integration: simulate the 2026-04-28 scenario — DB body has forgejo_user, source has git_author, DB version > source version → seeder repairs, next dispatch succeeds

Out of scope

  • Auto-rerouting in-flight failed dispatches retroactively (the broken events that were dropped while the flow was un-compilable are gone — rerouting them is a separate issue)
  • A general "flow lint" CLI command (would be useful, but not a prerequisite)
  • Operator-row migrations on port renames (operator rows are operator-owned by design — separate issue if rename is a recurring concern)

References

  • Today's bug surface: PR #541 sat without a reviewer until the flow body was patched manually
  • Source repaired: apps/server/src/domain/flows/review-requested-graph.json v3 in PR #541
  • DB repaired: UPDATE flows SET body=?, version=3 WHERE id='review-requested'
  • Seeder code: apps/server/src/http/flows-routes.ts::seedDefaultFlow (per the default-graph.ts:35–37 comment)
  • Compile error path: [flow-dispatch] flow <id> failed to compile: … in apps/server/src/domain/flows/flow-dispatch.ts
  • Memory: feedback_dispatch_triggering_labels (forgejo_user → git_author rename day)
## User story As an operator, I want the default-flow seeder to detect a broken DB body and repair it from source unconditionally, so that a stale flow row left over from a port rename can't silently swallow every dispatch for that trigger. ## Why On 2026-04-28 the `review-requested` flow stopped dispatching reviewer agents. Every PR review-request webhook logged exactly: ``` [flow-dispatch] flow review-requested failed to compile: node "resolve_reviewer" (agent.resolve_by_login) has no output port "forgejo_user" ``` Root cause: - `forgejo_user` was renamed to `git_author` on `agent.resolve_by_login` (rename day 2026-04-16, repo memory `agents.md`). - The shipped `review-requested-graph.json` source was updated to use `git_author` correctly. - BUT the deployed `flows` SQLite row had been seeded at some intermediate version (`version=2`) with a body that still referenced `forgejo_user`, and the source JSON was at `version=1`. - Seeder logic: rewrite DB only when source version > DB version. With DB v2 > source v1, the seeder left the broken row alone forever. - Every `pull_request.review_requested` event compiled-failed silently, no reviewer dispatched, PRs sat unreviewed for hours before the operator noticed. Fix shipped immediately: rewrote DB body from source via `sqlite3 UPDATE`, bumped both DB and source JSON to `version=3` so they resync. PR #541 carries the source bump. The deeper bug is the seeder design: a flow row that fails to compile is **always** broken, regardless of its version number. The seeder should treat a compile failure as authoritative ground for repair (or refuse to boot) — the version field exists to gate operator overrides, not to gate fixing dead defaults. ## Acceptance criteria ### Seeder behaviour - [ ] On boot, every default-flow row in `flows` is compiled against the live `NodeRegistry`. Compile-fail counts as "must repair" regardless of version. - [ ] When source body compiles AND the DB body fails to compile, the seeder rewrites the DB row with the source body, **preserving operator-source rows** (`source = 'operator'` is never touched even on compile fail — log and continue). - [ ] When BOTH source and DB fail to compile, the seeder logs `[startup] flow seed <id>: source body fails compile` at ERROR and the boot aborts (or `process.exitCode = 1` after current init finishes). Catastrophic — needs a real fix, not a silent skip. - [ ] Bump-version semantics for *operator* edits stay intact (operator at version N+1 wins over source at N for `source = 'operator'` rows). ### Telemetry - [ ] On every successful seed-time repair, log one line: `[startup] flow seed <id>: repaired (DB v<old> → v<new>, source v<src>): <reason>`. Reason is the compile error. - [ ] `GET /flows/divergence/summary` (or a new `/flows/health` if the divergence endpoint is read-only) surfaces last seed-time repair counts so the operator can spot a silent fix happening. ### Tests - [ ] Unit: source compiles, DB doesn't (port renamed) → repaired, version reset to source's - [ ] Unit: source compiles, DB compiles, DB version > source → seeder no-op (current behaviour preserved for legitimate operator overrides) - [ ] Unit: source fails to compile → seeder logs ERROR and sets `process.exitCode = 1` - [ ] Unit: operator-source row fails to compile → seeder logs ERROR but does NOT rewrite (operator owns it) - [ ] Integration: simulate the 2026-04-28 scenario — DB body has `forgejo_user`, source has `git_author`, DB version > source version → seeder repairs, next dispatch succeeds ## Out of scope - Auto-rerouting in-flight failed dispatches retroactively (the broken events that were dropped while the flow was un-compilable are gone — rerouting them is a separate issue) - A general "flow lint" CLI command (would be useful, but not a prerequisite) - Operator-row migrations on port renames (operator rows are operator-owned by design — separate issue if rename is a recurring concern) ## References - Today's bug surface: PR #541 sat without a reviewer until the flow body was patched manually - Source repaired: `apps/server/src/domain/flows/review-requested-graph.json` v3 in PR #541 - DB repaired: `UPDATE flows SET body=?, version=3 WHERE id='review-requested'` - Seeder code: `apps/server/src/http/flows-routes.ts::seedDefaultFlow` (per the `default-graph.ts:35–37` comment) - Compile error path: `[flow-dispatch] flow <id> failed to compile: …` in `apps/server/src/domain/flows/flow-dispatch.ts` - Memory: `feedback_dispatch_triggering_labels` (forgejo_user → git_author rename day)
Author
Collaborator

Picked up + landing on PR #541 (commit d3f5679).

Root cause refined

After digging in, the bug had two layers, not one:

  1. Source-level inconsistencyreview-requested-graph.json carries version: 1 (which the parser at review-requested-graph.ts:42 strict-checks: if (g.version !== 1) throw … — that field is the graph DSL version, must stay 1). The flow content version lives in the .ts constant REVIEW_REQUESTED_GRAPH_VERSION = 2. The 2026-04-16 rename edited the JSON body but did NOT bump the .ts constant, so the seeder's spec-version stayed at 2.

  2. Seeder bugseedDefaultFlow compared existing.version === version and returned "unchanged". With both at 2, the body-divergent row was declared fine and never repaired.

What landed

  • Bumped REVIEW_REQUESTED_GRAPH_VERSION from 2 → 3 (forces reseed of the deployed v=2 row)
  • seedDefaultFlow now compares both version and body. Returns a new "repaired" status when version matches but body diverges. Operator rows (source = 'operator') remain untouched at any version — that contract is preserved
  • New unit test default-source row with matching version but stale body is repaired (#545) reproduces the production scenario

Acceptance-criteria coverage

  • Body comparison closes the hot path that hit production
  • Compile-validation + boot-abort on source-fail (still TODO; doesn't gate this PR — body comparison alone catches the 2026-04-28 case)
  • /flows/health telemetry endpoint surfacing repair counts (TODO — out of scope for this PR)
  • Operator-source rows preserved on body mismatch (verified by existing skipped test path)

The boot-abort + telemetry stories I'd file as follow-ups against this issue if you want to keep #545 open until they land, or close once #541 merges and reopen as a fresh narrower issue. Your call.

Picked up + landing on PR #541 (commit `d3f5679`). ### Root cause refined After digging in, the bug had two layers, not one: 1. **Source-level inconsistency** — `review-requested-graph.json` carries `version: 1` (which the parser at `review-requested-graph.ts:42` strict-checks: `if (g.version !== 1) throw …` — that field is the **graph DSL version**, must stay 1). The **flow content version** lives in the .ts constant `REVIEW_REQUESTED_GRAPH_VERSION = 2`. The 2026-04-16 rename edited the JSON body but did NOT bump the .ts constant, so the seeder's spec-version stayed at 2. 2. **Seeder bug** — `seedDefaultFlow` compared `existing.version === version` and returned "unchanged". With both at 2, the body-divergent row was declared fine and never repaired. ### What landed - Bumped `REVIEW_REQUESTED_GRAPH_VERSION` from 2 → 3 (forces reseed of the deployed v=2 row) - `seedDefaultFlow` now compares **both** version and body. Returns a new `"repaired"` status when version matches but body diverges. Operator rows (`source = 'operator'`) remain untouched at any version — that contract is preserved - New unit test `default-source row with matching version but stale body is repaired (#545)` reproduces the production scenario ### Acceptance-criteria coverage - [x] Body comparison closes the hot path that hit production - [ ] Compile-validation + boot-abort on source-fail (still TODO; doesn't gate this PR — body comparison alone catches the 2026-04-28 case) - [ ] `/flows/health` telemetry endpoint surfacing repair counts (TODO — out of scope for this PR) - [x] Operator-source rows preserved on body mismatch (verified by existing `skipped` test path) The boot-abort + telemetry stories I'd file as follow-ups against this issue if you want to keep #545 open until they land, or close once #541 merges and reopen as a fresh narrower issue. Your call.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#545
No description provided.