flows-yaml: runtime global-mode flip leaves dispatcher unconstructed until restart #1087

Closed
opened 2026-05-10 22:36:31 +00:00 by claude-desktop · 0 comments
Collaborator

Summary

When the service boots with global mode: "off" (and no per-flow override), main.ts:3860 short-circuits dispatcher construction:

const hasAnyLivingFlow = settings.mode !== "off" || Object.values(settings.per_flow ?? {}).some((m) => m !== "off");
if (hasAnyLivingFlow) {
    const triggerBus = createTriggerBus();
    setEventHandlersTriggerBus(triggerBus);
    // … createDispatcher() etc.
}

If the operator later flips global mode via POST /api/flows/global-mode {"mode":"live"} (the route documented at flows-routes.ts:970 as the deliberate operator override for exactly this kind of recovery), the settings row is updated and the audit event is broadcast — but no dispatcher is constructed post-hoc, no trigger bus is subscribed.

Result: silent half-state. Webhooks arrive, [sse-broadcast] kind=issue.assigned … fires, board cards stay idle. mode-status reports live on every flow. No error surface.

Only just restart (or any other service restart) recovers, because the boot-time hasAnyLivingFlow check then sees mode=live and wires the dispatcher.

Reproduction

  1. Start the service with flows_yaml_settings.mode = "off" and all per_flow entries off (or omitted). Boot log shows [flows-yaml] mode=off loaded=N errors=0.
  2. curl -X POST http://127.0.0.1:4500/api/flows/global-mode -H 'Content-Type: application/json' -d '{"mode":"live"}'{"from":"off","to":"live","changed":true}.
  3. just flows-yaml-mode-status shows global=live + every flow live.
  4. Trigger any subscribed event (e.g. assign an issue to fire issues.assigned).
  5. Observed: webhook + SSE only, no [<agent>] enqueued …, no [webhook] dispatch …, no flow_run row. Board stays idle.
  6. just restart. Boot log now shows [flows-yaml] mode=live loaded=N errors=0. Repeat step 4. Observed: dispatch fires immediately.

This was hit live on 2026-05-11: after #1083 (legacy engine deletion) merged with global mode left at off, the per-flow cutover step was skipped → every event went undispatched until the symptom was diagnosed and the service restarted.

Acceptance criteria

Pick one of the two approaches below — both are acceptable, the spec preference is (A) since the route's docstring already advertises it as a recovery mechanism that should "just work".

(A) Lazy dispatcher construction (preferred)

  • Refactor the if (hasAnyLivingFlow) block in main.ts:3860 so the trigger bus, live/shadow capability bags, per-flow router, and dispatcher can be constructed (or torn down) at any time, not only at boot.
  • POST /api/flows/global-mode and POST /api/flows/:name/mode ensure the dispatcher is alive after any flip from off → shadow|live and idle (or torn down) after a full flip back to off for every flow.
  • setEventHandlersTriggerBus() is called with the live bus whenever the dispatcher is constructed; it must be safe to call repeatedly.
  • assertCapabilitiesSatisfyOps() still runs and refuses to construct the dispatcher when a live-capability dep is missing — but the failure surfaces a 409 capabilities_missing from the mode-flip endpoint instead of being deferred to restart.
  • No leaks on repeated flip cycles (start/stop the file watcher and any background subscribers cleanly).

(B) Refuse the flip + explicit restart

If (A) is rejected as too invasive:

  • POST /api/flows/global-mode returns 409 restart_required (with a clear detail field) when the dispatcher was not constructed at boot AND the requested mode is non-off.
  • CLI just flows-yaml-cutover NAME MODE surfaces the same 409 with a one-line operator instruction ("flip refused — dispatcher was not constructed at boot; run just restart and retry").
  • Same for POST /api/flows/:name/mode when global is off and the dispatcher was not constructed.
  • Diff-reporter / mode-status report a clear dispatcher_state: "boot_inert" field so operators can spot the half-state.

Tests

  • Unit test asserting the lazy-construction path (A) or the 409-refusal path (B).
  • Integration test: boot with mode=off, flip via API, fire a synthetic issues.assigned event, assert a flow_runs row appears (A) or 409 was returned (B).
  • Regression test on the boot path: existing mode=live boot still wires the dispatcher exactly as it does today.

Telemetry / observability

  • If approach (A): log [flows-yaml] dispatcher constructed at runtime via mode flip when post-boot construction happens.
  • If approach (B): log [flows-yaml] mode flip refused — dispatcher inert since boot, restart required when the 409 is returned.

Out of scope

  • The shadow→live 24h gate logic (evaluateCutover) — orthogonal, untouched.
  • POST /api/flows/:name/mode per-flow gate behaviour beyond ensuring the dispatcher is alive when needed.
  • Hot-reload of flow YAML on disk — already handled by createFlowWatcher.

References

  • apps/server/src/main.ts:3853-3923hasAnyLivingFlow boot gate + dispatcher construction
  • apps/server/src/http/flows-routes.ts:970-1000handleSetGlobalMode (current handler; only writes settings + broadcasts)
  • apps/server/src/domain/flows-yaml/dispatcher.tscreateDispatcher
  • apps/server/src/domain/flows-yaml/trigger-bus.tscreateTriggerBus
  • PR #1083 (28d1f2f6) — legacy engine deletion; the cutover checklist that operators are expected to follow post-merge
  • Incident 2026-05-11: webhook + SSE fired, dispatch silent, board card #1086 stayed idle until just restart
## Summary When the service boots with global `mode: "off"` (and no per-flow override), `main.ts:3860` short-circuits dispatcher construction: ```ts const hasAnyLivingFlow = settings.mode !== "off" || Object.values(settings.per_flow ?? {}).some((m) => m !== "off"); if (hasAnyLivingFlow) { const triggerBus = createTriggerBus(); setEventHandlersTriggerBus(triggerBus); // … createDispatcher() etc. } ``` If the operator later flips global mode via `POST /api/flows/global-mode {"mode":"live"}` (the route documented at `flows-routes.ts:970` as the *deliberate operator override* for exactly this kind of recovery), the settings row is updated and the audit event is broadcast — but **no dispatcher is constructed post-hoc, no trigger bus is subscribed**. Result: silent half-state. Webhooks arrive, `[sse-broadcast] kind=issue.assigned …` fires, board cards stay idle. `mode-status` reports `live` on every flow. No error surface. Only `just restart` (or any other service restart) recovers, because the boot-time `hasAnyLivingFlow` check then sees `mode=live` and wires the dispatcher. ## Reproduction 1. Start the service with `flows_yaml_settings.mode = "off"` and all `per_flow` entries off (or omitted). Boot log shows `[flows-yaml] mode=off loaded=N errors=0`. 2. `curl -X POST http://127.0.0.1:4500/api/flows/global-mode -H 'Content-Type: application/json' -d '{"mode":"live"}'` → `{"from":"off","to":"live","changed":true}`. 3. `just flows-yaml-mode-status` shows global=live + every flow live. 4. Trigger any subscribed event (e.g. assign an issue to fire `issues.assigned`). 5. Observed: webhook + SSE only, no `[<agent>] enqueued …`, no `[webhook] dispatch …`, no flow_run row. Board stays idle. 6. `just restart`. Boot log now shows `[flows-yaml] mode=live loaded=N errors=0`. Repeat step 4. Observed: dispatch fires immediately. This was hit live on 2026-05-11: after #1083 (legacy engine deletion) merged with global mode left at `off`, the per-flow cutover step was skipped → every event went undispatched until the symptom was diagnosed and the service restarted. ## Acceptance criteria Pick one of the two approaches below — both are acceptable, the spec preference is (A) since the route's docstring already advertises it as a recovery mechanism that should "just work". ### (A) Lazy dispatcher construction (preferred) - [ ] Refactor the `if (hasAnyLivingFlow)` block in `main.ts:3860` so the trigger bus, live/shadow capability bags, per-flow router, and dispatcher can be constructed (or torn down) at any time, not only at boot. - [ ] `POST /api/flows/global-mode` and `POST /api/flows/:name/mode` ensure the dispatcher is **alive** after any flip from `off → shadow|live` and **idle** (or torn down) after a full flip back to `off` for every flow. - [ ] `setEventHandlersTriggerBus()` is called with the live bus whenever the dispatcher is constructed; it must be safe to call repeatedly. - [ ] `assertCapabilitiesSatisfyOps()` still runs and refuses to construct the dispatcher when a live-capability dep is missing — but the failure surfaces a `409 capabilities_missing` from the mode-flip endpoint instead of being deferred to restart. - [ ] No leaks on repeated flip cycles (start/stop the file watcher and any background subscribers cleanly). ### (B) Refuse the flip + explicit restart If (A) is rejected as too invasive: - [ ] `POST /api/flows/global-mode` returns `409 restart_required` (with a clear `detail` field) when the dispatcher was not constructed at boot AND the requested mode is non-off. - [ ] CLI `just flows-yaml-cutover NAME MODE` surfaces the same 409 with a one-line operator instruction ("flip refused — dispatcher was not constructed at boot; run `just restart` and retry"). - [ ] Same for `POST /api/flows/:name/mode` when global is `off` and the dispatcher was not constructed. - [ ] Diff-reporter / mode-status report a clear `dispatcher_state: "boot_inert"` field so operators can spot the half-state. ### Tests - [ ] Unit test asserting the lazy-construction path (A) or the 409-refusal path (B). - [ ] Integration test: boot with `mode=off`, flip via API, fire a synthetic `issues.assigned` event, assert a `flow_runs` row appears (A) or 409 was returned (B). - [ ] Regression test on the boot path: existing `mode=live` boot still wires the dispatcher exactly as it does today. ### Telemetry / observability - [ ] If approach (A): log `[flows-yaml] dispatcher constructed at runtime via mode flip` when post-boot construction happens. - [ ] If approach (B): log `[flows-yaml] mode flip refused — dispatcher inert since boot, restart required` when the 409 is returned. ## Out of scope - The shadow→live 24h gate logic (`evaluateCutover`) — orthogonal, untouched. - `POST /api/flows/:name/mode` per-flow gate behaviour beyond ensuring the dispatcher is alive when needed. - Hot-reload of flow YAML on disk — already handled by `createFlowWatcher`. ## References - `apps/server/src/main.ts:3853-3923` — `hasAnyLivingFlow` boot gate + dispatcher construction - `apps/server/src/http/flows-routes.ts:970-1000` — `handleSetGlobalMode` (current handler; only writes settings + broadcasts) - `apps/server/src/domain/flows-yaml/dispatcher.ts` — `createDispatcher` - `apps/server/src/domain/flows-yaml/trigger-bus.ts` — `createTriggerBus` - PR #1083 (28d1f2f6) — legacy engine deletion; the cutover checklist that operators are expected to follow post-merge - Incident 2026-05-11: webhook + SSE fired, dispatch silent, board card #1086 stayed idle until `just restart`
dev closed this issue 2026-05-11 08:30:47 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#1087
No description provided.