BUG agent-type rename leaves worker registry stale — board column disappears #711

Closed
opened 2026-05-02 01:02:52 +00:00 by claude-desktop · 0 comments
Collaborator

Summary

POST /agents/types/{old}/rename (WIZ-prereq-B / #671) updates the DB + rewrites config/agents.json + reloads the in-memory webhook config — but does not re-register the in-memory worker FIFO queues under the new type names. Result: until the operator manually restarts the service, every column for a renamed type disappears from the board.

Reproduction

  1. Boot the service with boss-default, boss-2 registered (workers booted from DB at startup).
  2. Call POST /agents/types/boss/rename with { "new_name": "code-lead" }.
  3. Endpoint returns 200 + affected_rows. DB now carries code-lead-default / code-lead-2. config/agents.json is rewritten with the new key. getWebhookConfig() reflects the rename.
  4. /health still reports the old worker names (boss-default, boss-2) because the worker registry was populated at boot and the rename never re-registers.
  5. GET /board joins listResolvedAgents() (DB → code-lead-*) with probeWorker(a.name) (registry → only boss-*). Every probe returns null. instancesByType.has("code-lead") is false. typeOrder filters the type out at apps/server/src/domain/views/board.ts:593. Column gone.

A just restart repopulates the registry from the renamed DB rows and the column re-appears.

Root cause

apps/server/src/http/handlers/agent-type-rename.ts runs (in order) post-commit:

  • rewriteAgentsJson(configPath, oldName, newName) — disk file rewrite.
  • loadWebhookConfig(configPath) — refresh the config cache.
  • enqueueRender(row.name) for every renamed agent — refresh the per-agent env directory (agent-env-sync.renderForInstance).
  • broadcastSSE({ type: "agent_type_renamed", … }) — dashboard cache invalidation.

Missing: nothing touches the worker FIFO registry. The registry is keyed by <type>-<instance> and was populated at boot. Renaming the type changes the DB row's type column but the in-memory registry entries never move.

Acceptance criteria

Worker registry refresh

  • After the DB transaction commits, the rename handler unregisters each worker that was renamed (old ${old}-${suffix} keys) and re-registers it under the new ${new}-${suffix} key, preserving currentTask / queue so an in-flight task is not orphaned.
  • If a worker is busy mid-rename, the new registration inherits its current slot — the task continues to drain through the freshly-keyed worker; /cancel + /history continue to work via the new name (and the old name 404s).
  • Queue depth on the renamed worker is preserved exactly (no drops).

Container reconcile

  • After re-registration, fire reconcileAgentOne(name, …) for each new name so the container is reconciled under the new name (per-agent CLAUDE_CONFIG_DIR mount key changes; the container itself is named claude-hooks-<type>-<instance>).
  • Reconcile failures log + don't unwind — DB is the source of truth; the watchdog picks up missed reconciliations.

Tests

  • Unit: rename bosscode-lead. Assert that the worker registry has code-lead-default / code-lead-2 (and not boss-*) after the call returns.
  • Unit: rename while a worker is busy. Assert currentTask carries over to the new key, the old key is unregistered, and /health reflects the new name.
  • Unit: rename while there are queued tasks. Assert queue.length matches before / after; the queued items are still in FIFO order under the new name.
  • Integration: GET /board immediately after rename renders a column for the new type with the right capacity / in_flight numbers (no missing column).

Backfill (operator path)

  • Document — in the rename handler's docstring + the WIZ-prereq-B section of specs/first-login-wizard.md — that operators on a service older than this fix can just restart to recover. Already standard, but worth flagging where the symptom shows up.

Out of scope

  • Multi-rename atomicity (renaming two types in one transaction).
  • Renaming host / architect / other reserved-name guards — those are already enforced upstream.
  • Worker pool resizing (separate concern; covered by pool_sizes).
  • Container image / CLAUDE_CONFIG_DIR migrations — the per-agent dir is keyed by <type> so a rename necessarily generates a fresh dir. The reconcile call above ensures it's rendered.

References

  • apps/server/src/http/handlers/agent-type-rename.ts — handler that needs the new step.
  • apps/server/src/domain/views/board.ts:546-553, 593 — where the join silently drops the renamed type.
  • apps/server/src/infrastructure/container/container-reconcile.tsreconcileOne to call post-rename.
  • apps/server/src/background/worker.ts (or wherever the FIFO registry lives) — for the unregister + register surface.
  • WIZ-prereq-B (#671) — original rename ticket.
  • Discovered 2026-05-02: live boss-* / code-lead-* mismatch on forge.jacquin.app.
## Summary `POST /agents/types/{old}/rename` (WIZ-prereq-B / #671) updates the DB + rewrites `config/agents.json` + reloads the in-memory webhook config — but **does not re-register the in-memory worker FIFO queues** under the new type names. Result: until the operator manually restarts the service, every column for a renamed type disappears from the board. ## Reproduction 1. Boot the service with `boss-default`, `boss-2` registered (workers booted from DB at startup). 2. Call `POST /agents/types/boss/rename` with `{ "new_name": "code-lead" }`. 3. Endpoint returns 200 + `affected_rows`. DB now carries `code-lead-default` / `code-lead-2`. `config/agents.json` is rewritten with the new key. `getWebhookConfig()` reflects the rename. 4. `/health` still reports the old worker names (`boss-default`, `boss-2`) because the worker registry was populated at boot and the rename never re-registers. 5. `GET /board` joins `listResolvedAgents()` (DB → `code-lead-*`) with `probeWorker(a.name)` (registry → only `boss-*`). Every probe returns `null`. `instancesByType.has("code-lead")` is false. `typeOrder` filters the type out at `apps/server/src/domain/views/board.ts:593`. **Column gone.** A `just restart` repopulates the registry from the renamed DB rows and the column re-appears. ## Root cause `apps/server/src/http/handlers/agent-type-rename.ts` runs (in order) post-commit: - `rewriteAgentsJson(configPath, oldName, newName)` — disk file rewrite. - `loadWebhookConfig(configPath)` — refresh the config cache. - `enqueueRender(row.name)` for every renamed agent — refresh the per-agent env directory (`agent-env-sync.renderForInstance`). - `broadcastSSE({ type: "agent_type_renamed", … })` — dashboard cache invalidation. Missing: nothing touches the worker FIFO registry. The registry is keyed by `<type>-<instance>` and was populated at boot. Renaming the type changes the DB row's `type` column but the in-memory registry entries never move. ## Acceptance criteria ### Worker registry refresh - [ ] After the DB transaction commits, the rename handler unregisters each worker that was renamed (old `${old}-${suffix}` keys) and re-registers it under the new `${new}-${suffix}` key, preserving `currentTask` / `queue` so an in-flight task is not orphaned. - [ ] If a worker is busy mid-rename, the new registration inherits its `current` slot — the task continues to drain through the freshly-keyed worker; `/cancel` + `/history` continue to work via the new name (and the old name 404s). - [ ] Queue depth on the renamed worker is preserved exactly (no drops). ### Container reconcile - [ ] After re-registration, fire `reconcileAgentOne(name, …)` for each new name so the container is reconciled under the new name (per-agent `CLAUDE_CONFIG_DIR` mount key changes; the container itself is named `claude-hooks-<type>-<instance>`). - [ ] Reconcile failures log + don't unwind — DB is the source of truth; the watchdog picks up missed reconciliations. ### Tests - [ ] Unit: rename `boss` → `code-lead`. Assert that the worker registry has `code-lead-default` / `code-lead-2` (and not `boss-*`) after the call returns. - [ ] Unit: rename while a worker is busy. Assert `currentTask` carries over to the new key, the old key is unregistered, and `/health` reflects the new name. - [ ] Unit: rename while there are queued tasks. Assert `queue.length` matches before / after; the queued items are still in FIFO order under the new name. - [ ] Integration: `GET /board` immediately after rename renders a column for the new type with the right `capacity` / `in_flight` numbers (no missing column). ### Backfill (operator path) - [ ] Document — in the rename handler's docstring + the WIZ-prereq-B section of `specs/first-login-wizard.md` — that operators on a service older than this fix can `just restart` to recover. Already standard, but worth flagging where the symptom shows up. ## Out of scope - Multi-rename atomicity (renaming two types in one transaction). - Renaming `host` / `architect` / other reserved-name guards — those are already enforced upstream. - Worker pool resizing (separate concern; covered by `pool_sizes`). - Container image / CLAUDE_CONFIG_DIR migrations — the per-agent dir is keyed by `<type>` so a rename necessarily generates a fresh dir. The reconcile call above ensures it's rendered. ## References - `apps/server/src/http/handlers/agent-type-rename.ts` — handler that needs the new step. - `apps/server/src/domain/views/board.ts:546-553, 593` — where the join silently drops the renamed type. - `apps/server/src/infrastructure/container/container-reconcile.ts` — `reconcileOne` to call post-rename. - `apps/server/src/background/worker.ts` (or wherever the FIFO registry lives) — for the unregister + register surface. - WIZ-prereq-B (#671) — original rename ticket. - Discovered 2026-05-02: live `boss-*` / `code-lead-*` mismatch on `forge.jacquin.app`.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#711
No description provided.