M26-1 Container lifecycle module + state machine #588

Closed
opened 2026-04-30 19:31:00 +00:00 by claude-desktop · 0 comments
Collaborator

User story

As a service operator, I want a per-instance lifecycle module that tracks Stopped → Starting → Running → Stopping for every container, so that lazy roles can transition safely between states without races between dispatch enqueue and idle-stop.

Acceptance criteria

Module

  • New apps/server/src/infrastructure/container/lifecycle.ts exports a getLifecycle(name) factory returning a per-instance state holder with state, acquireForEnqueue(), acquireForIdleStop(), and markStopped() / markRunning() transitions.
  • State enum: Stopped | Starting | Running | Stopping. Hot instances live exclusively in Running (or Missing, owned by watchdog).
  • Per-instance in-process Mutex guards every transition. Held during docker start, docker stop, and the readiness probe — not during the actual agent task body.

Enqueue path

  • enqueue for a lazy instance acquires the lock, observes state, and: if Stopped → fires docker start, runs readiness probe, transitions Running; if Starting → awaits the in-flight transition; if Running → no-op.
  • Lock released before the dispatch returns to caller — the agent task itself runs lock-free.

Idle-stop path

  • On task completion (lazy roles only) schedule a one-shot timer at idle_stop_seconds. On fire: acquire lock, re-check queue.length === 0 && currentTask === null && lifecycle === lazy, transition Stopping, docker stop, transition Stopped. Re-arm cancelled if a new task lands.

Readiness probe

  • After docker start, run docker exec <name> sh -c 'true' with 5 s timeout. Retry up to 3× with 200 ms backoff. Failure surfaces a container.lazy_start_failed event (event wiring lands in M26-5) and the dispatch fails with a clear error.

Tests

  • Unit cover: enqueue while Stopped (starts container), enqueue while Starting (awaits), idle-tick while queue non-empty (no-op), idle-tick while in-flight task (no-op), concurrent enqueue + idle-tick (no double start, no exec into Exited).
  • Tests use a fake DockerRunner (same harness as container.test.ts / container-reconcile.test.ts).

Out of scope

  • Reconcile boot behaviour (M26-2).
  • Watchdog event suppression (M26-2).
  • Pool selection rules (M26-3).
  • Config schema / dashboard surface (M26-4).
  • SSE / metrics events (M26-5).

References

  • specs/container-lazy-lifecycle.md — full spec.
  • apps/server/src/infrastructure/container/container.ts — existing docker runner.
  • apps/server/src/domain/dispatch/registry.ts:61 — current onEnqueue hook to wrap.
  • M25 (#agentic dispatch port) for surface conventions.
## User story As a service operator, I want a per-instance lifecycle module that tracks `Stopped → Starting → Running → Stopping` for every container, so that lazy roles can transition safely between states without races between dispatch enqueue and idle-stop. ## Acceptance criteria ### Module - [ ] New `apps/server/src/infrastructure/container/lifecycle.ts` exports a `getLifecycle(name)` factory returning a per-instance state holder with `state`, `acquireForEnqueue()`, `acquireForIdleStop()`, and `markStopped()` / `markRunning()` transitions. - [ ] State enum: `Stopped | Starting | Running | Stopping`. Hot instances live exclusively in `Running` (or `Missing`, owned by watchdog). - [ ] Per-instance in-process `Mutex` guards every transition. Held during `docker start`, `docker stop`, and the readiness probe — **not** during the actual agent task body. ### Enqueue path - [ ] `enqueue` for a lazy instance acquires the lock, observes state, and: if `Stopped` → fires `docker start`, runs readiness probe, transitions `Running`; if `Starting` → awaits the in-flight transition; if `Running` → no-op. - [ ] Lock released before the dispatch returns to caller — the agent task itself runs lock-free. ### Idle-stop path - [ ] On task completion (lazy roles only) schedule a one-shot timer at `idle_stop_seconds`. On fire: acquire lock, re-check `queue.length === 0 && currentTask === null && lifecycle === lazy`, transition `Stopping`, `docker stop`, transition `Stopped`. Re-arm cancelled if a new task lands. ### Readiness probe - [ ] After `docker start`, run `docker exec <name> sh -c 'true'` with 5 s timeout. Retry up to 3× with 200 ms backoff. Failure surfaces a `container.lazy_start_failed` event (event wiring lands in M26-5) and the dispatch fails with a clear error. ### Tests - [ ] Unit cover: enqueue while `Stopped` (starts container), enqueue while `Starting` (awaits), idle-tick while queue non-empty (no-op), idle-tick while in-flight task (no-op), concurrent enqueue + idle-tick (no double start, no `exec into Exited`). - [ ] Tests use a fake `DockerRunner` (same harness as `container.test.ts` / `container-reconcile.test.ts`). ## Out of scope - Reconcile boot behaviour (M26-2). - Watchdog event suppression (M26-2). - Pool selection rules (M26-3). - Config schema / dashboard surface (M26-4). - SSE / metrics events (M26-5). ## References - `specs/container-lazy-lifecycle.md` — full spec. - `apps/server/src/infrastructure/container/container.ts` — existing docker runner. - `apps/server/src/domain/dispatch/registry.ts:61` — current `onEnqueue` hook to wrap. - M25 (#agentic dispatch port) for surface conventions.
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks#588
No description provided.