fix(cursor): wrap cwd in symlink dir so SDK doesn't blow HTTP/2 frame #1027

Merged
charles merged 3 commits from fix/cursor-sdk-cwd-frame-size into main 2026-05-09 22:00:40 +00:00
Collaborator

Summary

Cursor SDK 1.0.12 indexes the entire local.cwd tree on Agent.send() and stuffs the file list into the initial HTTP/2 frame. A Bun monorepo with node_modules/ (~30k paths) overflows Node's default 16 KB SETTINGS_MAX_FRAME_SIZENGHTTP2_FRAME_SIZE_ERROR on every dispatch that picked the cursor provider. .cursorignore is not honoured for this code path in 1.0.12.

Fix: build a per-task wrapper directory of symlinks to the worktree's top-level entries minus {node_modules, .git, dist, build, .turbo, .next, .cache, coverage}. The SDK does not traverse symlinks, so the indexed tree stays small. Tools (Read/Bash/Grep) still resolve real source paths through the symlinks. Wrapper torn down in the runTask finally.

What this also fixes

Adjacent failure modes uncovered while diagnosing:

  • replayConversationForAgent skips for non-bc- ids. SDK 1.0.12 strict-rejects forced runtime: "cloud" listRuns on local-runtime ids with Agent ID must be in the format 'bc-<uuid>'. Replay only fires for cloud agents now.
  • agent.send() recovers from UnknownAgentError: already has active run. Stale active_run_id in the local SDK store after a prior crashed run made every subsequent kick fail. Recovery now disposes the resumed agent, mints a fresh one via Agent.create, retries once. cursor_init is yielded after recovery so the agent-runner persists the FRESH session id — breaks the otherwise-permanent recovery loop.
  • agent.send() is now raced against a 90s timeout + abort signal. A wedged pre-stream phase no longer pins the worker. Emits cursor_send_failed with elapsed_ms + reason on timeout/abort.
  • Process-level unhandledRejection / uncaughtException handlers in main.ts. Cursor SDK throws unhandled HTTP/2 stream errors from connectrpc that previously took down the whole service (every worker), not just the offending one.

Diagnostic flow

  1. First kick → validation_error on replay.
  2. Fixed replay path → second kick → UnknownAgentError: already has active run from a stale active_run_id.
  3. Fixed recovery path → third kick → NGHTTP2_FRAME_SIZE_ERROR from connectrpc.
  4. Standalone reproducer (apps/server/src/repro/cursor-repro.ts) confirmed: bare cwd works, real worktree crashes, symlinked worktree (no node_modules) works.

Repro

apps/server/src/repro/cursor-repro.ts — standalone script that isolates Agent.create + agent.send against the real cursor cloud (decrypts the API key from the secret table). Pulled out of the runner stack so future SDK regressions can be triaged in isolation.

CLAUDE_HOOKS_SECRET_KEY=… bun run apps/server/src/repro/cursor-repro.ts
# REPRO_CWD=/path/to/repo to test against a real worktree

Test plan

  • bun test apps/server/src/infrastructure/agent/cursor-sdk-adapter.test.ts (43/43 pass)
  • just typecheck clean
  • Reproducer with REPRO_CWD=/home/charles/Workspace/claude-hooks succeeds (was crashing pre-fix, equivalent flow now goes through buildCursorCwd)
  • Live kick #1017 streamed events end-to-end after restart

🤖 Generated with Claude Code

## Summary Cursor SDK 1.0.12 indexes the entire `local.cwd` tree on `Agent.send()` and stuffs the file list into the initial HTTP/2 frame. A Bun monorepo with `node_modules/` (~30k paths) overflows Node's default 16 KB `SETTINGS_MAX_FRAME_SIZE` → `NGHTTP2_FRAME_SIZE_ERROR` on every dispatch that picked the cursor provider. `.cursorignore` is **not** honoured for this code path in 1.0.12. **Fix:** build a per-task wrapper directory of symlinks to the worktree's top-level entries minus `{node_modules, .git, dist, build, .turbo, .next, .cache, coverage}`. The SDK does not traverse symlinks, so the indexed tree stays small. Tools (Read/Bash/Grep) still resolve real source paths through the symlinks. Wrapper torn down in the runTask `finally`. ## What this also fixes Adjacent failure modes uncovered while diagnosing: - **`replayConversationForAgent` skips for non-`bc-` ids.** SDK 1.0.12 strict-rejects forced `runtime: "cloud"` `listRuns` on local-runtime ids with `Agent ID must be in the format 'bc-<uuid>'`. Replay only fires for cloud agents now. - **`agent.send()` recovers from `UnknownAgentError: already has active run`.** Stale `active_run_id` in the local SDK store after a prior crashed run made every subsequent kick fail. Recovery now disposes the resumed agent, mints a fresh one via `Agent.create`, retries once. `cursor_init` is yielded **after** recovery so the agent-runner persists the FRESH session id — breaks the otherwise-permanent recovery loop. - **`agent.send()` is now raced against a 90s timeout + abort signal.** A wedged pre-stream phase no longer pins the worker. Emits `cursor_send_failed` with `elapsed_ms` + reason on timeout/abort. - **Process-level `unhandledRejection` / `uncaughtException` handlers** in `main.ts`. Cursor SDK throws unhandled HTTP/2 stream errors from connectrpc that previously took down the whole service (every worker), not just the offending one. ## Diagnostic flow 1. First kick → `validation_error` on replay. 2. Fixed replay path → second kick → `UnknownAgentError: already has active run` from a stale `active_run_id`. 3. Fixed recovery path → third kick → `NGHTTP2_FRAME_SIZE_ERROR` from connectrpc. 4. Standalone reproducer (`apps/server/src/repro/cursor-repro.ts`) confirmed: bare cwd works, real worktree crashes, symlinked worktree (no `node_modules`) works. ## Repro `apps/server/src/repro/cursor-repro.ts` — standalone script that isolates `Agent.create` + `agent.send` against the real cursor cloud (decrypts the API key from the `secret` table). Pulled out of the runner stack so future SDK regressions can be triaged in isolation. ```sh CLAUDE_HOOKS_SECRET_KEY=… bun run apps/server/src/repro/cursor-repro.ts # REPRO_CWD=/path/to/repo to test against a real worktree ``` ## Test plan - [x] `bun test apps/server/src/infrastructure/agent/cursor-sdk-adapter.test.ts` (43/43 pass) - [x] `just typecheck` clean - [x] Reproducer with `REPRO_CWD=/home/charles/Workspace/claude-hooks` succeeds (was crashing pre-fix, equivalent flow now goes through `buildCursorCwd`) - [x] Live kick #1017 streamed events end-to-end after restart 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fix(cursor): wrap cwd in symlink dir so SDK doesn't blow HTTP/2 frame
All checks were successful
qa / dockerfile (pull_request) Successful in 16s
qa / i18n-string-check (pull_request) Successful in 16s
qa / sql-layer-check (pull_request) Successful in 16s
qa / db-schema (pull_request) Successful in 21s
qa / qa-1 (pull_request) Successful in 3m39s
qa / qa (pull_request) Successful in 0s
3e16e911fa
Cursor SDK 1.0.12 indexes the entire `local.cwd` tree on `Agent.send()`
and stuffs the file list into the initial HTTP/2 frame. A Bun monorepo
with `node_modules/` (~30k paths) overflows Node's default 16 KB
SETTINGS_MAX_FRAME_SIZE → `NGHTTP2_FRAME_SIZE_ERROR` on every dispatch
that picked the cursor provider. `.cursorignore` is not honoured for
this code path.

Fix: build a per-task wrapper directory of symlinks to the worktree's
top-level entries minus `{node_modules, .git, dist, build, .turbo,
.next, .cache, coverage}`. The SDK does not traverse symlinks, so the
indexed tree stays small. Tools (Read/Bash/Grep) still resolve real
source paths through the symlinks. Wrapper torn down in the runTask
finally.

Also fixes adjacent failure modes uncovered while diagnosing this:

  - `replayConversationForAgent` skips for non-`bc-` (local-runtime)
    agent ids — SDK 1.0.12 strict-rejects forced `runtime: "cloud"`
    listRuns on local ids with `Agent ID must be in the format
    'bc-<uuid>'`.
  - `agent.send()` recovers from `UnknownAgentError: already has
    active run` (stale `active_run_id` in the local SDK store after a
    prior crashed run): dispose the resumed agent, mint a fresh one
    via `Agent.create`, retry once. `cursor_init` is now yielded AFTER
    the recovery so the agent-runner persists the FRESH session id —
    breaks the otherwise-permanent recovery loop.
  - `agent.send()` is now raced against a 90s timeout + abort signal
    so a wedged pre-stream phase no longer pins the worker. Emits
    `cursor_send_failed` with `elapsed_ms` + reason on timeout/abort.
  - Process-level `unhandledRejection` / `uncaughtException` handlers
    in `main.ts` log the failure but keep the service alive — cursor
    SDK throws unhandled HTTP/2 stream errors from connectrpc that
    would otherwise take down every worker, not just the offending one.

`apps/server/src/repro/cursor-repro.ts` is a standalone script that
isolates Agent.create + agent.send against the real cursor cloud
(decrypts the API key from the secret table). Pulled out of the
runner stack so future SDK regressions can be triaged in isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reviewer left a comment
  • behavior cursor-sdk-adapter.ts stale-recovery retry has no timeout. In the isStaleActiveRun catch branch (~line 1641), run = await agent.send(first.value, sendOpts) is a bare await with no Promise.race. The original sendTimer fires after 90 s and produces an unhandled rejection (logged, but harmless), while the await remains pending indefinitely — worker pinned. Fix: race the retry with a fresh timeout promise, same pattern as the first send:
    const retryTimeout = new Promise<never>((_, r) =>
      setTimeout(() => r(new Error(`agent.send() retry timed out after ${SEND_TIMEOUT_MS}ms`)), SEND_TIMEOUT_MS)
    );
    run = (await Promise.race([agent.send(first.value, sendOpts), retryTimeout])) as import("@cursor/sdk").Run;
    
- **behavior** `cursor-sdk-adapter.ts` stale-recovery retry has no timeout. In the `isStaleActiveRun` catch branch (~line 1641), `run = await agent.send(first.value, sendOpts)` is a bare `await` with no `Promise.race`. The original `sendTimer` fires after 90 s and produces an unhandled rejection (logged, but harmless), while the `await` remains pending indefinitely — worker pinned. Fix: race the retry with a fresh timeout promise, same pattern as the first send: ```ts const retryTimeout = new Promise<never>((_, r) => setTimeout(() => r(new Error(`agent.send() retry timed out after ${SEND_TIMEOUT_MS}ms`)), SEND_TIMEOUT_MS) ); run = (await Promise.race([agent.send(first.value, sendOpts), retryTimeout])) as import("@cursor/sdk").Run; ```
fix(cursor): race stale-recovery retry against timeout + abort
All checks were successful
qa / dockerfile (pull_request) Successful in 9s
qa / db-schema (pull_request) Successful in 14s
qa / i18n-string-check (pull_request) Successful in 12s
qa / sql-layer-check (pull_request) Successful in 8s
qa / qa-1 (pull_request) Successful in 2m31s
qa / qa (pull_request) Successful in 0s
2c400d5f5b
Reviewer flagged: the `isStaleActiveRun` retry path called
`run = await agent.send(...)` as a bare await with no `Promise.race`.
The original `sendTimer` had already been cleared in the outer try's
finally block, so a wedged retry would pin the worker indefinitely.

Mirror the same `Promise.race(send, timeout, abort)` pattern as the
first send: 90s `SEND_TIMEOUT_MS` ceiling + abort-signal short-circuit,
clearTimeout in finally so the timer does not fire after a successful
retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Collaborator

Fixed in 2c400d5f. Stale-recovery retry now races agent.send() against SEND_TIMEOUT_MS + abort signal, mirroring the first send. clearTimeout in finally so the retry timer can't fire post-success. Tests + typecheck clean.

Fixed in 2c400d5f. Stale-recovery retry now races `agent.send()` against `SEND_TIMEOUT_MS` + abort signal, mirroring the first send. `clearTimeout` in finally so the retry timer can't fire post-success. Tests + typecheck clean.
fix(cursor): stall watchdog wakes the stream race when both racers hang
All checks were successful
qa / sql-layer-check (pull_request) Successful in 9s
qa / i18n-string-check (pull_request) Successful in 10s
qa / dockerfile (pull_request) Successful in 14s
qa / db-schema (pull_request) Successful in 16s
qa / qa-1 (pull_request) Successful in 2m27s
qa / qa (pull_request) Successful in 0s
08c8c3e339
The stall timer set `stalled = true` but never broke
`Promise.race(streamPromise, deltaPromise)`. When cursor cloud goes
silent and the delta queue is empty, both racers hang forever, the
race never resolves, and the loop never reaches `yieldStalledIfArmed`.
Net effect: `cursor_stalled` is never yielded, the worker stays
busy past the 5-min threshold with no signal to the operator.

Fix: make the stall a deferred promise that resolves when the timer
fires, include it as a third racer in `Promise.race`. On stall-wake,
yield `cursor_stalled`, re-arm the timer + replace the promise so a
sustained silence keeps producing periodic events.

Regression test pulls the generator with a 500ms ceiling and asserts
the event fires from the timer alone (no external `__pushEvent` call
to wake the loop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
Collaborator

Pushed 08c8c3e3 — same class of bug spotted in streamRunWithStallAndAbort. The stall timer set stalled = true but never broke Promise.race(streamPromise, deltaPromise). Cloud silent + delta queue empty → both racers hang forever, race never resolves, cursor_stalled never yields. Repro'd live: two dev workers wedged 26+ min past agent.send() with zero cursor_stalled signal.

Fix: stall is now a deferred promise resolved by the timer, included as a third racer. On stall-wake yields cursor_stalled, re-arms timer, replaces promise so sustained silence keeps producing periodic events. New regression test asserts the event fires from the timer alone (no external __pushEvent wake) within 500ms of a 30ms threshold.

Pushed 08c8c3e3 — same class of bug spotted in `streamRunWithStallAndAbort`. The stall timer set `stalled = true` but never broke `Promise.race(streamPromise, deltaPromise)`. Cloud silent + delta queue empty → both racers hang forever, race never resolves, `cursor_stalled` never yields. Repro'd live: two `dev` workers wedged 26+ min past `agent.send()` with zero `cursor_stalled` signal. Fix: stall is now a deferred promise resolved by the timer, included as a third racer. On stall-wake yields `cursor_stalled`, re-arms timer, replaces promise so sustained silence keeps producing periodic events. New regression test asserts the event fires from the timer alone (no external `__pushEvent` wake) within 500ms of a 30ms threshold.
charles deleted branch fix/cursor-sdk-cwd-frame-size 2026-05-09 22:00:40 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
charles/claude-hooks!1027
No description provided.