fix(cursor): wrap cwd in symlink dir so SDK doesn't blow HTTP/2 frame #1027
No reviewers
Labels
No labels
area:agents
area:dashboard
area:database
area:design
area:design-review
area:flows
area:infra
area:meta
area:security
area:sessions
area:webhook
area:workdir
security
type:bug
type:chore
type:meta
type:user-story
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
charles/claude-hooks!1027
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/cursor-sdk-cwd-frame-size"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Cursor SDK 1.0.12 indexes the entire
local.cwdtree onAgent.send()and stuffs the file list into the initial HTTP/2 frame. A Bun monorepo withnode_modules/(~30k paths) overflows Node's default 16 KBSETTINGS_MAX_FRAME_SIZE→NGHTTP2_FRAME_SIZE_ERRORon every dispatch that picked the cursor provider..cursorignoreis not honoured for this code path in 1.0.12.Fix: build a per-task wrapper directory of symlinks to the worktree's top-level entries minus
{node_modules, .git, dist, build, .turbo, .next, .cache, coverage}. The SDK does not traverse symlinks, so the indexed tree stays small. Tools (Read/Bash/Grep) still resolve real source paths through the symlinks. Wrapper torn down in the runTaskfinally.What this also fixes
Adjacent failure modes uncovered while diagnosing:
replayConversationForAgentskips for non-bc-ids. SDK 1.0.12 strict-rejects forcedruntime: "cloud"listRunson local-runtime ids withAgent ID must be in the format 'bc-<uuid>'. Replay only fires for cloud agents now.agent.send()recovers fromUnknownAgentError: already has active run. Staleactive_run_idin the local SDK store after a prior crashed run made every subsequent kick fail. Recovery now disposes the resumed agent, mints a fresh one viaAgent.create, retries once.cursor_initis yielded after recovery so the agent-runner persists the FRESH session id — breaks the otherwise-permanent recovery loop.agent.send()is now raced against a 90s timeout + abort signal. A wedged pre-stream phase no longer pins the worker. Emitscursor_send_failedwithelapsed_ms+ reason on timeout/abort.unhandledRejection/uncaughtExceptionhandlers inmain.ts. Cursor SDK throws unhandled HTTP/2 stream errors from connectrpc that previously took down the whole service (every worker), not just the offending one.Diagnostic flow
validation_erroron replay.UnknownAgentError: already has active runfrom a staleactive_run_id.NGHTTP2_FRAME_SIZE_ERRORfrom connectrpc.apps/server/src/repro/cursor-repro.ts) confirmed: bare cwd works, real worktree crashes, symlinked worktree (nonode_modules) works.Repro
apps/server/src/repro/cursor-repro.ts— standalone script that isolatesAgent.create+agent.sendagainst the real cursor cloud (decrypts the API key from thesecrettable). Pulled out of the runner stack so future SDK regressions can be triaged in isolation.Test plan
bun test apps/server/src/infrastructure/agent/cursor-sdk-adapter.test.ts(43/43 pass)just typecheckcleanREPRO_CWD=/home/charles/Workspace/claude-hookssucceeds (was crashing pre-fix, equivalent flow now goes throughbuildCursorCwd)🤖 Generated with Claude Code
Cursor SDK 1.0.12 indexes the entire `local.cwd` tree on `Agent.send()` and stuffs the file list into the initial HTTP/2 frame. A Bun monorepo with `node_modules/` (~30k paths) overflows Node's default 16 KB SETTINGS_MAX_FRAME_SIZE → `NGHTTP2_FRAME_SIZE_ERROR` on every dispatch that picked the cursor provider. `.cursorignore` is not honoured for this code path. Fix: build a per-task wrapper directory of symlinks to the worktree's top-level entries minus `{node_modules, .git, dist, build, .turbo, .next, .cache, coverage}`. The SDK does not traverse symlinks, so the indexed tree stays small. Tools (Read/Bash/Grep) still resolve real source paths through the symlinks. Wrapper torn down in the runTask finally. Also fixes adjacent failure modes uncovered while diagnosing this: - `replayConversationForAgent` skips for non-`bc-` (local-runtime) agent ids — SDK 1.0.12 strict-rejects forced `runtime: "cloud"` listRuns on local ids with `Agent ID must be in the format 'bc-<uuid>'`. - `agent.send()` recovers from `UnknownAgentError: already has active run` (stale `active_run_id` in the local SDK store after a prior crashed run): dispose the resumed agent, mint a fresh one via `Agent.create`, retry once. `cursor_init` is now yielded AFTER the recovery so the agent-runner persists the FRESH session id — breaks the otherwise-permanent recovery loop. - `agent.send()` is now raced against a 90s timeout + abort signal so a wedged pre-stream phase no longer pins the worker. Emits `cursor_send_failed` with `elapsed_ms` + reason on timeout/abort. - Process-level `unhandledRejection` / `uncaughtException` handlers in `main.ts` log the failure but keep the service alive — cursor SDK throws unhandled HTTP/2 stream errors from connectrpc that would otherwise take down every worker, not just the offending one. `apps/server/src/repro/cursor-repro.ts` is a standalone script that isolates Agent.create + agent.send against the real cursor cloud (decrypts the API key from the secret table). Pulled out of the runner stack so future SDK regressions can be triaged in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>cursor-sdk-adapter.tsstale-recovery retry has no timeout. In theisStaleActiveRuncatch branch (~line 1641),run = await agent.send(first.value, sendOpts)is a bareawaitwith noPromise.race. The originalsendTimerfires after 90 s and produces an unhandled rejection (logged, but harmless), while theawaitremains pending indefinitely — worker pinned. Fix: race the retry with a fresh timeout promise, same pattern as the first send:Fixed in
2c400d5f. Stale-recovery retry now racesagent.send()againstSEND_TIMEOUT_MS+ abort signal, mirroring the first send.clearTimeoutin finally so the retry timer can't fire post-success. Tests + typecheck clean.Pushed
08c8c3e3— same class of bug spotted instreamRunWithStallAndAbort. The stall timer setstalled = truebut never brokePromise.race(streamPromise, deltaPromise). Cloud silent + delta queue empty → both racers hang forever, race never resolves,cursor_stallednever yields. Repro'd live: twodevworkers wedged 26+ min pastagent.send()with zerocursor_stalledsignal.Fix: stall is now a deferred promise resolved by the timer, included as a third racer. On stall-wake yields
cursor_stalled, re-arms timer, replaces promise so sustained silence keeps producing periodic events. New regression test asserts the event fires from the timer alone (no external__pushEventwake) within 500ms of a 30ms threshold.