VOICE-1: /architect/transcribe server proxy + speaches integration #773

Closed
opened 2026-05-02 21:41:50 +00:00 by code-lead · 0 comments
Collaborator

As an operator, I want the claude-hooks server to broker every transcription request so my speaches instance stays bound to localhost and the browser only ever talks to the claude-hooks origin it already has a session cookie for.

Acceptance criteria

Config plumbing

  • Factory-image defaults added to config/service.json under a new "speech": { … } block matching the spec's storage table (speech.enabled, speech.transcribe_url, speech.model, speech.default_language, speech.max_audio_seconds, speech.max_audio_bytes, speech.allowed_languages_json). Not read at runtime — only by SVC-1's syncServiceConfigBuiltin.
  • getSpeechConfig() accessor lands next to the other typed sub-accessors that SVC-2 introduces. Reads exclusively via getServiceConfig(). No readFileSync('service.json') anywhere.
  • When speech.enabled === false or speech.transcribe_url is empty, POST /architect/transcribe returns 503 service-disabled with a JSON body explaining how to flip the flag from the dashboard. Don't 404 — the UI needs the distinction so it can disable the mic button gracefully.

Endpoint shape

  • POST /architect/transcribe accepts multipart/form-data with:
    • audio: the recorded blob (typically audio/webm;codecs=opus).
    • language (optional): ISO 639-1 code or auto. Falls back to speech.default_language.
    • prompt (optional): forwarded to speaches as the Whisper prompt field for context-priming.
  • Auth: same operator-session middleware as the rest of /architect/* (reuse what architect.ts already wires).
  • Reject requests larger than speech.max_audio_bytes with 413. Reject requests where the audio duration (deduced from blob header where possible, otherwise enforced after speaches replies) exceeds speech.max_audio_seconds with 413.
  • Response is SSE (text/event-stream) with envelope events:
    • event: partialdata: { "text": "…", "is_final": false } emitted as speaches yields incremental hypotheses.
    • event: finaldata: { "text": "…", "language": "fr", "duration_ms": 4321 } once speaches finishes.
    • event: errordata: { "code": "...", "message": "..." } on any upstream failure; the stream then closes.
  • Heartbeat comment frames every 10 s while waiting on speaches so idle proxies don't reset the connection (mirror the pattern already used by /events — see apps/server/src/main.ts::SSE_HEARTBEAT_MS).

Speaches integration

  • Forwards the audio to ${transcribe_url}/v1/audio/transcriptions with model and language (null when the operator chose auto, since speaches/Whisper auto-detects when omitted).
  • Requests stream=true and parses speaches' SSE response, fanning events through to the browser. If the upstream returns plain JSON instead of SSE (older speaches builds), fall back to a single final event.
  • Aborts the upstream request when the browser disconnects (use AbortController keyed to the response's close signal).
  • Logs a one-line entry per request: agent, language, duration_ms, audio_bytes, upstream HTTP status. No transcript text in logs.

Tests

  • Unit: proxy correctly maps language: "auto" → omits the field upstream, language: "fr" → forwards verbatim.
  • Unit: oversize blob → 413; speech.enabled=false503; missing operator session → 401.
  • Integration: a fixture upstream that emits two partial SSE frames and a final produces three SSE frames downstream in the same order.
  • Integration: flipping speech.enabled from false to true at scope='global' (via direct service_config insert in the test) makes the next request succeed without restart.

Out of scope

  • TTS / read-back of architect responses.
  • Speaker diarisation, custom vocabularies, or fine-tuned models.
  • Multi-provider abstraction (OpenAI Whisper, Groq, Replicate, etc.) — config shape stays generic enough to swap later, but no provider-abstraction code lands here.
  • Hosted-STT API keys in the SC-6 secret table — defer until we actually need a hosted provider. speaches needs no auth.

References

  • specs/workspace-chat-voice-input.md — full spec (P1 section).
  • specs/config-to-db.md — SVC-1/SVC-2/SVC-3 contract this story lands on from day one.
  • ~/.config/systemd/user/speaches.service — existing speaches unit on the desktop, port 8078, model deepdml/faster-whisper-large-v3-turbo-ct2, STT_MODEL_TTL=-1 (warm).
  • speaches OpenAI-compatible STT: POST /v1/audio/transcriptions with stream=true for SSE partials.

Dependencies

  • SVC-1 (#750) — already merged. No native dep edge needed.
  • This story has no open blockers at creation time. It will land unassigned on /issues/ready; the architect (or operator) assigns it manually to kick off dispatch.
As an operator, I want the claude-hooks server to broker every transcription request so my speaches instance stays bound to localhost and the browser only ever talks to the claude-hooks origin it already has a session cookie for. ## Acceptance criteria ### Config plumbing - [ ] Factory-image defaults added to `config/service.json` under a new `"speech": { … }` block matching the spec's storage table (`speech.enabled`, `speech.transcribe_url`, `speech.model`, `speech.default_language`, `speech.max_audio_seconds`, `speech.max_audio_bytes`, `speech.allowed_languages_json`). **Not** read at runtime — only by SVC-1's `syncServiceConfigBuiltin`. - [ ] `getSpeechConfig()` accessor lands next to the other typed sub-accessors that SVC-2 introduces. Reads exclusively via `getServiceConfig()`. No `readFileSync('service.json')` anywhere. - [ ] When `speech.enabled === false` or `speech.transcribe_url` is empty, `POST /architect/transcribe` returns `503 service-disabled` with a JSON body explaining how to flip the flag from the dashboard. Don't 404 — the UI needs the distinction so it can disable the mic button gracefully. ### Endpoint shape - [ ] `POST /architect/transcribe` accepts `multipart/form-data` with: - `audio`: the recorded blob (typically `audio/webm;codecs=opus`). - `language` (optional): ISO 639-1 code or `auto`. Falls back to `speech.default_language`. - `prompt` (optional): forwarded to speaches as the Whisper `prompt` field for context-priming. - [ ] Auth: same operator-session middleware as the rest of `/architect/*` (reuse what `architect.ts` already wires). - [ ] Reject requests larger than `speech.max_audio_bytes` with `413`. Reject requests where the audio duration (deduced from blob header where possible, otherwise enforced after speaches replies) exceeds `speech.max_audio_seconds` with `413`. - [ ] Response is **SSE** (`text/event-stream`) with envelope events: - `event: partial` — `data: { "text": "…", "is_final": false }` emitted as speaches yields incremental hypotheses. - `event: final` — `data: { "text": "…", "language": "fr", "duration_ms": 4321 }` once speaches finishes. - `event: error` — `data: { "code": "...", "message": "..." }` on any upstream failure; the stream then closes. - [ ] Heartbeat comment frames every 10 s while waiting on speaches so idle proxies don't reset the connection (mirror the pattern already used by `/events` — see `apps/server/src/main.ts::SSE_HEARTBEAT_MS`). ### Speaches integration - [ ] Forwards the audio to `${transcribe_url}/v1/audio/transcriptions` with `model` and `language` (`null` when the operator chose `auto`, since speaches/Whisper auto-detects when omitted). - [ ] Requests `stream=true` and parses speaches' SSE response, fanning events through to the browser. If the upstream returns plain JSON instead of SSE (older speaches builds), fall back to a single `final` event. - [ ] Aborts the upstream request when the browser disconnects (use `AbortController` keyed to the response's close signal). - [ ] Logs a one-line entry per request: agent, language, duration_ms, audio_bytes, upstream HTTP status. **No transcript text in logs.** ### Tests - [ ] Unit: proxy correctly maps `language: "auto"` → omits the field upstream, `language: "fr"` → forwards verbatim. - [ ] Unit: oversize blob → `413`; `speech.enabled=false` → `503`; missing operator session → `401`. - [ ] Integration: a fixture upstream that emits two `partial` SSE frames and a `final` produces three SSE frames downstream in the same order. - [ ] Integration: flipping `speech.enabled` from `false` to `true` at `scope='global'` (via direct `service_config` insert in the test) makes the next request succeed without restart. ## Out of scope - TTS / read-back of architect responses. - Speaker diarisation, custom vocabularies, or fine-tuned models. - Multi-provider abstraction (OpenAI Whisper, Groq, Replicate, etc.) — config shape stays generic enough to swap later, but no provider-abstraction code lands here. - Hosted-STT API keys in the SC-6 `secret` table — defer until we actually need a hosted provider. speaches needs no auth. ## References - `specs/workspace-chat-voice-input.md` — full spec (P1 section). - `specs/config-to-db.md` — SVC-1/SVC-2/SVC-3 contract this story lands on from day one. - `~/.config/systemd/user/speaches.service` — existing speaches unit on the desktop, port 8078, model `deepdml/faster-whisper-large-v3-turbo-ct2`, `STT_MODEL_TTL=-1` (warm). - speaches OpenAI-compatible STT: `POST /v1/audio/transcriptions` with `stream=true` for SSE partials. ## Dependencies - **SVC-1 (#750)** — already merged. No native dep edge needed. - This story has **no open blockers** at creation time. It will land unassigned on `/issues/ready`; the architect (or operator) assigns it manually to kick off dispatch.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
charles/claude-hooks#773
No description provided.