VOICE-3: Composer mic toggle + live partials #775

Closed
opened 2026-05-02 21:41:53 +00:00 by code-lead · 0 comments
Collaborator

As an operator typing into the workspace chat, I want to click a mic icon to start dictating, see the words appear live, and click again to stop and have the final transcript inserted at my caret — without losing whatever I had already typed.

Acceptance criteria

Mic button

  • apps/web/src/components/planner/composer.tsx gains a <Button> with lucide-react Mic icon, sitting between the attachments strip and Send/Queue/Stop. aria-label="Start dictation""Stop dictation" swap on toggle. aria-pressed mirrors the recording state.
  • The button is hidden when the /architect/transcribe/health probe (see VOICE-2 (#774)) reports the feature disabled or unreachable. It renders disabled (with a tooltip explaining why) when the browser lacks navigator.mediaDevices.getUserMedia.
  • First click prompts mic permission. Permission denial surfaces a one-shot toast ("Microphone access denied — enable in browser settings") and reverts the button to idle.

Recording state machine

  • States: idle → requesting-permission → recording → uploading → idle. Errors at any step return to idle with a toast.
  • While recording: a small pulsing dot appears next to the button (CSS animation, gated by @media (prefers-reduced-motion: reduce)), and a live elapsed timer (mm:ss) renders to the right. Hard cap at speech.max_audio_seconds — auto-stops at the limit and proceeds to uploading.
  • Esc while recording cancels (no upload, no insertion). Esc already aborts a streaming architect turn (onAbort in the Composer) — keep the existing handler, just add a higher-priority cancel when recording is active.
  • MediaRecorder configured for audio/webm;codecs=opus when supported, falls back to the browser default mime type. Chunk timeslice 250 ms so we have something to upload promptly.

Streaming partials & final insert

  • On stop, the assembled blob POSTs to /architect/transcribe with the operator's resolved default language (no per-browser pref — comes from speech.default_language via the health probe payload). The response is consumed via EventSource-style SSE — use fetch + a ReadableStream reader since EventSource can't POST (match the helper used in useArchitectStream).
  • partial events render under the textarea in a role="status" aria-live="polite" band styled with text-fg-muted, debounced ~1 s so screen-reader announcements don't flood.
  • On final, the final text is inserted at the current caret position of the textarea via a controlled-input update — preserving any text the operator typed before/after the recording started. If the textarea has lost focus, append at the end with a leading space when the existing text is non-empty.
  • On error, surface a toast (tone="error") with the upstream message, drop any partial preview, leave the textarea unchanged.

Tests

  • Vitest: mock MediaRecorder + a stub SSE reader; assert the state-machine transitions on each event, that Esc cancels cleanly, and that the final insert respects caret position.
  • Vitest: assert the mic button is hidden when the health probe returns enabled: false.

Out of scope

  • Mobile-specific recording UX (push-to-talk gesture, background recording). Desktop browsers only this iteration.
  • Replacing the existing keyboard composer flow — voice is purely additive; ⌘/ctrl+Enter still sends.
  • Capturing audio outside the workspace / planner composer (no global hotkey, no recording from other routes).
  • TTS read-back of architect responses.
  • Settings-group UI — that lands in VOICE-4 (settings group, posted next).

References

  • specs/workspace-chat-voice-input.md — full spec (P2 section).
  • apps/web/src/components/planner/composer.tsx — the shared composer used by both the workspace and planner chat surfaces.
  • apps/web/CLAUDE.md — primitives, a11y baseline, radius/shadow conventions.
  • useArchitectStream — existing helper that does fetch + ReadableStream SSE consumption (pattern to mirror for the transcribe POST).

Dependencies

  • Blocked by VOICE-1 (#773) — the /architect/transcribe server proxy must exist before the composer can call it. Native dep edge written against VOICE-1 (#773)'s issue number.
  • Blocked by VOICE-2 (#774) — the composer reads the health probe to decide whether to render the mic and which default language to forward. Native dep edge written against VOICE-2 (#774)'s issue number.
As an operator typing into the workspace chat, I want to click a mic icon to start dictating, see the words appear live, and click again to stop and have the final transcript inserted at my caret — without losing whatever I had already typed. ## Acceptance criteria ### Mic button - [ ] `apps/web/src/components/planner/composer.tsx` gains a `<Button>` with `lucide-react` `Mic` icon, sitting between the attachments strip and Send/Queue/Stop. `aria-label="Start dictation"` → `"Stop dictation"` swap on toggle. `aria-pressed` mirrors the recording state. - [ ] The button is hidden when the `/architect/transcribe/health` probe (see VOICE-2 (#774)) reports the feature disabled or unreachable. It renders disabled (with a tooltip explaining why) when the browser lacks `navigator.mediaDevices.getUserMedia`. - [ ] First click prompts mic permission. Permission denial surfaces a one-shot toast ("Microphone access denied — enable in browser settings") and reverts the button to idle. ### Recording state machine - [ ] States: `idle → requesting-permission → recording → uploading → idle`. Errors at any step return to `idle` with a toast. - [ ] While `recording`: a small pulsing dot appears next to the button (CSS animation, gated by `@media (prefers-reduced-motion: reduce)`), and a live elapsed timer (`mm:ss`) renders to the right. Hard cap at `speech.max_audio_seconds` — auto-stops at the limit and proceeds to `uploading`. - [ ] `Esc` while recording cancels (no upload, no insertion). `Esc` already aborts a streaming architect turn (`onAbort` in the Composer) — keep the existing handler, just add a higher-priority cancel when recording is active. - [ ] `MediaRecorder` configured for `audio/webm;codecs=opus` when supported, falls back to the browser default mime type. Chunk timeslice 250 ms so we have something to upload promptly. ### Streaming partials & final insert - [ ] On stop, the assembled blob POSTs to `/architect/transcribe` with the operator's resolved default language (no per-browser pref — comes from `speech.default_language` via the health probe payload). The response is consumed via `EventSource`-style SSE — use `fetch` + a `ReadableStream` reader since `EventSource` can't POST (match the helper used in `useArchitectStream`). - [ ] `partial` events render under the textarea in a `role="status" aria-live="polite"` band styled with `text-fg-muted`, debounced ~1 s so screen-reader announcements don't flood. - [ ] On `final`, the final text is inserted **at the current caret position** of the textarea via a controlled-input update — preserving any text the operator typed before/after the recording started. If the textarea has lost focus, append at the end with a leading space when the existing text is non-empty. - [ ] On `error`, surface a toast (`tone="error"`) with the upstream message, drop any partial preview, leave the textarea unchanged. ### Tests - [ ] Vitest: mock `MediaRecorder` + a stub SSE reader; assert the state-machine transitions on each event, that `Esc` cancels cleanly, and that the final insert respects caret position. - [ ] Vitest: assert the mic button is hidden when the health probe returns `enabled: false`. ## Out of scope - Mobile-specific recording UX (push-to-talk gesture, background recording). Desktop browsers only this iteration. - Replacing the existing keyboard composer flow — voice is purely additive; ⌘/ctrl+Enter still sends. - Capturing audio outside the workspace / planner composer (no global hotkey, no recording from other routes). - TTS read-back of architect responses. - Settings-group UI — that lands in VOICE-4 (settings group, posted next). ## References - `specs/workspace-chat-voice-input.md` — full spec (P2 section). - `apps/web/src/components/planner/composer.tsx` — the shared composer used by both the workspace and planner chat surfaces. - `apps/web/CLAUDE.md` — primitives, a11y baseline, radius/shadow conventions. - `useArchitectStream` — existing helper that does `fetch` + `ReadableStream` SSE consumption (pattern to mirror for the transcribe POST). ## Dependencies - **Blocked by VOICE-1 (#773)** — the `/architect/transcribe` server proxy must exist before the composer can call it. Native dep edge written against VOICE-1 (#773)'s issue number. - **Blocked by VOICE-2 (#774)** — the composer reads the health probe to decide whether to render the mic and which default language to forward. Native dep edge written against VOICE-2 (#774)'s issue number.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
charles/claude-hooks#775
No description provided.