Voice Assistant

The voice assistant provides hands-free AI interaction through speech. It's available as a floating widget on web and a long-press FAB action on mobile.

Architecture

User speaks → Mic recording → STT (transcribe) → AI Agent (chat) → TTS (synthesize) → Speaker playback

The voice assistant chains three services:

STT — converts speech to text (configurable provider)
AI Agent — processes the text and generates a response
TTS — converts the response back to speech

Permission Gating

The voice assistant is only visible to users with ai.view or ai.configure permission:

Web: The AIAssistantWidget is conditionally rendered in the authenticated layout
Mobile: The VoiceAssistant component and FAB long-press action are gated by permission
API: The /ai/voice endpoint requires tenant.permission:ai.view middleware

Users without AI permissions see no widget, no voice button, and get 403 on API calls.

API Endpoint

POST /api/v1/ai/voice
Content-Type: multipart/form-data

audio: <file>          # Audio file (m4a, mp3, wav, webm)
context: "orders page" # Optional page context

Response:

{
  "success": true,
  "data": {
    "transcription": "What is the status of order 9002?",
    "ai_response": "Order #9002 is currently in transit. It was shipped yesterday via Express delivery and is expected to arrive tomorrow.",
    "usage": {
      "stt_ms": 450,
      "llm_ms": 1200,
      "total_ms": 1650
    }
  }
}

The endpoint handles the full pipeline: transcribe the audio, send the text to the AI agent, and return both the transcription and AI response.

Voice Activity Detection (VAD)

VAD enables hands-free conversation by automatically detecting when the user stops speaking and triggering the send action.

How It Works

User opens voice assistant — mic starts recording
VAD monitors audio energy levels every 60ms
When speech is detected (raw dB > -35), VAD marks speech start
When silence follows speech for 1.5 seconds, VAD auto-triggers processing
After AI responds via TTS, the mic re-opens for the next turn

VAD Parameters

Parameter	Value	Purpose
`VAD_SPEECH_DB`	-35 dB	Raw metering threshold — above this is speech
`VAD_SILENCE_DURATION`	1500 ms	Silence after speech before auto-send
`VAD_MIN_SPEECH_DURATION`	400 ms	Minimum speech duration before considering auto-send

Platform Differences

Web (browser): Uses the Web Audio API AnalyserNode to read frequency data from the microphone stream. The normalized RMS level is compared against the threshold.

Mobile (React Native / Expo): Uses expo-av recording metering (getStatusAsync().metering) which returns raw dB values. On Android, metering is very spiky (jumps from -14 dB to -160 dB between ticks), so VAD uses raw dB directly instead of smoothed/normalized levels.

Echo Prevention

When TTS plays the AI response through the speaker, the mic could pick it up and create a feedback loop. The system prevents this with:

ttsActiveRef — tracks the full TTS lifecycle (API call start → playback finish). The mic never re-opens while TTS is active, even during the 2-second API call before audio starts playing.
Echo cooldown — after TTS finishes, the mic stays blocked for 1.5 seconds to let speaker echo dissipate. The UI shows "Preparing to listen..." during this period.
Min speech duration — VAD ignores speech bursts shorter than 400ms, filtering out residual echo.

Web Widget

The AI assistant widget appears as a floating button in the bottom-right corner of the web app. It supports:

Text chat — type messages to the AI agent
Voice mode — click the mic to speak, click again to send
Agent role switching — select between Support, Operations, Analytics roles
Docked mode — pin the widget to the side panel (shifts main content)
Page context — automatically sends the current page URL for context

The widget lives in modules/AI/frontend/components/ai-assistant-widget.tsx and is dynamically imported by the authenticated layout — no hard dependency on the AI module.

Mobile Voice Assistant

Triggered by long-pressing the center FAB in the tab bar. Features:

Full-screen overlay with animated waveform visualization
Live metering — waveform bars respond to mic input in real-time
Pulse rings — concentric rings pulse with voice intensity
TTS playback — AI responses are spoken back via expo-av
Auto-resume — after TTS finishes, mic re-opens for continuous conversation
Haptic feedback — on speech detection, auto-send, and response

Recording Configuration

{
  sampleRate: 16000,       // Voice-optimized
  numberOfChannels: 1,     // Mono — reduces ambient noise
  bitRate: 64000,          // Sufficient for speech
  extension: ".m4a",       // AAC codec
  isMeteringEnabled: true  // Required for VAD
}

Conversation Flow

Manual Mode

User taps mic → recording starts
User taps mic again → recording stops, audio sent
AI processes and responds (with TTS if configured)
User taps mic to ask again

Hands-Free Mode (VAD)

User opens voice assistant → mic starts, VAD arms
User speaks → VAD detects speech
User pauses for 1.5s → VAD auto-sends
AI responds via TTS → speaker plays response
TTS finishes → 1.5s echo cooldown → mic re-opens
Cycle repeats until user closes the assistant

Troubleshooting

Issue	Cause	Fix
"Microphone access denied"	Browser/app permissions	Grant mic permission in settings
No TTS playback	TTS provider not configured	Add a TTS provider in AI Configuration
Echo feedback loop	Old app version	Update — echo prevention added in latest
VAD never triggers (mobile)	Smoothed levels too low	Uses raw dB now — update app
VAD triggers on background noise	Threshold too sensitive	Increase `VAD_SPEECH_DB` (e.g., -30)