Voice Assistant

The voice assistant provides hands-free AI interaction through speech. It's available as a floating widget on web and a long-press FAB action on mobile.

Architecture

User speaks → Mic recording → STT (transcribe) → AI Agent (chat) → TTS (synthesize) → Speaker playback

The voice assistant chains three services:

  1. STT — converts speech to text (configurable provider)
  2. AI Agent — processes the text and generates a response
  3. TTS — converts the response back to speech

Permission Gating

The voice assistant is only visible to users with ai.view or ai.configure permission:

  • Web: The AIAssistantWidget is conditionally rendered in the authenticated layout
  • Mobile: The VoiceAssistant component and FAB long-press action are gated by permission
  • API: The /ai/voice endpoint requires tenant.permission:ai.view middleware

Users without AI permissions see no widget, no voice button, and get 403 on API calls.

API Endpoint

POST /api/v1/ai/voice
Content-Type: multipart/form-data

audio: <file>          # Audio file (m4a, mp3, wav, webm)
context: "orders page" # Optional page context

Response:

{
  "success": true,
  "data": {
    "transcription": "What is the status of order 9002?",
    "ai_response": "Order #9002 is currently in transit. It was shipped yesterday via Express delivery and is expected to arrive tomorrow.",
    "usage": {
      "stt_ms": 450,
      "llm_ms": 1200,
      "total_ms": 1650
    }
  }
}

The endpoint handles the full pipeline: transcribe the audio, send the text to the AI agent, and return both the transcription and AI response.

Voice Activity Detection (VAD)

VAD enables hands-free conversation by automatically detecting when the user stops speaking and triggering the send action.

How It Works

  1. User opens voice assistant — mic starts recording
  2. VAD monitors audio energy levels every 60ms
  3. When speech is detected (raw dB > -35), VAD marks speech start
  4. When silence follows speech for 1.5 seconds, VAD auto-triggers processing
  5. After AI responds via TTS, the mic re-opens for the next turn

VAD Parameters

Parameter Value Purpose
VAD_SPEECH_DB -35 dB Raw metering threshold — above this is speech
VAD_SILENCE_DURATION 1500 ms Silence after speech before auto-send
VAD_MIN_SPEECH_DURATION 400 ms Minimum speech duration before considering auto-send

Platform Differences

Web (browser): Uses the Web Audio API AnalyserNode to read frequency data from the microphone stream. The normalized RMS level is compared against the threshold.

Mobile (React Native / Expo): Uses expo-av recording metering (getStatusAsync().metering) which returns raw dB values. On Android, metering is very spiky (jumps from -14 dB to -160 dB between ticks), so VAD uses raw dB directly instead of smoothed/normalized levels.

Echo Prevention

When TTS plays the AI response through the speaker, the mic could pick it up and create a feedback loop. The system prevents this with:

  1. ttsActiveRef — tracks the full TTS lifecycle (API call start → playback finish). The mic never re-opens while TTS is active, even during the 2-second API call before audio starts playing.
  2. Echo cooldown — after TTS finishes, the mic stays blocked for 1.5 seconds to let speaker echo dissipate. The UI shows "Preparing to listen..." during this period.
  3. Min speech duration — VAD ignores speech bursts shorter than 400ms, filtering out residual echo.

Web Widget

The AI assistant widget appears as a floating button in the bottom-right corner of the web app. It supports:

  • Text chat — type messages to the AI agent
  • Voice mode — click the mic to speak, click again to send
  • Agent role switching — select between Support, Operations, Analytics roles
  • Docked mode — pin the widget to the side panel (shifts main content)
  • Page context — automatically sends the current page URL for context

The widget lives in modules/AI/frontend/components/ai-assistant-widget.tsx and is dynamically imported by the authenticated layout — no hard dependency on the AI module.

Mobile Voice Assistant

Triggered by long-pressing the center FAB in the tab bar. Features:

  • Full-screen overlay with animated waveform visualization
  • Live metering — waveform bars respond to mic input in real-time
  • Pulse rings — concentric rings pulse with voice intensity
  • TTS playback — AI responses are spoken back via expo-av
  • Auto-resume — after TTS finishes, mic re-opens for continuous conversation
  • Haptic feedback — on speech detection, auto-send, and response

Recording Configuration

{
  sampleRate: 16000,       // Voice-optimized
  numberOfChannels: 1,     // Mono — reduces ambient noise
  bitRate: 64000,          // Sufficient for speech
  extension: ".m4a",       // AAC codec
  isMeteringEnabled: true  // Required for VAD
}

Conversation Flow

Manual Mode

  1. User taps mic → recording starts
  2. User taps mic again → recording stops, audio sent
  3. AI processes and responds (with TTS if configured)
  4. User taps mic to ask again

Hands-Free Mode (VAD)

  1. User opens voice assistant → mic starts, VAD arms
  2. User speaks → VAD detects speech
  3. User pauses for 1.5s → VAD auto-sends
  4. AI responds via TTS → speaker plays response
  5. TTS finishes → 1.5s echo cooldown → mic re-opens
  6. Cycle repeats until user closes the assistant

Troubleshooting

Issue Cause Fix
"Microphone access denied" Browser/app permissions Grant mic permission in settings
No TTS playback TTS provider not configured Add a TTS provider in AI Configuration
Echo feedback loop Old app version Update — echo prevention added in latest
VAD never triggers (mobile) Smoothed levels too low Uses raw dB now — update app
VAD triggers on background noise Threshold too sensitive Increase VAD_SPEECH_DB (e.g., -30)