Voice Assistant
The voice assistant provides hands-free AI interaction through speech. It's available as a floating widget on web and a long-press FAB action on mobile.
Architecture
User speaks → Mic recording → STT (transcribe) → AI Agent (chat) → TTS (synthesize) → Speaker playback
The voice assistant chains three services:
- STT — converts speech to text (configurable provider)
- AI Agent — processes the text and generates a response
- TTS — converts the response back to speech
Permission Gating
The voice assistant is only visible to users with ai.view or ai.configure permission:
- Web: The
AIAssistantWidgetis conditionally rendered in the authenticated layout - Mobile: The
VoiceAssistantcomponent and FAB long-press action are gated by permission - API: The
/ai/voiceendpoint requirestenant.permission:ai.viewmiddleware
Users without AI permissions see no widget, no voice button, and get 403 on API calls.
API Endpoint
POST /api/v1/ai/voice
Content-Type: multipart/form-data
audio: <file> # Audio file (m4a, mp3, wav, webm)
context: "orders page" # Optional page context
Response:
{
"success": true,
"data": {
"transcription": "What is the status of order 9002?",
"ai_response": "Order #9002 is currently in transit. It was shipped yesterday via Express delivery and is expected to arrive tomorrow.",
"usage": {
"stt_ms": 450,
"llm_ms": 1200,
"total_ms": 1650
}
}
}
The endpoint handles the full pipeline: transcribe the audio, send the text to the AI agent, and return both the transcription and AI response.
Voice Activity Detection (VAD)
VAD enables hands-free conversation by automatically detecting when the user stops speaking and triggering the send action.
How It Works
- User opens voice assistant — mic starts recording
- VAD monitors audio energy levels every 60ms
- When speech is detected (raw dB > -35), VAD marks speech start
- When silence follows speech for 1.5 seconds, VAD auto-triggers processing
- After AI responds via TTS, the mic re-opens for the next turn
VAD Parameters
| Parameter | Value | Purpose |
|---|---|---|
VAD_SPEECH_DB |
-35 dB | Raw metering threshold — above this is speech |
VAD_SILENCE_DURATION |
1500 ms | Silence after speech before auto-send |
VAD_MIN_SPEECH_DURATION |
400 ms | Minimum speech duration before considering auto-send |
Platform Differences
Web (browser):
Uses the Web Audio API AnalyserNode to read frequency data from the microphone stream. The normalized RMS level is compared against the threshold.
Mobile (React Native / Expo):
Uses expo-av recording metering (getStatusAsync().metering) which returns raw dB values. On Android, metering is very spiky (jumps from -14 dB to -160 dB between ticks), so VAD uses raw dB directly instead of smoothed/normalized levels.
Echo Prevention
When TTS plays the AI response through the speaker, the mic could pick it up and create a feedback loop. The system prevents this with:
ttsActiveRef— tracks the full TTS lifecycle (API call start → playback finish). The mic never re-opens while TTS is active, even during the 2-second API call before audio starts playing.- Echo cooldown — after TTS finishes, the mic stays blocked for 1.5 seconds to let speaker echo dissipate. The UI shows "Preparing to listen..." during this period.
- Min speech duration — VAD ignores speech bursts shorter than 400ms, filtering out residual echo.
Web Widget
The AI assistant widget appears as a floating button in the bottom-right corner of the web app. It supports:
- Text chat — type messages to the AI agent
- Voice mode — click the mic to speak, click again to send
- Agent role switching — select between Support, Operations, Analytics roles
- Docked mode — pin the widget to the side panel (shifts main content)
- Page context — automatically sends the current page URL for context
The widget lives in modules/AI/frontend/components/ai-assistant-widget.tsx and is dynamically imported by the authenticated layout — no hard dependency on the AI module.
Mobile Voice Assistant
Triggered by long-pressing the center FAB in the tab bar. Features:
- Full-screen overlay with animated waveform visualization
- Live metering — waveform bars respond to mic input in real-time
- Pulse rings — concentric rings pulse with voice intensity
- TTS playback — AI responses are spoken back via expo-av
- Auto-resume — after TTS finishes, mic re-opens for continuous conversation
- Haptic feedback — on speech detection, auto-send, and response
Recording Configuration
{
sampleRate: 16000, // Voice-optimized
numberOfChannels: 1, // Mono — reduces ambient noise
bitRate: 64000, // Sufficient for speech
extension: ".m4a", // AAC codec
isMeteringEnabled: true // Required for VAD
}
Conversation Flow
Manual Mode
- User taps mic → recording starts
- User taps mic again → recording stops, audio sent
- AI processes and responds (with TTS if configured)
- User taps mic to ask again
Hands-Free Mode (VAD)
- User opens voice assistant → mic starts, VAD arms
- User speaks → VAD detects speech
- User pauses for 1.5s → VAD auto-sends
- AI responds via TTS → speaker plays response
- TTS finishes → 1.5s echo cooldown → mic re-opens
- Cycle repeats until user closes the assistant
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| "Microphone access denied" | Browser/app permissions | Grant mic permission in settings |
| No TTS playback | TTS provider not configured | Add a TTS provider in AI Configuration |
| Echo feedback loop | Old app version | Update — echo prevention added in latest |
| VAD never triggers (mobile) | Smoothed levels too low | Uses raw dB now — update app |
| VAD triggers on background noise | Threshold too sensitive | Increase VAD_SPEECH_DB (e.g., -30) |