A Technical Introduction to Ava

Ava is an AI voice assistant that runs entirely in the browser. It is an experimental project to understand how to use LLM models completely on the client side and use it with speech. It uses WebAssembly to perform all of its functions locally on the user’s device which means that no data is ever sent to a server, ensuring complete privacy.
GitHub Repository: https://github.com/muthuspark/ava/
Live Demo: https://ava.muthu.co
Ava’s capabilities include:
- Voice Activity Detection (VAD): Detects when the user is speaking and when they have finished.
- Speech-to-Text: Transcribes the user’s speech into text.
- Language Model: Generates a response based on the user’s query.
- Text-to-Speech: Converts the generated response into speech.
This document provides a technical overview of Ava’s architecture, technical stack, and configuration.
Architecture
Ava uses a three-stage pipeline architecture, with each stage is powered by WebAssembly-based components.
graph TB
%% Input Stage
MIC[Microphone Input]
%% Stage 1: Speech Recognition
subgraph SR["Stage 1: Speech Recognition"]
direction TB
VAD[Voice Activity Detection <br> Detects speech segments]
ASR[Speech-to-Text Transcription <br> Converts audio to text]
VAD --> ASR
end
%% Stage 2: Language Model
subgraph LM["Stage 2: Language Model"]
direction TB
INF[LLM Inference<br>Generates contextual response]
end
%% Stage 3: Speech Synthesis
subgraph TTS["Stage 3: Speech Synthesis"]
direction TB
SYNTH[Text-to-Speech<br>Low-latency audio output]
end
%% Output Stage
SPEAKER[Speaker Output]
%% Flow connections
MIC --> SR
SR -->|Transcribed Text| LM
LM -->|Generated Response| TTS
TTS --> SPEAKER
%% Styling
classDef inputOutput fill:#e1f5ff,stroke:#0288d1,stroke-width:2px
classDef stage fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef component fill:#fff3e0,stroke:#f57c00,stroke-width:1px
class MIC,SPEAKER inputOutput
class SR,LM,TTS stage
class VAD,ASR,INF,SYNTH component
Speech Recognition: This initial stage captures and processes audio from the user’s microphone.
- Voice Activity Detection (VAD): Using the Silero VAD model, Ava listens for speech activity. It determines when the user starts and stops speaking, which is a more natural approach than fixed-interval processing.
- Transcription: Once the VAD detects the end of speech, the captured audio segment is transcribed into text using the Whisper (tiny-en) model, which is run via
Transformers.js.
Language Model: The transcribed text from the previous stage is then fed into the language model.
- Inference: Ava uses the Gemma 3 270M model, running on
Wllama(a llama.cpp WASM port), to generate a response. TheuseConversation.tscomposable manages this process, triggering inference and streaming the generated tokens.
- Inference: Ava uses the Gemma 3 270M model, running on
Speech Synthesis: The final stage converts the generated text response back into speech.
- Low-Latency Output: The response text is split into sentences at punctuation boundaries (
.,!,?,,). Each sentence is then queued for synthesis using the browser’s nativeSpeechSynthesisAPI. This allows Ava to start speaking before the entire response has been generated, providing a more interactive experience.
- Low-Latency Output: The response text is split into sentences at punctuation boundaries (
Technical Stack
The following table details the technologies used in Ava:
| Component | Technology | Size |
|---|---|---|
| Voice Activity Detection | Silero VAD v5 (@ricky0123/vad-web) | ~2MB |
| Speech-to-Text | Whisper tiny-en (@huggingface/transformers) | ~40MB |
| LLM | Gemma 3 270M Instruct (Wllama) | ~180MB |
| Text-to-Speech | Web Speech Synthesis API | Native |
| Audio Visualization | Web Audio API | Native |
| Frontend | Vue 3 + TypeScript | — |
| Build | Vite | — |
Project Structure
The project is structured as follows:
src/
├── App.vue # Main application shell
├── components/
│ ├── AboutPopup.vue # Info modal
│ └── WaveformVisualizer.vue # Real-time audio visualization
├── composables/
│ ├── useConversation.ts # Orchestrates conversation flow
│ ├── useWhisper.ts # VAD + Whisper speech recognition
│ ├── useWllama.ts # Gemma LLM inference
│ ├── useSpeechSynthesis.ts # Browser TTS wrapper
│ └── useAudioVisualizer.ts # Web Audio frequency analysis
├── styles/
│ └── main.css # Global styles
└── types/
└── index.ts # TypeScript definitions
Workflow
The following diagram illustrates the event-driven data flow between the core composables:
graph TD;
subgraph "User Interaction"
A[Microphone Audio]
end
subgraph "Composables"
B(useAudioVisualizer)
C(useWhisper)
D(useConversation)
E(useWllama)
F(useSpeechSynthesis)
end
subgraph "Browser APIs"
G[Web Audio API]
H[SpeechSynthesis API]
end
A -- Raw Audio Stream --> G;
G -- Analyzed Frequency Data --> B;
A -- Raw Audio Stream --> C;
C -- Transcribed Text --> D;
D -- Prompt --> E;
E -- Generated Tokens --> D;
D -- Full Sentence --> F;
F -- Synthesized Speech --> H;
style A fill:#f9f,stroke:#333,stroke-width:2px
style B fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#ccf,stroke:#333,stroke-width:2px
style E fill:#ccf,stroke:#333,stroke-width:2px
style F fill:#cfc,stroke:#333,stroke-width:2px
style G fill:#fcf,stroke:#333,stroke-width:2px
style H fill:#fcf,stroke:#333,stroke-width:2px
Explanation of the Workflow
- Audio Input: The
useAudioVisualizeranduseWhispercomposables both receive the raw audio stream from the microphone via the Web Audio API. - Visualization:
useAudioVisualizeranalyzes the frequency data of the audio stream to create the waveform visualization on the UI. - Speech Recognition:
useWhisperprocesses the audio stream. It uses the Silero VAD model to detect speech and, once the user stops talking, sends the audio to the Whisper model for transcription. - Conversation Orchestration: The transcribed text is passed to the
useConversationcomposable, which manages the overall conversation flow. - LLM Inference:
useConversationsends the transcribed text as a prompt to theuseWllamacomposable.useWllamathen uses the Gemma 3 LLM to generate a response, streaming the tokens back touseConversation. - Speech Synthesis: As
useConversationreceives tokens fromuseWllama, it assembles them into sentences. Once a complete sentence is formed (determined by theSENTENCE_BOUNDARYregex), it’s passed to theuseSpeechSynthesiscomposable.useSpeechSynthesisthen uses the browser’s SpeechSynthesis API to speak the sentence aloud. This process repeats until the entire response has been synthesized.
Configuration
Ava’s behavior can be customized by adjusting the following parameters:
LLM Settings (useWllama.ts)
nPredict: 64, // Max tokens (lower = faster response)
temp: 0.7, // Sampling temperature
top_k: 40, // Top-k sampling
top_p: 0.9, // Nucleus sampling
VAD Settings (useWhisper.ts)
positiveSpeechThreshold: 0.5, // Confidence threshold for speech detection
negativeSpeechThreshold: 0.35, // Threshold for non-speech
redemptionMs: 800, // Wait time after speech ends before triggering
minSpeechMs: 200, // Minimum speech duration to consider
preSpeechPadMs: 300, // Audio to include before speech detected
Sentence Boundary (useWllama.ts)
const SENTENCE_BOUNDARY = /[.!?,](?:\s|$)/ // TTS triggers on punctuation
Requirements
- Browser: Chrome 90+ or Edge 90+ (requires
SharedArrayBuffer) - Headers: Cross-Origin Isolation must be enabled on the hosting server:
Cross-Origin-Opener-Policy: same-origin Cross-Origin-Embedder-Policy: require-corp
Permissions
Ava requires the following browser permissions:
| Permission | Purpose | Required |
|---|---|---|
| Microphone | Voice input for Whisper | Yes |
| Audio Playback | Text-to-speech output | Yes |
Performance
- First load: Downloads ~220MB of models, which are then cached by the browser.
- Inference:
- VAD runs in real-time.
- Whisper transcription takes approximately 0.3-0.5 seconds.
- The LLM takes about 1-2 seconds to generate a response.
- Memory: Consumes between 500MB and 1GB of RAM during operation.
- WebGPU: Not yet supported; all processing runs on the CPU via WASM SIMD.
Future Work
- WebGPU Support: Integrating WebGPU would offload processing to the GPU, significantly speeding up inference times.
- Improved Speech Synthesis: While the native browser API is effective, exploring more advanced, natural-sounding TTS options could enhance the user experience.
- Conversation Context: Implementing a mechanism to carry context over multiple turns would allow for more coherent and engaging conversations.