Building a Real-Time Voice Pipeline From Scratch

ElevenLabs charges $0.40 per hour for speech-to-text. Their TTS runs $0.30 per thousand characters at the Creator tier. Run a voice interface for a few hours a day and you're looking at $50-100 a month, for one user. Scale to a product with multiple concurrent sessions and the API math stops working fast.

The alternative is running the models yourself. Whisper on an M-series Mac transcribes at 18x real-time. Kokoro-82M, an 82-million parameter TTS model, synthesizes speech at 24kHz on CPU. The marginal cost is electricity. I've been running both locally for months, and the voice pipeline around them went through two complete architectures before the current one stopped needing rewrites.

The Starting Point

The first version was a two-process system. Node.js handled audio capture via sox, piping 48kHz stereo PCM from a Loopback virtual audio device. A channel splitter extracted mono from the stereo pair, RNNoise suppressed background noise at 480-sample frame boundaries, libsamplerate-js resampled from 48kHz to 16kHz, and an RMS-based VAD decided when someone had stopped talking.

Each completed phrase got wrapped in a WAV container and HTTP-POSTed to a Python FastAPI server running OpenAI's Whisper medium model on localhost.

It worked. Built it in a couple of days. Transcription was accurate, the RNNoise denoising was effective, and a promise-chain queue serialized requests so Whisper never got concurrent calls. But the architecture had hard limits: two runtimes to coordinate, macOS-only hardware capture, single-user by design, one-way only with no synthesis. Good enough to validate the concept, not good enough to build a product on.

VAD Engineering

Voice activity detection sounds simple until you try to make it reliable. The prototype used raw RMS energy against an Int16 amplitude threshold of 275, with a sliding window of 50 chunks at 10ms each. If 60% of the last 50 chunks fell below the threshold, the phrase was over. I tuned it until it worked, and it did. But the production rewrite was a chance to do it properly.

The production VAD is built around seven parameters that handle the edge cases the prototype's simpler approach left to threshold tuning:

TypeScript

const DEFAULT_THRESHOLD = 0.008;       // Float32 normalized [-1, 1], ~-42dB
const DEFAULT_PRE_ROLL_MS = 500;       // idle audio kept before speech start
const DEFAULT_HANGOVER_MS = 200;       // wait after speech before checking silence
const DEFAULT_MIN_PHRASE_MS = 60;      // discard clicks and pops
const DEFAULT_MAX_PHRASE_MS = 30_000;  // force-end to prevent unbounded buffering
const DEFAULT_SILENCE_WINDOW_MS = 500; // trailing silence detection width
const DEFAULT_PERCENT_SILENT = 0.6;    // 60% of window must be below threshold

Threshold moved from Int16 amplitude (275 out of 32768) to normalized Float32 (0.008, roughly -42dB). At this level, quiet room noise sits around 0.001-0.005, soft speech onset hits 0.005-0.015, and vowels land between 0.05 and 0.3.

Pre-roll compensates for detection latency. The VAD continuously buffers 500ms of audio in a rolling window while idle. When the RMS first crosses the speech threshold, the entire buffer gets prepended to the phrase, silence and all. A breathy consonant like the "h" in "hey" might take a window or two before it crosses 0.008, but the pre-roll already has those samples. The leading silence is harmless. STT models handle it fine, and trimming it would reintroduce the clipping the pre-roll exists to prevent.

Hangover guards against premature end-of-speech detection. Brief dips below threshold mid-word (sibilants, unvoiced consonants) can look like silence. The 200ms hangover ensures that once speech starts, the VAD waits before evaluating whether the speaker has actually stopped. Short single-word responses like "yes" and "yeah" survive instead of getting swallowed by an eager silence check.

Minimum phrase duration (60ms) discards mechanical noise. A door closing, a keyboard tap, or a cough that's just loud enough to cross the threshold but too short to be speech. With 30ms RMS windows, 60ms quantizes to 2 voiced windows (960 samples). The word "hi" takes about 60-90ms of voiced audio, so 2 windows is the floor that lets short words through while still rejecting clicks and pops.

Post-speech padding is implicit. The trailing silence detection window requires 60% of 17 windows (11+ windows, 330ms minimum) to be below threshold before firing. By the time the phrase emits, those silence windows are already in the phrase buffer. The phrase audio ships with 330ms+ of trailing silence baked in, which gives Whisper clean context on both ends without an explicit padding step.

Maximum phrase duration (30 seconds) is a safety valve. If the VAD misses a silence boundary, this forces a transcription rather than letting the buffer grow unbounded.

This is roughly as far as you can push an RMS-based VAD.

The fundamental limitation is that RMS measures energy, not speech. A cough and the word "hi" can produce the same RMS value.

Model-based detectors like Silero VAD solve this properly. Silero uses spectral features and learned attention to classify speech vs non-speech regardless of volume, runs under 1ms per 32ms chunk on CPU, and the model is under 1MB. It's a viable option for real-time pipelines and would eliminate most of the noise that ends up as Whisper hallucinations.

For this project, RMS with hallucination stripping was the right call. It's an open-source tool designed for single-user voice input where transcripts accumulate into a prompt and trigger words control when it sends. The VAD just needs to find phrase boundaries cleanly enough that Whisper produces usable text. Adding an ONNX runtime dependency for a marginal improvement in phrase boundary detection wasn't worth the build complexity for that use case. If the interaction model were real-time conversational turn-taking or multi-speaker, Silero would be the obvious choice.

The Whisper Padding Trick

Even with a tuned VAD, short utterances cause problems. A one-word response like "yes" or "no" produces 300-400ms of audio. Whisper's attention mechanism treats input as a 30-second spectrogram and runs cross-attention across the full image. With sub-second audio, the model sees mostly silence and either returns nothing or hallucinates filler text like "Thank you" or an ellipsis.

The fix is four lines:

TypeScript

const MIN_AUDIO_SAMPLES = 16_000 * 1; // 1 second at 16kHz
 
function padToMinDuration(audio: Float32Array): Float32Array {
  if (audio.length >= MIN_AUDIO_SAMPLES) return audio;
  const padded = new Float32Array(MIN_AUDIO_SAMPLES);
  padded.set(audio);
  // Float32Array zero-initializes, remainder is silence
  return padded;
}

Every phrase gets padded to at least one second before passing to Whisper. The trailing silence gives the attention mechanism enough context to anchor on the actual speech content. Longer phrases pass through unchanged. This eliminated hallucinations on short utterances entirely.

Padding doesn't catch everything though. Non-speech audio that passes the VAD (coughs, a door closing, ambient noise that crosses the RMS threshold long enough to survive the minimum phrase filter) still reaches Whisper. The model doesn't know what to do with it and outputs bracketed or asterisk-wrapped annotations: [cough], [silence], [BLANK_AUDIO], *inaudible*. These look like real transcriptions to downstream consumers.

The fix is a regex strip after transcription:

TypeScript

const HALLUCINATION_RE = /\[.*?\]|\*+[^*]+\*+/g;
 
function stripHallucinations(text: string): string {
  return text.replace(HALLUCINATION_RE, '').replace(/\s{2,}/g, ' ').trim();
}

This runs on every transcription result before it gets delivered. [cough] yes do that becomes yes do that. A transcription that's entirely hallucination artifacts strips down to an empty string and gets dropped. The pattern is intentionally broad. These annotations aren't special tokens defined in Whisper's tokenizer. They're regular vocabulary inherited from YouTube subtitles in the training data, where human transcribers wrote [Music] and [Applause] as conventions. Whisper's tokenizer does have a non_speech_tokens suppression list that can block bracket and asterisk patterns during decoding, but whisper.cpp disables it by default, and the vocabulary of YouTube subtitle conventions isn't stable across model sizes or languages.

Provider Architecture

The production system abstracts STT and TTS behind a provider interface with three tiers: local, cloud, and browser.

TypeScript

const manager = new VoiceManager({
  stt: { tier: 'local' },
  tts: { tier: 'cloud' }
});
 
manager.registerSttProvider('local', new LocalSTTProvider({ model: 'base.en' }));
manager.registerSttProvider('cloud', new CloudSTTProvider({ apiKey }));
manager.registerSttProvider('browser', new BrowserSTTProvider());
manager.registerTtsProvider('cloud', new CloudTTSProvider({ apiKey, voiceId }));
manager.registerTtsProvider('browser', new BrowserTTSProvider());

STT and TTS tiers are independent. You can run local STT with cloud TTS, or cloud STT with browser-native synthesis, or any combination. The configured tier is tried first; if initialization fails, the system falls back to the browser tier, which delegates to the Web Speech API and SpeechSynthesis natively on the client. Not great quality, but it means voice never completely breaks.

This matters because native addon dependencies like smart-whisper fail on some platforms. ONNX models fail to load under memory pressure. API keys expire. The tiered architecture means a failed dependency downgrades the experience instead of killing it.

Binary WebSocket Protocol

Audio moves between browser and server as binary WebSocket frames with a minimal framing protocol. Each frame starts with a 4-byte ASCII tag followed by the raw payload:

text

[4 bytes tag][N bytes Float32Array payload]
     VAUD    → voice audio (browser → server)
     VTTS    → synthesized speech (server → browser)
     ECHO    → echo/debug audio

Sending is a Buffer.concat:

TypeScript

function sendBinary(ws: WebSocket, tag: string, payload: Buffer): void {
  ws.send(Buffer.concat([Buffer.from(tag, 'ascii'), payload]));
}

Receiving strips the tag and dispatches:

TypeScript

function handleBinaryMessage(data: Buffer): void {
  const tag = data.subarray(0, 4).toString('ascii');
  const payload = data.subarray(4);
  const handler = binaryHandlers.get(tag);
  if (handler) handler(payload);
}

One gotcha with Node.js WebSocket binary data: Buffer objects share a pool slab internally, so a Float32Array view into payload.buffer may alias memory from unrelated messages. The handler copies the data out before processing:

TypeScript

const view = new Float32Array(payload.buffer, payload.byteOffset, payload.byteLength / 4);
const audio = new Float32Array(view.length);
audio.set(view); // copy to avoid aliasing Node's buffer pool

Skip the copy and you get intermittent audio corruption that's nearly impossible to reproduce consistently.

What Got Dropped

RNNoise was the most complex piece of the prototype. A custom-compiled static library, a C++ NAPI addon with node-gyp build tooling, a FrameBuffer transform to enforce exactly 480-sample frames at 48kHz, and runtime that processed every audio frame before anything else in the chain.

The production system doesn't use it. Browser-captured audio arrives cleaner than hardware capture because getUserMedia can be configured to apply noise suppression, echo cancellation, and auto gain control before the audio ever reaches the application. Requesting all three explicitly in the media constraints gives a clean enough signal that RNNoise's benefit doesn't justify its build complexity.

The same logic killed the resampling stage. The prototype captured at 48kHz (sox's default for macOS CoreAudio) and resampled to 16kHz for Whisper via a WASM port of libsamplerate. In the production system, the browser captures at 16kHz directly, configured through the AudioContext sample rate. No server-side resampling needed. The WASM dependency disappeared with it.

Both removals follow the same principle: when the audio source changes, the processing chain should change with it. Infrastructure decisions from the prototype encoded assumptions about the input signal that stopped being true when the input moved from hardware capture to browser WebSocket.

Local Inference

The prototype ran Whisper as a separate Python process, loading the medium model at startup and accepting HTTP multipart POSTs with WAV files. Every transcription involved constructing a WAV header in memory, HTTP round-trip to localhost, temp file I/O on the Python side, and JSON serialization back. It worked, but the process coordination overhead added 200-400ms per request on top of inference time.

smart-whisper replaces this with an in-process NAPI addon. whisper.cpp compiles to native code and links directly into the Node.js runtime. Model loading happens once, and transcription is a function call across the JS/C++ boundary:

TypeScript

async init(): Promise<void> {
  const { Whisper, manager } = await import('smart-whisper');
  const modelName = await manager.download('base.en');
  this.whisper = new Whisper(manager.resolve(modelName), { gpu: true });
}
 
async transcribe(audio: Float32Array): Promise<string> {
  const { result } = await this.whisper.transcribe(audio, {
    language: 'en',
    format: 'detail'
  });
  const segments = await result;
  return segments.map(s => s.text).join(' ').trim();
}

The two-phase await is intentional: the first await on transcribe() kicks off inference, the second resolves the segments. On an M-series Mac with Metal GPU acceleration, the base.en model (74M parameters) transcribes at roughly 18x real-time. A 5-second phrase completes in under 300ms.

For TTS, Kokoro-82M runs through kokoro-js, which wraps the ONNX model for Node.js. 82 million parameters, 24kHz output, Apache 2.0 licensed. The streaming API yields Float32Array chunks per sentence:

TypeScript

async *synthesize(text: string): AsyncGenerator<Float32Array> {
  const splitter = new TextSplitterStream();
  splitter.push(text);
  splitter.close();
 
  for await (const chunk of this.tts.stream(splitter, {
    voice: this.voice,
    speed: this.speed
  })) {
    const samples = chunk.audio?.audio;
    if (samples?.length > 0) yield samples;
  }
}

The TextSplitterStream construction deserves a note. Passing a raw string to tts.stream() silently buffers the last sentence forever because the internal splitter never receives a close signal, and the async generator hangs. Constructing the splitter manually and calling .close() before passing it to stream() fixes this. The kind of bug that costs an afternoon the first time you hit it.

The Economics

Running both models locally on owned hardware, the marginal cost per hour of voice processing is effectively the electricity to keep the machine running. Compare that to cloud APIs:

Service	STT (per hour of audio)	TTS (per 1k characters)
ElevenLabs	$0.40/hr	$0.30
OpenAI Whisper API	$0.36/hr	—
Deepgram Nova-3	$0.46/hr (streaming)	$0.03
Local (owned hardware)	~$0.00	~$0.00

A thousand characters of TTS is roughly 60-90 seconds of natural speech. For a voice interface handling a few hours of conversation per day, cloud costs run $50-100+ monthly per user. Local inference on an M-series Mac that's already on the desk costs nothing incremental.

The tradeoff is real: local inference requires the hardware, the build toolchain for native addons, and tolerance for the occasional model-loading hiccup on cold start. Cloud APIs are a POST request away. The provider tier architecture means you don't have to choose permanently. Start with cloud, add local when the volume justifies it, fall back to browser when nothing else works.

Build It, Then Replace the Pieces

The end-to-end voice platforms (OpenAI's Realtime API, ElevenLabs Conversational AI, Deepgram's Voice Agent) are getting good. They handle the full pipeline in one API call and the latency is dropping fast. If all you need is a voice interface, those are the right answer.

But if you need to control the VAD tuning, choose your own models, mix local and cloud providers, or run the whole thing on your own hardware, there's no shortcut. You have to build the pipeline yourself and understand each stage well enough to replace it when something better comes along. Every component in this system has been swapped at least once. The prototype ran RNNoise, Python Whisper, and sox. The production version runs none of those. A year from now, smart-whisper and Kokoro will probably be replaced too. The architecture that survives is the one designed for its own parts to be thrown away.