Voice AI Latency Benchmarks: Why Only 2 of 5 Stacks Stay Under 300ms in Production

Voice AI vendors claim sub-300ms response times. Ken Imoto tested five production stacks against the same one-minute conversation and found three of them miss the target entirely. The two that succeeded were the ones he assumed were marketing fluff. The hand-stitched pipelines were the problem.

The Three Latency Cliffs

Voice latency does not degrade smoothly. It falls off cliffs. User behavior changes sharply at specific thresholds:

Latency Range	User Behavior
0-300ms	Talks normally, never thinks about the AI
300-500ms	Senses a pause, tolerates it
500-800ms	Talks over the AI (“can you hear me?“)
800-1500ms	Repeats the question
1500ms+	Treats the call like an international line, gives up

At 300ms, the user starts noticing a machine. Above 500ms, they fight the turn-taking model and the STT keeps resetting because they talk over the response. By 800ms, half the testers said “hello? hello?” on playback.

Where the 300ms Budget Goes

A cascaded voice pipeline has four serial components:

STT (speech-to-text): 80-300ms depending on model and VAD design
LLM TTFT (time to first token): 100-500ms depending on model size, context length, and cold-start
TTS TTFB (time to first byte of audio): 75-300ms depending on the vocoder
Network round-trip: 50-200ms, capped by the speed of light and your colo choice

Add the fastest number in every row and you get 305ms. Add the typical numbers and you are at 650ms. The budget is already gone before you account for queueing, retries, or concurrent load.

The Five Stacks Tested

The author tested:

Vapi (managed WebRTC)
Retell (managed WebRTC)
AssemblyAI Realtime + GPT-4o + ElevenLabs (hand-stitched WebSocket)
Deepgram + Claude Sonnet + PlayHT (hand-stitched HTTP streaming)
OpenAI Realtime API (managed WebSocket)

Only Vapi and Retell stayed under 300ms at P95. The hand-stitched stacks (AssemblyAI, Deepgram, OpenAI Realtime) all exceeded 500ms under concurrent load.

Why Hand-Stitched Pipelines Miss the Target

The managed stacks (Vapi, Retell) use WebRTC with custom VAD tuning and pre-warmed LLM connections. They pipeline the stages: TTS starts synthesizing before the LLM finishes the full response.

The hand-stitched stacks serialize everything. The STT waits for silence detection. The LLM waits for the full transcript. The TTS waits for the full LLM response. Each stage adds queueing delay.

Example of the serialization problem:

# Hand-stitched pipeline (serialized)
transcript = await stt_client.transcribe(audio_chunk)  # 200ms
llm_response = await llm_client.complete(transcript)   # 400ms
audio_bytes = await tts_client.synthesize(llm_response)  # 250ms
# Total: 850ms

# Managed stack (pipelined)
async for token in llm_client.stream(transcript):
    tts_client.push_token(token)  # TTS starts before LLM finishes
# TTFB: 280ms

The managed stacks also pre-warm connections and keep LLM instances hot. Cold-start adds 200-400ms to TTFT. Under concurrent load, the hand-stitched stacks queue requests. The managed stacks scale horizontally with WebRTC session affinity.

P95 Latency Results (May 2026)

Stack	P50 Latency	P95 Latency	Cost per Minute
Vapi	240ms	290ms	$0.12
Retell	260ms	310ms	$0.10
AssemblyAI + GPT-4o + ElevenLabs	420ms	680ms	$0.08
Deepgram + Claude Sonnet + PlayHT	390ms	720ms	$0.07
OpenAI Realtime API	310ms	520ms	$0.15

The managed stacks cost 40-50% more per minute but stay under 300ms at P95. The hand-stitched stacks save money but fail the latency target. OpenAI Realtime API is the most expensive and still misses P95 under load.

Where the Budget Forces Trade-offs

The cost difference comes from three places:

Pre-warmed infrastructure: Managed stacks keep LLM instances hot. Hand-stitched stacks pay cold-start tax on every request.
WebRTC overhead: WebRTC requires TURN servers, session management, and codec negotiation. HTTP streaming is cheaper but adds latency.
VAD tuning: Managed stacks use custom VAD models trained for conversational turn-taking. Generic VAD waits too long for silence.

If you are building a high-volume call center, the $0.05 per minute difference matters. If you are building a conversational agent where users hang up after 800ms, the latency cliff matters more.

Observability Gaps

The author found three observability problems in hand-stitched stacks:

No per-stage tracing: Without distributed tracing, you cannot tell if the 680ms P95 is STT queueing, LLM cold-start, or TTS synthesis.
No VAD visibility: You do not know when the VAD detected silence or how long it waited.
No concurrent load testing: Single-request benchmarks hide queueing delay under concurrent load.

The managed stacks expose per-stage latency in their dashboards. The hand-stitched stacks require custom instrumentation.

Technical Verdict

Use managed stacks (Vapi, Retell) when:

You need sub-300ms P95 latency for conversational agents
You can afford $0.10-0.12 per minute
You want WebRTC session management handled for you
You need horizontal scaling without custom infrastructure

Use hand-stitched stacks when:

You can tolerate 500-700ms P95 latency
You are optimizing for cost per minute
You need custom VAD tuning or model swapping
You already have distributed tracing and observability

Avoid OpenAI Realtime API when:

You need predictable P95 latency under concurrent load
You are cost-sensitive (most expensive option)

The 300ms cliff is real. If your users talk over the AI or repeat questions, you are above it. Measure P95, not P50. The median hides the problem.

Source Links

Original benchmark article