Voice AI vendors claim sub-300ms response times. Ken Imoto tested five production stacks against the same one-minute conversation and found three of them miss the target entirely. The two that succeeded were the ones he assumed were marketing fluff. The hand-stitched pipelines were the problem.
The Three Latency Cliffs
Voice latency does not degrade smoothly. It falls off cliffs. User behavior changes sharply at specific thresholds:
| Latency Range | User Behavior |
|---|---|
| 0-300ms | Talks normally, never thinks about the AI |
| 300-500ms | Senses a pause, tolerates it |
| 500-800ms | Talks over the AI (“can you hear me?“) |
| 800-1500ms | Repeats the question |
| 1500ms+ | Treats the call like an international line, gives up |
At 300ms, the user starts noticing a machine. Above 500ms, they fight the turn-taking model and the STT keeps resetting because they talk over the response. By 800ms, half the testers said “hello? hello?” on playback.
Where the 300ms Budget Goes
A cascaded voice pipeline has four serial components:
- STT (speech-to-text): 80-300ms depending on model and VAD design
- LLM TTFT (time to first token): 100-500ms depending on model size, context length, and cold-start
- TTS TTFB (time to first byte of audio): 75-300ms depending on the vocoder
- Network round-trip: 50-200ms, capped by the speed of light and your colo choice
Add the fastest number in every row and you get 305ms. Add the typical numbers and you are at 650ms. The budget is already gone before you account for queueing, retries, or concurrent load.
The Five Stacks Tested
The author tested:
- Vapi (managed WebRTC)
- Retell (managed WebRTC)
- AssemblyAI Realtime + GPT-4o + ElevenLabs (hand-stitched WebSocket)
- Deepgram + Claude Sonnet + PlayHT (hand-stitched HTTP streaming)
- OpenAI Realtime API (managed WebSocket)
Only Vapi and Retell stayed under 300ms at P95. The hand-stitched stacks (AssemblyAI, Deepgram, OpenAI Realtime) all exceeded 500ms under concurrent load.
Why Hand-Stitched Pipelines Miss the Target
The managed stacks (Vapi, Retell) use WebRTC with custom VAD tuning and pre-warmed LLM connections. They pipeline the stages: TTS starts synthesizing before the LLM finishes the full response.
The hand-stitched stacks serialize everything. The STT waits for silence detection. The LLM waits for the full transcript. The TTS waits for the full LLM response. Each stage adds queueing delay.
Example of the serialization problem:
# Hand-stitched pipeline (serialized)
transcript = await stt_client.transcribe(audio_chunk) # 200ms
llm_response = await llm_client.complete(transcript) # 400ms
audio_bytes = await tts_client.synthesize(llm_response) # 250ms
# Total: 850ms
# Managed stack (pipelined)
async for token in llm_client.stream(transcript):
tts_client.push_token(token) # TTS starts before LLM finishes
# TTFB: 280ms
The managed stacks also pre-warm connections and keep LLM instances hot. Cold-start adds 200-400ms to TTFT. Under concurrent load, the hand-stitched stacks queue requests. The managed stacks scale horizontally with WebRTC session affinity.
P95 Latency Results (May 2026)
| Stack | P50 Latency | P95 Latency | Cost per Minute |
|---|---|---|---|
| Vapi | 240ms | 290ms | $0.12 |
| Retell | 260ms | 310ms | $0.10 |
| AssemblyAI + GPT-4o + ElevenLabs | 420ms | 680ms | $0.08 |
| Deepgram + Claude Sonnet + PlayHT | 390ms | 720ms | $0.07 |
| OpenAI Realtime API | 310ms | 520ms | $0.15 |
The managed stacks cost 40-50% more per minute but stay under 300ms at P95. The hand-stitched stacks save money but fail the latency target. OpenAI Realtime API is the most expensive and still misses P95 under load.
Where the Budget Forces Trade-offs
The cost difference comes from three places:
- Pre-warmed infrastructure: Managed stacks keep LLM instances hot. Hand-stitched stacks pay cold-start tax on every request.
- WebRTC overhead: WebRTC requires TURN servers, session management, and codec negotiation. HTTP streaming is cheaper but adds latency.
- VAD tuning: Managed stacks use custom VAD models trained for conversational turn-taking. Generic VAD waits too long for silence.
If you are building a high-volume call center, the $0.05 per minute difference matters. If you are building a conversational agent where users hang up after 800ms, the latency cliff matters more.
Observability Gaps
The author found three observability problems in hand-stitched stacks:
- No per-stage tracing: Without distributed tracing, you cannot tell if the 680ms P95 is STT queueing, LLM cold-start, or TTS synthesis.
- No VAD visibility: You do not know when the VAD detected silence or how long it waited.
- No concurrent load testing: Single-request benchmarks hide queueing delay under concurrent load.
The managed stacks expose per-stage latency in their dashboards. The hand-stitched stacks require custom instrumentation.
Technical Verdict
Use managed stacks (Vapi, Retell) when:
- You need sub-300ms P95 latency for conversational agents
- You can afford $0.10-0.12 per minute
- You want WebRTC session management handled for you
- You need horizontal scaling without custom infrastructure
Use hand-stitched stacks when:
- You can tolerate 500-700ms P95 latency
- You are optimizing for cost per minute
- You need custom VAD tuning or model swapping
- You already have distributed tracing and observability
Avoid OpenAI Realtime API when:
- You need predictable P95 latency under concurrent load
- You are cost-sensitive (most expensive option)
The 300ms cliff is real. If your users talk over the AI or repeat questions, you are above it. Measure P95, not P50. The median hides the problem.