Voice agents fail in production not because any single component is slow, but because the seams between ASR, LLM, TTS, and client playback create compounding latency that no single-stage benchmark catches. End-to-end latency as one number hides which layer is the bottleneck on any given turn.
The solution is to instrument each stage as its own OpenTelemetry span, tied together by a session ID. This exposes where time is actually spent and which layer is causing perceived lag.
The Four Layers and What They Hide
A voice agent turn looks simple: user speaks, agent responds. Under the hood:
- ASR (Automatic Speech Recognition): Audio frames stream in, ASR emits partial text results before finalizing
- LLM: Text plus conversation history goes to the model, which streams tokens back (possibly with tool calls in between)
- TTS (Text-to-Speech): Tokens convert to audio, also streaming
- Client: Audio frames arrive over WebSocket, get buffered, and play back
Each layer streams. Each layer can stall. Each layer has its own failure mode.
The metric that matters most is barge-in latency: when a user starts talking over the agent, how many milliseconds until the agent stops sending audio. If p95 barge-in exceeds 200ms, the agent feels like it is talking at you instead of with you.
Span Boundaries for Streaming Pipelines
Traditional APM assumes request-response. Voice agents are four concurrent streams with partial results. You need span boundaries that capture both processing time and the gaps between stages.
ASR Span
Start when the first audio frame arrives. End when ASR emits the final transcription. Attributes:
asr.partial_count: How many partial results before finalasr.final_text: The completed transcriptionasr.silence_detected_ms: Time from last speech to final resultasr.provider: Which ASR service (Deepgram, AssemblyAI, etc.)
The gap between last partial and final result often reveals network jitter or model hesitation. If silence_detected_ms is high, the ASR is waiting too long to finalize.
LLM Span
Start when the final ASR text hits the model. End when the last token streams out. Attributes:
llm.prompt_tokens: Input size including historyllm.completion_tokens: Output sizellm.tool_calls: Number of function calls before responsellm.first_token_ms: Time to first token (TTFT)llm.model: Which model (gpt-4, claude-3, etc.)
If the LLM makes tool calls, create child spans for each call. The parent LLM span should cover the entire reasoning loop, not just the final text generation. This exposes when the agent is stuck waiting on a slow tool.
TTS Span
Start when the first LLM token arrives at TTS. End when the last audio frame is generated. Attributes:
tts.input_chars: Text lengthtts.audio_duration_ms: Length of generated audiotts.first_chunk_ms: Time to first audio frametts.provider: Which TTS service (ElevenLabs, Play.ht, etc.)
Streaming TTS should emit audio before the LLM finishes. If first_chunk_ms is high, the TTS is waiting for too much text before starting synthesis. Some TTS providers buffer several sentences before streaming, which kills perceived responsiveness.
Client Span
Start when the first audio frame arrives at the client. End when playback finishes. Attributes:
client.buffer_underruns: How many times playback stalled waiting for framesclient.jitter_ms: Variance in frame arrival timesclient.playback_duration_ms: Actual playback timeclient.network_rtt_ms: Round-trip time to server
The client span often reveals problems invisible on the server. High jitter means frames are arriving in bursts. Buffer underruns mean the client is playing audio faster than it arrives, causing stuttering.
Correlating Spans Across WebSocket Connections
Voice agents run over WebSocket, not HTTP. You cannot rely on trace context headers. Instead:
- Generate a
session_idwhen the WebSocket opens - Include
session_idin every span across all four layers - Use
turn_idto group spans within a single user utterance - Emit spans from both client and server, using the same IDs
The client must send its span data back to the server (or directly to your collector) because client-side buffering delays are invisible to server-side tracing.
Metrics That Distinguish User-Perceived Latency
Processing time is not perceived latency. A user does not care if the LLM took 800ms if audio started playing after 300ms. The metrics that matter:
| Metric | Definition | Target p95 |
|---|---|---|
| Barge-in latency | User starts talking → agent stops sending audio | < 200ms |
| First audio latency | User stops talking → first audio frame plays | < 500ms |
| Turn completion | User stops talking → agent finishes speaking | < 5s |
| ASR finalization | Last speech → final transcript | < 300ms |
| TTFT (time to first token) | Prompt sent → first LLM token | < 400ms |
| First audio chunk | First LLM token → first TTS audio | < 200ms |
Barge-in latency is the hardest to optimize because it requires the entire pipeline to detect interruption and stop gracefully. If ASR takes 300ms to detect silence, you have already blown your budget.
Instrumenting Barge-in Detection
Barge-in is not a single span. It is a cross-cutting event that touches all four layers:
- Client detects user speech (voice activity detection)
- Client sends interrupt signal to server
- Server stops LLM generation mid-stream
- Server stops TTS synthesis
- Server flushes remaining audio frames
- Client stops playback
Create a barge_in event with attributes:
barge_in.detected_at_ms: When VAD triggeredbarge_in.signal_sent_at_ms: When client sent interruptbarge_in.llm_stopped_at_ms: When LLM generation haltedbarge_in.tts_stopped_at_ms: When TTS synthesis haltedbarge_in.playback_stopped_at_ms: When client stopped audio
The delta between detected_at_ms and playback_stopped_at_ms is your barge-in latency. If it is high, check each intermediate timestamp to find the bottleneck.
Code Example: Emitting ASR Spans
use opentelemetry::trace::{Tracer, SpanKind};
use opentelemetry::KeyValue;
async fn process_audio_stream(
session_id: &str,
turn_id: &str,
audio_frames: impl Stream<Item = AudioFrame>,
) -> Result<String, Error> {
let tracer = global::tracer("voice-agent");
let mut span = tracer
.span_builder("asr.transcribe")
.with_kind(SpanKind::Internal)
.with_attributes(vec![
KeyValue::new("session_id", session_id.to_string()),
KeyValue::new("turn_id", turn_id.to_string()),
])
.start(&tracer);
let mut partial_count = 0;
let mut final_text = String::new();
let start = Instant::now();
pin_mut!(audio_frames);
while let Some(frame) = audio_frames.next().await {
if let Some(partial) = asr_client.send_frame(frame).await? {
partial_count += 1;
}
}
let result = asr_client.finalize().await?;
final_text = result.text;
let silence_detected = result.silence_duration_ms;
span.set_attribute(KeyValue::new("asr.partial_count", partial_count as i64));
span.set_attribute(KeyValue::new("asr.final_text", final_text.clone()));
span.set_attribute(KeyValue::new("asr.silence_detected_ms", silence_detected as i64));
span.end();
Ok(final_text)
}
The key is to start the span when the first frame arrives, not when you call the ASR API. This captures queueing and buffering delays.
Failure Modes by Layer
Each layer has distinct failure signatures in the traces:
ASR: High silence_detected_ms means the model is waiting too long to finalize. High partial_count with short audio means the model is unstable.
LLM: High first_token_ms means prompt is too long or model is cold. High tool_calls with long span means tools are slow or the agent is stuck in a reasoning loop.
TTS: High first_chunk_ms means the TTS is buffering too much text. High variance in chunk timing means the TTS provider is overloaded.
Client: High buffer_underruns means network is too slow for real-time. High jitter_ms means frames are arriving in bursts, not smoothly.
When to Alert
Set alerts on p95, not p50. Voice agents are interactive, so tail latency is what users notice. Alert thresholds:
- Barge-in p95 > 250ms: Users will complain about interruptions
- First audio p95 > 600ms: Feels like the agent is not listening
- ASR finalization p95 > 400ms: Agent starts responding before user finishes
- TTFT p95 > 500ms: Long pauses feel like the agent is thinking too hard
Do not alert on turn completion unless it exceeds 10s. Users tolerate long responses if audio starts quickly.
Technical Verdict
Use this instrumentation approach when you are running voice agents in production and need to debug perceived latency across streaming components. It is overkill for prototypes or single-user demos.
Avoid this if your voice agent is synchronous (wait for full transcription, then full LLM response, then full TTS) because the span boundaries will not capture streaming behavior. Also avoid if you are not already running OpenTelemetry, because adding it just for voice tracing is a large dependency for a narrow use case.
The hardest part is instrumenting the client. If you cannot emit spans from the client, you will miss half the latency story. Browser-based clients can use the OpenTelemetry JS SDK, but mobile clients require more plumbing.
Barge-in latency is the metric that predicts user satisfaction better than any other. If you can only instrument one thing, instrument that.