The 4-Layer Voice Agent Latency Stack: Tracing ASR, LLM, TTS, and Client with OpenTelemetry

Voice agents fail in production not because any single component is slow, but because the seams between ASR, LLM, TTS, and client playback create compounding latency that no single-stage benchmark catches. End-to-end latency as one number hides which layer is the bottleneck on any given turn.

The solution is to instrument each stage as its own OpenTelemetry span, tied together by a session ID. This exposes where time is actually spent and which layer is causing perceived lag.

The Four Layers and What They Hide

A voice agent turn looks simple: user speaks, agent responds. Under the hood:

ASR (Automatic Speech Recognition): Audio frames stream in, ASR emits partial text results before finalizing
LLM: Text plus conversation history goes to the model, which streams tokens back (possibly with tool calls in between)
TTS (Text-to-Speech): Tokens convert to audio, also streaming
Client: Audio frames arrive over WebSocket, get buffered, and play back

Each layer streams. Each layer can stall. Each layer has its own failure mode.

The metric that matters most is barge-in latency: when a user starts talking over the agent, how many milliseconds until the agent stops sending audio. If p95 barge-in exceeds 200ms, the agent feels like it is talking at you instead of with you.

Span Boundaries for Streaming Pipelines

Traditional APM assumes request-response. Voice agents are four concurrent streams with partial results. You need span boundaries that capture both processing time and the gaps between stages.

ASR Span

Start when the first audio frame arrives. End when ASR emits the final transcription. Attributes:

asr.partial_count: How many partial results before final
asr.final_text: The completed transcription
asr.silence_detected_ms: Time from last speech to final result
asr.provider: Which ASR service (Deepgram, AssemblyAI, etc.)

The gap between last partial and final result often reveals network jitter or model hesitation. If silence_detected_ms is high, the ASR is waiting too long to finalize.

LLM Span

Start when the final ASR text hits the model. End when the last token streams out. Attributes:

llm.prompt_tokens: Input size including history
llm.completion_tokens: Output size
llm.tool_calls: Number of function calls before response
llm.first_token_ms: Time to first token (TTFT)
llm.model: Which model (gpt-4, claude-3, etc.)

If the LLM makes tool calls, create child spans for each call. The parent LLM span should cover the entire reasoning loop, not just the final text generation. This exposes when the agent is stuck waiting on a slow tool.

TTS Span

Start when the first LLM token arrives at TTS. End when the last audio frame is generated. Attributes:

tts.input_chars: Text length
tts.audio_duration_ms: Length of generated audio
tts.first_chunk_ms: Time to first audio frame
tts.provider: Which TTS service (ElevenLabs, Play.ht, etc.)

Streaming TTS should emit audio before the LLM finishes. If first_chunk_ms is high, the TTS is waiting for too much text before starting synthesis. Some TTS providers buffer several sentences before streaming, which kills perceived responsiveness.

Client Span

Start when the first audio frame arrives at the client. End when playback finishes. Attributes:

client.buffer_underruns: How many times playback stalled waiting for frames
client.jitter_ms: Variance in frame arrival times
client.playback_duration_ms: Actual playback time
client.network_rtt_ms: Round-trip time to server

The client span often reveals problems invisible on the server. High jitter means frames are arriving in bursts. Buffer underruns mean the client is playing audio faster than it arrives, causing stuttering.

Correlating Spans Across WebSocket Connections

Voice agents run over WebSocket, not HTTP. You cannot rely on trace context headers. Instead:

Generate a session_id when the WebSocket opens
Include session_id in every span across all four layers
Use turn_id to group spans within a single user utterance
Emit spans from both client and server, using the same IDs

The client must send its span data back to the server (or directly to your collector) because client-side buffering delays are invisible to server-side tracing.

Metrics That Distinguish User-Perceived Latency

Processing time is not perceived latency. A user does not care if the LLM took 800ms if audio started playing after 300ms. The metrics that matter:

Metric	Definition	Target p95
Barge-in latency	User starts talking → agent stops sending audio	< 200ms
First audio latency	User stops talking → first audio frame plays	< 500ms
Turn completion	User stops talking → agent finishes speaking	< 5s
ASR finalization	Last speech → final transcript	< 300ms
TTFT (time to first token)	Prompt sent → first LLM token	< 400ms
First audio chunk	First LLM token → first TTS audio	< 200ms

Barge-in latency is the hardest to optimize because it requires the entire pipeline to detect interruption and stop gracefully. If ASR takes 300ms to detect silence, you have already blown your budget.

Instrumenting Barge-in Detection

Barge-in is not a single span. It is a cross-cutting event that touches all four layers:

Client detects user speech (voice activity detection)
Client sends interrupt signal to server
Server stops LLM generation mid-stream
Server stops TTS synthesis
Server flushes remaining audio frames
Client stops playback

Create a barge_in event with attributes:

barge_in.detected_at_ms: When VAD triggered
barge_in.signal_sent_at_ms: When client sent interrupt
barge_in.llm_stopped_at_ms: When LLM generation halted
barge_in.tts_stopped_at_ms: When TTS synthesis halted
barge_in.playback_stopped_at_ms: When client stopped audio

The delta between detected_at_ms and playback_stopped_at_ms is your barge-in latency. If it is high, check each intermediate timestamp to find the bottleneck.

Code Example: Emitting ASR Spans

use opentelemetry::trace::{Tracer, SpanKind};
use opentelemetry::KeyValue;

async fn process_audio_stream(
    session_id: &str,
    turn_id: &str,
    audio_frames: impl Stream<Item = AudioFrame>,
) -> Result<String, Error> {
    let tracer = global::tracer("voice-agent");
    let mut span = tracer
        .span_builder("asr.transcribe")
        .with_kind(SpanKind::Internal)
        .with_attributes(vec![
            KeyValue::new("session_id", session_id.to_string()),
            KeyValue::new("turn_id", turn_id.to_string()),
        ])
        .start(&tracer);

    let mut partial_count = 0;
    let mut final_text = String::new();
    let start = Instant::now();

    pin_mut!(audio_frames);
    while let Some(frame) = audio_frames.next().await {
        if let Some(partial) = asr_client.send_frame(frame).await? {
            partial_count += 1;
        }
    }

    let result = asr_client.finalize().await?;
    final_text = result.text;
    let silence_detected = result.silence_duration_ms;

    span.set_attribute(KeyValue::new("asr.partial_count", partial_count as i64));
    span.set_attribute(KeyValue::new("asr.final_text", final_text.clone()));
    span.set_attribute(KeyValue::new("asr.silence_detected_ms", silence_detected as i64));
    span.end();

    Ok(final_text)
}

The key is to start the span when the first frame arrives, not when you call the ASR API. This captures queueing and buffering delays.

Failure Modes by Layer

Each layer has distinct failure signatures in the traces:

ASR: High silence_detected_ms means the model is waiting too long to finalize. High partial_count with short audio means the model is unstable.

LLM: High first_token_ms means prompt is too long or model is cold. High tool_calls with long span means tools are slow or the agent is stuck in a reasoning loop.

TTS: High first_chunk_ms means the TTS is buffering too much text. High variance in chunk timing means the TTS provider is overloaded.

Client: High buffer_underruns means network is too slow for real-time. High jitter_ms means frames are arriving in bursts, not smoothly.

When to Alert

Set alerts on p95, not p50. Voice agents are interactive, so tail latency is what users notice. Alert thresholds:

Barge-in p95 > 250ms: Users will complain about interruptions
First audio p95 > 600ms: Feels like the agent is not listening
ASR finalization p95 > 400ms: Agent starts responding before user finishes
TTFT p95 > 500ms: Long pauses feel like the agent is thinking too hard

Do not alert on turn completion unless it exceeds 10s. Users tolerate long responses if audio starts quickly.

Technical Verdict

Use this instrumentation approach when you are running voice agents in production and need to debug perceived latency across streaming components. It is overkill for prototypes or single-user demos.

Avoid this if your voice agent is synchronous (wait for full transcription, then full LLM response, then full TTS) because the span boundaries will not capture streaming behavior. Also avoid if you are not already running OpenTelemetry, because adding it just for voice tracing is a large dependency for a narrow use case.

The hardest part is instrumenting the client. If you cannot emit spans from the client, you will miss half the latency story. Browser-based clients can use the OpenTelemetry JS SDK, but mobile clients require more plumbing.

Barge-in latency is the metric that predicts user satisfaction better than any other. If you can only instrument one thing, instrument that.

Source Links

Primary article: The 4-layer voice-agent latency stack, traced with OTel spans