When your agent system calls an LLM, invokes a tool, queries a vector database, and writes to a queue, you need a single trace that shows the full execution path. Langfuse (28K+ stars, YC W23) is an open-source LLM observability platform that integrates with OpenTelemetry to bridge the gap between agent-specific telemetry and traditional APM stacks.
The problem is not logging. The problem is correlation. You need to see token usage, prompt versions, and eval scores in the same trace as database query latency, API errors, and queue depth. Langfuse solves this by exposing LLM-specific spans that propagate OpenTelemetry context across framework boundaries.
How Trace Context Propagates Across Agent Boundaries
Langfuse instruments LLM calls, tool invocations, and retrieval steps as OpenTelemetry spans. Each span carries a trace ID and parent span ID, so distributed tracing works across services.
Key propagation points:
- LLM calls: Langfuse wraps OpenAI SDK, Langchain, LiteLLM, and other clients. Each completion becomes a span with token counts, latency, and model metadata.
- Tool invocations: When an agent calls a function or API, Langfuse creates a child span. If that tool makes its own HTTP requests, the trace context flows through standard OpenTelemetry headers.
- Retrieval steps: Vector database queries, reranking, and document fetching each get their own spans. If your retrieval service uses OpenTelemetry instrumentation, Langfuse traces connect to it automatically.
- Prompt templates: Langfuse versions prompts and links them to traces. When you update a template, you can query which traces used which version.
The instrumentation happens at the SDK level. You initialize Langfuse with your project key, and it injects trace context into outbound requests. If your downstream services already emit OpenTelemetry spans, they appear in the same trace.
Correlating LLM Spans with Infrastructure Metrics
Langfuse exports spans to OpenTelemetry collectors, which forward them to your APM backend (Datadog, Honeycomb, Grafana Tempo, etc.). This means you can:
- See LLM latency alongside database query time in a single flame graph.
- Correlate token usage spikes with memory pressure or queue backlog.
- Filter traces by prompt version, eval score, or user feedback, then drill into infrastructure bottlenecks.
Example trace structure:
| Span Type | Parent | Attributes | Exported To |
|---|---|---|---|
| Agent execution | Root | user_id, session_id | OpenTelemetry collector |
| LLM completion | Agent | model, tokens, prompt_version | Langfuse + collector |
| Tool call (API) | Agent | endpoint, status_code | Collector (via HTTP instrumentation) |
| Vector search | Tool call | query, top_k, latency | Collector (via DB instrumentation) |
| Eval check | Agent | score, eval_name | Langfuse |
Langfuse stores LLM-specific metadata (prompt text, completions, scores) in its own database, but the span structure and trace IDs flow through OpenTelemetry. This lets you query Langfuse for prompt analysis and your APM tool for infrastructure health, then join them by trace ID.
Instrumenting Prompt Templates and Dataset Versioning
Langfuse treats prompts as versioned artifacts. You define a prompt template in the Langfuse UI or API, then reference it by name and version in your code. Each trace records which prompt version it used.
Prompt versioning flow:
- Create a prompt template in Langfuse with variables (e.g.,
{user_query},{context}). - Tag it with a version (e.g.,
v1.2). - In your agent code, fetch the prompt by name:
langfuse.get_prompt("summarize", version="v1.2"). - Langfuse logs the prompt ID and version in the trace.
- When you update the prompt, you can query which traces used the old version and compare eval scores.
Dataset management:
Langfuse stores evaluation datasets (input/output pairs, expected results) and links them to traces. When you run an eval, Langfuse creates a span that references the dataset version. This makes it possible to:
- Track which dataset version produced which eval scores.
- Rerun evals on new prompt versions using the same dataset.
- Query traces by dataset ID to see production behavior on known test cases.
The storage model is PostgreSQL-based. Langfuse writes traces, spans, prompts, and datasets to Postgres, then indexes them by trace ID, prompt version, and user ID. You can self-host or use Langfuse Cloud. The OpenTelemetry exporter runs as a sidecar or in-process, depending on your deployment.
Architecture: How Langfuse Fits Into Your Stack
Langfuse sits between your agent framework and your APM backend. It does not replace OpenTelemetry. It extends it with LLM-specific semantics.
Typical deployment:
import Langfuse from "langfuse";
import { trace, context } from "@opentelemetry/api";
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
});
// Start a Langfuse trace
const langfuseTrace = langfuse.trace({
name: "agent-execution",
userId: "user-123",
metadata: { sessionId: "session-456" },
});
// Propagate OpenTelemetry context
const span = trace.getTracer("agent").startSpan("llm-call");
context.with(trace.setSpan(context.active(), span), () => {
// LLM call with Langfuse instrumentation
const generation = langfuseTrace.generation({
name: "summarize",
model: "gpt-4",
prompt: [{ role: "user", content: "Summarize this document" }],
});
// Simulate LLM response
generation.end({
output: "Summary text",
usage: { promptTokens: 50, completionTokens: 30 },
});
span.end();
});
langfuseTrace.update({ output: "Final result" });
Data flow:
- Your agent code calls Langfuse SDK methods (
trace,generation,span). - Langfuse writes LLM metadata to its Postgres database.
- Langfuse emits OpenTelemetry spans to a collector (OTLP endpoint).
- The collector forwards spans to your APM backend.
- You query Langfuse for prompt analysis, eval scores, and user feedback.
- You query your APM tool for infrastructure metrics, error rates, and trace timelines.
- You join the two by trace ID when debugging production issues.
Failure Modes and Observability Gaps
What breaks:
- Context propagation across async boundaries: If your agent uses message queues or background workers, you must manually propagate trace context. Langfuse does not automatically inject context into queue messages.
- High-cardinality metadata: Storing full prompt text and completions in every span can overwhelm your APM backend. Langfuse stores this in Postgres, but if you export it to OpenTelemetry, you may hit cardinality limits.
- Eval latency: Running evals inline adds latency to your agent execution. Langfuse supports async evals, but you need to set up a worker queue.
- Schema drift: If you change prompt variables or dataset structure, old traces may reference missing fields. Langfuse does not enforce schema validation.
Security boundaries:
- Langfuse stores prompt text, completions, and user IDs. If you handle PII, you need to redact it before sending to Langfuse or self-host with encryption at rest.
- API keys are required for both Langfuse and OpenTelemetry exporters. Rotate them regularly.
- If you use Langfuse Cloud, traces leave your infrastructure. Self-hosting keeps data in your VPC.
When to Use Langfuse vs. Pure OpenTelemetry
| Scenario | Use Langfuse | Use OpenTelemetry Only |
|---|---|---|
| You need prompt versioning and eval tracking | Yes | No (requires custom instrumentation) |
| You want to query traces by model, token count, or eval score | Yes | No (APM tools lack LLM-specific indexes) |
| You already have a mature APM stack and just need LLM spans | Maybe (use Langfuse exporter) | Yes (write custom spans) |
| You need sub-100ms trace ingestion latency | No (Langfuse adds a network hop) | Yes |
| You want a UI for non-engineers to inspect prompts and evals | Yes | No |
Langfuse is not a replacement for OpenTelemetry. It is a layer on top that adds LLM-specific semantics. If you only need basic tracing, you can instrument LLM calls with OpenTelemetry directly. If you need prompt management, eval tracking, and a UI for non-engineers, Langfuse saves you from building it yourself.
Technical Verdict
Use Langfuse when:
- You run agents in production and need to correlate LLM behavior with infrastructure metrics.
- You version prompts and want to query which traces used which version.
- You run evals and need to track scores over time.
- You want a UI for non-engineers to inspect traces, prompts, and feedback.
Avoid Langfuse when:
- You need sub-100ms trace ingestion latency (the extra network hop adds overhead).
- You already have custom LLM instrumentation and do not need prompt versioning.
- You cannot tolerate another service in your stack (self-hosting requires Postgres and a web server).
- You handle highly sensitive data and cannot send traces outside your VPC (unless you self-host).
Langfuse solves the observability gap between agent execution and infrastructure monitoring. It does not replace your APM tool. It extends it with LLM-specific spans that propagate trace context across tool boundaries. If you run agents in production, you need something like this. Langfuse is the most mature open-source option.