AI agents do not become expensive all at once. They become expensive one small decision at a time. One extra routing call. One confidence check. One retry. One tool failure that triggers another LLM request. One formatting agent that exists because it felt cleaner during the first design.
Individually, these calls look harmless. Together, they can turn a simple support request into a chain of model calls, tool executions, retries, and post-processing steps that nobody can easily explain from logs alone.
The Visibility Gap
Most teams start with simple logs:
[10:14:02] RouterAgent: classified request as ORDER_CHANGE
[10:14:03] OrderAgent: fetched order details
[10:14:05] OrderAgent: generated response
[10:14:06] ResponseAgent: formatted final message
This tells you something happened. It does not tell you:
- How many LLM calls happened inside each agent
- Which calls were tool invocations versus model completions
- Whether retries occurred and why
- Whether an agent made multiple calls for work that could have been handled by a cheaper model, a cache, or a simple rule
When a customer message like “I need to change my shipping address for order #12345” triggers a cascade of agent steps, you need instrumentation that surfaces the three cost drivers: token consumption, tool overhead, and retry logic.
Architecture: Cost-Aware Agent Wrapper
The pattern is to wrap every agent interaction with a tracer that captures:
- Token counts (input and output, per model call)
- Tool invocations (name, latency, success/failure)
- Retry attempts (trigger reason, backoff strategy, final outcome)
Here’s the TypeScript shape:
interface AgentTrace {
agentName: string;
sessionId: string;
timestamp: number;
// Token accounting
inputTokens: number;
outputTokens: number;
modelName: string;
// Tool tracking
toolCalls: ToolCall[];
// Retry metadata
attemptNumber: number;
retryReason?: string;
// Cost projection
estimatedCostUSD: number;
}
interface ToolCall {
name: string;
durationMs: number;
success: boolean;
errorType?: string;
}
class TracedAgent {
private tracer: AgentTracer;
async execute(input: string, sessionId: string): Promise<AgentResponse> {
const trace: AgentTrace = {
agentName: this.name,
sessionId,
timestamp: Date.now(),
inputTokens: 0,
outputTokens: 0,
modelName: this.modelName,
toolCalls: [],
attemptNumber: 1,
estimatedCostUSD: 0
};
try {
// Wrap LLM call with token counting
const response = await this.callLLM(input, trace);
// Track tool invocations
if (response.toolCalls) {
for (const tool of response.toolCalls) {
const toolTrace = await this.executeToolWithTracing(tool);
trace.toolCalls.push(toolTrace);
}
}
// Calculate cost
trace.estimatedCostUSD = this.calculateCost(trace);
return response;
} finally {
await this.tracer.record(trace);
}
}
private async executeToolWithTracing(tool: ToolDefinition): Promise<ToolCall> {
const start = Date.now();
try {
await this.tools[tool.name](tool.args);
return {
name: tool.name,
durationMs: Date.now() - start,
success: true
};
} catch (error) {
return {
name: tool.name,
durationMs: Date.now() - start,
success: false,
errorType: error.constructor.name
};
}
}
}
Token Counting Without Breaking Streaming
The challenge with token instrumentation is that streaming responses do not return token counts until the stream completes. You have three options:
| Approach | Accuracy | Latency Impact | Implementation Complexity |
|---|---|---|---|
| Client-side estimation (tiktoken) | ~95% | None | Low |
| Server-side count from headers | 100% | None (if provider exposes) | Low |
| Wait for stream completion | 100% | Blocks until done | Medium |
For production agents, use server-side counts when available (OpenAI returns usage in the final stream chunk) and fall back to client-side estimation for providers that do not expose counts mid-stream.
async function streamWithTokenCounting(
messages: Message[],
trace: AgentTrace
): Promise<string> {
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages,
stream: true,
stream_options: { include_usage: true }
});
let fullResponse = '';
for await (const chunk of stream) {
if (chunk.choices[0]?.delta?.content) {
fullResponse += chunk.choices[0].delta.content;
}
// OpenAI sends usage in final chunk
if (chunk.usage) {
trace.inputTokens = chunk.usage.prompt_tokens;
trace.outputTokens = chunk.usage.completion_tokens;
}
}
return fullResponse;
}
Retry Logic Placement
Retries belong at the boundary where failure is detected, not in a global wrapper. This means:
- LLM client wrapper: Retry on rate limits, transient network errors
- Tool boundary: Retry on tool-specific failures (database deadlock, external API timeout)
- Orchestration layer: Do not retry here unless you want to re-run the entire agent chain
Example tool-level retry with exponential backoff:
class ToolExecutor {
async executeWithRetry(
toolName: string,
args: any,
trace: AgentTrace,
maxAttempts = 3
): Promise<any> {
let lastError: Error;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
trace.attemptNumber = attempt;
try {
return await this.tools[toolName](args);
} catch (error) {
lastError = error;
trace.retryReason = error.message;
// Only retry on specific error types
if (!this.isRetryable(error) || attempt === maxAttempts) {
throw error;
}
// Exponential backoff: 100ms, 200ms, 400ms
await this.sleep(100 * Math.pow(2, attempt - 1));
}
}
throw lastError;
}
private isRetryable(error: Error): boolean {
return (
error instanceof DatabaseDeadlockError ||
error instanceof RateLimitError ||
error instanceof NetworkTimeoutError
);
}
}
Granularity: Per-Invocation vs. Per-Session
You need both. Per-invocation traces show individual cost spikes. Per-session aggregates show cumulative spend across a conversation.
class SessionTracker {
private sessions = new Map<string, SessionAggregate>();
recordTrace(trace: AgentTrace): void {
const session = this.sessions.get(trace.sessionId) || {
sessionId: trace.sessionId,
totalTokens: 0,
totalCostUSD: 0,
agentCalls: [],
toolCalls: [],
retries: 0
};
session.totalTokens += trace.inputTokens + trace.outputTokens;
session.totalCostUSD += trace.estimatedCostUSD;
session.agentCalls.push({
agentName: trace.agentName,
timestamp: trace.timestamp,
tokens: trace.inputTokens + trace.outputTokens
});
session.toolCalls.push(...trace.toolCalls);
session.retries += trace.attemptNumber - 1;
this.sessions.set(trace.sessionId, session);
}
getSessionCost(sessionId: string): number {
return this.sessions.get(sessionId)?.totalCostUSD || 0;
}
}
Cost Calculation
Token counts mean nothing without pricing. Maintain a pricing table and update it when providers change rates:
const MODEL_PRICING = {
'gpt-4': {
inputPer1k: 0.03,
outputPer1k: 0.06
},
'gpt-3.5-turbo': {
inputPer1k: 0.0015,
outputPer1k: 0.002
},
'claude-3-opus': {
inputPer1k: 0.015,
outputPer1k: 0.075
}
};
function calculateCost(trace: AgentTrace): number {
const pricing = MODEL_PRICING[trace.modelName];
if (!pricing) return 0;
const inputCost = (trace.inputTokens / 1000) * pricing.inputPer1k;
const outputCost = (trace.outputTokens / 1000) * pricing.outputPer1k;
return inputCost + outputCost;
}
Observability Integration
Traces are useless if they live in memory. Push them to a time-series database or observability platform:
class AgentTracer {
constructor(
private metricsClient: MetricsClient,
private logSink: LogSink
) {}
async record(trace: AgentTrace): Promise<void> {
// Structured logs for debugging
await this.logSink.write({
level: 'info',
message: 'Agent execution completed',
...trace
});
// Metrics for dashboards and alerts
this.metricsClient.gauge('agent.tokens.input', trace.inputTokens, {
agent: trace.agentName,
model: trace.modelName
});
this.metricsClient.gauge('agent.tokens.output', trace.outputTokens, {
agent: trace.agentName,
model: trace.modelName
});
this.metricsClient.gauge('agent.cost.usd', trace.estimatedCostUSD, {
agent: trace.agentName,
session: trace.sessionId
});
this.metricsClient.increment('agent.tool_calls', trace.toolCalls.length, {
agent: trace.agentName
});
if (trace.attemptNumber > 1) {
this.metricsClient.increment('agent.retries', trace.attemptNumber - 1, {
agent: trace.agentName,
reason: trace.retryReason
});
}
}
}
Failure Modes
Token estimation drift: Client-side token counting (tiktoken) can diverge from actual usage by 2-5% due to tokenizer version mismatches. Always reconcile estimates against provider-reported counts in post-processing.
Tool call attribution: If a tool internally calls another LLM (e.g., a summarization step inside a database query tool), those tokens will not appear in the agent trace unless the tool itself is instrumented. Nested LLM calls are the most common source of cost leakage.
Retry amplification: A single user request that triggers three retries at the orchestration layer can multiply costs by 4x if each retry re-runs the entire agent chain. Always retry at the narrowest possible scope.
Session ID collision: If you reuse session IDs across different conversations, cost aggregates become meaningless. Generate a new UUID per conversation and include it in every trace.
Technical Verdict
Use this pattern when:
- You have multi-agent systems where cost attribution is unclear
- Your LLM bill is growing faster than user growth
- You need to identify which agents or tools are driving spend
- You want to set per-session cost budgets or alerts
Avoid this pattern when:
- You have a single-agent system with no tool calls (basic logging is enough)
- Your provider does not expose token counts (you will be stuck with estimates)
- You are still in prototype phase and cost is not yet a concern
The instrumentation overhead is negligible (sub-millisecond per trace), but the operational value is high. You cannot optimize what you cannot measure, and agent costs are invisible until you make them visible.