The Hidden Cost of AI Agents: Tracing Tokens, Tool Calls, and Retries in TypeScript

AI agents do not become expensive all at once. They become expensive one small decision at a time. One extra routing call. One confidence check. One retry. One tool failure that triggers another LLM request. One formatting agent that exists because it felt cleaner during the first design.

Individually, these calls look harmless. Together, they can turn a simple support request into a chain of model calls, tool executions, retries, and post-processing steps that nobody can easily explain from logs alone.

The Visibility Gap

Most teams start with simple logs:

[10:14:02] RouterAgent: classified request as ORDER_CHANGE
[10:14:03] OrderAgent: fetched order details
[10:14:05] OrderAgent: generated response
[10:14:06] ResponseAgent: formatted final message

This tells you something happened. It does not tell you:

How many LLM calls happened inside each agent
Which calls were tool invocations versus model completions
Whether retries occurred and why
Whether an agent made multiple calls for work that could have been handled by a cheaper model, a cache, or a simple rule

When a customer message like “I need to change my shipping address for order #12345” triggers a cascade of agent steps, you need instrumentation that surfaces the three cost drivers: token consumption, tool overhead, and retry logic.

Architecture: Cost-Aware Agent Wrapper

The pattern is to wrap every agent interaction with a tracer that captures:

Token counts (input and output, per model call)
Tool invocations (name, latency, success/failure)
Retry attempts (trigger reason, backoff strategy, final outcome)

Here’s the TypeScript shape:

interface AgentTrace {
  agentName: string;
  sessionId: string;
  timestamp: number;
  
  // Token accounting
  inputTokens: number;
  outputTokens: number;
  modelName: string;
  
  // Tool tracking
  toolCalls: ToolCall[];
  
  // Retry metadata
  attemptNumber: number;
  retryReason?: string;
  
  // Cost projection
  estimatedCostUSD: number;
}

interface ToolCall {
  name: string;
  durationMs: number;
  success: boolean;
  errorType?: string;
}

class TracedAgent {
  private tracer: AgentTracer;
  
  async execute(input: string, sessionId: string): Promise<AgentResponse> {
    const trace: AgentTrace = {
      agentName: this.name,
      sessionId,
      timestamp: Date.now(),
      inputTokens: 0,
      outputTokens: 0,
      modelName: this.modelName,
      toolCalls: [],
      attemptNumber: 1,
      estimatedCostUSD: 0
    };
    
    try {
      // Wrap LLM call with token counting
      const response = await this.callLLM(input, trace);
      
      // Track tool invocations
      if (response.toolCalls) {
        for (const tool of response.toolCalls) {
          const toolTrace = await this.executeToolWithTracing(tool);
          trace.toolCalls.push(toolTrace);
        }
      }
      
      // Calculate cost
      trace.estimatedCostUSD = this.calculateCost(trace);
      
      return response;
    } finally {
      await this.tracer.record(trace);
    }
  }
  
  private async executeToolWithTracing(tool: ToolDefinition): Promise<ToolCall> {
    const start = Date.now();
    try {
      await this.tools[tool.name](tool.args);
      return {
        name: tool.name,
        durationMs: Date.now() - start,
        success: true
      };
    } catch (error) {
      return {
        name: tool.name,
        durationMs: Date.now() - start,
        success: false,
        errorType: error.constructor.name
      };
    }
  }
}

Token Counting Without Breaking Streaming

The challenge with token instrumentation is that streaming responses do not return token counts until the stream completes. You have three options:

Approach	Accuracy	Latency Impact	Implementation Complexity
Client-side estimation (tiktoken)	~95%	None	Low
Server-side count from headers	100%	None (if provider exposes)	Low
Wait for stream completion	100%	Blocks until done	Medium

For production agents, use server-side counts when available (OpenAI returns usage in the final stream chunk) and fall back to client-side estimation for providers that do not expose counts mid-stream.

async function streamWithTokenCounting(
  messages: Message[],
  trace: AgentTrace
): Promise<string> {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages,
    stream: true,
    stream_options: { include_usage: true }
  });
  
  let fullResponse = '';
  
  for await (const chunk of stream) {
    if (chunk.choices[0]?.delta?.content) {
      fullResponse += chunk.choices[0].delta.content;
    }
    
    // OpenAI sends usage in final chunk
    if (chunk.usage) {
      trace.inputTokens = chunk.usage.prompt_tokens;
      trace.outputTokens = chunk.usage.completion_tokens;
    }
  }
  
  return fullResponse;
}

Retry Logic Placement

Retries belong at the boundary where failure is detected, not in a global wrapper. This means:

LLM client wrapper: Retry on rate limits, transient network errors
Tool boundary: Retry on tool-specific failures (database deadlock, external API timeout)
Orchestration layer: Do not retry here unless you want to re-run the entire agent chain

Example tool-level retry with exponential backoff:

class ToolExecutor {
  async executeWithRetry(
    toolName: string,
    args: any,
    trace: AgentTrace,
    maxAttempts = 3
  ): Promise<any> {
    let lastError: Error;
    
    for (let attempt = 1; attempt <= maxAttempts; attempt++) {
      trace.attemptNumber = attempt;
      
      try {
        return await this.tools[toolName](args);
      } catch (error) {
        lastError = error;
        trace.retryReason = error.message;
        
        // Only retry on specific error types
        if (!this.isRetryable(error) || attempt === maxAttempts) {
          throw error;
        }
        
        // Exponential backoff: 100ms, 200ms, 400ms
        await this.sleep(100 * Math.pow(2, attempt - 1));
      }
    }
    
    throw lastError;
  }
  
  private isRetryable(error: Error): boolean {
    return (
      error instanceof DatabaseDeadlockError ||
      error instanceof RateLimitError ||
      error instanceof NetworkTimeoutError
    );
  }
}

Granularity: Per-Invocation vs. Per-Session

You need both. Per-invocation traces show individual cost spikes. Per-session aggregates show cumulative spend across a conversation.

class SessionTracker {
  private sessions = new Map<string, SessionAggregate>();
  
  recordTrace(trace: AgentTrace): void {
    const session = this.sessions.get(trace.sessionId) || {
      sessionId: trace.sessionId,
      totalTokens: 0,
      totalCostUSD: 0,
      agentCalls: [],
      toolCalls: [],
      retries: 0
    };
    
    session.totalTokens += trace.inputTokens + trace.outputTokens;
    session.totalCostUSD += trace.estimatedCostUSD;
    session.agentCalls.push({
      agentName: trace.agentName,
      timestamp: trace.timestamp,
      tokens: trace.inputTokens + trace.outputTokens
    });
    session.toolCalls.push(...trace.toolCalls);
    session.retries += trace.attemptNumber - 1;
    
    this.sessions.set(trace.sessionId, session);
  }
  
  getSessionCost(sessionId: string): number {
    return this.sessions.get(sessionId)?.totalCostUSD || 0;
  }
}

Cost Calculation

Token counts mean nothing without pricing. Maintain a pricing table and update it when providers change rates:

const MODEL_PRICING = {
  'gpt-4': {
    inputPer1k: 0.03,
    outputPer1k: 0.06
  },
  'gpt-3.5-turbo': {
    inputPer1k: 0.0015,
    outputPer1k: 0.002
  },
  'claude-3-opus': {
    inputPer1k: 0.015,
    outputPer1k: 0.075
  }
};

function calculateCost(trace: AgentTrace): number {
  const pricing = MODEL_PRICING[trace.modelName];
  if (!pricing) return 0;
  
  const inputCost = (trace.inputTokens / 1000) * pricing.inputPer1k;
  const outputCost = (trace.outputTokens / 1000) * pricing.outputPer1k;
  
  return inputCost + outputCost;
}

Observability Integration

Traces are useless if they live in memory. Push them to a time-series database or observability platform:

class AgentTracer {
  constructor(
    private metricsClient: MetricsClient,
    private logSink: LogSink
  ) {}
  
  async record(trace: AgentTrace): Promise<void> {
    // Structured logs for debugging
    await this.logSink.write({
      level: 'info',
      message: 'Agent execution completed',
      ...trace
    });
    
    // Metrics for dashboards and alerts
    this.metricsClient.gauge('agent.tokens.input', trace.inputTokens, {
      agent: trace.agentName,
      model: trace.modelName
    });
    
    this.metricsClient.gauge('agent.tokens.output', trace.outputTokens, {
      agent: trace.agentName,
      model: trace.modelName
    });
    
    this.metricsClient.gauge('agent.cost.usd', trace.estimatedCostUSD, {
      agent: trace.agentName,
      session: trace.sessionId
    });
    
    this.metricsClient.increment('agent.tool_calls', trace.toolCalls.length, {
      agent: trace.agentName
    });
    
    if (trace.attemptNumber > 1) {
      this.metricsClient.increment('agent.retries', trace.attemptNumber - 1, {
        agent: trace.agentName,
        reason: trace.retryReason
      });
    }
  }
}

Failure Modes

Token estimation drift: Client-side token counting (tiktoken) can diverge from actual usage by 2-5% due to tokenizer version mismatches. Always reconcile estimates against provider-reported counts in post-processing.

Tool call attribution: If a tool internally calls another LLM (e.g., a summarization step inside a database query tool), those tokens will not appear in the agent trace unless the tool itself is instrumented. Nested LLM calls are the most common source of cost leakage.

Retry amplification: A single user request that triggers three retries at the orchestration layer can multiply costs by 4x if each retry re-runs the entire agent chain. Always retry at the narrowest possible scope.

Session ID collision: If you reuse session IDs across different conversations, cost aggregates become meaningless. Generate a new UUID per conversation and include it in every trace.

Technical Verdict

Use this pattern when:

You have multi-agent systems where cost attribution is unclear
Your LLM bill is growing faster than user growth
You need to identify which agents or tools are driving spend
You want to set per-session cost budgets or alerts

Avoid this pattern when:

You have a single-agent system with no tool calls (basic logging is enough)
Your provider does not expose token counts (you will be stuck with estimates)
You are still in prototype phase and cost is not yet a concern

The instrumentation overhead is negligible (sub-millisecond per trace), but the operational value is high. You cannot optimize what you cannot measure, and agent costs are invisible until you make them visible.

Source Links

Original Article: The Hidden Cost of AI Agents