LLM-as-Judge for Agent Evals: How to Catch Silent Failures Before Production

Your agent returns the right answer but burns three unnecessary API calls and hallucinates a price check along the way. Traditional pass/fail metrics score this as perfect. This is the silent failure problem.

Binary output validation cannot see wasted tokens, unsafe reasoning paths, or tool misuse. You need two evaluation layers: LLM-as-Judge for output quality and trajectory evaluation for process quality. Together they catch the failures that slip past unit tests and make it into production logs only after customer complaints.

The Silent Failure Gap

Agents fail in ways that output-only checks miss:

Token waste: Correct answer, but the agent made redundant tool calls or verbose reasoning loops
Hallucinated intermediates: Final output is accurate, but the agent fabricated facts during reasoning
Unsafe paths: Agent reached the goal but violated safety constraints or business rules
Lucky guesses: Agent skipped necessary verification steps and happened to be right

Traditional evaluation compares final output to expected output. If they match, the test passes. This works for deterministic functions. It breaks for agents that orchestrate multiple tools, maintain state, and reason through multi-step plans.

LLM-as-Judge: Structured Output Evaluation

LLM-as-Judge uses a second model to evaluate the first model’s output against explicit criteria. The judge model receives the agent’s output, the expected behavior, and a scoring rubric.

Implementation Pattern

from strands_agents_evals import LLMAsJudge

# Define evaluation criteria
rubric = """
Score the agent's response on a 1-5 scale:
1 - Incorrect or irrelevant
2 - Partially correct but missing key details
3 - Correct but verbose or inefficient
4 - Correct and concise
5 - Correct, concise, and well-formatted

Provide:
- Score (1-5)
- Reasoning (2-3 sentences)
- Specific issues (if score < 5)
"""

judge = LLMAsJudge(
    model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    rubric=rubric,
    temperature=0.0  # Deterministic scoring
)

# Evaluate agent output
result = judge.evaluate(
    agent_output="BA117 at 7PM ($450)",
    expected_behavior="Return flight number, time, and price",
    context={"user_query": "Find me a flight to Boston tonight"}
)

print(result.score)        # 4
print(result.reasoning)    # "Correct information, slightly informal format"
print(result.issues)       # ["Missing airline name", "Currency symbol inconsistent"]

Judge Prompt Design

The judge prompt is your evaluation contract. Weak prompts produce inconsistent scores. Strong prompts include:

Explicit scale: Define what each score means with examples
Failure modes: List common errors to check (hallucinations, format violations, missing data)
Context requirements: Specify what information the judge needs to make a decision
Output format: Structured JSON or specific fields for programmatic parsing

Example structured judge prompt:

judge_prompt = """
Evaluate the agent's flight booking response.

SCORING CRITERIA:
5 - All required fields present, accurate, properly formatted
4 - All required fields present, minor formatting issues
3 - Missing one optional field OR one formatting error
2 - Missing required field OR factual error
1 - Multiple errors or completely wrong

REQUIRED FIELDS:
- Flight number (format: AA1234)
- Departure time (format: HH:MM AM/PM)
- Price (format: $XXX)

OPTIONAL FIELDS:
- Airline name
- Arrival time
- Seat class

COMMON FAILURES TO CHECK:
- Hallucinated flight numbers
- Incorrect price format
- Missing timezone
- Ambiguous times (is 7PM departure or arrival?)

Return JSON:
{
  "score": <1-5>,
  "reasoning": "<explanation>",
  "missing_required": [<list>],
  "missing_optional": [<list>],
  "errors": [<list>]
}
"""

Judge Model Selection

Model Tier	Use Case	Latency	Cost	Consistency
GPT-4 / Claude 3.5 Sonnet	Production evals, complex rubrics	2-5s	High	Excellent
GPT-3.5 / Claude 3 Haiku	CI/CD pipeline, simple rubrics	0.5-1s	Low	Good
Fine-tuned small model	High-volume, domain-specific	<0.5s	Very low	Variable

For CI/CD pipelines, use a fast cheap model for initial screening and escalate to a stronger judge for failures or edge cases.

Trajectory Evaluation: Process Quality

Trajectory evaluation examines the sequence of actions an agent took, not just the final output. This catches inefficiency, unsafe tool use, and reasoning errors that happen to produce correct answers.

What a Trajectory Contains

A trajectory is the full execution trace:

Tool calls: Which tools were invoked, in what order, with what arguments
Reasoning steps: Internal monologue or chain-of-thought
State transitions: How the agent’s internal state changed
Observations: What the agent learned from each tool call
Decisions: Why the agent chose each action

Strands Agents captures this automatically via hooks. Other frameworks require manual instrumentation.

Trajectory Metrics

Token efficiency:

def calculate_token_waste(trajectory):
    total_tokens = sum(step.tokens for step in trajectory)
    necessary_tokens = estimate_minimum_tokens(trajectory.task)
    waste_ratio = (total_tokens - necessary_tokens) / necessary_tokens
    return waste_ratio

# Flag trajectories with >50% waste
if calculate_token_waste(trajectory) > 0.5:
    log_inefficiency(trajectory)

Tool call redundancy:

def detect_redundant_calls(trajectory):
    tool_calls = [step for step in trajectory if step.type == "tool_call"]
    seen = {}
    redundant = []
    
    for call in tool_calls:
        key = (call.tool_name, frozenset(call.args.items()))
        if key in seen:
            redundant.append((call, seen[key]))
        seen[key] = call
    
    return redundant

Reasoning loops:

def detect_loops(trajectory, window=3):
    reasoning_steps = [s.content for s in trajectory if s.type == "reasoning"]
    
    for i in range(len(reasoning_steps) - window):
        window_slice = reasoning_steps[i:i+window]
        if len(set(window_slice)) < window:  # Repeated reasoning
            return True, i
    
    return False, None

LLM-as-Judge for Trajectories

You can also use an LLM to evaluate the trajectory itself:

trajectory_judge_prompt = """
Evaluate the agent's execution path for this flight booking task.

TRAJECTORY:
{trajectory}

EVALUATION CRITERIA:
1. Tool efficiency (did it make unnecessary calls?)
2. Reasoning quality (did it consider alternatives?)
3. Error handling (did it recover from failures gracefully?)
4. Safety (did it validate inputs and outputs?)

Score each dimension 1-5 and provide specific examples.
Flag any hallucinations or unsafe actions.
"""

trajectory_result = judge.evaluate(
    agent_output=format_trajectory(trajectory),
    rubric=trajectory_judge_prompt
)

This catches patterns like:

Agent called search_flights three times with identical parameters
Agent hallucinated a price check tool that doesn’t exist
Agent skipped input validation and passed unsanitized user input to a tool
Agent made an API call, ignored the result, and made the same call again

Versioning Judge Prompts

Judge prompts are code. They need version control, regression tests, and change management.

Judge Prompt Regression Suite

# tests/judge_regression.py
import pytest
from evals import LLMAsJudge

@pytest.fixture
def judge_v1():
    return LLMAsJudge(rubric=load_rubric("judge_v1.txt"))

@pytest.fixture
def judge_v2():
    return LLMAsJudge(rubric=load_rubric("judge_v2.txt"))

def test_judge_consistency(judge_v1, judge_v2):
    """Ensure new judge version doesn't regress on known cases"""
    test_cases = load_golden_set("test_cases.json")
    
    for case in test_cases:
        score_v1 = judge_v1.evaluate(case.output, case.expected).score
        score_v2 = judge_v2.evaluate(case.output, case.expected).score
        
        # Allow ±1 point variance
        assert abs(score_v1 - score_v2) <= 1, \
            f"Judge v2 regressed on case {case.id}: {score_v1} -> {score_v2}"

Judge Prompt Versioning Strategy

Store judge prompts in version control with semantic versioning:

evals/
  judges/
    flight_booking_v1.0.0.txt
    flight_booking_v1.1.0.txt  # Added timezone check
    flight_booking_v2.0.0.txt  # Changed scoring scale
  golden_sets/
    flight_booking_v1.json
  tests/
    test_judge_regression.py

Tag each evaluation run with the judge version:

eval_result = {
    "agent_version": "v2.3.1",
    "judge_version": "v1.1.0",
    "score": 4,
    "timestamp": "2026-05-25T08:18:45Z"
}

This lets you compare agent improvements across judge versions and detect when judge changes affect scores.

Production Integration Patterns

CI/CD Pipeline

# .github/workflows/agent-eval.yml
name: Agent Evaluation

on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run agent test suite
        run: python -m pytest tests/agent/
      
      - name: Run LLM-as-Judge evals
        run: |
          python evals/run_judge.py \
            --test-set golden_set.json \
            --judge-version v1.1.0 \
            --threshold 4.0
      
      - name: Run trajectory analysis
        run: |
          python evals/analyze_trajectories.py \
            --max-token-waste 0.5 \
            --max-redundant-calls 2
      
      - name: Post results to PR
        uses: actions/github-script@v6
        with:
          script: |
            const results = require('./eval_results.json');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              body: `## Eval Results\n\n${formatResults(results)}`
            });

Observability Hooks

from strands_agents import Agent, hooks

@hooks.on_trajectory_complete
def log_trajectory_metrics(trajectory):
    metrics = {
        "total_tokens": sum(s.tokens for s in trajectory),
        "tool_calls": len([s for s in trajectory if s.type == "tool_call"]),
        "reasoning_steps": len([s for s in trajectory if s.type == "reasoning"]),
        "duration_ms": trajectory.duration,
        "token_waste_ratio": calculate_token_waste(trajectory)
    }
    
    # Send to observability backend
    cloudwatch.put_metric_data(
        Namespace="AgentEvals",
        MetricData=[
            {"MetricName": k, "Value": v} for k, v in metrics.items()
        ]
    )
    
    # Flag inefficient trajectories
    if metrics["token_waste_ratio"] > 0.5:
        slack.post_message(
            channel="#agent-alerts",
            text=f"High token waste detected: {trajectory.id}"
        )

Amazon Bedrock AgentCore Evaluators

AWS provides built-in evaluators for common patterns:

import boto3

bedrock = boto3.client('bedrock-agent-runtime')

response = bedrock.evaluate_agent(
    agentId='agent-123',
    evaluationConfig={
        'evaluators': [
            {
                'type': 'HALLUCINATION_DETECTION',
                'config': {
                    'threshold': 0.8,
                    'groundTruthSource': 's3://bucket/knowledge-base/'
                }
            },
            {
                'type': 'TOOL_USE_EFFICIENCY',
                'config': {
                    'maxRedundantCalls': 2,
                    'requiredTools': ['search_flights', 'book_flight']
                }
            },
            {
                'type': 'SAFETY_ALIGNMENT',
                'config': {
                    'policyDocument': 's3://bucket/safety-policy.json'
                }
            }
        ]
    },
    testSet='s3://bucket/test-cases.json'
)

print(response['evaluationResults'])

These evaluators run server-side and integrate with CloudWatch for alerting.

Failure Mode Taxonomy

Failure Type	Detection Method	Example
Wrong output	LLM-as-Judge	Agent returns flight to wrong city
Hallucinated fact	LLM-as-Judge + knowledge base check	Agent invents a flight number
Token waste	Trajectory analysis	Agent makes 5 calls when 2 would work
Unsafe tool use	Trajectory analysis + safety rules	Agent passes unsanitized SQL query
Reasoning loop	Trajectory analysis	Agent repeats same thought 3 times
Missing validation	Trajectory analysis	Agent skips input sanitization
Lucky guess	Trajectory analysis	Agent skips verification but gets right answer

Research Context

Recent work validates these patterns:

WindowsWorld: Evaluation framework for GUI agents shows trajectory analysis catches 40% more failures than output-only metrics
D3-Gym: Benchmark for decision-making agents demonstrates LLM-as-Judge correlates 0.89 with human ratings when rubrics are explicit
CARE framework: Combines correctness, alignment, reasoning quality, and efficiency metrics, all requiring trajectory access

The trend is clear: output-only evaluation is insufficient for production agent systems.

Technical Verdict

Use LLM-as-Judge when:

You need to evaluate subjective qualities (tone, helpfulness, formatting)
Your output space is too large for exhaustive golden sets
You can afford 1-5 second eval latency per test case
You have budget for judge model API calls ($0.01-0.10 per evaluation)

Avoid LLM-as-Judge when:

You need sub-100ms evaluation latency (use rule-based checks instead)
Your evaluation criteria are purely deterministic