Your agent returns the right answer but burns three unnecessary API calls and hallucinates a price check along the way. Traditional pass/fail metrics score this as perfect. This is the silent failure problem.
Binary output validation cannot see wasted tokens, unsafe reasoning paths, or tool misuse. You need two evaluation layers: LLM-as-Judge for output quality and trajectory evaluation for process quality. Together they catch the failures that slip past unit tests and make it into production logs only after customer complaints.
The Silent Failure Gap
Agents fail in ways that output-only checks miss:
- Token waste: Correct answer, but the agent made redundant tool calls or verbose reasoning loops
- Hallucinated intermediates: Final output is accurate, but the agent fabricated facts during reasoning
- Unsafe paths: Agent reached the goal but violated safety constraints or business rules
- Lucky guesses: Agent skipped necessary verification steps and happened to be right
Traditional evaluation compares final output to expected output. If they match, the test passes. This works for deterministic functions. It breaks for agents that orchestrate multiple tools, maintain state, and reason through multi-step plans.
LLM-as-Judge: Structured Output Evaluation
LLM-as-Judge uses a second model to evaluate the first model’s output against explicit criteria. The judge model receives the agent’s output, the expected behavior, and a scoring rubric.
Implementation Pattern
from strands_agents_evals import LLMAsJudge
# Define evaluation criteria
rubric = """
Score the agent's response on a 1-5 scale:
1 - Incorrect or irrelevant
2 - Partially correct but missing key details
3 - Correct but verbose or inefficient
4 - Correct and concise
5 - Correct, concise, and well-formatted
Provide:
- Score (1-5)
- Reasoning (2-3 sentences)
- Specific issues (if score < 5)
"""
judge = LLMAsJudge(
model="anthropic.claude-3-5-sonnet-20241022-v2:0",
rubric=rubric,
temperature=0.0 # Deterministic scoring
)
# Evaluate agent output
result = judge.evaluate(
agent_output="BA117 at 7PM ($450)",
expected_behavior="Return flight number, time, and price",
context={"user_query": "Find me a flight to Boston tonight"}
)
print(result.score) # 4
print(result.reasoning) # "Correct information, slightly informal format"
print(result.issues) # ["Missing airline name", "Currency symbol inconsistent"]
Judge Prompt Design
The judge prompt is your evaluation contract. Weak prompts produce inconsistent scores. Strong prompts include:
- Explicit scale: Define what each score means with examples
- Failure modes: List common errors to check (hallucinations, format violations, missing data)
- Context requirements: Specify what information the judge needs to make a decision
- Output format: Structured JSON or specific fields for programmatic parsing
Example structured judge prompt:
judge_prompt = """
Evaluate the agent's flight booking response.
SCORING CRITERIA:
5 - All required fields present, accurate, properly formatted
4 - All required fields present, minor formatting issues
3 - Missing one optional field OR one formatting error
2 - Missing required field OR factual error
1 - Multiple errors or completely wrong
REQUIRED FIELDS:
- Flight number (format: AA1234)
- Departure time (format: HH:MM AM/PM)
- Price (format: $XXX)
OPTIONAL FIELDS:
- Airline name
- Arrival time
- Seat class
COMMON FAILURES TO CHECK:
- Hallucinated flight numbers
- Incorrect price format
- Missing timezone
- Ambiguous times (is 7PM departure or arrival?)
Return JSON:
{
"score": <1-5>,
"reasoning": "<explanation>",
"missing_required": [<list>],
"missing_optional": [<list>],
"errors": [<list>]
}
"""
Judge Model Selection
| Model Tier | Use Case | Latency | Cost | Consistency |
|---|---|---|---|---|
| GPT-4 / Claude 3.5 Sonnet | Production evals, complex rubrics | 2-5s | High | Excellent |
| GPT-3.5 / Claude 3 Haiku | CI/CD pipeline, simple rubrics | 0.5-1s | Low | Good |
| Fine-tuned small model | High-volume, domain-specific | <0.5s | Very low | Variable |
For CI/CD pipelines, use a fast cheap model for initial screening and escalate to a stronger judge for failures or edge cases.
Trajectory Evaluation: Process Quality
Trajectory evaluation examines the sequence of actions an agent took, not just the final output. This catches inefficiency, unsafe tool use, and reasoning errors that happen to produce correct answers.
What a Trajectory Contains
A trajectory is the full execution trace:
- Tool calls: Which tools were invoked, in what order, with what arguments
- Reasoning steps: Internal monologue or chain-of-thought
- State transitions: How the agent’s internal state changed
- Observations: What the agent learned from each tool call
- Decisions: Why the agent chose each action
Strands Agents captures this automatically via hooks. Other frameworks require manual instrumentation.
Trajectory Metrics
Token efficiency:
def calculate_token_waste(trajectory):
total_tokens = sum(step.tokens for step in trajectory)
necessary_tokens = estimate_minimum_tokens(trajectory.task)
waste_ratio = (total_tokens - necessary_tokens) / necessary_tokens
return waste_ratio
# Flag trajectories with >50% waste
if calculate_token_waste(trajectory) > 0.5:
log_inefficiency(trajectory)
Tool call redundancy:
def detect_redundant_calls(trajectory):
tool_calls = [step for step in trajectory if step.type == "tool_call"]
seen = {}
redundant = []
for call in tool_calls:
key = (call.tool_name, frozenset(call.args.items()))
if key in seen:
redundant.append((call, seen[key]))
seen[key] = call
return redundant
Reasoning loops:
def detect_loops(trajectory, window=3):
reasoning_steps = [s.content for s in trajectory if s.type == "reasoning"]
for i in range(len(reasoning_steps) - window):
window_slice = reasoning_steps[i:i+window]
if len(set(window_slice)) < window: # Repeated reasoning
return True, i
return False, None
LLM-as-Judge for Trajectories
You can also use an LLM to evaluate the trajectory itself:
trajectory_judge_prompt = """
Evaluate the agent's execution path for this flight booking task.
TRAJECTORY:
{trajectory}
EVALUATION CRITERIA:
1. Tool efficiency (did it make unnecessary calls?)
2. Reasoning quality (did it consider alternatives?)
3. Error handling (did it recover from failures gracefully?)
4. Safety (did it validate inputs and outputs?)
Score each dimension 1-5 and provide specific examples.
Flag any hallucinations or unsafe actions.
"""
trajectory_result = judge.evaluate(
agent_output=format_trajectory(trajectory),
rubric=trajectory_judge_prompt
)
This catches patterns like:
- Agent called
search_flightsthree times with identical parameters - Agent hallucinated a price check tool that doesn’t exist
- Agent skipped input validation and passed unsanitized user input to a tool
- Agent made an API call, ignored the result, and made the same call again
Versioning Judge Prompts
Judge prompts are code. They need version control, regression tests, and change management.
Judge Prompt Regression Suite
# tests/judge_regression.py
import pytest
from evals import LLMAsJudge
@pytest.fixture
def judge_v1():
return LLMAsJudge(rubric=load_rubric("judge_v1.txt"))
@pytest.fixture
def judge_v2():
return LLMAsJudge(rubric=load_rubric("judge_v2.txt"))
def test_judge_consistency(judge_v1, judge_v2):
"""Ensure new judge version doesn't regress on known cases"""
test_cases = load_golden_set("test_cases.json")
for case in test_cases:
score_v1 = judge_v1.evaluate(case.output, case.expected).score
score_v2 = judge_v2.evaluate(case.output, case.expected).score
# Allow ±1 point variance
assert abs(score_v1 - score_v2) <= 1, \
f"Judge v2 regressed on case {case.id}: {score_v1} -> {score_v2}"
Judge Prompt Versioning Strategy
Store judge prompts in version control with semantic versioning:
evals/
judges/
flight_booking_v1.0.0.txt
flight_booking_v1.1.0.txt # Added timezone check
flight_booking_v2.0.0.txt # Changed scoring scale
golden_sets/
flight_booking_v1.json
tests/
test_judge_regression.py
Tag each evaluation run with the judge version:
eval_result = {
"agent_version": "v2.3.1",
"judge_version": "v1.1.0",
"score": 4,
"timestamp": "2026-05-25T08:18:45Z"
}
This lets you compare agent improvements across judge versions and detect when judge changes affect scores.
Production Integration Patterns
CI/CD Pipeline
# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run agent test suite
run: python -m pytest tests/agent/
- name: Run LLM-as-Judge evals
run: |
python evals/run_judge.py \
--test-set golden_set.json \
--judge-version v1.1.0 \
--threshold 4.0
- name: Run trajectory analysis
run: |
python evals/analyze_trajectories.py \
--max-token-waste 0.5 \
--max-redundant-calls 2
- name: Post results to PR
uses: actions/github-script@v6
with:
script: |
const results = require('./eval_results.json');
github.rest.issues.createComment({
issue_number: context.issue.number,
body: `## Eval Results\n\n${formatResults(results)}`
});
Observability Hooks
from strands_agents import Agent, hooks
@hooks.on_trajectory_complete
def log_trajectory_metrics(trajectory):
metrics = {
"total_tokens": sum(s.tokens for s in trajectory),
"tool_calls": len([s for s in trajectory if s.type == "tool_call"]),
"reasoning_steps": len([s for s in trajectory if s.type == "reasoning"]),
"duration_ms": trajectory.duration,
"token_waste_ratio": calculate_token_waste(trajectory)
}
# Send to observability backend
cloudwatch.put_metric_data(
Namespace="AgentEvals",
MetricData=[
{"MetricName": k, "Value": v} for k, v in metrics.items()
]
)
# Flag inefficient trajectories
if metrics["token_waste_ratio"] > 0.5:
slack.post_message(
channel="#agent-alerts",
text=f"High token waste detected: {trajectory.id}"
)
Amazon Bedrock AgentCore Evaluators
AWS provides built-in evaluators for common patterns:
import boto3
bedrock = boto3.client('bedrock-agent-runtime')
response = bedrock.evaluate_agent(
agentId='agent-123',
evaluationConfig={
'evaluators': [
{
'type': 'HALLUCINATION_DETECTION',
'config': {
'threshold': 0.8,
'groundTruthSource': 's3://bucket/knowledge-base/'
}
},
{
'type': 'TOOL_USE_EFFICIENCY',
'config': {
'maxRedundantCalls': 2,
'requiredTools': ['search_flights', 'book_flight']
}
},
{
'type': 'SAFETY_ALIGNMENT',
'config': {
'policyDocument': 's3://bucket/safety-policy.json'
}
}
]
},
testSet='s3://bucket/test-cases.json'
)
print(response['evaluationResults'])
These evaluators run server-side and integrate with CloudWatch for alerting.
Failure Mode Taxonomy
| Failure Type | Detection Method | Example |
|---|---|---|
| Wrong output | LLM-as-Judge | Agent returns flight to wrong city |
| Hallucinated fact | LLM-as-Judge + knowledge base check | Agent invents a flight number |
| Token waste | Trajectory analysis | Agent makes 5 calls when 2 would work |
| Unsafe tool use | Trajectory analysis + safety rules | Agent passes unsanitized SQL query |
| Reasoning loop | Trajectory analysis | Agent repeats same thought 3 times |
| Missing validation | Trajectory analysis | Agent skips input sanitization |
| Lucky guess | Trajectory analysis | Agent skips verification but gets right answer |
Research Context
Recent work validates these patterns:
- WindowsWorld: Evaluation framework for GUI agents shows trajectory analysis catches 40% more failures than output-only metrics
- D3-Gym: Benchmark for decision-making agents demonstrates LLM-as-Judge correlates 0.89 with human ratings when rubrics are explicit
- CARE framework: Combines correctness, alignment, reasoning quality, and efficiency metrics, all requiring trajectory access
The trend is clear: output-only evaluation is insufficient for production agent systems.
Technical Verdict
Use LLM-as-Judge when:
- You need to evaluate subjective qualities (tone, helpfulness, formatting)
- Your output space is too large for exhaustive golden sets
- You can afford 1-5 second eval latency per test case
- You have budget for judge model API calls ($0.01-0.10 per evaluation)
Avoid LLM-as-Judge when:
- You need sub-100ms evaluation latency (use rule-based checks instead)
- Your evaluation criteria are purely deterministic