AWS just released Agent-EvalKit as open-source (Apache 2.0) with a specific claim: most teams evaluate agents the way they evaluate functions, by checking outputs. That works until your agent fabricates facts from empty tool results or skips verification steps while still producing a plausible answer. Agent-EvalKit addresses this by splitting evaluation into six phases that trace execution paths, not just final responses.
The toolkit integrates with Claude Code, Kiro CLI, and Kilo Code. It reads your agent source, generates test cases, runs evaluations, and produces reports that reference specific code locations. The reference implementation uses Strands Agents SDK and Amazon Bedrock, but the architecture is tool-agnostic.
The Six-Phase Architecture
Agent-EvalKit separates test definition from execution through a phase model that isolates concerns:
Phase 1: Task Definition
You describe evaluation goals in natural language. The toolkit parses your agent’s source code and generates test cases with ground truth outcomes. This is not prompt engineering. The system needs to understand your agent’s tool inventory, state transitions, and decision boundaries to create meaningful tests.
Phase 2: Setup
Provision test environments, seed databases, configure API mocks. This phase handles the infrastructure that most teams skip until production failures force them to retrofit it. Setup artifacts are versioned alongside test definitions.
Phase 3: Execution
Run the agent against test cases while capturing tool calls, intermediate state, and decision points. This is where observability instrumentation matters. You need structured logs that correlate tool invocations with state changes, not just timestamped text blobs.
Phase 4: Observation
Collect execution traces: which tools fired, what data they returned, how the agent transformed that data into the next action. Observation is passive. It does not interfere with execution but captures enough detail to reconstruct the decision graph later.
Phase 5: Verification
Compare observed behavior against expected outcomes. This is where non-binary success criteria surface. Did the agent call the right tools? Did it use the data those tools returned, or did it hallucinate? Did it follow the verification steps your process requires, even if the final answer looks correct?
Phase 6: Teardown
Clean up test environments, archive logs, reset state. Teardown failures are evaluation failures. If your test leaves residue that affects the next run, your results are not reproducible.
Why Phase Isolation Matters
Most agent testing collapses these phases into a single script: spin up environment, run agent, check output, clean up. That works for demos. It breaks when you need to debug why an agent chose tool A over tool B, or why it ignored the data tool C returned.
Phase isolation gives you three things:
-
Reproducibility: You can re-run phase 3 (execution) without re-running phase 2 (setup) if your environment state is stable. You can re-run phase 5 (verification) with different success criteria without re-executing the agent.
-
Parallelization: Setup and teardown can run concurrently for independent test cases. Observation and verification can happen asynchronously after execution completes.
-
Debugging surface: When a test fails, you know which phase failed. Setup failure means infrastructure problems. Execution failure means agent crashes. Verification failure means behavior drift.
Integration Points and State Management
Agent-EvalKit integrates with Strands Agents SDK and Amazon Bedrock through a plugin architecture. The Strands integration is instructive because it exposes how multi-turn evaluations handle state.
Strands agents maintain conversation history, tool call logs, and intermediate reasoning traces. Agent-EvalKit hooks into these state stores during phase 4 (observation) to capture:
- Tool invocation sequence and timing
- Input parameters and return values for each tool
- State transitions between turns
- Reasoning traces (if the agent exposes them)
This is not black-box testing. You need white-box access to the agent’s internal state to verify that it used tool outputs correctly. If your agent framework does not expose this state, you cannot run meaningful evaluations.
The Bedrock integration handles model invocations. Agent-EvalKit captures prompt construction, model responses, and token usage. This matters for cost analysis and performance profiling, but also for debugging hallucinations. If the model response contradicts tool outputs, you need both artifacts to diagnose the failure.
Verification When Success Is Not Binary
Phase 5 (verification) is where most homegrown eval systems fall apart. Checking that an agent returned the right answer is easy. Checking that it followed the right process is hard.
Agent-EvalKit supports three verification modes:
Exact match: Output equals expected value. This works for deterministic tasks like database queries or arithmetic.
Semantic equivalence: Output conveys the same information as the expected value, even if phrasing differs. This requires an LLM judge, which introduces latency and cost.
Process compliance: Agent followed required steps, regardless of output quality. This checks tool call sequences, not final responses.
Most real-world evaluations need all three. You want the right answer (exact match), expressed clearly (semantic equivalence), derived through the correct process (compliance).
Here is what process compliance looks like in practice:
# Process compliance verification
def verify_travel_research_process(execution_trace):
required_steps = [
("search_flights", "must check flight availability"),
("search_hotels", "must check accommodation options"),
("verify_dates", "must confirm date consistency"),
]
called_tools = [call.tool_name for call in execution_trace.tool_calls]
for tool, reason in required_steps:
if tool not in called_tools:
return VerificationFailure(
phase="process_compliance",
missing_step=tool,
reason=reason,
recommendation=f"Add {tool} call before final response"
)
# Check that verification happened after data collection
verify_index = called_tools.index("verify_dates")
search_indices = [
called_tools.index("search_flights"),
called_tools.index("search_hotels")
]
if verify_index < max(search_indices):
return VerificationFailure(
phase="process_compliance",
issue="premature_verification",
recommendation="Move verify_dates call after all searches complete"
)
return VerificationSuccess()
This code checks that the agent called required tools and called them in the right order. It does not care whether the final response was good. It cares whether the process was sound.
Failure Modes and Observability
Agent-EvalKit exposes three failure categories that output-only testing misses:
Tool misuse: Agent called the wrong tool, or the right tool with wrong parameters. This shows up in phase 4 (observation) as unexpected tool calls or malformed inputs.
Data fabrication: Agent ignored tool outputs and generated plausible-sounding responses from nothing. This requires comparing the final response against captured tool outputs in phase 5 (verification).
Process skipping: Agent reached the correct conclusion but skipped verification steps. This is the hardest failure to catch because the output looks fine. You need process compliance checks to surface it.
The toolkit generates reports that map failures back to source code locations. If verification detects data fabrication, the report shows which tool call returned empty results and which part of the agent code should have handled that case.
Deployment Shape and Cost Model
Agent-EvalKit runs in your development environment, not as a hosted service. You install it via pip, point it at your agent source, and run evaluations locally or in CI/CD pipelines.
The cost model has three components:
-
LLM calls for test generation: Agent-EvalKit uses an LLM to parse your source code and generate test cases. This is a one-time cost per agent version.
-
Agent execution costs: Running your agent against test cases incurs the same costs as production usage (model inference, tool API calls, etc.).
-
LLM judge calls for semantic verification: If you use semantic equivalence checks, each verification requires an LLM call. This adds 10-30% to total evaluation cost depending on test suite size.
For a typical agent with 50 test cases, expect $5-20 per full evaluation run, depending on model choice and tool complexity.
Trade-Offs and Alternatives
| Approach | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Agent-EvalKit | Phase isolation, process compliance checks, source code integration | Requires white-box agent access, LLM judge costs | Teams building custom agents with complex tool chains |
| LangSmith | Hosted service, built-in tracing, multi-agent support | Black-box only, limited process verification | Teams using LangChain/LangGraph exclusively |
| Manual testing | Zero infrastructure cost, full control | Not reproducible, does not scale, misses edge cases | Prototypes and demos |
| Unit tests only | Fast, deterministic, cheap | Cannot verify multi-step behavior or tool interactions | Stateless functions and single-tool agents |
Agent-EvalKit sits between unit tests (too narrow) and manual testing (too slow). It gives you reproducible, process-aware evaluation without requiring a hosted service or framework lock-in.
Technical Verdict
Use Agent-EvalKit when:
- Your agent makes multi-step decisions across multiple tools
- You need to verify process compliance, not just output correctness
- You have white-box access to agent state and tool calls
- You run evaluations in CI/CD and need reproducible results
Avoid it when:
- Your agent is a single LLM call with no tools (use unit tests)
- You cannot instrument your agent to expose internal state
- You need black-box testing of third-party agents
- Your team lacks the bandwidth to maintain eval infrastructure
The six-phase architecture is the real contribution here. Phase isolation makes debugging tractable and results reproducible. Process compliance checks catch the failures that output-only testing misses. If you are deploying agents that make consequential decisions, you need something like this. Agent-EvalKit gives you a working implementation instead of forcing you to build it from scratch.