mech.app
AI Agents

Agent-EvalKit: AWS's Six-Phase Testing Harness for Multi-Step AI Workflows

How Agent-EvalKit structures evaluation across six distinct phases and what this reveals about the gap between unit tests and end-to-end agent validation.

Source: aws.amazon.com
Agent-EvalKit: AWS's Six-Phase Testing Harness for Multi-Step AI Workflows

AWS just released Agent-EvalKit as open-source (Apache 2.0) with a specific claim: most teams evaluate agents the way they evaluate functions, by checking outputs. That works until your agent fabricates facts from empty tool results or skips verification steps while still producing a plausible answer. Agent-EvalKit addresses this by splitting evaluation into six phases that trace execution paths, not just final responses.

The toolkit integrates with Claude Code, Kiro CLI, and Kilo Code. It reads your agent source, generates test cases, runs evaluations, and produces reports that reference specific code locations. The reference implementation uses Strands Agents SDK and Amazon Bedrock, but the architecture is tool-agnostic.

The Six-Phase Architecture

Agent-EvalKit separates test definition from execution through a phase model that isolates concerns:

Phase 1: Task Definition
You describe evaluation goals in natural language. The toolkit parses your agent’s source code and generates test cases with ground truth outcomes. This is not prompt engineering. The system needs to understand your agent’s tool inventory, state transitions, and decision boundaries to create meaningful tests.

Phase 2: Setup
Provision test environments, seed databases, configure API mocks. This phase handles the infrastructure that most teams skip until production failures force them to retrofit it. Setup artifacts are versioned alongside test definitions.

Phase 3: Execution
Run the agent against test cases while capturing tool calls, intermediate state, and decision points. This is where observability instrumentation matters. You need structured logs that correlate tool invocations with state changes, not just timestamped text blobs.

Phase 4: Observation
Collect execution traces: which tools fired, what data they returned, how the agent transformed that data into the next action. Observation is passive. It does not interfere with execution but captures enough detail to reconstruct the decision graph later.

Phase 5: Verification
Compare observed behavior against expected outcomes. This is where non-binary success criteria surface. Did the agent call the right tools? Did it use the data those tools returned, or did it hallucinate? Did it follow the verification steps your process requires, even if the final answer looks correct?

Phase 6: Teardown
Clean up test environments, archive logs, reset state. Teardown failures are evaluation failures. If your test leaves residue that affects the next run, your results are not reproducible.

Why Phase Isolation Matters

Most agent testing collapses these phases into a single script: spin up environment, run agent, check output, clean up. That works for demos. It breaks when you need to debug why an agent chose tool A over tool B, or why it ignored the data tool C returned.

Phase isolation gives you three things:

  1. Reproducibility: You can re-run phase 3 (execution) without re-running phase 2 (setup) if your environment state is stable. You can re-run phase 5 (verification) with different success criteria without re-executing the agent.

  2. Parallelization: Setup and teardown can run concurrently for independent test cases. Observation and verification can happen asynchronously after execution completes.

  3. Debugging surface: When a test fails, you know which phase failed. Setup failure means infrastructure problems. Execution failure means agent crashes. Verification failure means behavior drift.

Integration Points and State Management

Agent-EvalKit integrates with Strands Agents SDK and Amazon Bedrock through a plugin architecture. The Strands integration is instructive because it exposes how multi-turn evaluations handle state.

Strands agents maintain conversation history, tool call logs, and intermediate reasoning traces. Agent-EvalKit hooks into these state stores during phase 4 (observation) to capture:

  • Tool invocation sequence and timing
  • Input parameters and return values for each tool
  • State transitions between turns
  • Reasoning traces (if the agent exposes them)

This is not black-box testing. You need white-box access to the agent’s internal state to verify that it used tool outputs correctly. If your agent framework does not expose this state, you cannot run meaningful evaluations.

The Bedrock integration handles model invocations. Agent-EvalKit captures prompt construction, model responses, and token usage. This matters for cost analysis and performance profiling, but also for debugging hallucinations. If the model response contradicts tool outputs, you need both artifacts to diagnose the failure.

Verification When Success Is Not Binary

Phase 5 (verification) is where most homegrown eval systems fall apart. Checking that an agent returned the right answer is easy. Checking that it followed the right process is hard.

Agent-EvalKit supports three verification modes:

Exact match: Output equals expected value. This works for deterministic tasks like database queries or arithmetic.

Semantic equivalence: Output conveys the same information as the expected value, even if phrasing differs. This requires an LLM judge, which introduces latency and cost.

Process compliance: Agent followed required steps, regardless of output quality. This checks tool call sequences, not final responses.

Most real-world evaluations need all three. You want the right answer (exact match), expressed clearly (semantic equivalence), derived through the correct process (compliance).

Here is what process compliance looks like in practice:

# Process compliance verification
def verify_travel_research_process(execution_trace):
    required_steps = [
        ("search_flights", "must check flight availability"),
        ("search_hotels", "must check accommodation options"),
        ("verify_dates", "must confirm date consistency"),
    ]
    
    called_tools = [call.tool_name for call in execution_trace.tool_calls]
    
    for tool, reason in required_steps:
        if tool not in called_tools:
            return VerificationFailure(
                phase="process_compliance",
                missing_step=tool,
                reason=reason,
                recommendation=f"Add {tool} call before final response"
            )
    
    # Check that verification happened after data collection
    verify_index = called_tools.index("verify_dates")
    search_indices = [
        called_tools.index("search_flights"),
        called_tools.index("search_hotels")
    ]
    
    if verify_index < max(search_indices):
        return VerificationFailure(
            phase="process_compliance",
            issue="premature_verification",
            recommendation="Move verify_dates call after all searches complete"
        )
    
    return VerificationSuccess()

This code checks that the agent called required tools and called them in the right order. It does not care whether the final response was good. It cares whether the process was sound.

Failure Modes and Observability

Agent-EvalKit exposes three failure categories that output-only testing misses:

Tool misuse: Agent called the wrong tool, or the right tool with wrong parameters. This shows up in phase 4 (observation) as unexpected tool calls or malformed inputs.

Data fabrication: Agent ignored tool outputs and generated plausible-sounding responses from nothing. This requires comparing the final response against captured tool outputs in phase 5 (verification).

Process skipping: Agent reached the correct conclusion but skipped verification steps. This is the hardest failure to catch because the output looks fine. You need process compliance checks to surface it.

The toolkit generates reports that map failures back to source code locations. If verification detects data fabrication, the report shows which tool call returned empty results and which part of the agent code should have handled that case.

Deployment Shape and Cost Model

Agent-EvalKit runs in your development environment, not as a hosted service. You install it via pip, point it at your agent source, and run evaluations locally or in CI/CD pipelines.

The cost model has three components:

  1. LLM calls for test generation: Agent-EvalKit uses an LLM to parse your source code and generate test cases. This is a one-time cost per agent version.

  2. Agent execution costs: Running your agent against test cases incurs the same costs as production usage (model inference, tool API calls, etc.).

  3. LLM judge calls for semantic verification: If you use semantic equivalence checks, each verification requires an LLM call. This adds 10-30% to total evaluation cost depending on test suite size.

For a typical agent with 50 test cases, expect $5-20 per full evaluation run, depending on model choice and tool complexity.

Trade-Offs and Alternatives

ApproachStrengthsWeaknessesBest For
Agent-EvalKitPhase isolation, process compliance checks, source code integrationRequires white-box agent access, LLM judge costsTeams building custom agents with complex tool chains
LangSmithHosted service, built-in tracing, multi-agent supportBlack-box only, limited process verificationTeams using LangChain/LangGraph exclusively
Manual testingZero infrastructure cost, full controlNot reproducible, does not scale, misses edge casesPrototypes and demos
Unit tests onlyFast, deterministic, cheapCannot verify multi-step behavior or tool interactionsStateless functions and single-tool agents

Agent-EvalKit sits between unit tests (too narrow) and manual testing (too slow). It gives you reproducible, process-aware evaluation without requiring a hosted service or framework lock-in.

Technical Verdict

Use Agent-EvalKit when:

  • Your agent makes multi-step decisions across multiple tools
  • You need to verify process compliance, not just output correctness
  • You have white-box access to agent state and tool calls
  • You run evaluations in CI/CD and need reproducible results

Avoid it when:

  • Your agent is a single LLM call with no tools (use unit tests)
  • You cannot instrument your agent to expose internal state
  • You need black-box testing of third-party agents
  • Your team lacks the bandwidth to maintain eval infrastructure

The six-phase architecture is the real contribution here. Phase isolation makes debugging tractable and results reproducible. Process compliance checks catch the failures that output-only testing misses. If you are deploying agents that make consequential decisions, you need something like this. Agent-EvalKit gives you a working implementation instead of forcing you to build it from scratch.