mech.app
AI Agents

Cross-Framework Agent Evals: How One Evaluation Harness Tests 17+ Orchestration Platforms

Building evaluation infrastructure that intercepts tool calls, normalizes metrics, and runs identical test suites across LangChain, CrewAI, and custom a...

Source: dev.to
Cross-Framework Agent Evals: How One Evaluation Harness Tests 17+ Orchestration Platforms

Most teams run agents on three or more orchestration platforms simultaneously. LangGraph for stateful workflows, CrewAI for multi-agent coordination, raw OpenAI calls for simple tasks.

Each framework exposes telemetry differently. Each has its own execution model. When you need to evaluate reliability across all of them, you face an infrastructure problem: how do you intercept tool calls, normalize metrics, and run identical test suites without rewriting adapters for every new framework?

Custom Evals is an open-source evaluation harness that claims broad framework support (including LangChain, LlamaIndex, CrewAI, AutoGPT, and custom orchestrators) through a unified interface. The key innovation is the adapter layer that makes identical test suites work across synchronous loops, async pipelines, and streaming executors.

The Adapter Problem

Agent frameworks differ in three critical dimensions:

Execution model. LangGraph uses async graph traversal. AutoGPT runs imperative loops. LlamaIndex streams chunks. Your eval harness needs to hook into all three without blocking, dropping events, or requiring framework-specific test code.

Telemetry surface. LangChain exposes callbacks. CrewAI logs to stdout. Custom agents might emit nothing. You need a fallback strategy when native instrumentation does not exist.

State serialization. Some frameworks expose checkpoints or structured state objects. Others require you to reconstruct state from message history. Your harness must capture agent context mid-execution without framework-specific hooks.

The solution is a two-layer adapter pattern:

  1. Native adapters for frameworks with rich instrumentation (LangChain callbacks, LlamaIndex event handlers).
  2. Proxy adapters that wrap LLM client calls when native hooks are unavailable.

Here’s the basic evaluation pattern from the source article:

from custom.evals import CoherenceEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CoherenceEvaluator(llm)

score = evaluator.evaluate({
    "input": "What is AI?",
    "output": "AI is artificial intelligence, enabling machines to perform intelligent tasks."
})

print(f"{score.label}: {score.explanation}")
# coherent: The response provides a clear, logical explanation...

The framework returns a structured Score object regardless of which orchestration platform generated the output. The adapter layer handles the translation from framework-specific telemetry to this unified interface.

Metric Normalization

Different frameworks report latency, token usage, and tool success rates in incompatible formats. LangChain returns structured LLMResult objects. AutoGPT logs JSON. Custom agents might return plain strings.

Custom Evals normalizes metrics through a Score interface:

# Score interface from source article
score.label      # "coherent", "hallucinated", etc.
score.explanation  # Human-readable justification

Every evaluator returns this shape, regardless of the underlying framework. The source article demonstrates this with multiple evaluator types (CoherenceEvaluator, ToolAccuracyEvaluator) that all return the same interface.

The challenge is extracting comparable metrics when frameworks expose different telemetry surfaces. Based on the framework architectures described in their respective documentation, the harness must handle these variations:

FrameworkExecution ModelNative TelemetryFallback Strategy
LangChainAsync callbacksRich (callbacks, run IDs)Native adapter
LlamaIndexEvent handlersModerate (event system)Native adapter
CrewAITask executionMinimal (stdout logs)Proxy wrapper
Raw OpenAIDirect API callsNoneProxy wrapper
CustomUser-definedVariableUser-provided trace

When native instrumentation exists, the harness uses framework-specific adapters. When it does not, proxy wrappers intercept LLM client calls to reconstruct execution traces.

State Capture Strategy

Evaluating multi-turn agents requires capturing intermediate state. You need to know what the agent “believed” at step 3 when it made a bad tool call at step 4.

The source article emphasizes that Custom Evals is “lightweight” and does not require “mandatory test runner” infrastructure. This suggests state capture happens through trace reconstruction rather than framework-specific checkpoint mechanisms.

The harness likely uses these strategies (inferred from the lightweight design philosophy):

  1. Message history reconstruction. Parse the conversation log to rebuild agent state at any turn.
  2. Tool call replay. Re-execute tool calls with recorded inputs to verify determinism.
  3. Memory snapshots. Serialize vector store queries and retrieval results at each step.

For frameworks without native state serialization, the harness must reconstruct state from observable outputs. This works for stateless agents but becomes challenging for agents with complex internal memory structures.

Evaluation Isolation

Some frameworks share global state. LangChain uses singleton LLM clients. CrewAI reads environment variables. If you run parallel evaluations, they can interfere with each other.

The source article describes Custom Evals as having “no required backend” and being installable via pip install -e ".[dev]". This library-first design means isolation must happen at the test suite level rather than through service boundaries.

Potential isolation strategies include:

  • Process-level sandboxing. Each test suite runs in a separate process with its own environment.
  • Client cloning. LLM clients are deep-copied per test to avoid shared connection pools.
  • Temporary vector stores. RAG evaluations use ephemeral instances that are destroyed after each test.

The trade-off is startup overhead. Spinning up a new process and vector store for every test adds latency. For teams running thousands of evals in CI, this matters.

Deployment Shape

Custom Evals is a library, not a service. You import it into your test suite. No backend to deploy, no dashboard to maintain.

The source article shows this installation pattern:

pip install -e ".[dev]"

Typical usage in CI would look like:

# .github/workflows/eval.yml
- name: Run agent evaluations
  run: |
    pip install custom-evals[dev]
    pytest tests/evals/ --framework=langchain
    pytest tests/evals/ --framework=crewai
    pytest tests/evals/ --framework=autogpt

Each --framework flag loads the corresponding adapter and runs the same test suite with different instrumentation underneath. The same test code executes three times, once per framework.

For teams that want centralized results, the harness would need pluggable reporters to push scores to external observability systems. The library design suggests you bring your own observability stack.

Failure Modes

Adapter lag. When a framework ships a breaking change, the adapter breaks. Framework APIs evolve faster than evaluation tooling. If you’re on the bleeding edge, expect adapter maintenance overhead.

LLM-as-judge variance. The source article shows evaluators using GPT-4 to score outputs (CoherenceEvaluator with gpt-4o-mini). This introduces non-determinism. The same agent response can score differently across runs. For pass/fail gates, you need deterministic evaluators (exact match, regex, tool call verification).

Memory overhead. Capturing full execution traces for long-running agents consumes RAM. The library design (no backend, in-process evaluation) means traces accumulate in memory. Multi-hour workflows will hit limits.

No streaming eval. The evaluation pattern shown in the source waits for complete outputs before scoring. You cannot evaluate partial outputs mid-stream. For latency-sensitive applications (chatbots), this means you cannot catch slow tool calls until the entire response completes.

Technical Verdict

Use Custom Evals when:

  • You run agents on multiple orchestration platforms and need identical test coverage.
  • You want evaluation infrastructure that does not dictate your deployment stack.
  • You need to evaluate coherence, tool call accuracy, or hallucination rates without framework lock-in.

Avoid it when:

  • You’re on a single framework with rich native eval tooling (LangSmith for LangChain, TruLens for LlamaIndex).
  • You need real-time streaming evaluation or sub-second feedback loops.
  • You’re evaluating agents with 100+ turn conversations (memory overhead becomes painful).

The adapter pattern works. The metric normalization is clean. The lack of a required backend is a feature, not a bug. The main risk is adapter maintenance lag when frameworks change. If you’re willing to contribute adapters or wait for upstream fixes, this is solid infrastructure for cross-framework agent testing.

Related Tools:

  • LangSmith: LangChain-native evaluation and tracing with built-in dataset management and prompt versioning. Use when you’re standardized on LangChain and want tight integration with their ecosystem.
  • TruLens: LlamaIndex-focused evaluation framework with feedback functions and guardrails. Best for RAG pipelines where you need retrieval quality metrics alongside LLM output scoring.

Framework Documentation:

  • LangGraph (Stateful agent orchestration)
  • CrewAI (Multi-agent coordination framework)
  • AutoGPT (Autonomous agent loops)

Tags

agentic-ai orchestration infrastructure

Primary Source

dev.to