Databricks + GPT-5.5: How Enterprise Agent Workflows Route Between Local and Frontier Models

Databricks now uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark. This is the first public case study showing how a data platform integrates frontier models into existing agent orchestration. The interesting part is not that they called an API. The interesting part is the routing decision: when does an agent task warrant a frontier model call versus local inference, and how do you manage state across that boundary?

The OfficeQA Pro Benchmark

OfficeQA Pro measures multi-step reasoning over structured enterprise data: spreadsheets, databases, and document stores. Tasks include conditional aggregations, cross-table joins, and natural language queries that require schema inference. The benchmark exposes whether a model can decompose a user request into executable steps without hallucinating column names or inventing data.

GPT-5.5’s performance matters because it validates the model’s ability to handle the kind of ambiguous, multi-hop queries that break simpler agents. If your workflow needs to answer “What was our highest revenue product in Q3 for customers who also bought service contracts?” you need a model that can plan the join, filter, and aggregation without losing context.

Routing Architecture

The core problem is cost and latency. Frontier models are expensive and slow. Local models are fast and cheap but fail on complex reasoning. The solution is a routing layer that decides which model handles each step.

Decision Boundary

Databricks likely uses a classifier or rule-based router that evaluates:

Query complexity: Token count, nested clauses, number of tables referenced.
Schema ambiguity: Does the query reference columns that exist, or does it require inference?
Failure history: Has this query type failed with the local model before?
Cost threshold: Is the user on a plan that allows frontier model calls?

If the router sends the task to GPT-5.5, the agent must serialize its current state, make the API call, parse the response, and resume. If it sends the task to a local model (likely a fine-tuned Llama or Mistral variant), the agent executes in-process.

State Management

The agent workflow needs to survive a round trip to an external API. This requires:

Checkpoint before the call: Serialize the conversation history, intermediate results, and execution plan.
Idempotent API wrapper: If the call fails, retry with the same checkpoint. Do not re-execute prior steps.
Response validation: Parse the frontier model’s output and confirm it matches the expected schema before resuming.
Fallback path: If GPT-5.5 returns garbage or times out, route to a human or a simpler fallback model.

This is not trivial. Most agent frameworks assume synchronous, in-process execution. Adding an async boundary means you need durable state storage (likely a database or object store) and a way to resume from arbitrary checkpoints.

Failure Modes

Failure Mode	Symptom	Mitigation
Routing thrash	Agent bounces between local and frontier models on the same query	Add hysteresis: once a query routes to GPT-5.5, keep it there for the session
State loss	API call succeeds but agent forgets prior context	Use a durable queue (Kafka, SQS) to store checkpoints before external calls
Cost explosion	Every query hits the frontier model	Implement per-user or per-query budget caps with circuit breakers
Latency spike	Frontier model takes 10+ seconds, user abandons workflow	Set aggressive timeouts (3-5s) and fall back to local model with a warning

Implementation Sketch

Here’s a simplified routing layer in Python. This is not production code, but it shows the decision logic:

from dataclasses import dataclass
from enum import Enum

class ModelTier(Enum):
    LOCAL = "local"
    FRONTIER = "frontier"

@dataclass
class QueryContext:
    text: str
    table_count: int
    token_count: int
    user_tier: str
    failure_history: list[str]

def route_query(ctx: QueryContext) -> ModelTier:
    # Rule 1: Cheap queries stay local
    if ctx.token_count < 100 and ctx.table_count == 1:
        return ModelTier.LOCAL
    
    # Rule 2: Known failure patterns go to frontier
    if any(pattern in ctx.text for pattern in ctx.failure_history):
        return ModelTier.FRONTIER
    
    # Rule 3: Enterprise users get frontier by default for complex queries
    if ctx.user_tier == "enterprise" and ctx.table_count > 2:
        return ModelTier.FRONTIER
    
    # Rule 4: Schema ambiguity requires frontier reasoning
    if "similar to" in ctx.text or "like" in ctx.text:
        return ModelTier.FRONTIER
    
    return ModelTier.LOCAL

def execute_with_checkpoint(query: str, state: dict) -> dict:
    # Serialize state before external call
    checkpoint = {"state": state, "query": query, "timestamp": time.time()}
    store_checkpoint(checkpoint)
    
    try:
        response = call_gpt_5_5(query, state)
        validate_response(response)
        return response
    except Exception as e:
        # Restore from checkpoint and fall back
        state = load_checkpoint(checkpoint["timestamp"])
        return fallback_local_model(query, state)

The key is the checkpoint. Without it, you cannot resume after a failure. The routing logic is simple, but the state management is where most teams fail.

Observability

You need metrics for:

Routing distribution: What percentage of queries hit each tier?
Latency by tier: How much slower is the frontier model?
Cost per query: Track API spend by user, query type, and tier.
Failure rate by tier: Does the local model fail more often on certain query patterns?

Databricks likely uses their own Lakehouse monitoring to track these metrics. If you are building this yourself, instrument the router and log every decision with context. You will need this data to tune the routing rules.

Security Boundaries

Sending enterprise data to an external API requires:

Data sanitization: Strip PII or sensitive columns before the call.
Audit logging: Record every query sent to the frontier model, who requested it, and what data was included.
Access control: Not every user should be able to route to the expensive model. Enforce this at the router, not in the UI.

If your compliance team requires on-premise inference, you cannot use this architecture. You need a self-hosted frontier model (Llama 3.1 405B or similar) or a private deployment of GPT-5.5 (if OpenAI offers it).

Technical Verdict

Use this architecture when:

You have a mix of simple and complex queries, and you want to optimize cost without sacrificing quality.
Your agent workflows already use checkpoints or durable queues for other reasons (retries, human-in-the-loop).
You can tolerate 3-10 second latency spikes for complex queries.
Your compliance posture allows sending anonymized data to external APIs.

Avoid this architecture when:

All your queries are simple enough for a local model (no multi-hop reasoning, no schema inference).
You cannot tolerate variable latency (real-time systems, user-facing chat).
Your data cannot leave your network (healthcare, finance, government).
You do not have the engineering capacity to build and maintain a routing layer with durable state.

The Databricks case study proves this pattern works at scale, but it is not a drop-in solution. You need infrastructure for checkpointing, observability for cost tracking, and security controls for data governance. If you have those pieces, hybrid routing is the most cost-effective way to build enterprise agent workflows.

Source Links

Databricks brings GPT-5.5 to enterprise agent workflows