Databricks now uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark. This is the first public case study showing how a data platform integrates frontier models into existing agent orchestration. The interesting part is not that they called an API. The interesting part is the routing decision: when does an agent task warrant a frontier model call versus local inference, and how do you manage state across that boundary?
The OfficeQA Pro Benchmark
OfficeQA Pro measures multi-step reasoning over structured enterprise data: spreadsheets, databases, and document stores. Tasks include conditional aggregations, cross-table joins, and natural language queries that require schema inference. The benchmark exposes whether a model can decompose a user request into executable steps without hallucinating column names or inventing data.
GPT-5.5’s performance matters because it validates the model’s ability to handle the kind of ambiguous, multi-hop queries that break simpler agents. If your workflow needs to answer “What was our highest revenue product in Q3 for customers who also bought service contracts?” you need a model that can plan the join, filter, and aggregation without losing context.
Routing Architecture
The core problem is cost and latency. Frontier models are expensive and slow. Local models are fast and cheap but fail on complex reasoning. The solution is a routing layer that decides which model handles each step.
Decision Boundary
Databricks likely uses a classifier or rule-based router that evaluates:
- Query complexity: Token count, nested clauses, number of tables referenced.
- Schema ambiguity: Does the query reference columns that exist, or does it require inference?
- Failure history: Has this query type failed with the local model before?
- Cost threshold: Is the user on a plan that allows frontier model calls?
If the router sends the task to GPT-5.5, the agent must serialize its current state, make the API call, parse the response, and resume. If it sends the task to a local model (likely a fine-tuned Llama or Mistral variant), the agent executes in-process.
State Management
The agent workflow needs to survive a round trip to an external API. This requires:
- Checkpoint before the call: Serialize the conversation history, intermediate results, and execution plan.
- Idempotent API wrapper: If the call fails, retry with the same checkpoint. Do not re-execute prior steps.
- Response validation: Parse the frontier model’s output and confirm it matches the expected schema before resuming.
- Fallback path: If GPT-5.5 returns garbage or times out, route to a human or a simpler fallback model.
This is not trivial. Most agent frameworks assume synchronous, in-process execution. Adding an async boundary means you need durable state storage (likely a database or object store) and a way to resume from arbitrary checkpoints.
Failure Modes
| Failure Mode | Symptom | Mitigation |
|---|---|---|
| Routing thrash | Agent bounces between local and frontier models on the same query | Add hysteresis: once a query routes to GPT-5.5, keep it there for the session |
| State loss | API call succeeds but agent forgets prior context | Use a durable queue (Kafka, SQS) to store checkpoints before external calls |
| Cost explosion | Every query hits the frontier model | Implement per-user or per-query budget caps with circuit breakers |
| Latency spike | Frontier model takes 10+ seconds, user abandons workflow | Set aggressive timeouts (3-5s) and fall back to local model with a warning |
Implementation Sketch
Here’s a simplified routing layer in Python. This is not production code, but it shows the decision logic:
from dataclasses import dataclass
from enum import Enum
class ModelTier(Enum):
LOCAL = "local"
FRONTIER = "frontier"
@dataclass
class QueryContext:
text: str
table_count: int
token_count: int
user_tier: str
failure_history: list[str]
def route_query(ctx: QueryContext) -> ModelTier:
# Rule 1: Cheap queries stay local
if ctx.token_count < 100 and ctx.table_count == 1:
return ModelTier.LOCAL
# Rule 2: Known failure patterns go to frontier
if any(pattern in ctx.text for pattern in ctx.failure_history):
return ModelTier.FRONTIER
# Rule 3: Enterprise users get frontier by default for complex queries
if ctx.user_tier == "enterprise" and ctx.table_count > 2:
return ModelTier.FRONTIER
# Rule 4: Schema ambiguity requires frontier reasoning
if "similar to" in ctx.text or "like" in ctx.text:
return ModelTier.FRONTIER
return ModelTier.LOCAL
def execute_with_checkpoint(query: str, state: dict) -> dict:
# Serialize state before external call
checkpoint = {"state": state, "query": query, "timestamp": time.time()}
store_checkpoint(checkpoint)
try:
response = call_gpt_5_5(query, state)
validate_response(response)
return response
except Exception as e:
# Restore from checkpoint and fall back
state = load_checkpoint(checkpoint["timestamp"])
return fallback_local_model(query, state)
The key is the checkpoint. Without it, you cannot resume after a failure. The routing logic is simple, but the state management is where most teams fail.
Observability
You need metrics for:
- Routing distribution: What percentage of queries hit each tier?
- Latency by tier: How much slower is the frontier model?
- Cost per query: Track API spend by user, query type, and tier.
- Failure rate by tier: Does the local model fail more often on certain query patterns?
Databricks likely uses their own Lakehouse monitoring to track these metrics. If you are building this yourself, instrument the router and log every decision with context. You will need this data to tune the routing rules.
Security Boundaries
Sending enterprise data to an external API requires:
- Data sanitization: Strip PII or sensitive columns before the call.
- Audit logging: Record every query sent to the frontier model, who requested it, and what data was included.
- Access control: Not every user should be able to route to the expensive model. Enforce this at the router, not in the UI.
If your compliance team requires on-premise inference, you cannot use this architecture. You need a self-hosted frontier model (Llama 3.1 405B or similar) or a private deployment of GPT-5.5 (if OpenAI offers it).
Technical Verdict
Use this architecture when:
- You have a mix of simple and complex queries, and you want to optimize cost without sacrificing quality.
- Your agent workflows already use checkpoints or durable queues for other reasons (retries, human-in-the-loop).
- You can tolerate 3-10 second latency spikes for complex queries.
- Your compliance posture allows sending anonymized data to external APIs.
Avoid this architecture when:
- All your queries are simple enough for a local model (no multi-hop reasoning, no schema inference).
- You cannot tolerate variable latency (real-time systems, user-facing chat).
- Your data cannot leave your network (healthcare, finance, government).
- You do not have the engineering capacity to build and maintain a routing layer with durable state.
The Databricks case study proves this pattern works at scale, but it is not a drop-in solution. You need infrastructure for checkpointing, observability for cost tracking, and security controls for data governance. If you have those pieces, hybrid routing is the most cost-effective way to build enterprise agent workflows.