Multi-Agent RL in Three-Sided Marketplaces: How Delayed Feedback Trains Dispatch Systems Without Immediate Rewards

Most reinforcement learning tutorials assume you get a reward signal milliseconds after each action. Production dispatch systems operate under different physics. A courier assignment decision made at 6:15 PM gets evaluated when the delivery completes at 6:47 PM, the merchant reports congestion at 7:02 PM, and the courier logs idle time at 7:18 PM. DoorDash’s deployed RL system shows how to train agents when feedback is delayed, noisy, and comes from three conflicting stakeholders.

The Three-Sided Coordination Problem

Traditional dispatch optimizers solve a combinatorial assignment problem: match available couriers to pending orders. The objective function weights delivery speed, batching efficiency, and merchant load. In production, those weights are static or manually tuned.

The RL layer sits above the optimizer. Instead of replacing the assignment solver, it adjusts the objective weights per store based on delayed marketplace feedback. A store-level policy selects a discrete multiplier that shifts the tradeoff between customer delivery quality and courier batching efficiency.

This design preserves two critical properties:

Feasibility constraints remain in the combinatorial solver (capacity limits, geographic zones, time windows)
Operational safeguards stay intact (minimum service levels, fairness rules, compliance checks)

The RL agent only controls the weight vector, not individual assignments.

Delayed Feedback Architecture

The system handles three types of delayed signals:

Signal Type	Delay Window	Source	Use Case
Delivery completion	15-60 minutes	Customer app	Actual delivery time vs. promise
Courier utilization	30-90 minutes	Driver logs	Idle time, batch acceptance rate
Merchant congestion	45-120 minutes	Store systems	Queue depth, prep time variance

Credit assignment becomes non-trivial. A decision at time t influences outcomes observed at t+30, t+60, and t+90. The agent must learn which weight adjustment caused which downstream effect when hundreds of other decisions happened in between.

The solution uses centralized offline training with decentralized execution:

Centralized value function: A shared Q-network trained on logged data from all stores
Store-level policies: Each store runs its own policy that queries the shared value function
Double Q-learning targets: Reduces overestimation when feedback is sparse and noisy
Conservative regularizer: Penalizes out-of-distribution actions during offline training

State Representation and Action Space

The state vector at each decision point includes:

Current order queue depth and age distribution
Available courier count and location density
Recent delivery performance (rolling 30-minute window)
Merchant-reported prep times
Historical demand patterns for this store and time slot

The action space is discrete: a small set of multipliers (likely 3-5 options) that scale the batching-efficiency term in the dispatch objective. This keeps the action space tractable for offline learning and makes the policy interpretable to operations teams.

Offline Training with Noisy Logs

The system trains entirely on logged marketplace data. No simulator, no online exploration during training. This creates two problems:

Problem 1: Distributional shift. The logged data reflects decisions made by the old policy. The new policy will encounter states the old policy never visited.

Problem 2: Coupled feedback. When you change dispatch weights at one store, it affects courier availability at nearby stores. The reward signal is not isolated to a single agent’s action.

The conservative regularizer addresses problem 1 by adding a penalty term to the Q-learning loss:

# Simplified conservative Q-learning update
q_target = reward + gamma * min(q1_next, q2_next)  # Double Q-learning
q_pred = q_network(state, action)

# Standard TD error
td_loss = (q_pred - q_target) ** 2

# Conservative penalty: penalize Q-values for unseen actions
unseen_actions = get_out_of_distribution_actions(state, logged_data)
conservative_penalty = alpha * max(q_network(state, unseen_actions))

total_loss = td_loss + conservative_penalty

Problem 2 is handled by treating each store as a decentralized agent with access to a centralized value function. The value function learns to predict long-term outcomes given the current state and the likely behavior of neighboring stores.

Production Deployment Shape

The deployment uses a switchback experiment design:

Control periods: Static objective weights (baseline)
Treatment periods: RL-selected weights
Switch frequency: Every 15-30 minutes per store

This design isolates the effect of the policy from daily demand patterns and seasonal trends. The switchback also provides a safety mechanism: if the policy degrades any monitored metric below a threshold, the system reverts to static weights.

The policy runs as a sidecar service to the dispatch optimizer:

┌─────────────────┐
│  Order Stream   │
└────────┬────────┘
         │
         v
┌─────────────────┐      ┌──────────────┐
│  RL Policy      │─────>│  Objective   │
│  (per store)    │      │  Weights     │
└─────────────────┘      └──────┬───────┘
                                │
                                v
                         ┌──────────────┐
                         │  Dispatch    │
                         │  Optimizer   │
                         └──────┬───────┘
                                │
                                v
                         ┌──────────────┐
                         │ Assignments  │
                         └──────────────┘

The policy service is stateless. It queries the shared value function with current store state and returns a weight multiplier. The dispatch optimizer applies the weights and solves the assignment problem.

Observability and Failure Modes

The system monitors three layers:

Policy layer:

Action distribution per store (detect stuck policies)
Q-value variance (detect value function collapse)
Out-of-distribution state frequency

Optimizer layer:

Solver runtime (weight changes can make the problem harder)
Infeasibility rate (bad weights can create unsolvable instances)
Constraint violation counts

Outcome layer:

Delivery time degradation
Courier idle time increase
Merchant queue overflow

Likely failure modes (and mitigation status):

Value overestimation (partially mitigated): The Q-network predicts high returns for aggressive batching, but logged data doesn’t capture the long-tail delivery failures. The conservative regularizer reduces this risk, but extreme out-of-distribution states remain a known gap.
Non-stationarity (actively mitigated): Demand patterns shift (new restaurant opens, road construction, weather event). The policy trained on historical data may not generalize. The switchback design catches degradation within 15-30 minutes and reverts to safe weights.
Coordination collapse (open risk): If too many stores adopt aggressive batching simultaneously, courier availability drops system-wide. The centralized value function should learn this coupling, but it remains a known vulnerability in multi-agent settings.
Delayed feedback loops (design constraint): If the policy changes weights too frequently, it never observes the full outcome of any decision. The 15-30 minute switch frequency is tuned to balance exploration and feedback latency, but this limits adaptation speed.

Production Results

The switchback experiment showed:

Increased batching efficiency (more orders per courier trip)
Reduced courier-side time costs (less idle time between deliveries)
No degradation in customer-facing delivery quality (promise time still met)

The key insight: the RL policy learned to exploit temporal slack in delivery promises. When a customer expects delivery in 45 minutes but the current queue could deliver in 30, the policy increases batching weights to improve courier utilization without risking the promise.

Technical Verdict

Use this approach when:

You have a working combinatorial optimizer that you can’t replace
Feedback is delayed but eventually observable (15-120 minute windows)
You need to balance multiple stakeholder objectives (customers, couriers, merchants)
You have enough logged data to train offline (months of production traffic at store-level granularity)
You can run controlled experiments (switchback or A/B tests with 15-30 minute periods)
Your action space is discrete and small (3-5 weight multipliers, not continuous control)

Avoid this approach when:

Feedback never arrives or is too noisy to attribute
The action space is continuous or very large (offline RL struggles with high-dimensional actions)
You need immediate adaptation to rare events (offline training requires retraining cycles)
The system is too critical to tolerate any exploration risk (even conservative policies can fail on unseen states)
You lack the infrastructure for safe rollback (switchback experiments require real-time monitoring and automatic reversion)

The architecture is a pragmatic middle ground. It doesn’t replace the dispatch optimizer, it doesn’t require online exploration, and it doesn’t assume instant feedback. For production systems with delayed operational outcomes and existing combinatorial solvers, this is a working pattern.