Deep RL for Job Shop Scheduling: How Agents Handle Random Arrivals Without Recomputing the Entire Plan

The Flexible Job Shop Scheduling Problem (FJSP) is a canonical operations research challenge: you have jobs, each with multiple steps, and machines that can perform those steps at different speeds. The goal is to minimize total completion time. The catch is that jobs arrive unpredictably, and the combinatorial explosion makes mixed-integer linear programming (MILP) solvers impractical beyond toy datasets.

A new ArXiv preprint (2605.22773v1, May 2026) shows how to train a deep RL agent that dispatches jobs in real time without replanning from scratch every time a new job lands. The pattern applies directly to trading execution (route orders across venues as liquidity shifts), warehouse robotics (assign tasks as packages arrive), and CI/CD pipelines (allocate build slots as PRs merge).

Why Static Schedules Break

Traditional FJSP solvers assume you know the full job list upfront. You feed the problem to a MILP solver, wait minutes or hours for an optimal plan, then execute. Two failure modes kill this approach in production:

Arrival uncertainty: Jobs show up at random times. A static plan computed at 9 AM is obsolete by 9:05 AM when three urgent jobs land.
Recomputation cost: Solving MILP from scratch every time a job arrives burns CPU and introduces latency. In a trading context, you cannot pause order routing for 30 seconds while the solver reruns.

The RL approach sidesteps both problems by learning a policy that maps the current state (which machines are busy, which jobs are waiting) to a dispatching decision. No global recomputation required.

State Representation and Action Space

The agent observes a flattened vector encoding:

Machine status: For each machine, current utilization, time until free, and capability flags (which operations it can perform).
Job queue: For each waiting job, the next operation to schedule, processing time on each eligible machine, and total remaining work.
Partial schedule: Operations already assigned but not yet complete, so the agent knows which machines will be free soon.

The action space is constrained to a set of well-known dispatching rules:

Shortest Processing Time (SPT): Pick the job with the smallest next operation time.
Most Work Remaining (MWKR): Prioritize jobs with the longest total remaining work.
First Come First Served (FCFS): Respect arrival order.
Random: Baseline for comparison.

Instead of outputting raw machine-job assignments (which would explode the action space), the agent selects which rule to apply at each decision point. This design keeps the policy lightweight (a small multi-layer perceptron) and training tractable.

Training with Proximal Policy Optimization

The paper uses PPO, a policy gradient method that clips updates to prevent catastrophic policy collapses. The reward signal is negative total completion time: the agent gets a penalty proportional to how long all jobs take to finish.

Key training details:

Event-driven simulation: The environment advances time to the next event (job arrival, operation completion) rather than ticking at fixed intervals. This matches production reality and reduces wasted computation.
Heterogeneous datasets: Training mixes problem instances with varying job counts, operation counts per job, and machine capabilities. The goal is a policy that generalizes across arrival distributions, not one that overfits to a single factory layout.
Arrival rate variation: Some episodes have bursty arrivals (ten jobs land in five minutes, then silence), others have steady trickles. The agent learns to balance greedy short-term packing against leaving slack for future high-priority jobs.

Reward Shaping to Prevent Greedy Packing

A naive reward (minimize completion time of the current batch) encourages the agent to pack jobs onto the first available machine, starving downstream operations. The paper addresses this by penalizing the total completion time across all jobs, not just the ones currently in the queue. This forces the agent to consider how today’s decisions affect tomorrow’s bottlenecks.

In a trading analogy: routing all orders to the fastest venue right now might exhaust liquidity and spike costs for the next wave of orders. The agent learns to spread load across venues to keep execution quality stable over time.

Deployment Shape and Failure Modes

The trained policy is a small MLP (a few hundred parameters) that runs in microseconds. Deployment looks like:

Event loop: Listen for job arrivals and operation completions.
State extraction: Query the current machine status and job queue from your scheduling database.
Policy inference: Pass the state vector to the MLP, get back a dispatching rule index.
Dispatch: Apply the selected rule to assign the next job to a machine.

Failure modes to watch:

Out-of-distribution arrivals: The agent was trained on Poisson arrivals with rate λ. If production suddenly sees a burst arrival pattern (Black Friday, market open), the policy may revert to a suboptimal rule. Mitigation: train on a wider range of arrival distributions or add a fallback heuristic when queue depth exceeds training bounds.
Machine downtime: The simulator assumes machines never break. Real factories have maintenance windows and unexpected failures. The state representation includes machine availability flags, but if downtime patterns differ from training, the agent may assign jobs to machines that will be offline in ten minutes. Mitigation: inject synthetic downtime events during training or add a rule-based override that blocks assignments to machines with scheduled maintenance.
Reward hacking: If the reward is purely completion time, the agent might learn to delay low-priority jobs indefinitely to optimize the metric. The paper does not explicitly address priority classes, but production systems need a secondary penalty for job starvation or SLA violations.

Benchmark Results

The paper compares the RL policy against:

Individual dispatching rules (SPT, MWKR, FCFS, random)
Arrival-triggered MILP (recompute the full schedule every time a job arrives)

On heterogeneous datasets with high arrival rates, the RL agent beats all individual rules by 8-15% on total completion time. Against MILP, the RL agent achieves 95% of optimal performance while running 1000x faster. On homogeneous datasets (all jobs have similar structure), MILP still wins, but the gap narrows as arrival rate increases and MILP recomputation overhead dominates.

Architecture Comparison

Component	MILP Baseline	RL Policy	Production Hybrid
Decision latency	Seconds to minutes	Microseconds	Microseconds (RL) + fallback MILP for overnight batch
Generalization	Recompute per instance	Trained across distributions	RL for real-time, MILP for planning
Observability	Solver logs, dual variables	Policy logits, state embeddings	Both, plus rule selection histogram
Failure mode	Timeout, infeasibility	Out-of-distribution collapse	Graceful degradation to heuristic
Update cycle	N/A (deterministic)	Retrain weekly on production traces	Continuous learning with human review

Code Sketch: Event-Driven Dispatch Loop

import torch
from collections import deque

class FJSPAgent:
    def __init__(self, policy_net, rules):
        self.policy = policy_net  # Trained MLP
        self.rules = rules        # [spt, mwkr, fcfs, random]
    
    def get_state(self, machines, job_queue):
        # Flatten machine status and job queue into fixed-size vector
        machine_features = [
            [m.utilization, m.time_until_free, *m.capabilities]
            for m in machines
        ]
        job_features = [
            [j.next_op_time, j.remaining_work, j.arrival_time]
            for j in job_queue
        ]
        return torch.tensor(machine_features + job_features).flatten()
    
    def dispatch(self, state, job_queue, machines):
        with torch.no_grad():
            logits = self.policy(state)
            rule_idx = logits.argmax().item()
        
        # Apply selected dispatching rule
        selected_rule = self.rules[rule_idx]
        job, machine = selected_rule(job_queue, machines)
        return job, machine, rule_idx

# Event loop
event_queue = deque()  # (timestamp, event_type, payload)
agent = FJSPAgent(policy_net, rules)

while event_queue:
    timestamp, event_type, payload = event_queue.popleft()
    
    if event_type == "job_arrival":
        job_queue.append(payload)
    elif event_type == "operation_complete":
        machine = payload
        machine.mark_free()
    
    # Dispatch next job if any machine is free
    if any(m.is_free() for m in machines) and job_queue:
        state = agent.get_state(machines, job_queue)
        job, machine, rule = agent.dispatch(state, job_queue, machines)
        machine.assign(job)
        event_queue.append((timestamp + job.duration, "operation_complete", machine))
        
        # Log for observability
        print(f"t={timestamp}: assigned job {job.id} to machine {machine.id} via rule {rule}")

Observability and Debugging

Production deployment needs instrumentation beyond the policy output:

Rule selection histogram: Track which dispatching rules the agent picks over time. If it always chooses SPT, the policy may have collapsed to a single heuristic.
State distribution drift: Log the mean and variance of state features (machine utilization, queue depth). If production state drifts outside training bounds, retrain or switch to a fallback.
Counterfactual logging: Record what each rule would have chosen, not just the selected rule. This lets you replay decisions offline and measure regret.
Queue depth alerts: If the job queue grows beyond a threshold, the agent is falling behind. Trigger a human review or temporarily switch to a conservative rule (FCFS) to prevent starvation.

When to Retrain

The policy is not static. Retrain when:

Arrival distribution shifts: A new product line changes the mix of job types.
Machine fleet changes: You add faster machines or retire old ones.
Performance degrades: Total completion time creeps up by more than 10% over a week.

Retraining workflow:

Collect production traces (state, action, reward) for the past N days.
Augment with synthetic scenarios (machine failures, burst arrivals).
Run PPO for a few thousand episodes.
Shadow the new policy in production (log decisions but do not execute) for 24 hours.
If shadow performance beats the live policy by >5%, promote to production.

Technical Verdict

Use this pattern when:

Jobs arrive unpredictably and you cannot afford to recompute a global schedule every time.
The problem is too large for MILP solvers to handle in real time (hundreds of jobs, dozens of machines).
You can tolerate 5-10% suboptimality compared to an offline optimal solution in exchange for microsecond dispatch latency.
You have historical data or a simulator to generate training episodes.

Avoid when:

The problem is small enough for MILP to solve in seconds (fewer than 50 jobs, 10 machines).
Arrivals are predictable and you can precompute schedules overnight.
The cost of a bad decision is catastrophic (safety-critical systems, financial trades with regulatory constraints). In these cases, use RL to generate candidate plans but require human or rule-based approval before execution.
You lack the infrastructure to retrain regularly. A stale policy will degrade as the environment drifts.

The core insight is event-driven dispatching: instead of replanning the entire schedule, the agent makes a local decision (which rule to apply right now) based on the current state. This trades global optimality for speed and adaptability. The same pattern applies to order routing in trading (which venue gets the next slice?), task assignment in warehouses (which robot picks the next item?), and resource allocation in CI/CD (which build gets the next runner?). The plumbing is identical: state extraction, policy inference, action execution, and continuous retraining on production traces.