Radiology Worklist Agents: How AWS Built an AI System That Routes 2.2M Studies Without Cherry-Picking

Traditional radiology worklist systems let radiologists cherry-pick easy cases while complex studies sit in the queue. Research across 62 hospitals analyzing 2.2 million studies found that inefficient case assignment causes 17.7-minute delays for expedited cases and costs of $2.1M to $4.2M across hospital networks.

AWS published a production case study showing how agent-based worklist optimization solves this problem in a regulated, life-critical domain. The system uses Amazon Bedrock AgentCore and Strands Agents SDK to route studies based on radiologist specialization, current workload, fatigue levels, and case complexity.

This is not a chatbot wrapper. It is a multi-agent task assignment architecture that prevents human cherry-picking while satisfying SLA constraints in a healthcare environment.

The Cherry-Picking Problem

Rule-based worklist engines route studies according to predefined logic. Static specialty matching ignores context: whether the available radiologist has been interpreting complex cases for several consecutive hours, or whether a straightforward follow-up scan truly warrants subspecialist attention.

Workload balancing responds to current queue depth rather than anticipating demands based on case complexity, estimated interpretation time, or physician fatigue patterns. Most critically, no learning occurs. When deterministic rules produce suboptimal assignments, the same inefficient patterns repeat until someone manually updates the underlying logic.

The result is predictable: radiologists pick easier, higher-value cases. Complex studies accumulate. Diagnostic delays increase. Costs rise.

Architecture: Orchestration Layer and Queue Design

The system uses a multi-agent architecture where each agent has a specific responsibility:

Complexity Scoring Agent: Analyzes DICOM metadata, prior study history, and clinical context to assign a complexity score to each incoming study.
Radiologist Profile Agent: Maintains a dynamic model of each radiologist’s specialization, current workload, recent case history, and estimated fatigue level.
Assignment Orchestrator: Receives complexity scores and radiologist profiles, then routes studies to maximize throughput while respecting SLA constraints.
Feedback Loop Agent: Captures actual interpretation time, quality metrics, and manual overrides to refine future assignments.

The queue architecture prevents deadlocks when all radiologists reject high-complexity cases. Each study has a priority score that increases over time. If no radiologist accepts a study within a threshold window, the system escalates by broadening the eligibility pool or triggering a manual review alert.

State Management

The orchestrator maintains three state tables:

Active Study Queue: Studies awaiting assignment, sorted by priority score and SLA deadline.
Radiologist Availability: Current workload, fatigue score (based on consecutive complex cases and time since last break), and specialization tags.
Assignment History: Past assignments, interpretation times, and quality metrics for feedback loop training.

State updates happen in real time. When a radiologist completes a study, the system recalculates their fatigue score and workload capacity before considering them for the next assignment.

Complexity Scoring Without Manual Rules

The Complexity Scoring Agent uses a fine-tuned model trained on historical study data. Inputs include:

DICOM metadata (modality, body part, protocol)
Prior study history (number of priors, time since last scan)
Clinical context from the order (indication, urgency flags)
Radiologist interpretation time from past similar studies

The model outputs a complexity score from 0 to 100. Scores above 80 trigger subspecialist routing. Scores below 30 can be assigned to general radiologists even if a subspecialist is available.

This approach avoids the manual rule authoring trap. When new protocols or modalities appear, the model adapts through the feedback loop rather than requiring a rule update.

Fatigue Signals and Real-Time Constraints

Fatigue scoring combines three signals:

Consecutive complex cases: Interpreting three high-complexity studies in a row increases fatigue score by 20 points.
Time since last break: Fatigue score increases linearly after 90 minutes of continuous work.
Case velocity: Interpreting studies faster than the historical average for that complexity level suggests either high skill or corner-cutting. The system flags velocity outliers for quality review.

The Assignment Orchestrator uses fatigue scores as a constraint. If a radiologist’s fatigue score exceeds 70, they are only eligible for low-complexity cases until they take a break or complete a simple study.

SLA constraints are hard limits. If a study has been in the queue for 80% of its SLA window, the orchestrator overrides fatigue and specialization preferences to assign it to the next available radiologist.

Observability and Audit Trail for Regulated Environments

Healthcare AI systems need audit trails that satisfy regulatory requirements. The system logs every assignment decision with:

Study ID and complexity score
Radiologist ID and current state (workload, fatigue, specialization)
Assignment rationale (why this radiologist was chosen)
SLA deadline and time remaining
Manual override flag (if a human changed the assignment)

Logs are immutable and stored in AWS HealthLake for HIPAA compliance. The Feedback Loop Agent uses these logs to train the complexity scoring model and refine assignment heuristics.

Observability includes real-time dashboards showing:

Queue depth by complexity tier
Average time to assignment
Radiologist utilization and fatigue distribution
SLA breach rate
Manual override frequency

Manual override frequency is a key metric. If radiologists frequently reject assignments, the system is either scoring complexity incorrectly or ignoring important context. High override rates trigger a model retraining cycle.

Rollback and Manual Override

When an agent makes a bad assignment, the system supports two recovery paths:

Immediate override: A radiologist or workflow coordinator can reject an assignment and manually select a different radiologist. The system logs the override and uses it as a negative training example.
Batch rollback: If a model update causes a spike in override rates, the system can roll back to the previous model version and pause automatic assignments until the issue is resolved.

Rollback is critical in a life-critical domain. The system maintains three model versions at all times: current production, previous stable, and experimental. The orchestrator can switch between them without downtime.

Trade-Offs and Failure Modes

Component	Trade-Off	Failure Mode
Complexity Scoring Agent	Accuracy vs. latency. Fine-tuned models are slower than rule-based heuristics.	Misclassification leads to subspecialist overload or undertriage of complex cases.
Fatigue Scoring	Granularity vs. privacy. Tracking break times and case velocity can feel invasive.	Radiologists game the system by taking unnecessary breaks or slowing down on easy cases.
Assignment Orchestrator	Fairness vs. throughput. Balancing workload evenly reduces peak efficiency.	SLA breaches when all radiologists are fatigued or unavailable.
Feedback Loop Agent	Training frequency vs. stability. Frequent retraining adapts quickly but risks overfitting to recent noise.	Model drift when feedback loop captures biased manual overrides.

The most common failure mode is SLA breach during peak load. When all radiologists are at high fatigue levels and the queue fills with complex cases, the system has no good options. The orchestrator escalates by triggering manual review alerts, but this reintroduces the cherry-picking problem.

A secondary failure mode is model drift. If radiologists consistently override assignments for reasons the system cannot observe (personal preferences, informal specialization), the feedback loop trains on biased data. The system mitigates this by flagging high override rates and pausing retraining until a human reviews the pattern.

Code Snippet: Assignment Decision Logic

def assign_study(study, radiologists, sla_threshold=0.8):
    """
    Assign a study to a radiologist based on complexity, workload, and fatigue.
    """
    complexity = score_complexity(study)
    eligible = [r for r in radiologists if r.is_available()]
    
    # Hard constraint: SLA deadline
    time_remaining = (study.sla_deadline - now()) / study.sla_window
    if time_remaining < sla_threshold:
        # Override fatigue and specialization preferences
        return min(eligible, key=lambda r: r.current_workload)
    
    # Soft constraint: specialization match
    specialists = [r for r in eligible if complexity in r.specializations]
    if specialists:
        eligible = specialists
    
    # Soft constraint: fatigue level
    low_fatigue = [r for r in eligible if r.fatigue_score < 70]
    if low_fatigue:
        eligible = low_fatigue
    
    # Assign to radiologist with lowest workload
    assigned = min(eligible, key=lambda r: r.current_workload)
    log_assignment(study, assigned, complexity, time_remaining)
    return assigned

This snippet shows the core decision logic. The system applies hard constraints first (SLA deadline), then soft constraints (specialization, fatigue), then optimizes for workload balance. The log captures every decision for audit and feedback.

Technical Verdict

Use this architecture when you have a task assignment problem where humans cherry-pick easy work and complex tasks accumulate. The system works best when you can measure task complexity objectively, track worker state in real time, and enforce SLA constraints.

Avoid this approach if your task assignment logic is simple enough for deterministic rules, or if you cannot tolerate the latency of model inference at assignment time. Also avoid if your workers will game fatigue signals or if manual overrides are too frequent to train a useful feedback loop.

The key insight is that agent-based assignment is not about replacing human judgment. It is about preventing the tragedy of the commons where individual incentives (pick easy cases) conflict with system goals (minimize delays, balance workload). The agents enforce fairness while learning from human overrides.

In a regulated domain like healthcare, the audit trail and rollback mechanisms are as important as the assignment logic. If you cannot explain why an agent made a decision, or if you cannot roll back a bad model update without downtime, the system is not production-ready.

Source Links

AWS Machine Learning Blog: Intelligent radiology workflow optimization with AI agents