AWS just published a reference architecture that shows how to build multi-agent systems without writing your own state synchronization, memory layer, or observability plumbing. The stack combines Strands Agents for orchestration, NVIDIA NIM for GPU inference, and Amazon Bedrock AgentCore for managed runtime primitives.
The example implements a marketing campaign review system where parallel agents evaluate content, share context, and produce traceable execution paths. The interesting part is not the use case. It is how AgentCore handles the infrastructure problems that usually force teams to build custom solutions: shared memory across agents, trace propagation through inference calls, and state recovery when tool calls fail.
Architecture Components
The stack has three layers:
- NVIDIA NIM: GPU-accelerated inference endpoints for model execution
- Amazon Bedrock AgentCore: Managed runtime that provides shared memory, observability hooks, and execution context
- Strands Agents: Serverless orchestration that decides when to spawn parallel agents versus sequential execution
AgentCore sits between the orchestration layer and the inference layer. It maintains a shared memory space that agents can read and write without custom locking logic. It also injects trace IDs into every inference call so you can follow execution paths through the Bedrock observability dashboard.
Strands handles the workflow graph. It spawns agents in parallel when the campaign review needs multiple perspectives (brand compliance, tone analysis, factual accuracy). It runs agents sequentially when one agent’s output feeds into another’s input.
Shared Memory and State Synchronization
The campaign review workflow has multiple agents updating a shared review state object. Each agent appends findings, scores, and recommendations. Without coordination, you get race conditions where agents overwrite each other’s updates or read stale data.
AgentCore provides a managed memory layer that handles this synchronization. Agents write to named memory slots using a versioned key-value API. The runtime guarantees that concurrent writes to the same slot are serialized. Agents reading from memory always get the latest committed version.
Here is the memory access pattern:
# Agent writes review findings to shared memory
memory_client.put(
memory_id="campaign_review_123",
key="brand_compliance_findings",
value={
"score": 8.5,
"issues": ["logo placement", "color variance"],
"timestamp": "2026-05-27T04:13:25Z"
},
version_check="latest"
)
# Another agent reads accumulated findings
findings = memory_client.get(
memory_id="campaign_review_123",
keys=["brand_compliance_findings", "tone_analysis_findings"]
)
The version_check parameter prevents lost updates. If another agent modified the same key between your read and write, the put operation fails and you retry with fresh data.
This removes the need to run your own Redis cluster, implement optimistic locking, or write conflict resolution logic. The trade-off is that you are locked into AgentCore’s memory model. You cannot use custom data structures or implement application-specific merge strategies.
Trace Propagation Through Inference Calls
When an agent makes a tool call that triggers a NVIDIA NIM inference request, you need to connect that inference event back to the originating agent and the broader workflow execution. Without trace propagation, you see isolated logs from the inference endpoint but cannot reconstruct the decision path that led to that call.
AgentCore injects trace context into every inference request. The context includes:
- Workflow execution ID
- Agent instance ID
- Parent span ID from the orchestration layer
- Memory snapshot version
NVIDIA NIM endpoints are configured to accept and propagate these trace headers. When the inference completes, the response includes the same trace context. AgentCore writes this to the observability backend so you can query execution graphs that span orchestration, memory access, and model inference.
The observability dashboard shows:
- Which agent triggered each inference call
- How long the agent waited for the response
- What memory state the agent read before making the call
- Whether the inference succeeded or returned a malformed tool call
This is useful when debugging why an agent made a bad decision. You can see the exact context window, memory snapshot, and tool call sequence that led to the error.
Parallel Agent Execution and Context Window Management
The campaign review workflow spawns three agents in parallel: brand compliance, tone analysis, and factual accuracy. Each agent needs access to the original campaign content plus any findings from previous review rounds.
Strands decides whether to run agents in parallel based on the workflow graph. If agents do not depend on each other’s outputs, they run concurrently. If one agent’s output is an input to another, they run sequentially.
The problem is context window limits. If each agent appends findings to shared memory, the accumulated review history can exceed the model’s context window. AgentCore handles this by maintaining a rolling context buffer per agent.
Each agent gets:
- The original campaign content (always included)
- A summary of previous review rounds (compressed)
- The agent’s own findings from the current round (full detail)
The runtime compresses older findings using a separate summarization model. This keeps the context window under the limit while preserving enough history for the agent to avoid repeating work.
When an agent writes findings to memory, AgentCore checks the total context size. If it exceeds a threshold, the runtime triggers background summarization and updates the memory slot with the compressed version. Agents reading from memory get the compressed summary for old rounds and full detail for recent rounds.
Failure Modes and State Recovery
Three failure scenarios matter in production:
- Inference timeout: NVIDIA NIM endpoint does not respond within the deadline
- Malformed tool call: Model returns JSON that does not match the tool schema
- Memory write conflict: Two agents try to update the same memory slot simultaneously
For inference timeouts, AgentCore retries the request up to three times with exponential backoff. If all retries fail, the agent execution is marked as failed and the orchestration layer decides whether to retry the entire agent or fail the workflow.
For malformed tool calls, AgentCore validates the response against the tool schema before passing it to the agent. If validation fails, the runtime logs the raw response and returns a structured error to the agent. The agent can choose to retry with a different prompt or escalate the error to the orchestration layer.
For memory write conflicts, the runtime rejects the write and returns a conflict error. The agent must re-read the latest memory state and retry the write. This prevents silent data loss but requires agents to implement retry logic.
In-flight agent state is not automatically recovered. If an agent crashes mid-execution, the workflow restarts from the last checkpoint. Strands defines checkpoints at agent boundaries, so you lose at most one agent’s work. AgentCore does not provide automatic state snapshots within an agent’s execution.
Observability and Cost Control
The observability dashboard shows:
- Execution graph with timing for each agent and inference call
- Memory access patterns (reads, writes, conflicts)
- Token usage per agent and per inference call
- Error rates and retry counts
You can filter by workflow execution ID, agent type, or time range. The dashboard also shows cost breakdowns by agent and by model. This helps identify which agents are expensive and whether parallel execution is worth the cost.
Cost control happens at two levels:
- Workflow level: Set a maximum token budget for the entire execution. If agents exceed the budget, the workflow terminates early.
- Agent level: Set a maximum inference time per agent. If an agent does not complete within the deadline, it is killed and the workflow continues with partial results.
These limits are configured in the Strands workflow definition. AgentCore enforces them at runtime and logs violations to the observability backend.
Trade-Offs and Constraints
| Aspect | Managed (AgentCore) | Custom Infrastructure |
|---|---|---|
| State synchronization | Built-in versioned key-value store | You build locking, conflict resolution |
| Trace propagation | Automatic across orchestration, memory, inference | You instrument every layer |
| Context window management | Automatic summarization and rolling buffer | You implement compression logic |
| Failure recovery | Retry at agent boundaries, no mid-execution snapshots | You decide granularity and recovery strategy |
| Cost visibility | Per-agent token usage in dashboard | You aggregate logs and correlate costs |
| Vendor lock-in | Locked to AWS Bedrock and AgentCore APIs | Portable across cloud providers |
The managed approach removes infrastructure work but limits flexibility. You cannot implement custom memory models, change the trace format, or control summarization behavior. You also pay for the managed service on top of inference costs.
The custom approach gives you full control but requires building and operating the state layer, observability pipeline, and failure recovery logic. This makes sense if you have specific requirements that AgentCore does not support or if you need to run across multiple cloud providers.
Technical Verdict
Use this stack when:
- You are building multi-agent systems that need shared memory and parallel execution
- You want observability without custom instrumentation
- You are already on AWS and can tolerate vendor lock-in
- Your agents fit within AgentCore’s memory model (versioned key-value store)
Avoid this stack when:
- You need custom state synchronization logic (e.g., CRDTs, application-specific merge strategies)
- You require mid-execution state snapshots for fine-grained recovery
- You need to run agents across multiple cloud providers
- You have strict cost constraints and cannot afford the managed service overhead
The architecture is production-ready for workflows where the managed primitives match your requirements. The observability and state management are solid. The limitations are in flexibility, not reliability.