mech.app
Dev Tools

12-Factor Agents: Production Principles for LLM Applications That Don't Hallucinate in Production

Engineering principles for production-grade agent systems: context budgets, tool isolation, observability, and deployment patterns beyond frameworks.

Source: github.com
12-Factor Agents: Production Principles for LLM Applications That Don't Hallucinate in Production

Most production agent systems don’t use frameworks. They roll their own stack because the gap between “agent demo” and “customer-facing reliability” is wider than the frameworks admit.

The 12-Factor Agents methodology adapts the original 12-factor app principles for LLM-specific constraints: token budgets, non-deterministic failures, prompt versioning, and tool call observability. It’s not a framework. It’s a checklist for the plumbing decisions you make when you can’t rely on LangChain to handle production edge cases.

Why Frameworks Don’t Ship

The author tested every major agent framework (CrewAI, LangChain, LangGraph, Griptape, smolagents) and found a pattern: strong founders building customer-facing agents mostly write their own orchestration layer. Frameworks optimize for demos. Production systems need:

  • Token budget enforcement at every context assembly step
  • Tool call isolation so one bad function doesn’t poison the session
  • Prompt versioning that survives model updates
  • Observability hooks that trace non-deterministic failures
  • Rollback semantics when the agent hallucinates in production

Frameworks bundle these concerns into opaque abstractions. Production teams need explicit control.

The 12 Factors for Agent Systems

The methodology covers deployment, state management, and runtime isolation. Here’s the breakdown:

FactorPrincipleLLM-Specific Constraint
1. CodebaseOne codebase, many deploysPrompt templates live in version control, not config files
2. DependenciesExplicitly declare tool dependenciesModel API versions, tool schemas, and retrieval indexes are dependencies
3. ConfigStore config in environmentAPI keys, model names, temperature settings stay out of code
4. Backing ServicesTreat vector DBs and APIs as attached resourcesSwap Pinecone for Weaviate without changing agent logic
5. Build/Release/RunStrict separation of stagesPrompt compilation happens at build time, not runtime
6. ProcessesExecute as stateless processesAgent state lives in external stores, not in-memory
7. Port BindingExport services via port bindingAgents expose HTTP/gRPC interfaces for tool calls
8. ConcurrencyScale out via process modelHorizontal scaling for parallel tool execution
9. DisposabilityFast startup and graceful shutdownAgents checkpoint state before termination
10. Dev/Prod ParityKeep environments similarSame model versions and tool schemas across stages
11. LogsTreat logs as event streamsTool calls, token usage, and state transitions stream to observability layer
12. Admin ProcessesRun admin tasks as one-off processesPrompt testing and context window profiling run separately

Factor 3: Own Your Context Window

This is the factor that breaks most agent deployments. “Own your context window” means you architect retrieval, caching, and prompt assembly to stay under token budgets without sacrificing task completion.

Context Budget Architecture

class ContextBudget:
    def __init__(self, max_tokens=8000):
        self.max_tokens = max_tokens
        self.reserved = {
            "system_prompt": 500,
            "tool_definitions": 1000,
            "output_buffer": 1500
        }
        self.available = max_tokens - sum(self.reserved.values())
    
    def allocate_retrieval(self, query_embedding):
        # Fetch top-k docs that fit budget
        budget = self.available * 0.6  # 60% for context
        docs = vector_db.search(
            query_embedding, 
            max_tokens=budget
        )
        return self.truncate_to_budget(docs, budget)
    
    def allocate_history(self, session_id):
        # Keep recent turns, drop old ones
        budget = self.available * 0.4  # 40% for history
        turns = session_store.get_turns(session_id)
        return self.sliding_window(turns, budget)
    
    def truncate_to_budget(self, items, budget):
        total = 0
        result = []
        for item in items:
            tokens = count_tokens(item)
            if total + tokens > budget:
                break
            result.append(item)
            total += tokens
        return result

The pattern: reserve token space for system prompt and tool definitions first, then split remaining budget between retrieval context and conversation history. If you don’t enforce this at the orchestration layer, the agent will hit token limits mid-task and fail silently.

Retrieval vs. History Trade-offs

Allocation StrategyUse CaseFailure Mode
80% retrieval, 20% historyDocument Q&A, knowledge lookupLoses conversation context, repeats questions
50% retrieval, 50% historyMulti-turn debugging, support chatMisses relevant docs, hallucinates facts
20% retrieval, 80% historyCode generation, iterative refinementForgets domain knowledge, reinvents solutions

You pick the split based on task type. Most production systems use dynamic allocation: measure retrieval relevance scores and conversation turn count, then adjust the split per request.

Observability: Logging Tool Calls and State Transitions

Factor 11 (logs as event streams) is where you catch non-deterministic failures before customers do. Agent systems need structured logging for:

  • Tool call traces: function name, arguments, return value, latency
  • Token usage: input tokens, output tokens, total cost per request
  • State transitions: what triggered the transition, what state changed
  • Failure modes: timeout, rate limit, schema mismatch, hallucinated tool call

Structured Event Schema

interface AgentEvent {
  timestamp: string;
  session_id: string;
  event_type: 
    | "tool_call_start"
    | "tool_call_success"
    | "tool_call_error"
    | "state_transition"
    | "token_budget_exceeded";
  payload: {
    tool_name?: string;
    arguments?: Record<string, unknown>;
    result?: unknown;
    error?: string;
    tokens_used?: number;
    state_before?: string;
    state_after?: string;
  };
}

Stream these events to your observability backend (Datadog, Honeycomb, custom warehouse). The goal: when an agent fails in production, you reconstruct the exact sequence of tool calls and context window state that led to the failure.

Deployment Isolation: Versioning Prompts and Tool Registries

Factor 5 (build/release/run separation) means you compile prompts at build time and version them alongside code. This prevents the “works in dev, hallucinates in prod” problem.

Prompt Compilation Pipeline

  1. Build stage: Jinja templates + schema validation
  2. Release stage: Tag prompts with git SHA, store in artifact registry
  3. Run stage: Load prompt by version hash, never modify at runtime
# prompts/system.yaml
version: "1.2.3"
template: |
  You are a customer support agent. You have access to:
  {% for tool in tools %}
  - {{ tool.name }}: {{ tool.description }}
  {% endfor %}
  
  Current user context:
  - Account tier: {{ user.tier }}
  - Recent tickets: {{ user.recent_tickets | length }}

tools:
  - name: search_kb
    schema: ./schemas/search_kb.json
  - name: create_ticket
    schema: ./schemas/create_ticket.json

At build time, validate that all tool schemas exist and match the LLM’s function calling format. At deploy time, pin the prompt version so rollbacks are deterministic.

Tool Registry Versioning

Tools (functions the agent can call) need the same versioning discipline:

class ToolRegistry:
    def __init__(self):
        self.tools = {}
        self.versions = {}
    
    def register(self, name: str, version: str, fn: Callable):
        key = f"{name}@{version}"
        self.tools[key] = fn
        self.versions.setdefault(name, []).append(version)
    
    def call(self, name: str, version: str, args: dict):
        key = f"{name}@{version}"
        if key not in self.tools:
            raise ToolNotFoundError(f"{key} not registered")
        return self.tools[key](**args)

When you update a tool’s signature, register it as a new version. Old agent deployments keep calling the old version. New deployments get the new version. No runtime surprises.

Stateless Processes: Checkpointing Agent State

Factor 6 (stateless processes) means agent state lives in external stores, not in-memory. This enables horizontal scaling and graceful shutdown.

State Checkpoint Pattern

class AgentSession:
    def __init__(self, session_id: str, store: StateStore):
        self.session_id = session_id
        self.store = store
        self.state = store.load(session_id) or self.initial_state()
    
    def execute_turn(self, user_input: str):
        # Load current state
        context = self.build_context(user_input)
        
        # Call LLM
        response = llm.complete(context)
        
        # Execute tool calls
        for tool_call in response.tool_calls:
            result = self.execute_tool(tool_call)
            self.state["tool_results"].append(result)
        
        # Checkpoint state before returning
        self.store.save(self.session_id, self.state)
        
        return response.content
    
    def execute_tool(self, tool_call):
        # Isolated execution with timeout
        try:
            result = tool_registry.call(
                tool_call.name,
                tool_call.version,
                tool_call.arguments,
                timeout=30
            )
            return {"success": True, "result": result}
        except Exception as e:
            return {"success": False, "error": str(e)}

Checkpoint state after every turn. If the process crashes mid-task, the next process picks up from the last checkpoint. This is the only way to survive pod evictions in Kubernetes or spot instance terminations.

Failure Modes and Mitigations

Production agent systems fail in predictable ways. Here’s what breaks and how to catch it:

Failure ModeSymptomMitigation
Token budget exceededTruncated responses, incomplete tool callsEnforce budget at context assembly, reject requests that can’t fit
Tool call hallucinationAgent invents function names or argumentsValidate tool calls against schema before execution
State corruptionAgent forgets context mid-conversationCheckpoint state after every turn, validate state schema on load
Rate limit cascadeAll requests fail after one spikeImplement exponential backoff, circuit breaker per tool
Prompt driftModel update changes behaviorPin model version in config, test prompts against new versions before deploy

The 12-factor approach doesn’t prevent these failures. It makes them observable and recoverable.

Technical Verdict

Use 12-Factor Agents when you’re building customer-facing agent systems that need to survive model updates, scale horizontally, and debug non-deterministic failures. The methodology is overkill for internal tools or proof-of-concept demos.

When to adopt:

  • You’re rolling your own orchestration layer because frameworks don’t fit
  • You need to version prompts and tools independently of code deploys
  • You’re debugging production failures that don’t reproduce in dev
  • You’re scaling beyond a single process and need stateless execution

When to skip:

  • You’re prototyping and iteration speed matters more than reliability
  • Your agent runs in a controlled environment with fixed inputs
  • You’re using a framework that already enforces these principles (rare)

The real value is the checklist. Most teams discover these principles the hard way, after the first production incident. Starting with explicit context budgets, tool versioning, and state checkpointing saves you from rebuilding the orchestration layer six months in.