Most explanations of agentic AI skip the plumbing. They describe what agents do (multi-step tasks, tool use, reasoning) but not how the orchestration loop actually works. A simple Python assistant built with Google Gemini Pro exposes the mechanics: how the LLM decides when to invoke a tool versus when to synthesize a final answer, how tool outputs feed back into the context window, and what happens when a tool call fails mid-task.
The difference between a chatbot and an agent is not the model. It’s the control flow. A chatbot answers a question and stops. An agent receives a goal, plans steps, invokes tools, evaluates results, and loops until the goal is satisfied or it hits a failure boundary.
The Minimal Agent Architecture
An agent needs three components:
- Reasoning engine: An LLM that generates plans and decides which tool to invoke next.
- Tool registry: A set of functions the agent can call (web search, calculator, database query).
- Orchestration loop: The control flow that feeds tool outputs back into the LLM until the task completes.
The example task: “Figure out how much it would cost for a team of 4 to fly to Madrid next month.” This requires the agent to:
- Recognize it needs flight price data (invoke web search tool).
- Parse the search result to extract a price.
- Multiply the price by four (invoke calculator tool).
- Synthesize the final answer.
The orchestration loop is where the work happens. The LLM does not execute tools directly. It emits a structured tool invocation request (usually JSON). The orchestrator catches that request, executes the tool, serializes the result, and appends it to the conversation history before calling the LLM again.
How Tool Invocation Works
The LLM receives the user goal and a description of available tools. It generates a response that either:
- Invokes a tool (structured JSON with tool name and parameters).
- Returns a final answer (plain text synthesis).
Here’s the decision boundary:
# Illustrative pseudocode showing the orchestration pattern.
# See the GitHub repository for the complete Gemini Pro implementation.
def agent_loop(user_goal, tools, max_iterations=10):
messages = [{"role": "user", "content": user_goal}]
for iteration in range(max_iterations):
response = llm.generate(messages, tools=tools)
if response.is_tool_call:
tool_name = response.tool_name
tool_args = response.tool_args
# Execute the tool
tool_result = execute_tool(tool_name, tool_args)
# Append tool result to conversation history
messages.append({
"role": "assistant",
"content": None,
"tool_calls": [{"name": tool_name, "args": tool_args}]
})
messages.append({
"role": "tool",
"name": tool_name,
"content": str(tool_result)
})
else:
# LLM returned a final answer
return response.content
raise TimeoutError("Agent exceeded max iterations")
The orchestrator does not interpret the tool output. It serializes it as a string and hands it back to the LLM. The LLM decides whether the result satisfies the goal or whether it needs to invoke another tool.
State Management and Context Window Limits
Every tool invocation adds two messages to the conversation history: the tool call and the tool result. On a multi-step task, the context window fills quickly.
The naive approach is to append everything. This works for short tasks but breaks when:
- The agent needs to invoke five or six tools in sequence.
- Tool outputs are verbose (API responses, web scrape results).
- The LLM’s context window is small (8k tokens).
Two strategies for managing state:
- Summarization: After each tool invocation, ask the LLM to summarize the result before appending it to history.
- Sliding window: Keep only the last N tool calls in context and discard older ones.
The Gemini Pro implementation uses a hybrid approach. It keeps the original user goal and the most recent tool call/result pair in full, but summarizes intermediate steps.
Error Boundaries and Failure Modes
The agent can fail in three ways:
- Tool execution failure: The web search API times out, the calculator receives invalid input.
- LLM hallucination: The LLM invokes a tool that doesn’t exist or passes malformed arguments.
- Infinite loop: The LLM keeps invoking tools without converging on a final answer.
The orchestrator needs explicit error handling for each case:
| Failure Mode | Detection | Recovery Strategy |
|---|---|---|
| Tool execution error | Exception during execute_tool() | Append error message to context, let LLM retry or pivot |
| Invalid tool invocation | Tool name not in registry | Return error to LLM, ask it to choose a valid tool |
| Infinite loop | Iteration count exceeds threshold | Abort task, return partial results or error |
The key insight: the orchestrator does not try to fix errors itself. It surfaces the error to the LLM and lets the reasoning engine decide whether to retry, use a different tool, or give up.
When the Agent Can’t Complete the Task
The agent loop terminates when:
- The LLM returns a final answer (no tool invocation).
- The iteration limit is reached.
- A tool returns a fatal error (API key invalid, network unreachable).
The orchestrator should distinguish between soft failures (retryable) and hard failures (abort immediately). Soft failures get appended to the conversation history. Hard failures raise an exception and exit the loop.
Example: if the web search tool returns zero results, that’s a soft failure. The LLM might decide to rephrase the query or use a different tool. If the web search API returns a 401 Unauthorized, that’s a hard failure. The agent cannot proceed.
Observability: What You Need to Log
Debugging an agent requires visibility into the reasoning loop. Log every:
- User goal.
- LLM response (tool call or final answer).
- Tool invocation (name, arguments, execution time).
- Tool result (truncated if verbose).
- Iteration count.
The log should be structured (JSON) so you can trace the decision path. When the agent fails, you need to see which tool call triggered the failure and what the LLM was thinking at that step.
Deployment Shape
The minimal agent runs as a single Python process. The orchestration loop is synchronous: invoke LLM, wait for response, execute tool, repeat.
For production, you need:
- Async tool execution: If the agent invokes multiple tools in parallel, use async/await to avoid blocking.
- Timeout enforcement: Set a wall-clock timeout for the entire task, not just per-iteration.
- Rate limiting: The LLM API has rate limits. Queue requests and retry with exponential backoff.
The Gemini Pro implementation is synchronous and single-threaded. It works for demo tasks but will not scale to high-concurrency workloads.
Technical Verdict
Use this architecture when:
- You are prototyping an agent for the first time and need to understand the orchestration mechanics before adding complexity.
- Your tasks are sequential (one tool call at a time) and you can tolerate 2-5 second latency per reasoning step.
- You control the tool set and can define clear error boundaries for each tool.
- You need an explainable agent for low-stakes tasks like research assistance, data lookup, or internal automation where failures are visible and acceptable.
Avoid this architecture when:
- You need sub-second end-to-end latency (the synchronous reasoning loop is inherently slow).
- Tool outputs are unpredictable, untrusted, or require sandboxing (this pattern assumes tools are safe to execute inline).
- You need to scale to hundreds of concurrent agents or handle bursty traffic (you need a distributed orchestrator with queue-based task management).
- Your tasks require parallel tool execution (web search + database query simultaneously) or you need to optimize for token efficiency across long multi-step workflows.
The minimal agent exposes the core mechanics: tool invocation, context management, error handling. It’s a foundation, not a production system. The next step is to add async execution, observability hooks, and failure recovery logic.