Most agent frameworks hide the control loop behind abstractions. NVIDIA NIM lets you build the loop yourself in under 60 lines of Python, exposing exactly how the model decides which tool to call and when to stop.
This is the final piece of a five-part series that started with a basic NIM chat call and added retrieval, guardrails, and self-hosting. Now the model gets two tools (a clock and a retriever) and decides which one to use based on the user’s question.
What the Loop Actually Does
The agent loop is a state machine with three outcomes per iteration:
- Model returns a final answer (stop)
- Model requests a tool call (execute, append result, loop)
- Loop limit reached (stop with error)
The conversation history grows with each turn. Tool results get appended as assistant messages with a tool_calls field, then user messages with a role: tool that carry the function output.
messages = [
{"role": "system", "content": "You are a USC campus assistant."},
{"role": "user", "content": "What time is it?"}
]
response = client.chat.completions.create(
model="meta/llama-3.1-70b-instruct",
messages=messages,
tools=tools_schema,
max_tokens=512
)
# Model returns tool_calls instead of content
if response.choices[0].message.tool_calls:
for call in response.choices[0].message.tool_calls:
result = execute_tool(call.function.name, call.function.arguments)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result
})
# Loop again with updated messages
The model sees the tool schema in the tools parameter, decides whether to call one, and returns structured JSON in tool_calls if it does. Your code executes the function, appends the result, and calls NIM again.
Tool Schema Contract
NIM expects tools in OpenAI function-calling format. Each tool is a JSON object with a name, description, and parameters schema.
tools_schema = [
{
"type": "function",
"function": {
"name": "get_current_time",
"description": "Returns the current time in Los Angeles",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
},
{
"type": "function",
"function": {
"name": "search_usc_knowledge",
"description": "Search USC campus information using semantic retrieval",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
}
]
The model uses the description to decide when to call the tool. If you write “Returns the current time,” the model will call it for time-related questions. If you write “Use this when the user asks about time,” you bias the model toward calling it more often.
Decision Logic Inside the Model
The model sees the tools array and decides in a single forward pass whether to:
- Answer directly (no tool needed)
- Call one tool
- Call multiple tools in parallel
NIM models trained for tool calling (like Llama 3.1 70B Instruct) output a special token sequence that the API parses into the tool_calls field. The model does not execute anything. It returns a structured request.
Your orchestration code maps function.name to a Python function and calls it. The model never sees your function implementation, only the schema and the result you append.
Execution and Error Handling
The simplest execution dispatcher is a dictionary:
def execute_tool(name, arguments_json):
args = json.loads(arguments_json)
tools = {
"get_current_time": lambda: datetime.now(timezone("America/Los_Angeles")).strftime("%I:%M %p"),
"search_usc_knowledge": lambda args: retriever.search(args["query"])
}
try:
return tools[name](args) if args else tools[name]()
except KeyError:
return f"Error: Unknown tool {name}"
except Exception as e:
return f"Error executing {name}: {str(e)}"
When a tool fails, you have three options:
- Return the error as a string (model sees it, may retry or apologize)
- Retry the tool call (add retry logic in your dispatcher)
- Stop the loop (raise an exception, surface to user)
The first option is simplest. The model sees “Error: Database timeout” and can tell the user “I couldn’t retrieve that information right now.”
Loop Termination and Guardrails
You need two guardrails:
- Maximum iterations (prevent infinite loops)
- Maximum tool calls per turn (prevent runaway parallel calls)
MAX_ITERATIONS = 5
MAX_TOOL_CALLS_PER_TURN = 3
for iteration in range(MAX_ITERATIONS):
response = client.chat.completions.create(...)
if not response.choices[0].message.tool_calls:
return response.choices[0].message.content
if len(response.choices[0].message.tool_calls) > MAX_TOOL_CALLS_PER_TURN:
return "Error: Too many tool calls requested"
# Execute tools, append results, continue
return "Error: Loop limit reached"
The model can get stuck in a loop if it keeps calling the same tool with slightly different arguments. The iteration limit prevents this. The per-turn limit prevents the model from requesting 50 parallel searches.
State Management
The conversation history is your state. Every tool call and result gets appended:
messages.append(response.choices[0].message.model_dump()) # Assistant message with tool_calls
for call in response.choices[0].message.tool_calls:
result = execute_tool(call.function.name, call.function.arguments)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"name": call.function.name,
"content": result
})
The tool_call_id links the result back to the request. The model uses this to match results to calls when it requested multiple tools in parallel.
If you want to persist state across sessions, serialize the messages array to a database. When the user returns, load it and continue the loop.
Observability Hooks
You want to log:
- Which tool the model chose
- The arguments it passed
- The result returned
- How many iterations the loop took
def execute_tool_with_logging(name, arguments_json, call_id):
logger.info(f"Tool call: {name}", extra={
"call_id": call_id,
"arguments": arguments_json
})
start = time.time()
result = execute_tool(name, arguments_json)
duration = time.time() - start
logger.info(f"Tool result: {name}", extra={
"call_id": call_id,
"duration_ms": duration * 1000,
"result_length": len(result)
})
return result
This gives you a trace of every decision the model made. You can replay failed loops by re-running the same messages array. If your tools return sensitive data (PII, credentials), truncate or mask the logged result before writing it to disk.
When to Use This vs. a Framework
| Scenario | Raw Loop | Framework |
|---|---|---|
| Two to five tools, simple logic | Use raw loop | Overkill |
| Need human approval per tool | Use raw loop (insert approval step) | Framework may abstract it away |
| Multi-agent orchestration | Use raw loop (you control routing) | Framework helps if it matches your pattern |
| Complex state machines | Framework (if it fits) | Otherwise raw loop |
| Debugging tool selection | Raw loop (full visibility) | Framework adds indirection |
| Production at scale | Framework (if you trust it) | Raw loop (if you need control) |
Frameworks like LangGraph and AutoGen add state persistence, retry logic, and visual debugging. They also add abstraction layers that hide what the model actually returned.
Start with the raw loop. Add a framework when you need features you don’t want to build yourself.
Failure Modes
Model refuses to call a tool: The description was too vague or the user question didn’t match. Rewrite the description to be more explicit.
Model calls the wrong tool: The descriptions overlap. Make them mutually exclusive or add a routing tool that picks the right one.
Infinite loop: The model keeps calling the same tool. The MAX_ITERATIONS guardrail stops runaway loops globally. For finer control, add a check that stops if the same tool is called twice in a row with identical arguments.
Tool returns too much data: The result exceeds the context window. Truncate the result or summarize it before appending.
Parallel tool calls fail: One tool succeeds, one fails. The model sees partial results. Either retry the failed call or let the model handle the partial data.
Technical Verdict
Use this pattern if:
- You control all tool implementations. The 60-line loop assumes synchronous execution. If your tools are third-party APIs with unpredictable latency (>500ms), you need async handling or a framework that manages it.
- Tool execution is fast (<500ms per call). The loop blocks on each tool. Slow tools will time out user requests. Frameworks like LangGraph handle async tool execution and timeouts.
- You need to inspect every model decision. The raw loop gives you full visibility into tool selection, arguments, and results. Frameworks add indirection that makes debugging harder.
- Your team understands state machines. The loop is simple but requires reasoning about message history, termination conditions, and error propagation. If your team is unfamiliar with these patterns, a framework provides guardrails.
Avoid this pattern if:
- Tools require human approval between calls. You can insert an approval step, but frameworks like AutoGen and LangChain have built-in human-in-the-loop primitives.
- You need automatic retries across multiple servers. The raw loop runs in a single process. Distributed retry logic requires state persistence and orchestration that frameworks provide.
- Tool latency is unpredictable. The synchronous loop will block. Frameworks handle async execution, timeouts, and fallback strategies.
- You need visual debugging. LangGraph and similar tools provide DAG visualizations of agent execution. The raw loop requires manual log inspection.
The loop is simple. The model returns a structured request. You execute it. You append the result. You call the model again. That’s the entire pattern. Build it once so you know what frameworks are doing under the hood.