mech.app
Dev Tools

Hermes Agent's Self-Improvement Loop: How Frameworks Decide When to Rewrite Their Own Tools

How Hermes, OpenClaw, and GoClaw handle tool versioning, capability expansion, and the decision boundary for when an agent should modify its own primitives.

Source: dev.to
Hermes Agent's Self-Improvement Loop: How Frameworks Decide When to Rewrite Their Own Tools

Most agent frameworks ship with a fixed tool catalog. You define functions, the LLM calls them, and the set of capabilities stays static until you redeploy. Hermes Agent, OpenClaw, and GoClaw all take a different approach: they let the agent modify its own tool definitions at runtime. This raises immediate engineering questions about versioning, rollback, dependency tracking, and the decision boundary between “call an existing tool” and “rewrite the tool.”

This is a working engineer’s tour of how these frameworks handle runtime tool modification, what breaks when agents rewrite their own primitives, and when you should tolerate the complexity.

The Self-Improvement Loop

Hermes runs a learning loop that creates, edits, and improves skills during normal use. According to the source article, Hermes has accumulated significant adoption since its February 2026 release. When the agent encounters a task it cannot solve with existing tools, it can:

  1. Generate a new skill (a Python function or shell script).
  2. Test the skill in a sandboxed environment.
  3. Store the skill in a local artifact directory.
  4. Add the skill to its tool catalog for future invocations.

This involves runtime code generation and execution. The agent writes code, saves it to disk, and imports it as a callable function. The next time a similar task appears, the agent has a new primitive.

OpenClaw and GoClaw implement variations of this pattern. OpenClaw focuses on web automation and uses a hybrid approach: it generates Playwright scripts but also maintains a library of reusable components. GoClaw is a Go-based framework that emphasizes statically typed tool definitions, so its self-improvement loop is more constrained.

Decision Boundary: Modify, Compose, or Create

The core engineering problem is the decision tree. When an agent receives a task, it must choose:

  • Call an existing tool if the task maps cleanly to a known function.
  • Compose existing tools if the task requires chaining multiple primitives.
  • Modify an existing tool if a function is close but needs parameter changes or logic tweaks.
  • Create a new tool if no combination of existing primitives works.

Hermes uses a prompt-based decision layer. The agent’s system prompt includes a section that describes the current tool catalog and instructs the model to evaluate whether existing tools are sufficient. If the model decides to create or modify a tool, it generates code and writes it to ~/.hermes/skills/.

OpenClaw uses a similar prompt-based approach but adds a heuristic layer: if a web automation task fails three times, the agent automatically attempts to generate a new Playwright script. This reduces the number of LLM calls but increases the risk of generating broken code.

GoClaw requires explicit tool registration. The agent cannot modify Go code at runtime, so the self-improvement loop is limited to generating configuration files or shell scripts. This makes GoClaw safer but less flexible.

Versioning and Rollback

When an agent rewrites a tool, it can break downstream dependencies. Hermes handles this with a simple versioning scheme:

  • Each skill is stored as a timestamped file: fetch_data_20260518_143022.py.
  • The agent maintains a skills.json manifest that maps tool names to file paths.
  • If a new version of a tool fails, the agent can roll back by updating the manifest to point to the previous file.

This versioning approach does not track dependencies between tools. If Tool A calls Tool B, and Tool B gets rewritten, Tool A does not automatically detect the change. The agent must re-test Tool A after modifying Tool B, which requires either manual intervention or a test harness. In practice, this means a tool modification can silently break a multi-step workflow that was working an hour earlier.

OpenClaw does not version tools. It overwrites the existing script file, so rollback requires manual file restoration. This is acceptable for web scraping tasks where tools are loosely coupled, but it becomes a problem for multi-step workflows.

GoClaw avoids the problem by not allowing runtime tool modification. If you want to add a new tool, you write Go code, recompile, and restart the agent. This is slower but eliminates versioning issues.

Tool Invocation Boundaries

The three frameworks differ in how they invoke tools (based on source article descriptions and inferred from architecture patterns):

FrameworkRuntimeInvocation MethodSecurity ModelObservability
HermesPythonDynamic import or subprocessSandboxed Docker container (optional)Logs to ~/.hermes/logs/
OpenClawNode.jsPlaywright script executionBrowser sandboxPlaywright trace files
GoClawGoCompiled function callsOS-level process isolationStructured logging to stdout

Hermes can run generated code in the same process or spawn subprocesses. The recommended deployment is to run Hermes inside a Docker container with limited filesystem access and no network egress except through a proxy.

OpenClaw runs generated Playwright scripts in a headless browser, which provides natural sandboxing. The browser cannot access the host filesystem or make arbitrary network requests. This makes OpenClaw safer for untrusted code generation.

GoClaw compiles tools into the agent binary, so there is no runtime code execution. The agent can only call pre-registered Go functions. This is the safest approach but requires a build step.

State Management

Self-improving agents accumulate state over time. Hermes stores:

  • Skills: Python files in ~/.hermes/skills/.
  • Memory: A SQLite database in ~/.hermes/memory.db that tracks conversation history and task outcomes.
  • Artifacts: Files generated by tools (CSVs, JSON, images) in ~/.hermes/artifacts/.

The agent’s design pattern does not delete old memory entries, which means the database grows indefinitely. You must manually prune old records or implement a retention policy.

OpenClaw stores state in a structured format that makes it easier to query and analyze tool usage patterns. The exact storage backend varies by deployment.

GoClaw uses a file-based state store. Each tool writes its output to a JSON file in a shared directory. For workflows generating more than a few hundred tasks per hour, this file-per-task pattern creates filesystem bottlenecks and makes atomic operations difficult.

Code Example: Hermes Skill Generation and Sandboxed Execution

Here is a simplified version of how Hermes generates and registers a new skill, then executes it in a subprocess for isolation. This example omits production-grade input validation and error handling for clarity:

import os
import json
import subprocess
from datetime import datetime

def create_skill(name: str, code: str, description: str):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{name}_{timestamp}.py"
    skills_dir = os.path.expanduser("~/.hermes/skills")
    
    # Ensure skills directory exists
    os.makedirs(skills_dir, exist_ok=True)
    filepath = os.path.join(skills_dir, filename)
    
    # Write the skill code to disk
    with open(filepath, "w") as f:
        f.write(code)
    
    # Update the skills manifest
    manifest_path = os.path.expanduser("~/.hermes/skills.json")
    with open(manifest_path, "r") as f:
        manifest = json.load(f)
    
    manifest[name] = {
        "path": filepath,
        "description": description,
        "created_at": timestamp
    }
    
    with open(manifest_path, "w") as f:
        json.dump(manifest, f, indent=2)
    
    return filepath

def execute_skill_sandboxed(skill_path: str, args: dict):
    """Execute skill in subprocess with limited permissions.
    
    Note: Production code must validate and sanitize args before
    passing to subprocess to prevent injection attacks.
    """
    # Simplified example: production requires type checking and sanitization
    sanitized_args = {k: v for k, v in args.items() 
                      if isinstance(v, (str, int, float, bool, list, dict))}
    
    result = subprocess.run(
        ["python3", skill_path],
        input=json.dumps(sanitized_args),
        capture_output=True,
        text=True,
        timeout=30,
        env={"HOME": "/tmp/hermes_sandbox"}
    )
    return result.stdout, result.returncode

The agent calls create_skill() after generating code. For execution, it uses execute_skill_sandboxed() to run the skill in a subprocess with a restricted environment. In production deployments, this subprocess runs inside a Docker container with additional resource limits.

Potential Risks and Failure Modes

Self-improving agents introduce several risk categories:

Code Execution Failures

The agent may write syntactically invalid Python or use undefined variables. Hermes mitigates this by running generated code in a try-except block and logging errors. If a skill fails, the agent can attempt to fix it by regenerating the code with error context.

Dependency Breakage

The agent modifies a tool that other tools depend on. For example, if a data fetching tool changes its return format from a list to a dictionary, any tool that expects a list will fail. Hermes does not track dependencies, so this requires manual debugging. You discover the breakage when a workflow fails, not when the tool is modified.

State Corruption

The agent may write malformed JSON to the skills manifest or memory database. Hermes does not validate state files, so corruption can cause the agent to crash. A single malformed JSON write can make the entire skill catalog unreadable.

Infinite Loops

The agent may generate a tool that calls itself recursively or creates a cycle of tool invocations. Hermes does not have a built-in recursion limit, so this can exhaust memory. In one scenario, a tool that “improves itself” could enter a loop where each iteration generates a new version that also tries to improve itself.

Security Bypass

The agent may generate code that attempts to escape the sandbox or exfiltrate data. Hermes relies on Docker for isolation, but Docker is not a complete security boundary. Plan for the possibility of sandbox escape in threat models. An LLM that generates code can also generate code that probes for sandbox weaknesses.

When to Use Each Framework

Use Hermes if:

  • You need an agent that adapts to new tasks without redeployment.
  • You are comfortable running untrusted code in a sandboxed environment.
  • You want a CLI-first workflow with messaging integrations (Telegram, Discord, Slack).

Use OpenClaw if:

  • Your primary use case is web automation (scraping, form filling, testing).
  • You need browser-based sandboxing for generated code.
  • You want to store tool usage data in a queryable format.

Use GoClaw if:

  • You require compile-time safety and static typing.
  • You cannot tolerate runtime code execution.
  • You are building a high-throughput system where agent restarts are acceptable.

Technical Verdict

Self-improving agents are useful for exploratory workflows where the task space is unknown. If you are building a customer support bot that needs to handle arbitrary API integrations, Hermes is a reasonable choice. If you are building a production system with well-defined tasks, a static tool catalog is safer and easier to test.

The versioning and rollback story is immature across all three frameworks. None of them handle dependency tracking or automated testing. You will need to build your own test harness and monitoring layer.

Security is the biggest concern. Running untrusted code generated by an LLM is inherently risky. Docker provides some isolation, but it is not a substitute for proper sandboxing. If you deploy a self-improving agent in a production environment, plan for the possibility of malicious code generation and implement defense-in-depth strategies: network egress filtering, filesystem access controls, resource limits, and audit logging.

The decision boundary between modifying existing tools and creating new ones is still prompt-driven and probabilistic. You cannot guarantee that the agent will make the right choice. This makes self-improving agents unsuitable for workflows where correctness is critical and manual review is impractical.

Tags

agentic-ai orchestration infrastructure

Primary Source

dev.to