Codebuff is an open-source CLI tool that edits codebases through natural language instructions by coordinating four specialized agents: File Picker, Planner, Editor, and Reviewer. Instead of sending your entire prompt to a single LLM, Codebuff’s orchestration layer routes subtasks to agents with narrow tool boundaries, then passes state forward through the chain.
The project just released public evals showing 61% success rate versus Claude Code’s 53% across 175+ real-world coding tasks. The gap comes from multi-agent coordination, not model quality. When you ask for “add authentication to my API,” a single model has to juggle file discovery, change sequencing, precise edits, and validation in one context window. Codebuff splits those responsibilities and hands off intermediate state between agents.
Agent Chain and State Flow
The orchestration follows a linear pipeline with optional loops:
- File Picker Agent scans the codebase, builds a file tree, and returns a list of relevant paths.
- Planner Agent receives the file list, reads file contents, and outputs a sequence of changes with dependencies.
- Editor Agent takes the plan, applies edits to specific files, and writes diffs.
- Reviewer Agent validates the edits, runs tests, and either approves or sends feedback back to the Editor.
If the Reviewer returns status: "retry", the orchestration layer re-invokes the Editor with feedback appended to the context (up to 3 times before halting). If the Reviewer returns status: "escalate", the orchestration layer sends the error back to the Planner to regenerate the change sequence.
State passed between agents includes:
- File paths and content snapshots
- Change plan with ordered steps
- Diffs and line-level edits
- Test results and validation errors
Each agent sees only the state it needs. The File Picker does not have write access. The Editor does not run tests. The Reviewer cannot modify code directly.
Agent Definition Format
Codebuff uses TypeScript-based agent definition files stored in .agents/. Each definition specifies:
- Tools: Functions the agent can call (file read, file write, shell exec, spawn child agent)
- Spawnable agents: Which other agents this agent can invoke
- Prompts: System and user prompt templates with variable interpolation
- Generators: TypeScript functions that programmatically construct prompts or tool lists based on runtime context
Example structure:
// .agents/editor-agent.ts
// Simplified example; actual implementation may vary
export const editorAgent = {
tools: [
"read_file",
"write_file",
"apply_diff"
],
spawnableAgents: [],
systemPrompt: `You are an editor agent. Apply precise changes to files based on the plan.`,
generator: (context) => {
// Programmatically adjust behavior based on file type
if (context.fileType === "typescript") {
return { additionalTools: ["run_tsc"] };
}
return {};
}
};
The generator function lets you inject tools or modify prompts at runtime. If the Planner identifies a TypeScript file, the Editor’s generator adds run_tsc to the tool list before the agent starts.
Orchestration Decision Logic
Codebuff’s orchestration layer does not use a central state machine. Instead, each agent definition declares which agents it can spawn. The File Picker spawns the Planner. The Planner spawns the Editor. The Editor spawns the Reviewer.
When an agent completes, it returns a result object:
// Simplified example of agent return structure
{
status: "success" | "retry" | "escalate",
data: { ... },
nextAgent?: "reviewer" | "planner"
}
The orchestration layer tracks:
- Agent invocation count (to prevent infinite loops)
- Token usage per agent
- Latency per agent call
- Tool call success/failure rates
After three retries, the orchestration layer halts and surfaces the error to the user.
Eval Methodology and Multi-Agent Metrics
Codebuff’s eval harness isolates orchestration failures from LLM capability gaps by tagging each failure with the agent that caused it. The suite runs 175+ tasks across multiple open-source repos. Each task includes:
- A natural language instruction
- A target codebase snapshot
- A set of expected file changes (diffs)
- A test suite that must pass after edits
The eval harness measures:
- Task success rate: Percentage of tasks where all tests pass and diffs match expectations
- Agent handoff errors: Tasks that failed due to incorrect state passed between agents (e.g., File Picker returned wrong paths, Planner generated invalid sequence)
- LLM capability errors: Tasks that failed because the Editor could not generate correct code, even with perfect planning
- Retry count: Number of Editor-Reviewer loops before success or failure
| Metric | Codebuff | Claude Code |
|---|---|---|
| Task success rate | 61% | 53% |
| Avg retries per task | 1.2 | N/A |
| Agent handoff errors | 8% | N/A |
| LLM capability errors | 31% | 47% |
| Avg tokens per task | 12,400 | 18,900 |
Codebuff’s lower token usage comes from scoped agent contexts. The File Picker sees only file paths, not full content. The Planner sees only relevant files, not the entire codebase.
Agent handoff errors (8%) represent cases where the File Picker returned incomplete file lists or the Planner generated plans with missing dependencies. These are orchestration failures, not model failures.
Deployment Shape and Observability
Codebuff runs as a local CLI. The orchestration layer executes in Node.js and calls OpenAI-compatible APIs for each agent. You can point it at OpenAI, Anthropic, or a local model server.
The CLI outputs a live agent trace in the terminal:
[File Picker] Scanning codebase...
[File Picker] Found 23 relevant files
[Planner] Reading src/auth/middleware.ts
[Planner] Reading src/routes/api.ts
[Planner] Generated 4-step plan
[Editor] Applying step 1/4: Add auth import
[Editor] Applying step 2/4: Wrap routes with middleware
[Reviewer] Running tests...
[Reviewer] 12/12 tests passed
Each agent logs:
- Tool calls with arguments
- Token usage
- Latency
- Success/failure status
The orchestration layer writes a JSON log file with the full trace, including intermediate state passed between agents. You can replay failed tasks by feeding the log back into the orchestration layer with modified agent definitions.
Security Boundaries and Tool Access
Each agent has a restricted tool set enforced by the orchestration layer:
- File Picker:
list_files,search_codebase(read-only) - Planner:
read_file,spawn_editor(read-only, cannot write) - Editor:
read_file,write_file,apply_diff,spawn_reviewer(write access, cannot execute shell commands) - Reviewer:
run_tests,read_file,spawn_editor(can execute tests, cannot modify files)
If the Planner tries to call write_file, the tool call fails and the orchestration layer logs a security violation.
Shell execution is isolated to the Reviewer agent and limited to test commands defined in the agent definition. You can whitelist specific commands:
// Simplified example of tool restriction
export const reviewerAgent = {
tools: [
{ name: "run_tests", allowedCommands: ["npm test", "pytest"] }
]
};
If the Reviewer tries to run rm -rf /, the tool call fails.
Failure Modes
Based on Codebuff’s 175+ task evals, File Picker misses account for approximately 5% of failures, while Editor retry loops account for approximately 31% (LLM capability errors). Common failure scenarios:
-
File Picker misses relevant files: The Planner generates an incomplete plan because it did not see all necessary files. The Reviewer catches this when tests fail, but the Editor has to re-run with an expanded file list.
-
Planner generates circular dependencies: The Editor tries to apply step 3 before step 1 completes. The orchestration layer detects the cycle after two retries and escalates to the user.
-
Editor produces syntactically invalid code: The Reviewer runs tests, they fail, and the Reviewer sends feedback to the Editor. After three retries, the orchestration layer halts.
-
Reviewer approves broken code: The Reviewer’s test suite does not cover the edge case that broke. The user discovers the bug later. Codebuff does not prevent this; it only validates against the provided tests.
-
Infinite retry loop: The Editor and Reviewer disagree on whether the code is correct. The orchestration layer halts after three Editor-Reviewer cycles and surfaces both perspectives to the user.
The orchestration layer does not automatically recover from these failures. It logs the error, shows the agent trace, and lets the user decide whether to retry with modified instructions or agent definitions.
Custom Agent Development
You can add new agents by creating definition files in .agents/. Run /init in the CLI to scaffold the directory structure:
.agents/
├── types/
│ ├── agent-definition.ts
│ ├── tools.ts
│ └── util-types.ts
└── my-custom-agent.ts
A custom agent definition might look like:
// Simplified example of custom agent
export const gitCommitterAgent = {
tools: ["read_file", "run_git_command"],
spawnableAgents: [],
systemPrompt: `You create git commits based on file changes.`,
generator: (context) => {
const changedFiles = context.editorOutput.files;
return {
userPrompt: `Create a commit message for changes to: ${changedFiles.join(", ")}`
};
}
};
The generator receives the output from the previous agent (Editor) and constructs a prompt dynamically. You can also add new tools by implementing functions in tools.ts:
// Simplified example of custom tool
export const runGitCommand = async (command: string) => {
// Validate command is safe
if (!command.startsWith("git commit")) {
throw new Error("Only git commit allowed");
}
// Execute
return execSync(command).toString();
};
The orchestration layer loads custom agents at startup and makes them available for spawning.
Technical Verdict
Use Codebuff if:
- You need to debug why agents fail at specific handoff points. The 8% agent handoff error rate in evals means you can trace whether the File Picker missed files or the Planner generated invalid sequences, not just “the LLM failed.”
- You want to inject TypeScript generators into agent prompts to modify tool availability at runtime based on file type or project structure.
- You are building evals for multi-agent coding systems and need a reference implementation with public benchmarks that separate orchestration failures from LLM capability limits.
- You prefer local CLI tools with full control over model endpoints (OpenAI, Anthropic, or local servers) instead of hosted services.
Avoid Codebuff if:
- Your codebase exceeds 100,000 files. The File Picker does not yet support incremental indexing, so initial scans will be slow and may miss files in large monorepos.
- You need automatic recovery from agent failures. The orchestration layer halts after 3 retries and requires manual intervention to adjust instructions or agent definitions.
- Your tasks fit in a single LLM context window and do not require coordination between discovery, planning, editing, and validation steps. The multi-agent architecture adds latency (1.2 retries per task on average) without benefit for simple edits.
- You want real-time collaboration features or web-based UIs. Codebuff is a terminal-only tool with no shared state across users.
The 61% vs 53% success rate gap over Claude Code comes from splitting responsibilities across agents with restricted tool access. If you already manually separate file discovery from editing in your workflow, Codebuff automates the handoffs. If you are satisfied with single-shot LLM edits, the orchestration overhead is not justified.