mech.app
AI Agents

2,500 Commits with AI Agents: Reverse-Engineering a 12-Phase Workflow That Actually Shipped Code

How orchestration boundaries, state handoffs, and human checkpoints enable high-volume agent-driven development without runaway failures.

Source: dev.to
2,500 Commits with AI Agents: Reverse-Engineering a 12-Phase Workflow That Actually Shipped Code

A developer shipped 2,500+ commits across two production projects using a 12-phase workflow where AI agents write code and a human acts as gatekeeper. The workflow is documented in three public files: a phase template, a project-specific spec, and an optional multi-agent extension. The projects include a self-hosted memory layer (free-context-hub) and a cloud-hosted multilingual novel platform (lore-weave) with 19 microservices.

This is not a framework. It’s a practitioner’s orchestration pattern that solves the core problem of AI code generation. Agents produce plausible code but fail at scope control, stale assumption detection, and knowing when to stop.

The 12-Phase Structure

Each phase is a discrete checkpoint with explicit inputs, outputs, and approval gates. The workflow enforces sequential execution to prevent agents from skipping steps or merging prematurely.

Phase breakdown:

  1. Requirement Capture – Human writes a feature request or bug report
  2. Spec Generation – Agent produces a technical spec with acceptance criteria
  3. Spec Review – Human approves, rejects, or requests changes
  4. Implementation Plan – Agent breaks spec into file-level tasks
  5. Plan Review – Human validates task decomposition
  6. Code Generation – Agent writes code for approved tasks
  7. Self-Review – Agent checks its own code against spec
  8. Human Review – Human reviews diffs, runs tests, checks contracts
  9. Revision – Agent fixes issues found in review
  10. Integration Test – Automated tests run against changes
  11. Documentation Update – Agent updates docs to match new behavior
  12. Commit & Log – Changes merge with audit trail linking back to original spec

State passes between phases via file artifacts. Each phase writes its output to a designated file (spec.md, plan.md, diff.txt, test-results.json). The next phase reads that file as input. No shared memory or message queue.

Human Approval Gates

The workflow has three hard stops where execution pauses until a human approves:

  • After Spec Generation (Phase 3) – Prevents building the wrong thing
  • After Implementation Plan (Phase 5) – Catches scope creep before code is written
  • After Code Generation (Phase 8) – Final check before merge

At each gate, the human can:

  • Approve and continue
  • Reject and restart from Phase 1
  • Request revisions and loop back to the previous phase

The workflow does not auto-resume. The human must explicitly trigger the next phase. This prevents runaway generation where an agent writes 50 files while you’re in a meeting.

State Handoff Mechanics

Each phase writes a structured artifact to a known location. The orchestrator (a shell script or task runner) checks for the artifact’s existence before starting the next phase.

Example handoff between Phase 2 and Phase 3:

# Phase 2: Spec Generation
agent generate-spec --input requirements.md --output spec.md

# Orchestrator checks for spec.md existence
if [ ! -f spec.md ]; then
  echo "Phase 2 failed: spec.md not found"
  exit 1
fi

# Phase 3: Human reviews spec.md and writes approval.txt
# Orchestrator waits for approval.txt
while [ ! -f approval.txt ]; do
  sleep 5
done

# Phase 4: Implementation Plan reads spec.md
agent generate-plan --input spec.md --output plan.md

This file-based handoff is simple but has trade-offs. It works well for single-developer workflows where phases run sequentially. It breaks down if multiple agents need to collaborate in parallel or if you need to replay a phase without re-running earlier steps.

Failure Modes and Recovery

The workflow exposes three common failure modes:

1. Agent hallucinates during Spec Generation (Phase 2)

The agent invents APIs or features that don’t exist. The human catches this at Phase 3 (Spec Review), then rejects. The workflow restarts from Phase 1 with clarified requirements.

2. Code Generation produces breaking changes (Phase 6)

The agent rewrites a public interface without updating callers. The human catches this at Phase 8 (Human Review), then loops back to Phase 6 with explicit constraints (“do not change the signature of X”).

3. Integration tests fail (Phase 10)

The agent’s code passes unit tests but breaks downstream services. The workflow loops back to Phase 9 (Revision), using test output as input. If the agent can’t fix it after two loops, the human intervenes and writes the fix manually.

The workflow does not auto-retry. Each failure requires human diagnosis to decide whether to loop back, restart, or escalate.

Observability and Audit Trail

Every phase writes a timestamped log entry to a central audit file. The log includes:

  • Phase name
  • Input artifact path
  • Output artifact path
  • Agent model and version
  • Human approval status
  • Timestamp

This creates a queryable history. If a bug appears three months later, you can trace it back to the spec and plan that introduced it.

Example audit entry:

{
  "phase": "code_generation",
  "timestamp": "2025-05-20T14:32:11Z",
  "input": "plan.md",
  "output": "diff.txt",
  "agent": "claude-3.5-sonnet-20241022",
  "approved_by": "human",
  "commit_sha": "a3f9c2d"
}

The audit file is append-only. It lives in the project root and gets committed with every merge.

Multi-Agent Extension (AMAW)

The optional AMAW extension adds a debate phase between Phases 7 and 8. Instead of one agent reviewing its own code, three agents review in parallel:

  • Advocate – Argues why the code is correct
  • Critic – Identifies flaws and edge cases
  • Mediator – Synthesizes feedback and proposes fixes

The debate output goes to the human at Phase 8. This catches more issues than single-agent self-review but adds latency (3x the API calls, roughly $0.15 per debate cycle at current Claude pricing) and complexity (you need to merge three conflicting opinions).

The author uses AMAW only for high-risk changes (database schema migrations, authentication logic, public API changes). For routine features, single-agent review is sufficient.

Deployment Shape

The workflow assumes:

  • A single developer or small team
  • Sequential phase execution (no parallel branches)
  • File-based state handoff
  • Human in the loop for all approvals
  • Local or cloud-hosted agents (no on-prem requirement)

It does not handle:

  • Concurrent feature branches (file-based state cannot coordinate concurrent branches because there is no locking mechanism or merge strategy for conflicting artifacts)
  • Multi-developer approval workflows (approval.txt is a single file with no routing or delegation logic)
  • Real-time collaboration between agents (file polling introduces 5-second minimum latency per handoff)
  • Automatic rollback on test failure (Phase 10 failure requires manual diagnosis and loop-back decision)

For teams larger than 3-5 people, you’ll need a proper orchestration platform (Temporal, Prefect, or a custom state machine) instead of shell scripts and file artifacts.

Trade-Off Table

DimensionBenefitCost
Explicit phasesClear failure boundaries, easy to debugSequential execution adds 2-4 hours per feature end-to-end
File-based stateSimple, auditable, no external dependenciesNo parallelism, hard to replay phases, 5-second polling overhead
Human approval gatesPrevents runaway generation, maintains qualityBlocks progress, requires human availability within review SLA
Audit trailFull traceability from bug to original specLarge log files (50KB per feature), manual querying with grep/jq
Multi-agent debateCatches more edge cases in high-risk code3x API cost ($0.15/cycle), 10-15 minute debate latency

Technical Verdict

Use this workflow if:

  • You can afford 2-4 hour review cycles per feature (the three approval checkpoints add latency but prevent hallucinated specs, scope creep, and breaking changes)
  • You need full audit trails for compliance (SOC 2, internal security reviews, or post-incident analysis where you must trace bugs back to original specs)
  • You’re working solo or with 2-3 developers (file-based state works for sequential workflows but has no locking for concurrent branches)
  • You have time to review every spec and diff (the workflow blocks until you approve)
  • You’re building production systems where correctness matters more than iteration speed

Avoid this workflow if:

  • Iteration must be sub-1-hour (the 12 phases add 2-4 hours minimum per feature, even for small changes)
  • You have 5+ developers needing parallel workflows (file-based state breaks down with concurrent branches because there is no locking mechanism, and multi-developer approval requires routing logic that a single approval.txt file cannot provide)
  • You want agents to operate autonomously without human approval (the three hard stops require human availability within your team’s review SLA)
  • You’re building low-stakes prototypes where bugs are acceptable (the overhead isn’t worth it for throwaway code)
  • You need real-time agent collaboration (file polling introduces 5-second minimum latency per handoff)

The 2,500-commit track record shows it works for production systems built by one person with high correctness requirements. If you need faster iteration, remove approval gates and accept more bugs. If you need team collaboration, replace file artifacts with a proper state machine (Temporal workflows, Prefect tasks, or AWS Step Functions) and approval queue (GitHub PR reviews, PagerDuty escalations, or Slack approval bots).

Tags

agentic-ai orchestration infrastructure

Primary Source

dev.to