Your CI pipeline is green. Your agent-generated code compiles, passes tests, and ships. Three weeks later, your team discovers the README documents an API that doesn’t exist, the tests validate the patch instead of the behavior, and a config file has become the repair surface for a runtime bug the agent never understood.
This is drift. Not the kind you catch with linters or code review. The kind where every artifact is individually correct but the repository no longer tells a coherent story.
The Problem: Probabilistic Code Generation Meets Deterministic Quality Gates
Traditional CI/CD assumes code is written by humans who understand system invariants. Pre-commit hooks check formatting. Unit tests verify behavior. Integration tests validate contracts. Code review catches architectural mistakes.
Agent-generated code breaks every assumption:
- Context fragility: The agent had 8,000 tokens of conversation history when it wrote the function. The repo has none of that.
- Hallucinated dependencies: The agent confidently imports a library method that doesn’t exist in that version.
- Semantically correct, architecturally wrong: The code does what the prompt asked, but violates an unstated system constraint.
- Test inversion: The agent rewrites tests to pass the new code instead of preserving the original behavioral contract.
Your pipeline sees passing tests and clean diffs. It has no mechanism to detect that the agent optimized for prompt completion instead of system coherence.
What Fails in Standard CI/CD
| Pipeline Stage | Human Code Assumption | Agent Code Reality |
|---|---|---|
| Pre-commit hooks | Developer knows formatting rules | Agent applies rules inconsistently across runs |
| Unit tests | Tests define behavior | Agent rewrites tests to match new code |
| Code review | Reviewer has context | No human saw the 47-turn conversation that led here |
| Integration tests | Contracts are stable | Agent changed both sides of the contract |
| Deployment | Artifacts match source | Generated code may reference runtime state not in repo |
The failure mode is not syntax errors or test failures. It’s semantic drift: the gap between what the repository claims and what the system actually does.
The Missing Primitives
To make agent-generated code auditable, you need new quality gates:
1. Provenance Tracking
Every generated artifact needs metadata:
# .agent-provenance/commit-abc123.yaml
agent_run_id: "run_2026-06-12_14-32-01"
model: "gpt-4-turbo-2024-04-09"
prompt_hash: "sha256:7f3a9c..."
context_window_tokens: 8192
tools_available:
- file_editor
- shell_executor
- web_search
input_files:
- src/api/routes.py
- tests/test_routes.py
output_files:
- src/api/routes.py (modified)
- src/api/middleware.py (created)
- tests/test_routes.py (modified)
conversation_turns: 12
This lets you trace any line of code back to the agent run, model version, and input context that produced it. When drift appears three weeks later, you can reconstruct what the agent was thinking.
2. Contract Verification
Standard integration tests check “does this endpoint return 200?” Agent-generated code needs deeper checks:
- Schema stability: Did the agent change the response shape without updating the OpenAPI spec?
- Behavioral invariants: Does the new code preserve the original function’s side effects?
- Dependency coherence: Are all imported modules actually available in the declared environment?
Example check:
# tests/agent_quality/test_contract_stability.py
def test_api_schema_matches_openapi_spec():
"""Verify agent didn't silently change API contracts."""
spec = load_openapi_spec("api/openapi.yaml")
runtime_schema = introspect_live_api("/api/users")
assert runtime_schema == spec["paths"]["/api/users"]["get"]["responses"]["200"]
3. Context Versioning
The agent’s prompt templates, tool schemas, and system instructions are infrastructure. They should live in version control alongside the code they generate:
repo/
├── src/
├── tests/
├── .agent/
│ ├── prompts/
│ │ ├── code_generation.md
│ │ └── test_writing.md
│ ├── tools/
│ │ ├── file_editor.json
│ │ └── shell_executor.json
│ └── system_instructions.md
When the agent changes behavior, you can diff the prompt templates to understand why.
4. Observability Hooks
Your deployment pipeline needs to know:
- Which commits contain agent-generated code
- Which model version produced each artifact
- How many conversation turns preceded each change
- Whether the agent had access to the full codebase or a subset
This metadata feeds into post-deployment monitoring. If a service degrades, you can correlate it with agent runs.
Architecture: Agent-Aware CI Pipeline
Here’s what a quality gate looks like when it understands probabilistic code generation:
graph TD
A[Agent generates code] --> B[Provenance metadata written]
B --> C[Standard linting/tests]
C --> D{Contract verification}
D -->|Pass| E{Behavioral invariant checks}
D -->|Fail| F[Block merge]
E -->|Pass| G{Context coherence scan}
E -->|Fail| F
G -->|Pass| H[Human review with agent context]
G -->|Fail| F
H --> I[Merge with provenance tag]
The new stages:
- Contract verification: Compare runtime behavior against declared interfaces
- Behavioral invariant checks: Ensure the agent didn’t break unstated system rules
- Context coherence scan: Verify README, comments, and docs match actual code
Failure Modes You Can Now Catch
With provenance tracking and contract verification, you detect:
- Test inversion: Agent rewrote tests to pass new code. The diff shows test assertions changed in the same commit as the implementation.
- Hallucinated dependencies: Agent imported
requests.Session.close_all()which doesn’t exist. Contract check fails because the method isn’t in the library. - Silent contract changes: Agent modified API response shape. OpenAPI spec still describes the old shape. Schema stability test fails.
- Context drift: Agent added a config option that only makes sense given a conversation turn you can’t see in the repo. Coherence scan flags undocumented behavior.
Implementation: Minimal Provenance Layer
You don’t need a new platform. Start with a Git hook that writes metadata:
#!/bin/bash
# .git/hooks/post-commit
if [ -f ".agent-run-active" ]; then
RUN_ID=$(cat .agent-run-active)
mkdir -p .agent-provenance
cat > .agent-provenance/$(git rev-parse HEAD).yaml <<EOF
agent_run_id: "$RUN_ID"
model: "$(cat .agent-model)"
timestamp: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
files_modified: $(git diff-tree --no-commit-id --name-only -r HEAD)
EOF
git add .agent-provenance/
git commit --amend --no-edit
fi
The agent writes .agent-run-active when it starts, removes it when it finishes. Every commit gets tagged with provenance.
The Observability Gap
Most teams treat agent-generated code as “just code.” They lose the ability to answer:
- Which production incidents trace back to agent runs?
- Which model versions produce the most drift?
- Which prompt templates lead to contract violations?
You need telemetry that connects runtime behavior to agent provenance:
# In your error tracking
sentry_sdk.set_context("agent_provenance", {
"run_id": get_provenance_for_file(__file__),
"model": get_model_version(__file__),
"conversation_turns": get_turn_count(__file__)
})
When an exception fires, you see which agent run produced the failing code.
Security Boundaries
Agent-generated code introduces new attack surfaces:
- Prompt injection via code comments: Agent reads existing code, including comments. Malicious comments can steer future generations.
- Dependency confusion: Agent hallucinates package names. Attacker registers the package. Agent imports it.
- Credential leakage: Agent writes API keys into config files because it saw them in conversation history.
Your CI pipeline needs to scan for:
- Hardcoded secrets in agent-generated files
- Dependencies that don’t exist in your package registry
- Code that references environment variables not declared in your deployment config
When More AI Doesn’t Help
The instinct is to add another agent: one that reviews the first agent’s code, writes better tests, or explains its reasoning. This creates a new problem.
You now have two probabilistic systems. The reviewer agent can drift. It can hallucinate. It can optimize for approval instead of correctness. You’ve added complexity without adding determinism.
The solution is not more agents. It’s deterministic checks that validate agent output against known-good state: schemas, contracts, behavioral invariants, and dependency manifests.
Technical Verdict
Use agent-generated code when:
- You version-control agent context (prompts, tools, system instructions) alongside code
- Your CI pipeline includes contract verification and behavioral invariant checks
- You have provenance tracking that connects every artifact to the agent run that produced it
- You treat agent output as untrusted input that must pass deterministic quality gates
Avoid agent-generated code when:
- Your pipeline only checks syntax and unit tests
- You have no way to trace code back to the conversation that produced it
- Your team treats agent output as equivalent to human-written code
- You rely on another AI to review the first AI’s work instead of deterministic checks
The problem is not that agents write bad code. It’s that they write code optimized for prompt completion instead of system coherence. Your infrastructure needs to detect the difference.
Source Links
- AI Code Quality Is Not Repo Truth (Dev.to)