AI agents write code faster than humans can read it. The gap between generation speed and review capacity creates a new class of technical debt that traditional linters and code review processes were not built to catch.
A developer building a Flutter media player entirely with AI agents cut 40% of the codebase without losing functionality. The reduction was not about bugs or broken features. It was about removing layers of abstraction, phantom error handling, and defensive code that agents introduced because they optimize for local correctness, not global coherence.
The AI Code Smell
AI-generated code has a signature. You can spot it in GitHub repos before you read a single line:
- Verbose READMEs that explain everything and clarify nothing
- Over-abstraction where simple functions are wrapped in interfaces, factories, and strategy patterns
- Defensive error handling that catches exceptions that will never be thrown
- Phantom dependencies where imports exist but the code never calls them
- Half-fixes where agents patch symptoms instead of removing root causes
The problem is not that agents write bad code. The problem is that agents write plausible code that accumulates faster than humans can process it. Each commit looks fine in isolation. The bloat emerges over time as agents layer new code on top of old patterns without refactoring the foundation.
The Review Workflow Gap
Traditional code review assumes humans write code at human speed. You can read a pull request, understand the context, and decide whether the abstraction makes sense. That workflow breaks when agents commit code 10x faster than you can review it.
The cognitive load is different. When you open an AI-generated file, you are not just reviewing logic. You are reverse-engineering intent from plausible-looking patterns. You have to ask:
- Why does this function exist?
- Is this abstraction solving a real problem or a hypothetical one?
- Did the agent introduce this layer because it was asked to, or because it pattern-matched on something in the training data?
You cannot answer those questions by skimming. You have to trace dependencies, check call sites, and mentally diff the current state against what the agent was originally asked to build. That takes time. More time than the agent took to write it.
What Debloating Looks Like
The developer’s experiment reduced a Flutter app from 19,772 lines to 13,509 lines (31.7% reduction) while keeping all 335 tests green. The Dart code in lib/ dropped from 15,859 lines to 9,924 lines. No features were removed. No functionality was lost.
The reduction came from:
- Collapsing abstraction layers where interfaces had one implementation
- Removing defensive error handling that was never triggered
- Deleting phantom dependencies that were imported but unused
- Simplifying control flow where agents had introduced nested conditionals for edge cases that did not exist
This is not refactoring in the traditional sense. Refactoring assumes the code was designed by a human who had a mental model. Debloating assumes the code was generated by an agent that optimized for local correctness without a global plan.
Static Analysis Gaps
Existing linters catch unused imports and dead code. They do not catch:
- Over-abstraction: An interface with one implementation is valid code. It is also unnecessary.
- Defensive bloat: Try-catch blocks that never trigger are syntactically correct. They are also noise.
- Phantom layers: A service class that wraps a single function call is legal. It is also pointless.
You need different heuristics to detect AI-generated bloat:
| Pattern | Traditional Linter | AI Bloat Detector |
|---|---|---|
| Unused import | ✓ Catches | ✓ Catches |
| Interface with one implementation | ✗ Ignores | ✓ Flags |
| Try-catch that never throws | ✗ Ignores | ✓ Flags |
| Function that wraps a single call | ✗ Ignores | ✓ Flags |
| Abstraction introduced for hypothetical future use | ✗ Ignores | ✓ Flags |
The challenge is distinguishing legitimate architectural choices from agent-generated bloat. A human might introduce an interface with one implementation because they know a second implementation is coming. An agent introduces it because the prompt mentioned “extensibility” or because the training data had similar patterns.
Building a Debloat Agent
The developer turned the debloating process into an agent skill. The agent scans the codebase for patterns that suggest bloat:
def detect_bloat_patterns(codebase):
patterns = []
# Interface with single implementation
for interface in find_interfaces(codebase):
if len(interface.implementations) == 1:
patterns.append({
'type': 'single_implementation_interface',
'file': interface.file,
'line': interface.line,
'suggestion': 'Remove interface, use concrete class'
})
# Try-catch that never throws
for try_block in find_try_blocks(codebase):
if not can_throw(try_block.body):
patterns.append({
'type': 'phantom_error_handling',
'file': try_block.file,
'line': try_block.line,
'suggestion': 'Remove try-catch, simplify control flow'
})
# Function that wraps a single call
for function in find_functions(codebase):
if is_single_call_wrapper(function):
patterns.append({
'type': 'wrapper_function',
'file': function.file,
'line': function.line,
'suggestion': 'Inline function, remove indirection'
})
return patterns
The agent does not automatically remove the bloat. It flags candidates and lets a human decide. The human still has to answer: is this abstraction serving a purpose I do not see, or is it just plausible-looking noise?
The Feedback Loop Problem
The root cause is not that agents write bloated code. The root cause is that agents do not get feedback on global coherence. They get feedback on whether the code compiles, whether tests pass, and whether the feature works. They do not get feedback on whether the abstraction makes sense in the context of the entire codebase.
You can fix this with better prompts, but prompts do not scale. You can add linting rules, but linting rules are brittle. The real fix is changing the feedback loop so agents see the cost of bloat before they commit it.
That means:
- Complexity budgets where agents get penalized for adding abstraction layers
- Diff-based review where agents compare their output to the simplest possible implementation
- Incremental debloating where agents periodically scan their own output and remove unnecessary layers
The challenge is defining “unnecessary” in a way that is not just “different from what a human would write.” Agents optimize for different constraints. Sometimes their abstractions are bloated. Sometimes they are just unfamiliar.
Technical Verdict
Use AI-generated code when:
- You have automated tests that cover the functionality
- You can afford to periodically debloat the codebase
- You are willing to treat agent output as a first draft, not a final implementation
- You have tooling to detect over-abstraction and phantom dependencies
Avoid AI-generated code when:
- You do not have time to review and refactor agent output
- The codebase will be maintained by humans who did not see it being built
- You need long-term architectural coherence more than short-term velocity
- You cannot afford the cognitive load of reverse-engineering agent intent
The gap between generation speed and review capacity is real. You can close it with better tooling, better feedback loops, and periodic debloating. But you cannot close it by pretending that agent-generated code is the same as human-written code. It is not. It accumulates differently, bloats differently, and requires different review processes.