Slop Issues: How AI-Generated Bug Reports Break Open-Source Triage

Armin Ronacher (Flask, Rye, Pi) just called out a new operational failure mode in open-source maintenance: bug reports rewritten by LLMs that arrive full of confident speculation, fake minimal repros, and root-cause guesses that waste maintainer time. He calls them “slop issues” (a clanker is his term for an LLM rewriting tool). Simon Willison tagged the post with ai-ethics and slop, signaling that the community recognizes this as a distinct problem class.

This is not about AI-generated pull requests or automated vulnerability hunters. This is about the human-to-maintainer communication layer breaking down when users paste their observed problem into an LLM, accept the rewritten output, and submit it as a GitHub issue. The result looks helpful but obscures the actual facts.

The Operational Damage

When an issue arrives rewritten by an LLM, maintainers face:

Confident hallucinations: The LLM invents root causes, suggests implementation strategies, and references adjacent code that may not be relevant.
Fake minimal repros: The model generates a reproduction case that looks plausible but doesn’t actually trigger the bug.
Verbose speculation: Long lists of error classes, analogies to similar problems, and guesswork that buries the actual observed behavior.
Loss of ground truth: The human’s original command, expected outcome, and exact error message are rewritten into prose that sounds authoritative but lacks precision.

Ronacher’s preferred format is stark:

I ran this command. I expected this to happen. This happened instead. Here is the exact error or log.

That structure gives maintainers the raw facts. LLM rewrites replace facts with inference.

Why Triage Systems Can’t Distinguish Slop

Issue trackers store markdown text. They have no built-in mechanism to distinguish:

A human typing their observations directly
A human using an LLM as a grammar assistant
A human pasting their problem into ChatGPT and submitting the output verbatim

All three arrive as markdown. The only signal is tone and structure. Slop issues tend to:

Use passive voice and hedge words (“it appears that,” “this might indicate”)
Include implementation suggestions before the bug is confirmed
Reference code paths the reporter never examined
Lack exact command invocations or raw log output

But these are heuristics, not metadata. A well-intentioned user with an LLM-assisted grammar check looks identical to a user who let the model rewrite their entire report.

Enforcement Options and Trade-Offs

Approach	What It Does	Maintainer Benefit	User Friction	When to Use
Structured issue templates	Enforce separate fields for command, expected, actual, logs	Forces factual input, resists essay-style expansion	Low (already common)	20+ issues/month, small team, clear repro requirements
Manual triage labels	Maintainer marks issues as “needs clarification”	Signals quality problems	None for users	Under 10 issues/month, strong community norms
Pre-submit linting	Bot checks for speculation keywords, missing logs	Catches common patterns	Medium (may block valid reports)	50+ issues/month, existing CI/CD, tolerance for 10-15% false positives
Community norms	Document “no LLM rewrites” in CONTRIBUTING.md	Cultural signal	Low	Small projects (under 1,000 stars), low issue volume

None of these are airtight. Structured templates help but don’t prevent users from pasting LLM output into the “actual behavior” field. Detection tools require ecosystem buy-in. Manual triage doesn’t scale.

What a Slop-Resistant Issue Template Looks Like

Here’s a template structure that resists LLM expansion:

## Command or Code You Ran
<!-- Paste the exact command or code snippet. No explanation. -->

## Expected Behavior
<!-- One sentence. What should have happened? -->

## Actual Behavior
<!-- One sentence. What happened instead? -->

## Logs or Error Messages
<!-- Paste raw output. Do not summarize or rewrite. -->

## Environment
- OS:
- Version:
- Install method:

---

**Do not include:**
- Root cause analysis
- Suggested fixes
- Comparisons to other issues
- Implementation strategies

We need your observations, not speculation.

This template:

Separates facts into discrete fields
Explicitly forbids speculation
Requests raw output, not prose summaries
States the maintainer’s needs upfront

It won’t stop all slop, but it makes the desired format clear.

The Assistive Tool Dilemma

Some users rely on LLMs for accessibility: grammar assistance, translation, or structuring their thoughts. A blanket ban on AI-generated text harms those users.

The distinction is:

Assistive use: LLM helps the user express what they observed.
Generative use: LLM invents content the user didn’t observe.

Here’s a concrete example. A user observes this:

$ rye sync
Error: connection timeout

They paste it into ChatGPT asking for help filing a bug report. The LLM rewrites it as:

When running rye sync, the initialization sequence appears to block on DNS resolution, which could indicate a network stack issue or firewall configuration problem. This might be related to the async runtime’s connection pooling strategy. The timeout suggests the underlying socket is not receiving a response within the expected window, possibly due to IPv6 fallback behavior.

The maintainer now has to ask: did you actually see a DNS error? Is there a firewall? Did you check IPv6? Or is this speculation? A structured template forces the user to paste the exact error message first. If they write “Error: connection timeout” in the “Actual Behavior” field and leave it at that, the maintainer has ground truth. If they paste the LLM’s speculation, the template’s explicit instructions (“Do not include root cause analysis”) signal they’ve violated the format.

Issue trackers have no way to enforce this boundary automatically. The best signal is whether the report contains:

Exact commands
Raw logs
Specific version numbers
Reproducible steps the reporter actually executed

If those are present, the LLM likely assisted rather than generated. If they’re absent and replaced with speculation, the LLM probably rewrote the entire report.

Observability Gap

GitHub Issues has no telemetry for:

Whether the issue text was pasted from an LLM
How many edits happened before submission
Whether the reporter ran the repro steps themselves

Maintainers see the final markdown. They can’t trace provenance. This is the same problem email spam filters faced in the early 2000s: content-based heuristics without sender authentication.

One possible mitigation: issue bots that ask clarifying questions when they detect speculation patterns. Example:

This issue contains phrases like "might indicate" and "could be caused by."
Can you confirm you ran the exact command listed and saw the error shown?
Please paste the raw log output if available.

This adds friction but surfaces the ground-truth question early.

Technical Verdict

Use structured issue templates with explicit anti-speculation language if your project receives more than 20 issues per month and you’ve seen multiple reports in the last 30 days that included root-cause speculation, implementation suggestions, or analogies to adjacent code without providing the exact command run, expected behavior, and raw error output. Ronacher’s critique shows that maintainers need “I ran this command. I expected this to happen. This happened instead. Here is the exact error or log.” Templates with separate fields for command, expected, actual, and logs force users to provide factual input before expanding into speculation. The setup cost (five minutes to add a GitHub issue template) is justified when triage overhead from slop issues exceeds the time saved by accepting unstructured reports.

Avoid structured templates if your project receives fewer than 10 issues per month, you have strong community norms, or your issue tracker is primarily used for feature requests and design discussions rather than bug reports. Templates optimized for factual bug reports frustrate users trying to propose ideas. At low volume, direct communication and cultural expectations work better than automation.

Why this works: Ronacher’s problem statement is that LLM-rewritten issues contain “complete guesswork on root causes, fake-minimal repros, suggested implementation strategies” that bury the actual observed behavior. Structured templates resist this by making it easier to paste raw facts into discrete fields than to paste LLM-generated prose into a single text box. The template doesn’t detect slop, it makes slop harder to produce.

Source Links

Simon Willison’s post quoting Armin Ronacher