AI Code Review That Actually Predicts Bugs Using Production Error Data

Most AI Code Review Is Noise Dressed Up as Signal
The first AI reviewer I wired into our PR pipeline was impressive for about a week. It caught a missing null check on day one, and I thought we were onto something. By day five, it was flagging variable naming conventions on every single PR, commenting on files that had not changed semantically in months, and suggesting error handling on code paths that were already wrapped three layers up.
The team stopped reading it. That is the failure mode nobody talks about: not an AI that misses bugs, but an AI that cries wolf until engineers train themselves to ignore it.
The problem was not the model. It was the context. The reviewer was guessing. It had no idea what actually breaks in our codebase, which modules have a history of production incidents, or what patterns the team has already collectively decided to dismiss. It was pattern-matching against its training data, not reasoning from evidence about our system.
The fix is not a better prompt. It is better context. Specifically: production error data.
The Core Idea
Your observability stack already knows what breaks. Every KeyError, ConnectionTimeout, and unhandled exception that hit production is in Sentry with a stack trace, variable values at crash time, and a timestamp. That data is a map of exactly the patterns that cause real incidents in your codebase.
If an AI reviewer can query that data before it comments on a PR, it stops guessing and starts reasoning from evidence. Instead of "this function might raise a KeyError", it can say "this function has raised a KeyError in production four times in the last 30 days, here is the stack trace".
The difference in signal quality is significant.
The Architecture
I built this as a multi-step pipeline triggered on PR open. Here is how each stage works.
Stage 1: Filtering
Large PRs are a trap. A 50-file diff fed wholesale into an LLM burns tokens, degrades output quality, and generates noise. Not every changed file deserves the same scrutiny.
The first stage narrows the diff to files worth investigating. Test files, documentation updates, trivial renames, and lock file changes are deprioritised. Files with a recent incident history, files touching shared utilities, and files changed alongside prior bug-fix PRs move to the top of the list.
In practice this means the agent focuses on three or four files out of a thirty-file PR. The ones that matter.
Stage 2: Hypothesis and Verification
This is the part that separates the architecture from a simple "send the diff to GPT" approach.
A drafting agent reads the filtered diff and produces candidate bug hypotheses. Not comments. Not suggestions. Hypotheses: "This refactor changes how analytics events are serialised. Prior errors in this module suggest serialisation failures are common."
Each hypothesis is then handed to a dedicated verification agent. The verify agent has one job: search production error history to confirm or discard the hypothesis. It runs these searches in parallel, one agent per hypothesis.
Diff
│
▼
[Filter] → relevant files only
│
▼
[Draft Agent] → produces N hypotheses
│
├──→ [Verify Agent 1] → search Sentry, confirm or discard
├──→ [Verify Agent 2] → search Sentry, confirm or discard
└──→ [Verify Agent N] → search Sentry, confirm or discard
│
▼
[Scoring + Output Filter] → discard low confidence, post the restThe parallel structure is deliberate. Each verify agent only carries context for one hypothesis at a time. Context stays focused, precision improves, and you are not burning a single massive context window on every hypothesis at once.
The verify agent has three search tools:
- Keyword search against error messages and event descriptions
- Error type search to find all past
KeyError,ValueError,TimeoutErroretc. events in relevant files - Stack trace search to find errors whose call stack passed through a specific function or file
It fetches issue summaries first. Only if an issue looks relevant does it drill into the full event detail: the stack trace, variable values at crash time, the environment it occurred in. This "click-into" pattern keeps context lean and avoids flooding the agent with irrelevant data.
Stage 3: Memory Across PRs
Generic AI reviewers know Python. They do not know that analytics.record() in your codebase must always be wrapped in a try/except because the upstream SDK silently changed its serialisation contract six months ago. They do not know that root_task has a hard 15-minute execution limit that bites anything touching async job scheduling.
The memory layer accumulates repository-specific knowledge across every PR run. When a hypothesis is confirmed by production data, the pattern gets written to memory. When the team downvotes a suggestion, that gets written to memory too. Every subsequent agent run receives this context.
After a few weeks, the system is reviewing PRs with the institutional knowledge of an engineer who has been on the project for a year. A junior contributor who introduces a call pattern that has caused incidents before gets flagged with context, not a vague warning.
Stage 4: Quality Filtering on Output
Every confirmed hypothesis comes out with a confidence score and a severity score. Low confidence, low severity suggestions are discarded before they ever reach a developer.
A similarity search runs against past downvoted suggestions. If the proposed comment is semantically close to something the team has already rejected, it gets filtered out. The system learns the team's preferences without anyone having to configure rules.
How I Applied This in Production
I connected our Sentry project to the pipeline via the Sentry API. When a PR opens, the verify agents query for errors in the files touched by the diff. The integration took about a day to wire up. The hard part was not the API calls. It was building the evaluation loop.
The Filtering Step Saved Real Money
Without filtering, a large PR sent 40,000+ tokens of context to the LLM. With filtering active, the average dropped to under 8,000. Token cost went down by roughly 80% on large PRs. More importantly, output quality improved because the agent was not trying to hold a massive context in its working memory.
Memory Was the Biggest Unlock
The first two weeks were good but generic. By week four, the memory layer had accumulated enough project-specific patterns that the suggestions started feeling like they came from someone who knew the codebase.
It flagged a PR that added a new consumer to our message queue without registering it in the error recovery config. There was no way a generic reviewer would have caught that. It caught it because two previous incidents had passed through that same config path and the pattern was in memory.
The Evaluation Pipeline Is Not Optional
I set up an eval dataset of 15 PRs: some known to be buggy (bugs confirmed post-merge with linked Sentry incidents), some confirmed clean. I ran the pipeline against this set after every significant change to a prompt or model.
Starting precision was around 60%. After two rounds of prompt tuning guided by the eval results, it was over 80%. Without the eval pipeline, I would have had no idea whether my prompt changes were helping or making things worse. Every change to the system would have been a guess.
The other key piece was context mocking: snapshotting the Sentry data at the timestamp of each eval PR so results are deterministic. Without this, re-running the eval on a PR from three months ago would produce different results because the error history has grown. You cannot measure improvement if the baseline keeps shifting.
Confidence Threshold Took Tuning
I started the confidence threshold too low because I was worried about missing things. The result was a reviewer that commented on everything, which is just the old noise problem restated. Raising the threshold reduced total comments by about 40%. Zero real catches were lost in that cut. The only things that dropped off were low-confidence suggestions the team would have dismissed anyway.
The Similarity Filter Was an Underrated Feature
After the first month, I stopped seeing a whole category of suggestions the team had consistently downvoted: comments about function length in our data transformation layer, where long functions are deliberate and the team is aware of it. The system just stopped bringing it up. No configuration required. It learned from the feedback signal.
What It Actually Catches
The memorable catch: a webhook handler that called a third-party payment API without any error handling around the network call. The PR author had not thought about it. The human reviewer had not flagged it. The AI caught it because Sentry had four ConnectionTimeout events from that exact API in the past 60 days.
That single catch justified the build time. A missing try/except in a webhook handler is a silent incident waiting to happen. It would have shipped, caused a production issue under load, and required a hotfix.
Other categories it catches reliably:
- Missing error handling on calls to APIs with a known timeout history
- Serialisation changes to objects that have caused
KeyErrororValueErrorbefore - Async patterns that interact badly with known resource limits
- New code paths through functions that have historically been involved in incidents
It does not reliably catch logic errors that have never manifested as a production error. It is not a replacement for human review of business logic. It is a first pass that removes the class of bugs your production history already knows about.
What It Took to Make It Trustworthy
Honest answer: about six weeks of iteration.
The first version produced too many comments and had no memory. Useful, but not trusted. The eval loop revealed that precision was lower than it felt subjectively, because the team was remembering the good catches and forgetting the noise.
The eval data forced objectivity. Precision went from 60% to 80% not because the model got smarter but because the prompts got more specific about when to withhold a suggestion. "Do not comment unless you can cite a specific past error as evidence" was the single most impactful prompt change.
The other thing that built trust was track record. Once the team saw the AI catch two or three real bugs that would have shipped, they started reading its suggestions properly instead of skimming them. Trust compounds.
Summary
| Component | What it does | Why it matters | | --------------------- | ------------------------------------------------ | ----------------------------------------- | | Filtering | Narrows diff to high-risk files | Cuts token cost, improves focus | | Draft + Verify agents | Hypothesise then confirm against production data | Replaces guessing with evidence | | Memory | Accumulates project-specific patterns | Makes generic AI reviewers project-aware | | Confidence scoring | Discards low-quality suggestions | Keeps developer trust intact | | Similarity filter | Learns team preferences from downvotes | Stops repeating dismissed suggestions | | Eval pipeline | Measures precision/recall over time | Makes improvement measurable, not assumed |
The architecture is not complicated. The investment is in the eval loop and the patience to tune the thresholds before you ship it to the team. Skip those two things and you get a noisy reviewer nobody reads. Do them properly and you get a reviewer that catches the bugs your production history already knows are coming.
Pro tip: Start with a small eval dataset of 10-15 PRs, at least half with confirmed bugs linked to production incidents. Run it before and after every prompt change. If you cannot measure the improvement, you cannot trust the improvement.