Architecture Drift Reduction with LLMs: A Python Tool That Combines Structural Rules with AI-Based Evaluation

Architecture Drift Reduction with LLMs
The Slow Rot Nobody Noticed
Six months ago we had a clean hexagonal architecture. Domain layer in the center, ports and adapters on the outside, clear dependency rules. I could draw it on a whiteboard and it matched reality. Then we onboarded three coding agents to handle the backlog.
The agents were productive. Unreasonably productive. They shipped features fast, tests passed, reviewers approved. But nobody was checking whether the code respected the architecture. Agents don't read architecture decision records. They don't know that infrastructure.persistence should never import from application.services. They pattern-match from context windows and existing code — and if one shortcut exists, they'll replicate it fifty times.
I noticed when I tried to swap out our PostgreSQL adapter for DynamoDB. What should've been a clean adapter replacement turned into a three-week excavation. The domain layer had direct imports from the persistence layer. Application services were calling infrastructure utilities. The hexagonal architecture existed only in our Confluence docs.
Like Arceus creating the Pokemon universe with specific rules and order — we'd defined the rules, but nothing was enforcing them while the agents worked.
Quantifying the Damage
Before building any tooling, I needed to understand how bad things actually were. I wrote a quick script to parse imports across our Python monorepo and build a dependency graph. The results were grim.
import ast
import os
from pathlib import Path
from collections import defaultdict
def extract_imports(file_path: str) -> list[str]:
"""Extract all import targets from a Python file."""
with open(file_path, "r") as f:
tree = ast.parse(f.read(), filename=file_path)
imports = []
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
imports.append(alias.name)
elif isinstance(node, ast.ImportFrom):
if node.module:
imports.append(node.module)
return imports
def build_dependency_graph(root: Path) -> dict[str, set[str]]:
"""Build a module-level dependency graph from source tree."""
graph = defaultdict(set)
for py_file in root.rglob("*.py"):
module = str(py_file.relative_to(root)).replace("/", ".").removesuffix(".py")
layer = module.split(".")[0]
for imp in extract_imports(str(py_file)):
if imp.startswith(("domain", "application", "infrastructure", "presentation")):
target_layer = imp.split(".")[0]
graph[layer].add(target_layer)
return dict(graph)Our hexagonal rules said: domain imports nothing. application imports domain. infrastructure imports application and domain. presentation imports application. That's it. What I found was 47 violations — edges in the graph that shouldn't exist. Most introduced in the last four months.
Deterministic Rules: The ArchUnit Approach
The first layer of defense had to be deterministic. No ambiguity, no LLM hallucination risk. If module A imports from module B and that edge isn't in the allowed list, it fails. Period.
I built this as a standalone Python tool using networkx for graph analysis:
import networkx as nx
from dataclasses import dataclass
from enum import Enum
class Severity(Enum):
ERROR = "error"
WARNING = "warning"
@dataclass
class ArchViolation:
source_module: str
target_module: str
source_layer: str
target_layer: str
file_path: str
line_number: int
severity: Severity
message: str
class ArchitectureRuleset:
"""Defines allowed dependency directions between layers."""
def __init__(self, allowed_deps: dict[str, set[str]]):
self.allowed_deps = allowed_deps
self._graph = nx.DiGraph()
for source, targets in allowed_deps.items():
for target in targets:
self._graph.add_edge(source, target)
def is_allowed(self, source_layer: str, target_layer: str) -> bool:
if source_layer == target_layer:
return True
return target_layer in self.allowed_deps.get(source_layer, set())
def check_for_cycles(self) -> list[list[str]]:
"""Detect circular dependencies between layers."""
return list(nx.simple_cycles(self._graph))
# Our hexagonal architecture rules
HEXAGONAL_RULES = ArchitectureRuleset({
"domain": set(), # domain depends on nothing
"application": {"domain"},
"infrastructure": {"domain", "application"},
"presentation": {"application"},
})This catches the obvious stuff. Domain importing infrastructure? Blocked. Presentation reaching into domain directly? Blocked. The tool walks the AST, resolves imports to layers, and checks every edge against the ruleset.
But here's where it gets interesting. Not all violations are structural.
The Subtle Violations Deterministic Rules Miss
Consider this: an application service that technically only imports from the domain layer, but it's constructing SQL query fragments and passing them as strings to a domain method. Structurally clean. Architecturally rotten. The application layer is encoding persistence concerns without a single illegal import.
Or this: a domain entity that defines a method called to_dynamodb_item(). No import violations. The dependency graph looks perfect. But the domain is now coupled to a specific infrastructure technology through intent, not imports.
This is where I realized we needed a second evaluation layer. One that understands architectural intent, not just structural edges.
LLM-Based Evaluation: Catching Intent Violations
The idea is straightforward. Take the diff from a PR, combine it with our architectural constraints expressed in natural language, and ask a model: "Does this change violate our architecture, and if so, how?"
import json
from openai import OpenAI
ARCHITECTURE_CONTEXT = """
Our system follows hexagonal architecture with these invariants:
1. DOMAIN LAYER: Pure business logic. No framework imports, no I/O, no technology-specific code.
Domain entities must not reference persistence mechanisms, serialization formats, or transport protocols.
2. APPLICATION LAYER: Orchestrates domain objects. Defines ports (interfaces) for infrastructure.
Must not contain SQL, HTTP client calls, or file system access directly.
3. INFRASTRUCTURE LAYER: Implements ports defined in application layer. Contains all I/O.
Adapters here are replaceable without touching domain or application code.
4. PRESENTATION LAYER: HTTP handlers, CLI commands. Translates external requests to application calls.
Must not contain business logic or direct infrastructure access.
Additional constraints:
- No layer may skip a layer (presentation cannot call infrastructure directly)
- Domain events are the only coupling mechanism between bounded contexts
- Value objects are immutable and contain no IDs
"""
@dataclass
class LLMEvaluation:
has_violation: bool
confidence: float
violations: list[dict]
reasoning: str
def evaluate_diff_with_llm(
diff: str,
file_paths: list[str],
client: OpenAI,
model: str = "gpt-4o",
) -> LLMEvaluation:
"""Evaluate a code diff against architectural constraints using an LLM."""
prompt = f"""You are an architecture reviewer. Analyze this code diff against our architectural rules.
{ARCHITECTURE_CONTEXT}
## Code Diff
{diff}
## Files Changed
{json.dumps(file_paths, indent=2)}
Respond in JSON with this structure:
{{
"has_violation": bool,
"confidence": float (0-1),
"violations": [
{{
"file": "path",
"description": "what's wrong",
"rule_violated": "which architectural invariant",
"severity": "error|warning"
}}
],
"reasoning": "brief explanation of your analysis"
}}
Only flag clear architectural violations. Style preferences or minor naming issues are NOT violations.
Be conservative — false positives erode trust in the tool.
"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1,
)
result = json.loads(response.choices[0].message.content)
return LLMEvaluation(**result)I set temperature to 0.1. We want consistency here, not creativity. The confidence score matters too — we only block PRs when the model reports confidence above 0.85. Below that, it adds a comment for human review.
Combining Both: The Architecture Fitness Function
The real power comes from running both checks together. Deterministic rules are the hard boundary — they never produce false positives, they're fast, and they don't cost API calls. The LLM evaluation is the soft boundary — it catches things static analysis can't, but it's slower and occasionally wrong.
from dataclasses import field
@dataclass
class ArchReviewResult:
deterministic_violations: list[ArchViolation] = field(default_factory=list)
llm_evaluation: LLMEvaluation | None = None
should_block: bool = False
summary: str = ""
class ArchitectureFitnessFunction:
"""Combined architectural fitness check for PR evaluation."""
def __init__(
self,
ruleset: ArchitectureRuleset,
llm_client: OpenAI,
confidence_threshold: float = 0.85,
skip_llm_on_hard_fail: bool = True,
):
self.ruleset = ruleset
self.llm_client = llm_client
self.confidence_threshold = confidence_threshold
self.skip_llm_on_hard_fail = skip_llm_on_hard_fail
def evaluate(
self,
changed_files: list[Path],
diff: str,
repo_root: Path,
) -> ArchReviewResult:
result = ArchReviewResult()
# Phase 1: Deterministic structural checks
for file_path in changed_files:
if not file_path.suffix == ".py":
continue
violations = self._check_file_dependencies(file_path, repo_root)
result.deterministic_violations.extend(violations)
hard_failures = [
v for v in result.deterministic_violations
if v.severity == Severity.ERROR
]
if hard_failures:
result.should_block = True
result.summary = (
f"Found {len(hard_failures)} hard architecture violation(s). "
"Fix dependency direction issues before merge."
)
if self.skip_llm_on_hard_fail:
return result
# Phase 2: LLM-based intent evaluation
file_paths = [str(f) for f in changed_files]
result.llm_evaluation = evaluate_diff_with_llm(
diff=diff,
file_paths=file_paths,
client=self.llm_client,
)
if (
result.llm_evaluation.has_violation
and result.llm_evaluation.confidence >= self.confidence_threshold
):
llm_errors = [
v for v in result.llm_evaluation.violations
if v.get("severity") == "error"
]
if llm_errors:
result.should_block = True
result.summary += (
f" LLM detected {len(llm_errors)} intent violation(s) "
f"with {result.llm_evaluation.confidence:.0%} confidence."
)
if not result.should_block and not result.summary:
result.summary = "Architecture checks passed."
return result
def _check_file_dependencies(
self, file_path: Path, repo_root: Path
) -> list[ArchViolation]:
"""Check a single file's imports against the ruleset."""
violations = []
relative = file_path.relative_to(repo_root)
parts = relative.parts
if not parts:
return violations
source_layer = parts[0]
if source_layer not in self.ruleset.allowed_deps:
return violations
with open(file_path) as f:
tree = ast.parse(f.read(), filename=str(file_path))
for node in ast.walk(tree):
target_module = None
if isinstance(node, ast.Import):
for alias in node.names:
target_module = alias.name
elif isinstance(node, ast.ImportFrom) and node.module:
target_module = node.module
if target_module is None:
continue
target_layer = target_module.split(".")[0]
if target_layer not in self.ruleset.allowed_deps:
continue
if not self.ruleset.is_allowed(source_layer, target_layer):
violations.append(ArchViolation(
source_module=str(relative).replace("/", ".").removesuffix(".py"),
target_module=target_module,
source_layer=source_layer,
target_layer=target_layer,
file_path=str(file_path),
line_number=getattr(node, "lineno", 0),
severity=Severity.ERROR,
message=(
f"Layer '{source_layer}' must not depend on '{target_layer}'. "
f"Import of '{target_module}' violates dependency rules."
),
))
return violationsWe run this as what Neal Ford calls an "architectural fitness function" — an automated check that evaluates how well the system adheres to its intended architecture. It runs on every PR, just like tests.
The GitHub Action
The integration is a composite action that pulls the diff, runs both evaluation phases, and posts results as a PR comment. If should_block is true, it exits with a non-zero code and the check fails.
name: Architecture Review
on:
pull_request:
paths:
- "**.py"
jobs:
arch-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install networkx openai
- name: Get changed files
id: changes
run: |
echo "files=$(git diff --name-only origin/${{ github.base_ref }}...HEAD | grep '\.py$' | tr '\n' ',')" >> $GITHUB_OUTPUT
- name: Get diff
id: diff
run: |
git diff origin/${{ github.base_ref }}...HEAD -- '*.py' > /tmp/pr_diff.patch
- name: Run architecture checks
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
CHANGED_FILES: ${{ steps.changes.outputs.files }}
run: python scripts/arch_review.py --diff /tmp/pr_diff.patch --files "$CHANGED_FILES"
- name: Post results
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const result = JSON.parse(fs.readFileSync('/tmp/arch_review_result.json', 'utf8'));
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: result.comment_body,
});We added a cost-saving optimization: the LLM evaluation only runs if the deterministic checks pass. No point spending tokens on a diff that's already blocked. We also cache the architecture context as a system message in a persistent assistant thread, which cuts token usage by about 40% on repeated calls.
What We Learned After Two Months
The deterministic layer catches about 70% of violations. Fast, free, deterministic. It caught 23 violations in the first week alone — all from agent-generated PRs that reviewers had approved.
The LLM layer catches things I wouldn't have thought to write rules for. It flagged a domain service that was constructing Redis key patterns. No Redis import anywhere — just string concatenation that happened to produce user:{id}:sessions. It flagged a presentation handler that was implementing retry logic that belonged in the infrastructure layer. Subtle stuff.
False positive rate on the LLM layer sits around 8%. We keep a false_positives.jsonl log and I review it weekly. Most false positives come from the model not understanding our specific conventions around shared kernel modules. I've been iterating on the architecture context prompt to address these.
The agents adapted too. Once the action started blocking their PRs, the agent-generated code started conforming better. The feedback loop works — agents read the failure messages and adjust. We went from 23 violations in week one to 3 in week eight.
Configuration as Code
One thing that made adoption easier: the architecture rules live in the repo as a YAML file that developers can read and update through normal PR processes.
# .architecture/rules.yaml
layers:
domain:
allowed_dependencies: []
invariants:
- "No framework imports"
- "No I/O operations"
- "No technology-specific serialization"
application:
allowed_dependencies: ["domain"]
invariants:
- "Defines ports as abstract classes"
- "No direct infrastructure access"
- "Orchestration only, no business logic"
infrastructure:
allowed_dependencies: ["domain", "application"]
invariants:
- "Implements ports from application layer"
- "All I/O lives here"
presentation:
allowed_dependencies: ["application"]
invariants:
- "No business logic"
- "No direct infrastructure access"
llm_evaluation:
enabled: true
confidence_threshold: 0.85
model: "gpt-4o"
skip_on_hard_fail: trueThe invariants field gets fed directly into the LLM prompt. When someone adds a new architectural constraint, it immediately becomes part of the evaluation. No code changes needed.
Where This Breaks Down
It's not perfect. The LLM struggles with large diffs — anything over 800 lines and accuracy drops noticeably. We chunk large PRs by layer and evaluate each chunk separately, which helps but isn't ideal.
The deterministic layer only works for import-based coupling. If someone passes infrastructure concerns through function parameters or config dictionaries, it won't catch that. The LLM sometimes catches it, sometimes doesn't.
And there's the cost question. Running GPT-4o on every PR adds up. We're spending about $180/month on this. For us, that's nothing compared to the cost of the DynamoDB migration disaster. But I could see smaller teams balking at it.
I've been experimenting with running a smaller model (GPT-4o-mini) as a first pass and only escalating to the full model when the smaller one flags something with low confidence. Early results look promising — cuts costs by 60% with minimal accuracy loss on clear violations.