Skip to content
Loading

Architecture Drift Reduction with LLMs: A Python Tool That Combines Structural Rules with AI-Based Evaluation

Architecture Drift Reduction with LLMs: A Python Tool That Combines Structural Rules with AI-Based Evaluation hero image

Architecture Drift Reduction with LLMs

The Slow Rot Nobody Noticed

Six months ago we had a clean hexagonal architecture. Domain layer in the center, ports and adapters on the outside, clear dependency rules. I could draw it on a whiteboard and it matched reality. Then we onboarded three coding agents to handle the backlog.

The agents were productive. Unreasonably productive. They shipped features fast, tests passed, reviewers approved. But nobody was checking whether the code respected the architecture. Agents don't read architecture decision records. They don't know that infrastructure.persistence should never import from application.services. They pattern-match from context windows and existing code — and if one shortcut exists, they'll replicate it fifty times.

I noticed when I tried to swap out our PostgreSQL adapter for DynamoDB. What should've been a clean adapter replacement turned into a three-week excavation. The domain layer had direct imports from the persistence layer. Application services were calling infrastructure utilities. The hexagonal architecture existed only in our Confluence docs.

Like Arceus creating the Pokemon universe with specific rules and order — we'd defined the rules, but nothing was enforcing them while the agents worked.

Quantifying the Damage

Before building any tooling, I needed to understand how bad things actually were. I wrote a quick script to parse imports across our Python monorepo and build a dependency graph. The results were grim.

import ast
import os
from pathlib import Path
from collections import defaultdict

def extract_imports(file_path: str) -> list[str]:
    """Extract all import targets from a Python file."""
    with open(file_path, "r") as f:
        tree = ast.parse(f.read(), filename=file_path)

    imports = []
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                imports.append(alias.name)
        elif isinstance(node, ast.ImportFrom):
            if node.module:
                imports.append(node.module)
    return imports


def build_dependency_graph(root: Path) -> dict[str, set[str]]:
    """Build a module-level dependency graph from source tree."""
    graph = defaultdict(set)

    for py_file in root.rglob("*.py"):
        module = str(py_file.relative_to(root)).replace("/", ".").removesuffix(".py")
        layer = module.split(".")[0]

        for imp in extract_imports(str(py_file)):
            if imp.startswith(("domain", "application", "infrastructure", "presentation")):
                target_layer = imp.split(".")[0]
                graph[layer].add(target_layer)

    return dict(graph)

Our hexagonal rules said: domain imports nothing. application imports domain. infrastructure imports application and domain. presentation imports application. That's it. What I found was 47 violations — edges in the graph that shouldn't exist. Most introduced in the last four months.

Deterministic Rules: The ArchUnit Approach

The first layer of defense had to be deterministic. No ambiguity, no LLM hallucination risk. If module A imports from module B and that edge isn't in the allowed list, it fails. Period.

I built this as a standalone Python tool using networkx for graph analysis:

import networkx as nx
from dataclasses import dataclass
from enum import Enum


class Severity(Enum):
    ERROR = "error"
    WARNING = "warning"


@dataclass
class ArchViolation:
    source_module: str
    target_module: str
    source_layer: str
    target_layer: str
    file_path: str
    line_number: int
    severity: Severity
    message: str


class ArchitectureRuleset:
    """Defines allowed dependency directions between layers."""

    def __init__(self, allowed_deps: dict[str, set[str]]):
        self.allowed_deps = allowed_deps
        self._graph = nx.DiGraph()

        for source, targets in allowed_deps.items():
            for target in targets:
                self._graph.add_edge(source, target)

    def is_allowed(self, source_layer: str, target_layer: str) -> bool:
        if source_layer == target_layer:
            return True
        return target_layer in self.allowed_deps.get(source_layer, set())

    def check_for_cycles(self) -> list[list[str]]:
        """Detect circular dependencies between layers."""
        return list(nx.simple_cycles(self._graph))


# Our hexagonal architecture rules
HEXAGONAL_RULES = ArchitectureRuleset({
    "domain": set(),  # domain depends on nothing
    "application": {"domain"},
    "infrastructure": {"domain", "application"},
    "presentation": {"application"},
})

This catches the obvious stuff. Domain importing infrastructure? Blocked. Presentation reaching into domain directly? Blocked. The tool walks the AST, resolves imports to layers, and checks every edge against the ruleset.

But here's where it gets interesting. Not all violations are structural.

The Subtle Violations Deterministic Rules Miss

Consider this: an application service that technically only imports from the domain layer, but it's constructing SQL query fragments and passing them as strings to a domain method. Structurally clean. Architecturally rotten. The application layer is encoding persistence concerns without a single illegal import.

Or this: a domain entity that defines a method called to_dynamodb_item(). No import violations. The dependency graph looks perfect. But the domain is now coupled to a specific infrastructure technology through intent, not imports.

This is where I realized we needed a second evaluation layer. One that understands architectural intent, not just structural edges.

LLM-Based Evaluation: Catching Intent Violations

The idea is straightforward. Take the diff from a PR, combine it with our architectural constraints expressed in natural language, and ask a model: "Does this change violate our architecture, and if so, how?"

import json
from openai import OpenAI

ARCHITECTURE_CONTEXT = """
Our system follows hexagonal architecture with these invariants:

1. DOMAIN LAYER: Pure business logic. No framework imports, no I/O, no technology-specific code.
   Domain entities must not reference persistence mechanisms, serialization formats, or transport protocols.

2. APPLICATION LAYER: Orchestrates domain objects. Defines ports (interfaces) for infrastructure.
   Must not contain SQL, HTTP client calls, or file system access directly.

3. INFRASTRUCTURE LAYER: Implements ports defined in application layer. Contains all I/O.
   Adapters here are replaceable without touching domain or application code.

4. PRESENTATION LAYER: HTTP handlers, CLI commands. Translates external requests to application calls.
   Must not contain business logic or direct infrastructure access.

Additional constraints:
- No layer may skip a layer (presentation cannot call infrastructure directly)
- Domain events are the only coupling mechanism between bounded contexts
- Value objects are immutable and contain no IDs
"""


@dataclass
class LLMEvaluation:
    has_violation: bool
    confidence: float
    violations: list[dict]
    reasoning: str


def evaluate_diff_with_llm(
    diff: str,
    file_paths: list[str],
    client: OpenAI,
    model: str = "gpt-4o",
) -> LLMEvaluation:
    """Evaluate a code diff against architectural constraints using an LLM."""

    prompt = f"""You are an architecture reviewer. Analyze this code diff against our architectural rules.

{ARCHITECTURE_CONTEXT}

## Code Diff

{diff}


## Files Changed
{json.dumps(file_paths, indent=2)}

Respond in JSON with this structure:
{{
    "has_violation": bool,
    "confidence": float (0-1),
    "violations": [
        {{
            "file": "path",
            "description": "what's wrong",
            "rule_violated": "which architectural invariant",
            "severity": "error|warning"
        }}
    ],
    "reasoning": "brief explanation of your analysis"
}}

Only flag clear architectural violations. Style preferences or minor naming issues are NOT violations.
Be conservative — false positives erode trust in the tool.
"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.1,
    )

    result = json.loads(response.choices[0].message.content)
    return LLMEvaluation(**result)

I set temperature to 0.1. We want consistency here, not creativity. The confidence score matters too — we only block PRs when the model reports confidence above 0.85. Below that, it adds a comment for human review.

Combining Both: The Architecture Fitness Function

The real power comes from running both checks together. Deterministic rules are the hard boundary — they never produce false positives, they're fast, and they don't cost API calls. The LLM evaluation is the soft boundary — it catches things static analysis can't, but it's slower and occasionally wrong.

from dataclasses import field


@dataclass
class ArchReviewResult:
    deterministic_violations: list[ArchViolation] = field(default_factory=list)
    llm_evaluation: LLMEvaluation | None = None
    should_block: bool = False
    summary: str = ""


class ArchitectureFitnessFunction:
    """Combined architectural fitness check for PR evaluation."""

    def __init__(
        self,
        ruleset: ArchitectureRuleset,
        llm_client: OpenAI,
        confidence_threshold: float = 0.85,
        skip_llm_on_hard_fail: bool = True,
    ):
        self.ruleset = ruleset
        self.llm_client = llm_client
        self.confidence_threshold = confidence_threshold
        self.skip_llm_on_hard_fail = skip_llm_on_hard_fail

    def evaluate(
        self,
        changed_files: list[Path],
        diff: str,
        repo_root: Path,
    ) -> ArchReviewResult:
        result = ArchReviewResult()

        # Phase 1: Deterministic structural checks
        for file_path in changed_files:
            if not file_path.suffix == ".py":
                continue

            violations = self._check_file_dependencies(file_path, repo_root)
            result.deterministic_violations.extend(violations)

        hard_failures = [
            v for v in result.deterministic_violations
            if v.severity == Severity.ERROR
        ]

        if hard_failures:
            result.should_block = True
            result.summary = (
                f"Found {len(hard_failures)} hard architecture violation(s). "
                "Fix dependency direction issues before merge."
            )

            if self.skip_llm_on_hard_fail:
                return result

        # Phase 2: LLM-based intent evaluation
        file_paths = [str(f) for f in changed_files]
        result.llm_evaluation = evaluate_diff_with_llm(
            diff=diff,
            file_paths=file_paths,
            client=self.llm_client,
        )

        if (
            result.llm_evaluation.has_violation
            and result.llm_evaluation.confidence >= self.confidence_threshold
        ):
            llm_errors = [
                v for v in result.llm_evaluation.violations
                if v.get("severity") == "error"
            ]
            if llm_errors:
                result.should_block = True
                result.summary += (
                    f" LLM detected {len(llm_errors)} intent violation(s) "
                    f"with {result.llm_evaluation.confidence:.0%} confidence."
                )

        if not result.should_block and not result.summary:
            result.summary = "Architecture checks passed."

        return result

    def _check_file_dependencies(
        self, file_path: Path, repo_root: Path
    ) -> list[ArchViolation]:
        """Check a single file's imports against the ruleset."""
        violations = []
        relative = file_path.relative_to(repo_root)
        parts = relative.parts

        if not parts:
            return violations

        source_layer = parts[0]
        if source_layer not in self.ruleset.allowed_deps:
            return violations

        with open(file_path) as f:
            tree = ast.parse(f.read(), filename=str(file_path))

        for node in ast.walk(tree):
            target_module = None
            if isinstance(node, ast.Import):
                for alias in node.names:
                    target_module = alias.name
            elif isinstance(node, ast.ImportFrom) and node.module:
                target_module = node.module

            if target_module is None:
                continue

            target_layer = target_module.split(".")[0]
            if target_layer not in self.ruleset.allowed_deps:
                continue

            if not self.ruleset.is_allowed(source_layer, target_layer):
                violations.append(ArchViolation(
                    source_module=str(relative).replace("/", ".").removesuffix(".py"),
                    target_module=target_module,
                    source_layer=source_layer,
                    target_layer=target_layer,
                    file_path=str(file_path),
                    line_number=getattr(node, "lineno", 0),
                    severity=Severity.ERROR,
                    message=(
                        f"Layer '{source_layer}' must not depend on '{target_layer}'. "
                        f"Import of '{target_module}' violates dependency rules."
                    ),
                ))

        return violations

We run this as what Neal Ford calls an "architectural fitness function" — an automated check that evaluates how well the system adheres to its intended architecture. It runs on every PR, just like tests.

The GitHub Action

The integration is a composite action that pulls the diff, runs both evaluation phases, and posts results as a PR comment. If should_block is true, it exits with a non-zero code and the check fails.

name: Architecture Review
on:
  pull_request:
    paths:
      - "**.py"

jobs:
  arch-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install networkx openai

      - name: Get changed files
        id: changes
        run: |
          echo "files=$(git diff --name-only origin/${{ github.base_ref }}...HEAD | grep '\.py$' | tr '\n' ',')" >> $GITHUB_OUTPUT

      - name: Get diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD -- '*.py' > /tmp/pr_diff.patch

      - name: Run architecture checks
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          CHANGED_FILES: ${{ steps.changes.outputs.files }}
        run: python scripts/arch_review.py --diff /tmp/pr_diff.patch --files "$CHANGED_FILES"

      - name: Post results
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const result = JSON.parse(fs.readFileSync('/tmp/arch_review_result.json', 'utf8'));

            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: result.comment_body,
            });

We added a cost-saving optimization: the LLM evaluation only runs if the deterministic checks pass. No point spending tokens on a diff that's already blocked. We also cache the architecture context as a system message in a persistent assistant thread, which cuts token usage by about 40% on repeated calls.

What We Learned After Two Months

The deterministic layer catches about 70% of violations. Fast, free, deterministic. It caught 23 violations in the first week alone — all from agent-generated PRs that reviewers had approved.

The LLM layer catches things I wouldn't have thought to write rules for. It flagged a domain service that was constructing Redis key patterns. No Redis import anywhere — just string concatenation that happened to produce user:{id}:sessions. It flagged a presentation handler that was implementing retry logic that belonged in the infrastructure layer. Subtle stuff.

False positive rate on the LLM layer sits around 8%. We keep a false_positives.jsonl log and I review it weekly. Most false positives come from the model not understanding our specific conventions around shared kernel modules. I've been iterating on the architecture context prompt to address these.

The agents adapted too. Once the action started blocking their PRs, the agent-generated code started conforming better. The feedback loop works — agents read the failure messages and adjust. We went from 23 violations in week one to 3 in week eight.

Configuration as Code

One thing that made adoption easier: the architecture rules live in the repo as a YAML file that developers can read and update through normal PR processes.

# .architecture/rules.yaml
layers:
  domain:
    allowed_dependencies: []
    invariants:
      - "No framework imports"
      - "No I/O operations"
      - "No technology-specific serialization"

  application:
    allowed_dependencies: ["domain"]
    invariants:
      - "Defines ports as abstract classes"
      - "No direct infrastructure access"
      - "Orchestration only, no business logic"

  infrastructure:
    allowed_dependencies: ["domain", "application"]
    invariants:
      - "Implements ports from application layer"
      - "All I/O lives here"

  presentation:
    allowed_dependencies: ["application"]
    invariants:
      - "No business logic"
      - "No direct infrastructure access"

llm_evaluation:
  enabled: true
  confidence_threshold: 0.85
  model: "gpt-4o"
  skip_on_hard_fail: true

The invariants field gets fed directly into the LLM prompt. When someone adds a new architectural constraint, it immediately becomes part of the evaluation. No code changes needed.

Where This Breaks Down

It's not perfect. The LLM struggles with large diffs — anything over 800 lines and accuracy drops noticeably. We chunk large PRs by layer and evaluate each chunk separately, which helps but isn't ideal.

The deterministic layer only works for import-based coupling. If someone passes infrastructure concerns through function parameters or config dictionaries, it won't catch that. The LLM sometimes catches it, sometimes doesn't.

And there's the cost question. Running GPT-4o on every PR adds up. We're spending about $180/month on this. For us, that's nothing compared to the cost of the DynamoDB migration disaster. But I could see smaller teams balking at it.

I've been experimenting with running a smaller model (GPT-4o-mini) as a first pass and only escalating to the full model when the smaller one flags something with low confidence. Early results look promising — cuts costs by 60% with minimal accuracy loss on clear violations.