Skip to content
Loading

Eliminating Codebase Cognitive Debt: A TypeScript Tool That Maps AI-Generated Code Understanding Gaps

Eliminating Codebase Cognitive Debt: A TypeScript Tool That Maps AI-Generated Code Understanding Gaps hero image

Eliminating Codebase Cognitive Debt: A TypeScript Tool That Maps AI-Generated Code Understanding Gaps

The Incident That Started Everything

Three months ago, our payment reconciliation service threw a RangeError in production at 2am. The on-call engineer pulled up the file, stared at it for twenty minutes, then escalated. The next engineer did the same thing. By the time I got the page, three senior engineers had looked at this code and none of us could explain what it was doing.

Git blame told the story. The file had been generated by an AI coding agent during a sprint back in February. The PR had two approvals. Both reviewers had left comments like "looks good, tests pass" and "nice, clean implementation." Nobody had actually read it. Nobody understood the reconciliation algorithm it implemented. Nobody could explain why it was using a sliding window approach instead of the batch processing pattern we use everywhere else.

The fix took four hours. Understanding the code well enough to fix it took three of those hours.

That's when I started thinking about this differently. We didn't have technical debt in that file. The code was well-structured, typed correctly, tests passing. What we had was something worse: cognitive debt. Working code that exists outside the team's collective understanding.

Naming the Problem

Thoughtworks Technology Radar Vol 34 coined this exact term - "codebase cognitive debt" - and when I read their description I felt seen. They defined it as code that functions correctly but that no current team member can confidently explain, modify, or debug without significant ramp-up time. The traditional response to this is "well, just read the code." That worked when humans wrote all the code. It doesn't scale when agents can produce hundreds of lines per hour across multiple PRs.

Here's what makes cognitive debt different from regular tech debt. Tech debt is code you know is bad. You can point at it. You have a ticket somewhere. Cognitive debt is invisible. The code looks fine. It probably is fine. But there's a gap between what exists in the codebase and what exists in anyone's head, and that gap is a ticking time bomb for your incident response time.

I started asking around. Every team I talked to had the same pattern. Agent-generated code was shipping faster than human understanding could keep up. Not because the code was bad - because the review process hadn't adapted.

Measuring What Nobody Talks About

Before building anything, I needed to figure out what "understanding" even means in a measurable way. I landed on three proxy signals:

Git blame + PR author analysis. If the author of a code block is a bot account or an AI-assisted commit (we tag these), and the PR reviewers spent less than a certain threshold of time on it, that's signal one.

Code review depth metrics. A review where someone left inline comments on specific lines, asked questions, or requested changes indicates engagement. A review that's just an approval with no comments? That's a rubber stamp. We can measure this from the GitHub API.

Convention divergence. Every team develops patterns. If a file uses patterns that appear nowhere else in the codebase, that's signal three. Not because unconventional code is bad, but because unconventional code is harder for the team to reason about quickly.

None of these alone means much. Together, they paint a picture.

The Scanner: Architecture

I built this as a TypeScript CLI tool that runs against a repo and produces a cognitive debt report. It uses the TypeScript compiler API for AST analysis, the GitHub API for review metadata, and git for blame information.

The core scanning pipeline looks like this:

import ts from "typescript";
import { execSync } from "child_process";

interface CognitiveDebtSignal {
  file: string;
  function: string;
  line: number;
  signals: {
    authorIsAgent: boolean;
    reviewDepth: "none" | "shallow" | "deep";
    cyclomaticComplexity: number;
    conventionDivergence: number;
    unknownDependencies: string[];
  };
  score: number;
}

interface BlameEntry {
  author: string;
  commit: string;
  line: number;
}

function parseGitBlame(filePath: string): BlameEntry[] {
  const raw = execSync(`git blame --porcelain ${filePath}`, {
    encoding: "utf-8",
  });
  const entries: BlameEntry[] = [];
  let current: Partial<BlameEntry> = {};

  for (const line of raw.split("\n")) {
    if (line.match(/^[a-f0-9]{40}/)) {
      if (current.author && current.commit) {
        entries.push(current as BlameEntry);
      }
      current = { commit: line.split(" ")[0] };
    } else if (line.startsWith("author ")) {
      current.author = line.slice(7);
    }
  }

  return entries;
}

Nothing fancy here. The interesting part is the cyclomatic complexity calculator that works at the function level:

function calculateCyclomaticComplexity(node: ts.FunctionLikeDeclaration): number {
  let complexity = 1;

  function visit(child: ts.Node) {
    switch (child.kind) {
      case ts.SyntaxKind.IfStatement:
      case ts.SyntaxKind.ConditionalExpression:
      case ts.SyntaxKind.ForStatement:
      case ts.SyntaxKind.ForInStatement:
      case ts.SyntaxKind.ForOfStatement:
      case ts.SyntaxKind.WhileStatement:
      case ts.SyntaxKind.DoStatement:
      case ts.SyntaxKind.CatchClause:
        complexity++;
        break;
      case ts.SyntaxKind.BinaryExpression: {
        const binary = child as ts.BinaryExpression;
        if (
          binary.operatorToken.kind === ts.SyntaxKind.AmpersandAmpersandToken ||
          binary.operatorToken.kind === ts.SyntaxKind.BarBarToken ||
          binary.operatorToken.kind === ts.SyntaxKind.QuestionQuestionToken
        ) {
          complexity++;
        }
        break;
      }
    }
    ts.forEachChild(child, visit);
  }

  ts.forEachChild(node, visit);
  return complexity;
}

Convention Divergence Detection

This was the hardest part to get right. How do you quantify "this doesn't look like our code"? I settled on pattern frequency analysis. The scanner first builds a profile of the codebase's conventions, then scores individual files against that profile.

interface CodebaseConventions {
  errorHandlingPatterns: Map<string, number>;
  importSources: Map<string, number>;
  asyncPatterns: Map<string, number>;
  namingConventions: {
    camelCase: number;
    pascalCase: number;
    snakeCase: number;
  };
  averageFunctionLength: number;
  preferredIterationStyle: "for-of" | "forEach" | "map" | "mixed";
}

function detectConventionDivergence(
  file: ts.SourceFile,
  conventions: CodebaseConventions,
): number {
  let divergenceScore = 0;
  const fileImports = new Set<string>();

  ts.forEachChild(file, (node) => {
    if (ts.isImportDeclaration(node)) {
      const moduleSpecifier = node.moduleSpecifier;
      if (ts.isStringLiteral(moduleSpecifier)) {
        const source = moduleSpecifier.text;
        fileImports.add(source);

        // Flag imports that appear nowhere else in the codebase
        if (!conventions.importSources.has(source)) {
          divergenceScore += 3;
        } else if ((conventions.importSources.get(source) ?? 0) < 3) {
          divergenceScore += 1;
        }
      }
    }

    if (ts.isFunctionDeclaration(node) || ts.isArrowFunction(node)) {
      const length = node.getEnd() - node.getStart();
      const avgLength = conventions.averageFunctionLength;
      if (length > avgLength * 2.5) {
        divergenceScore += 2;
      }
    }
  });

  return divergenceScore;
}

The import analysis turned out to be the most useful signal. When an agent pulls in a library nobody on the team has used before, that's a massive cognitive debt multiplier. You now have code that depends on something nobody can reason about without reading that library's documentation too.

Unknown Dependency Detection

This one's straightforward but impactful. We maintain a team-known-deps.json (auto-generated from historical usage) and flag anything new:

interface DependencyProfile {
  knownPackages: Set<string>;
  packageUsageCount: Map<string, number>;
  lastHumanIntroduction: Map<string, string>; // package -> commit hash
}

function findUnknownDependencies(
  file: ts.SourceFile,
  profile: DependencyProfile,
): string[] {
  const unknown: string[] = [];

  ts.forEachChild(file, (node) => {
    if (!ts.isImportDeclaration(node)) return;
    const specifier = node.moduleSpecifier;
    if (!ts.isStringLiteral(specifier)) return;

    const pkg = specifier.text;
    // Skip relative imports
    if (pkg.startsWith(".") || pkg.startsWith("@/")) return;

    // Extract package name (handle scoped packages)
    const packageName = pkg.startsWith("@")
      ? pkg.split("/").slice(0, 2).join("/")
      : pkg.split("/")[0];

    if (!profile.knownPackages.has(packageName)) {
      unknown.push(packageName);
    }
  });

  return unknown;
}

The Scoring System

Every file gets a cognitive debt score from 0 to 100. The formula weighs the signals differently based on what we found correlates most strongly with "time to understand during an incident":

function calculateCognitiveDebtScore(
  signals: CognitiveDebtSignal["signals"],
): number {
  let score = 0;

  // Agent authorship is the base multiplier
  const authorMultiplier = signals.authorIsAgent ? 1.5 : 1.0;

  // Review depth is the biggest single factor
  const reviewWeight =
    signals.reviewDepth === "none"
      ? 30
      : signals.reviewDepth === "shallow"
        ? 15
        : 0;

  // Complexity above threshold
  const complexityWeight = Math.min(
    25,
    Math.max(0, (signals.cyclomaticComplexity - 5) * 3),
  );

  // Convention divergence
  const conventionWeight = Math.min(25, signals.conventionDivergence * 2);

  // Unknown dependencies
  const depWeight = Math.min(20, signals.unknownDependencies.length * 7);

  score = (reviewWeight + complexityWeight + conventionWeight + depWeight) *
    authorMultiplier;

  return Math.min(100, Math.round(score));
}

We aggregate at three levels. File scores are the raw output. Module scores are the weighted average of files in a directory (weighted by file size and import centrality - a high-debt utility file that everything imports is worse than a high-debt leaf file). Team scores aggregate across the modules a team owns, giving leadership a dashboard view.

Integration: Flagging in Code Review

The scanner runs in CI and posts comments on PRs. When a file's cognitive debt score exceeds our threshold (we started at 60, tuned down to 45 after a month), the PR gets a label and a comment explaining which signals fired:

interface ReviewFlag {
  file: string;
  score: number;
  recommendation: string;
  requiredActions: string[];
}

function generateReviewFlags(
  results: CognitiveDebtSignal[],
  threshold: number,
): ReviewFlag[] {
  return results
    .filter((r) => r.score >= threshold)
    .map((r) => ({
      file: r.file,
      score: r.score,
      recommendation: buildRecommendation(r),
      requiredActions: [
        r.signals.reviewDepth === "none"
          ? "Requires line-by-line review from domain expert"
          : null,
        r.signals.unknownDependencies.length > 0
          ? `Team has no experience with: ${r.signals.unknownDependencies.join(", ")}`
          : null,
        r.signals.cyclomaticComplexity > 10
          ? "Consider requesting the author explain the algorithm in PR description"
          : null,
      ].filter(Boolean) as string[],
    }));
}

function buildRecommendation(signal: CognitiveDebtSignal): string {
  if (signal.score > 75) {
    return "HIGH COGNITIVE DEBT: This code needs a dedicated review session. " +
      "Schedule 30 min with the team to walk through it before merging.";
  }
  if (signal.score > 45) {
    return "MODERATE COGNITIVE DEBT: Ensure at least one reviewer can explain " +
      "this code without reading it. Add inline comments for non-obvious logic.";
  }
  return "LOW: Standard review process sufficient.";
}

The key insight here: we're not blocking merges. We're making the invisible visible. The PR author (human or agent) still ships the code. But now there's a record that says "nobody on this team fully reviewed this" and a nudge to fix that before it becomes a 2am problem.

Real Numbers

I ran the scanner against our main monorepo. 847 files, about 180k lines of TypeScript, six months of history with heavy agent usage starting around month three.

Here's what fell out:

| Metric | Value | |--------|-------| | Total files scanned | 847 | | Files with agent-authored code | 312 (37%) | | Agent-authored files with deep review | 89 (29%) | | Agent-authored files with shallow/no review | 223 (71%) | | Files scoring above 45 (moderate debt) | 156 (18%) | | Files scoring above 75 (high debt) | 41 (5%) | | Average cognitive debt score (agent code) | 38 | | Average cognitive debt score (human code) | 12 | | Unknown dependencies introduced by agents | 23 packages | | Unknown deps reviewed before merge | 4 (17%) |

The 71% shallow-or-no-review number for agent code was the one that got leadership's attention. We were shipping code faster than ever, but building a codebase that was increasingly opaque to the people responsible for keeping it running.

The scariest cluster was in our event processing pipeline. Fourteen files, all agent-generated in the same sprint, average cognitive debt score of 67. Complex state machines with custom retry logic. Tests all green. Not a single review comment on any of the PRs. That's not a codebase - that's a black box we happen to deploy.

What Changed After

We didn't slow down agent usage. That would be stupid. Instead we changed three things.

First, the scanner runs on every PR now. High-score files get routed to specific reviewers based on module ownership. Not "anyone on the team" - the person who knows that area best.

Second, we added a weekly "cognitive debt reduction" session. Thirty minutes, one high-scoring module, the team walks through it together. It's like a book club but for code nobody read the first time. After the session, someone adds explanatory comments and the debt score drops.

Third - and this was the agents team's idea - we started requiring agents to produce an "explanation document" alongside any PR with complexity above a threshold. Not comments in the code. A separate markdown file explaining the why behind architectural decisions. Think of it as forced documentation of intent.

Our average cognitive debt score across the repo dropped from 24 to 16 in eight weeks. More importantly, our mean time to resolve for incidents in agent-generated code dropped from 3.2 hours to 1.4 hours. That's the number that matters.

Beheeyem's Lesson

There's a Pokemon called Beheeyem. Psychic type, looks like a grey alien. Its signature ability is manipulating memories - it can rewrite what you think you know, make you forget things, implant false understanding. I think about it every time I look at a clean, well-typed, fully-tested file that nobody on my team can explain.

The code looks understood. It has good variable names. It has types. It passes every automated check we have. But understanding isn't in the code. Understanding is in people's heads, and if it was never there to begin with, no amount of clean architecture will save you at 2am.

The gap between "code that works" and "code the team understands" is the defining challenge of the agent-assisted era. We've spent decades building tools to measure code quality. We need to start measuring code comprehension with the same rigor.

That's what the scanner does. It doesn't judge code quality. It measures the gap between what exists and what's understood. And it turns out that gap is a lot wider than any of us thought.