Skip to content
Loading

Feedback Sensors for Coding Agents: A TypeScript Harness That Catches What Code Review Misses

Feedback Sensors for Coding Agents: A TypeScript Harness That Catches What Code Review Misses hero image

Feedback Sensors for Coding Agents: A TypeScript Harness That Catches What Code Review Misses

Last month, one of our coding agents opened a PR that refactored a pricing calculation module. It passed TypeScript strict mode. It passed ESLint. It passed 247 unit tests. Two engineers approved it after a quick review. It shipped on a Tuesday afternoon.

By Wednesday morning, we'd undercharged 1,200 enterprise customers by roughly 15% each.

The bug was elegant in its subtlety. The agent had reordered two discount application steps — applying the volume discount before the loyalty discount instead of after. Both discounts still applied. The math still produced a positive number. The types were all correct. But the business semantics were wrong, and the difference compounded in a way that only showed up at scale.

That's the moment I started building what I now call feedback sensors.

Porygon-Z and the Dubious Upgrade Problem

There's a Pokemon called Porygon-Z. It was created by upgrading Porygon2 with a "Dubious Disc" — a sketchy, unofficial modification that made it more powerful but also unstable and erratic. It's the perfect mascot for agent-modified code. The code technically works. It might even be faster or cleaner. But something's off, and you can't tell just by looking at it.

Our agents are producing Porygon-Z code. Powerful, type-safe, well-structured — and occasionally semantically wrong in ways that are invisible to static analysis.

What I Tried First (And Why It Failed)

My first instinct was more tests. I asked the team to increase coverage thresholds from 80% to 95% on any file touched by an agent. This was a mistake for two reasons.

First, the agents are excellent at writing tests that pass. They'll generate a test suite that achieves 98% line coverage while testing almost nothing meaningful. The tests assert what the code does, not what it should do. Tautological testing.

Second, coverage doesn't measure semantic correctness. You can cover every branch of a pricing function without ever asserting that volume discounts compound differently than loyalty discounts.

I also tried adding more specific assertions to our existing test suites. This helped marginally but didn't scale — we couldn't anticipate every semantic invariant an agent might violate.

The Sensor Architecture

What worked was building a layer that sits between the agent's output and the merge. I think of each check as a "sensor" because it's measuring a specific dimension of code quality that traditional CI ignores.

Here's the core orchestrator:

// src/sensors/orchestrator.ts
import type { SensorResult, SensorConfig, AgentPR } from "./types";

interface SensorReport {
  passed: boolean;
  sensors: SensorResult[];
  confidence: number;
  feedbackPrompt: string | null;
}

export class SensorOrchestrator {
  private sensors: Map<string, Sensor>;
  private readonly confidenceThreshold: number;

  constructor(config: SensorConfig) {
    this.sensors = new Map();
    this.confidenceThreshold = config.confidenceThreshold ?? 0.85;
    this.registerDefaults(config);
  }

  async evaluate(pr: AgentPR): Promise<SensorReport> {
    const changedFiles = await pr.getChangedFiles();
    const results: SensorResult[] = [];

    for (const [name, sensor] of this.sensors) {
      if (!sensor.appliesTo(changedFiles)) continue;

      const result = await sensor.run(changedFiles, pr.baseBranch);
      results.push({ sensor: name, ...result });
    }

    const confidence = this.computeConfidence(results);
    const passed = confidence >= this.confidenceThreshold;

    return {
      passed,
      sensors: results,
      confidence,
      feedbackPrompt: passed ? null : this.buildFeedbackPrompt(results),
    };
  }

  private computeConfidence(results: SensorResult[]): number {
    const weights = results.map((r) => r.weight ?? 1);
    const totalWeight = weights.reduce((sum, w) => sum + w, 0);
    return results.reduce(
      (acc, r, i) => acc + r.score * (weights[i] / totalWeight),
      0,
    );
  }

  private buildFeedbackPrompt(results: SensorResult[]): string {
    const failures = results.filter((r) => r.score < 0.7);
    return failures
      .map(
        (f) =>
          `[${f.sensor}] Score: ${f.score.toFixed(2)}. Issues:\n${f.diagnostics.join("\n")}`,
      )
      .join("\n\n");
  }
}

The key insight is the feedbackPrompt — when sensors flag issues, we don't just block the PR. We send structured feedback back to the agent and let it try again. More on that loop later.

Sensor 1: Mutation Testing with Stryker

Mutation testing is the closest thing we have to measuring whether tests actually verify behavior. Stryker (v8.2) mutates your source code — flipping operators, removing conditions, changing return values — and checks whether your tests catch the mutations.

If an agent writes code where 40% of mutations survive, that's a strong signal that the tests are cosmetic.

// src/sensors/mutation-sensor.ts
import { execFile } from "node:child_process";
import { promisify } from "node:util";
import type { Sensor, SensorRunResult } from "./types";

const exec = promisify(execFile);

export class MutationSensor implements Sensor {
  private readonly survivalThreshold = 0.25;

  appliesTo(files: string[]): boolean {
    return files.some((f) => f.endsWith(".ts") && !f.includes(".test."));
  }

  async run(changedFiles: string[]): Promise<SensorRunResult> {
    const sourceFiles = changedFiles.filter(
      (f) => f.endsWith(".ts") && !f.includes(".test.") && !f.includes(".d."),
    );

    const { stdout } = await exec("npx", [
      "stryker",
      "run",
      "--mutate",
      sourceFiles.join(","),
      "--reporters",
      "json",
      "--concurrency",
      "4",
    ]);

    const report = JSON.parse(stdout);
    const survivalRate =
      report.metrics.survived / report.metrics.totalMutants || 0;

    const diagnostics: string[] = [];
    if (survivalRate > this.survivalThreshold) {
      const survived = report.files.flatMap((f: any) =>
        f.mutants
          .filter((m: any) => m.status === "Survived")
          .map(
            (m: any) =>
              `  ${f.filename}:${m.location.start.line} - ${m.mutatorName}: "${m.replacement}"`,
          ),
      );
      diagnostics.push(
        `Mutation survival rate: ${(survivalRate * 100).toFixed(1)}% (threshold: ${this.survivalThreshold * 100}%)`,
        `Surviving mutations suggest these behaviors are untested:`,
        ...survived.slice(0, 15),
      );
    }

    return {
      score: 1 - survivalRate,
      weight: 2.5,
      diagnostics,
    };
  }
}

We set the survival threshold at 25%. In practice, well-tested code from human engineers lands around 15-20% survival. Agent code without this sensor? Regularly hits 45-60%.

Sensor 2: Property-Based Testing with fast-check

This is where things get interesting. Instead of just running existing tests, this sensor generates new property-based tests for the changed code and runs them on the spot.

// src/sensors/property-sensor.ts
import * as fc from "fast-check";
import { Project, SyntaxKind } from "ts-morph";
import type { Sensor, SensorRunResult } from "./types";

interface PropertySpec {
  name: string;
  arbitrary: fc.Arbitrary<unknown[]>;
  predicate: (...args: unknown[]) => boolean | Promise<boolean>;
}

export class PropertySensor implements Sensor {
  private readonly numRuns = 1000;
  private readonly specDir: string;

  constructor(specDir: string) {
    this.specDir = specDir;
  }

  appliesTo(files: string[]): boolean {
    return files.some((f) => f.endsWith(".ts"));
  }

  async run(changedFiles: string[]): Promise<SensorRunResult> {
    const specs = await this.loadSpecs(changedFiles);
    const diagnostics: string[] = [];
    let failures = 0;

    for (const spec of specs) {
      try {
        await fc.assert(
          fc.asyncProperty(spec.arbitrary, async (...args) => {
            return spec.predicate(...args);
          }),
          { numRuns: this.numRuns, verbose: fc.VerbosityLevel.Verbose },
        );
      } catch (err) {
        failures++;
        const counterexample = (err as any).counterexample;
        diagnostics.push(
          `Property "${spec.name}" violated with input: ${JSON.stringify(counterexample)}`,
        );
      }
    }

    const score = specs.length > 0 ? 1 - failures / specs.length : 1;
    return { score, weight: 3.0, diagnostics };
  }

  private async loadSpecs(changedFiles: string[]): Promise<PropertySpec[]> {
    // Loads property specs from co-located .properties.ts files
    // or from the central spec directory
    const specs: PropertySpec[] = [];
    for (const file of changedFiles) {
      const propFile = file.replace(".ts", ".properties.ts");
      try {
        const mod = await import(propFile);
        specs.push(...(mod.properties as PropertySpec[]));
      } catch {
        // No property spec for this file — that's fine
      }
    }
    return specs;
  }
}

The property specs live alongside the source code. Here's what one looks like for that pricing module that burned us:

// src/pricing/calculate-total.properties.ts
import * as fc from "fast-check";
import { calculateTotal } from "./calculate-total";
import type { PropertySpec } from "@/sensors/property-sensor";

export const properties: PropertySpec[] = [
  {
    name: "discounts never increase total beyond base price",
    arbitrary: fc.record({
      basePrice: fc.float({ min: 0.01, max: 100000, noNaN: true }),
      volumeDiscount: fc.float({ min: 0, max: 0.5, noNaN: true }),
      loyaltyDiscount: fc.float({ min: 0, max: 0.3, noNaN: true }),
    }),
    predicate: ({ basePrice, volumeDiscount, loyaltyDiscount }) => {
      const total = calculateTotal(basePrice, volumeDiscount, loyaltyDiscount);
      return total <= basePrice && total >= 0;
    },
  },
  {
    name: "volume discount applied to base, loyalty applied to result",
    arbitrary: fc.record({
      basePrice: fc.constant(1000),
      volumeDiscount: fc.constant(0.1),
      loyaltyDiscount: fc.constant(0.2),
    }),
    predicate: ({ basePrice, volumeDiscount, loyaltyDiscount }) => {
      const total = calculateTotal(basePrice, volumeDiscount, loyaltyDiscount);
      // volume first: 1000 * 0.9 = 900, then loyalty: 900 * 0.8 = 720
      return Math.abs(total - 720) < 0.01;
    },
  },
];

That second property would have caught our pricing bug immediately. The agent reordered the operations, and this invariant encodes the correct order explicitly.

Sensor 3: AST Complexity Analysis

This one's more of a heuristic, but it's caught real issues. When an agent refactors a function and the cyclomatic complexity jumps by more than 40%, or the cognitive complexity (as defined by SonarQube's metric) doubles, something's usually wrong.

// src/sensors/complexity-sensor.ts
import { Project, SourceFile, FunctionDeclaration } from "ts-morph";
import type { Sensor, SensorRunResult } from "./types";

interface ComplexityDelta {
  function: string;
  file: string;
  before: number;
  after: number;
  ratio: number;
}

export class ComplexitySensor implements Sensor {
  private readonly maxComplexityRatio = 1.4;

  appliesTo(files: string[]): boolean {
    return files.some((f) => f.endsWith(".ts") && !f.includes(".test."));
  }

  async run(
    changedFiles: string[],
    baseBranch: string,
  ): Promise<SensorRunResult> {
    const project = new Project({ tsConfigFilePath: "tsconfig.json" });
    const diagnostics: string[] = [];
    const deltas: ComplexityDelta[] = [];

    for (const filePath of changedFiles) {
      const currentSource = project.getSourceFileOrThrow(filePath);
      const baseSource = await this.getBaseVersion(filePath, baseBranch);
      if (!baseSource) continue;

      const currentFns = this.extractComplexities(currentSource);
      const baseFns = this.extractComplexities(baseSource);

      for (const [name, complexity] of currentFns) {
        const baseComplexity = baseFns.get(name);
        if (!baseComplexity) continue;

        const ratio = complexity / baseComplexity;
        if (ratio > this.maxComplexityRatio) {
          deltas.push({
            function: name,
            file: filePath,
            before: baseComplexity,
            after: complexity,
            ratio,
          });
        }
      }
    }

    if (deltas.length > 0) {
      diagnostics.push(
        `Complexity increased significantly in ${deltas.length} function(s):`,
        ...deltas.map(
          (d) =>
            `  ${d.file}#${d.function}: ${d.before} → ${d.after} (+${((d.ratio - 1) * 100).toFixed(0)}%)`,
        ),
      );
    }

    const score = deltas.length === 0 ? 1 : Math.max(0, 1 - deltas.length * 0.2);
    return { score, weight: 1.0, diagnostics };
  }

  private extractComplexities(source: SourceFile): Map<string, number> {
    const map = new Map<string, number>();
    source.getFunctions().forEach((fn) => {
      const name = fn.getName() ?? "anonymous";
      map.set(name, this.calculateCognitiveComplexity(fn));
    });
    return map;
  }

  private calculateCognitiveComplexity(fn: FunctionDeclaration): number {
    let complexity = 0;
    let nesting = 0;

    fn.forEachDescendant((node) => {
      const kind = node.getKind();
      const incrementors = [
        SyntaxKind.IfStatement,
        SyntaxKind.ForStatement,
        SyntaxKind.ForInStatement,
        SyntaxKind.ForOfStatement,
        SyntaxKind.WhileStatement,
        SyntaxKind.CatchClause,
      ];
      if (incrementors.includes(kind)) {
        complexity += 1 + nesting;
        nesting++;
      }
    });

    return complexity;
  }

  private async getBaseVersion(
    filePath: string,
    baseBranch: string,
  ): Promise<SourceFile | null> {
    // Shells out to git show to get the base branch version
    const { execFile } = await import("node:child_process");
    const { promisify } = await import("node:util");
    const exec = promisify(execFile);

    try {
      const { stdout } = await exec("git", [
        "show",
        `${baseBranch}:${filePath}`,
      ]);
      const project = new Project({ useInMemoryFileSystem: true });
      return project.createSourceFile(filePath, stdout);
    } catch {
      return null;
    }
  }
}

I'll admit the cognitive complexity calculation here is simplified — the real implementation handles arrow functions, ternaries, and logical operator sequences. But the pattern holds.

Sensor 4: Dependency Graph Impact Analysis

This is the sensor I'm least confident about, honestly. It traces the import graph from changed files outward and flags when a change touches a module with high fan-out. The idea is: if you change a utility that's imported by 30 other modules, the blast radius demands higher scrutiny.

// src/sensors/impact-sensor.ts
import { Project, SourceFile } from "ts-morph";
import type { Sensor, SensorRunResult } from "./types";

export class ImpactSensor implements Sensor {
  private readonly highImpactThreshold = 15;

  appliesTo(files: string[]): boolean {
    return files.length > 0;
  }

  async run(changedFiles: string[]): Promise<SensorRunResult> {
    const project = new Project({ tsConfigFilePath: "tsconfig.json" });
    const diagnostics: string[] = [];
    let maxFanOut = 0;

    for (const filePath of changedFiles) {
      const source = project.getSourceFile(filePath);
      if (!source) continue;

      const dependents = this.findDependents(source, project);
      if (dependents.length > this.highImpactThreshold) {
        maxFanOut = Math.max(maxFanOut, dependents.length);
        diagnostics.push(
          `${filePath} is imported by ${dependents.length} modules. Changes here have high blast radius.`,
          `  Top dependents: ${dependents.slice(0, 5).join(", ")}`,
        );
      }
    }

    const score =
      maxFanOut <= this.highImpactThreshold
        ? 1
        : Math.max(0.3, 1 - (maxFanOut - this.highImpactThreshold) * 0.03);

    return { score, weight: 1.5, diagnostics };
  }

  private findDependents(source: SourceFile, project: Project): string[] {
    const filePath = source.getFilePath();
    const allFiles = project.getSourceFiles();
    return allFiles
      .filter((f) =>
        f.getImportDeclarations().some((imp) => {
          const resolved = imp.getModuleSpecifierSourceFile();
          return resolved?.getFilePath() === filePath;
        }),
      )
      .map((f) => f.getFilePath());
  }
}

I'd love to hear how others handle this. The threshold of 15 is arbitrary — I picked it based on our codebase's distribution of fan-out. In a monorepo with shared utilities, you'd probably want a different number, or maybe a percentile-based threshold instead.

The Feedback Loop

The sensors aren't just a gate. When the confidence score drops below threshold, the orchestrator generates a structured feedback prompt and sends it back to the agent. This is where the real value kicks in.

// src/feedback/loop.ts
import { SensorOrchestrator } from "../sensors/orchestrator";
import type { Agent, AgentPR } from "../types";

export async function runFeedbackLoop(
  agent: Agent,
  pr: AgentPR,
  orchestrator: SensorOrchestrator,
  maxIterations = 3,
): Promise<{ finalReport: SensorReport; iterations: number }> {
  let iteration = 0;
  let report = await orchestrator.evaluate(pr);

  while (!report.passed && iteration < maxIterations) {
    iteration++;

    const correctionPrompt = [
      "Your previous code change has been flagged by automated quality sensors.",
      "Please review the following diagnostics and submit a corrected version.",
      "",
      report.feedbackPrompt,
      "",
      `Current confidence: ${(report.confidence * 100).toFixed(1)}% (need ${(orchestrator.confidenceThreshold * 100).toFixed(1)}%)`,
      "",
      "Focus on the specific issues identified. Do not refactor unrelated code.",
    ].join("\n");

    await agent.requestCorrection(pr, correctionPrompt);
    report = await orchestrator.evaluate(pr);
  }

  return { finalReport: report, iterations: iteration };
}

In practice, agents self-correct successfully about 70% of the time on the first retry. By the third iteration, we're at 89%. The remaining 11% gets escalated to a human reviewer with the full sensor diagnostics attached — which makes the review dramatically faster because you already know where to look.

The Numbers

We've been running this system in production for about ten weeks now across three teams. Here's what the data looks like:

| Metric | Before sensors | After sensors | |--------|---------------|--------------| | Agent PRs with semantic bugs merged | 12.3% | 1.8% | | Mean time to detect semantic bug | 3.2 days | 14 minutes | | Agent PRs requiring human intervention | 100% | 34% | | Average sensor evaluation time | — | 4.2 minutes | | False positive rate (sensor blocks clean PR) | — | 8.1% |

The false positive rate bugs me. 8% of the time, we're making the agent do unnecessary work or pinging a human for no reason. Most false positives come from the complexity sensor — sometimes a refactor legitimately increases complexity in one function while reducing it system-wide. I'm working on making that sensor context-aware across the full changeset rather than per-function.

GitHub Actions Integration

The whole thing runs as a GitHub Actions workflow that triggers on PRs labeled agent-generated:

# .github/workflows/sensor-check.yml
name: Agent PR Sensors

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  sensor-evaluation:
    if: contains(github.event.pull_request.labels.*.name, 'agent-generated')
    runs-on: ubuntu-latest
    timeout-minutes: 15

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: oven-sh/setup-bun@v2
        with:
          bun-version: latest

      - name: Install dependencies
        run: bun install --frozen-lockfile

      - name: Run sensor evaluation
        run: bun run sensors:evaluate
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          STRYKER_DASHBOARD_API_KEY: ${{ secrets.STRYKER_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          BASE_BRANCH: ${{ github.event.pull_request.base.ref }}

      - name: Post sensor report
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = JSON.parse(fs.readFileSync('.sensor-report.json', 'utf8'));
            const body = [
              `## Sensor Evaluation Report`,
              `**Confidence:** ${(report.confidence * 100).toFixed(1)}%`,
              `**Status:** ${report.passed ? '✓ Passed' : '✗ Below threshold'}`,
              '',
              '| Sensor | Score | Weight |',
              '|--------|-------|--------|',
              ...report.sensors.map(s =>
                `| ${s.sensor} | ${(s.score * 100).toFixed(0)}% | ${s.weight} |`
              ),
            ].join('\n');

            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body,
            });

      - name: Trigger feedback loop
        if: ${{ failure() }}
        run: bun run sensors:feedback --pr=${{ github.event.pull_request.number }}

The sensors:feedback script is what actually talks back to the agent. In our setup, it posts a comment that our agent framework picks up, but you could wire it to any agent API.

What's Still Missing

I don't want to oversell this. There are categories of bugs these sensors won't catch.

Race conditions in async code. The property-based tests run sequentially, so they miss concurrency issues entirely. I've been looking at whether something like Loom (from the Java world) could be adapted, but nothing's materialized for TypeScript yet.

Cross-service contract violations. If the agent changes how a service produces events, and a downstream consumer breaks, our sensors don't see it. We'd need integration-level property tests for that, which is a whole other infrastructure problem.

Slow degradation. A change that makes P99 latency 5% worse won't trigger any sensor. We're experimenting with a benchmark sensor that runs micro-benchmarks on changed hot paths, but the noise floor in CI environments makes it unreliable.

The system also assumes you've written property specs for your critical paths. If you haven't, the property sensor has nothing to check against. We've been backfilling these gradually — about 60% coverage of our core domain logic after ten weeks — but it's real work that someone has to do.

The Bet I'm Making

I think the next year is going to be defined by tooling that treats agent output as untrusted by default. Not because agents are bad — they're remarkably good at producing syntactically and structurally correct code. But correctness has layers, and the semantic layer is where humans still have a massive edge.

Feedback sensors are one way to encode that human semantic knowledge into an automated check. The agents get better with each iteration of the loop. The property specs accumulate as a living specification. And the whole thing runs in four minutes, which is faster than waiting for a code review.

The full implementation is about 2,400 lines of TypeScript. It runs on Node 22 with TypeScript 5.5. The main dependencies are ts-morph for AST work, fast-check 3.x for properties, and Stryker 8.2 for mutations. Nothing exotic.

If you're running coding agents in production and haven't built something like this yet — start with the mutation sensor. It's the highest signal-to-noise ratio of the four, and it requires zero upfront specification work. You'll be surprised how many tests your agents write that don't actually test anything.