Measuring Collaboration Quality with Coding Agents: A Metrics Framework That Actually Tells You If Your Agents Are Helping

Measuring Collaboration Quality with Coding Agents
Four months ago, our platform team adopted coding agents across the board. Copilot, Cursor, some custom agents wired into our CLI tooling. Leadership was enthusiastic. Engineering was cautiously optimistic. Everyone agreed we'd "measure the impact."
Then last month, my director pinged me: "Can you put together a deck showing the ROI on our agent tooling spend?"
I stared at the message for a while. Because here's the thing — I'd been tracking what I could, and the numbers I had were garbage. Lines of code per day had gone up 40%. PRs merged per week were up. Velocity points were slightly higher. And none of it told me whether we were actually building better software faster, or just generating more stuff that would need to be maintained later.
Why Traditional Metrics Lie to You
Let me be blunt: LOC/day as a productivity metric was already a joke before agents. Now it's actively dangerous. An agent can generate 500 lines of code in 30 seconds. That doesn't mean those 500 lines are good, correct, or even necessary.
PRs per day? Same problem. If an agent helps me split what used to be one thoughtful PR into four PRs because it's easier to generate code in smaller chunks, that's not productivity — that's fragmentation.
Velocity points are the worst offender. We estimate stories based on perceived complexity. If agents make things feel easier, estimates drop. Velocity stays flat or goes up slightly, but the actual throughput might not have changed at all. You're just measuring your own calibration drift.
I went back to my director and said: "I can give you those numbers, but they won't tell you anything real. Give me two weeks and I'll build something that does."
The Framework: Five Metrics That Actually Mean Something
After digging through research, talking to other staff engineers dealing with the same question, and staring at our own git history, I landed on five metrics. None of them are perfect individually. Together, they paint a picture.
First-Pass Acceptance Rate
What percentage of agent-generated outputs does an engineer accept without meaningful modification? This isn't about accepting a completion verbatim — small formatting tweaks don't count. It's about whether the agent's output was directionally correct and structurally sound on the first try.
A high acceptance rate means your agents are well-configured, your prompts are good, and the agent has enough context. A low rate means you're spending time babysitting.
Iteration Depth
When the first pass isn't accepted, how many back-and-forth cycles does it take to get to something usable? One round of "actually, I need this to handle the error case too" is fine. Six rounds of increasingly frustrated rephrasing means the agent isn't suited for that task, or the engineer doesn't have the right mental model for collaborating with it.
Rework Rate
Of the code that was agent-generated and merged, how much gets modified within 7 days? Not bug fixes from new requirements — that's normal churn. I mean "someone went back and rewrote this because it was subtly wrong or poorly structured." This is the metric that catches the deferred cost of fast generation.
Cognitive Load Score
This one's hybrid. Part self-reported (a quick weekly survey: "how much mental overhead did agent collaboration add this week?" on a 1-5 scale), part measured. The measured component tracks context switches during agent-assisted sessions, time spent in code review for agent-generated PRs vs. human-written PRs, and time-to-first-meaningful-edit after an agent generates a block.
Defect Origin Rate
What percentage of production incidents trace back to agent-generated code? This requires tagging — either through git metadata or a lightweight annotation system. It's a lagging indicator, but it's the one leadership actually cares about most.
Building the Telemetry Collector
I built a service that pulls from three sources: git history (with agent metadata tags), CI pipeline data, and an IDE plugin that reports session-level interactions. Here's the core of it:
interface AgentInteraction {
sessionId: string;
engineerId: string;
agentId: string;
timestamp: number;
type: "generation" | "iteration" | "acceptance" | "rejection";
metadata: {
linesGenerated: number;
linesModified: number;
iterationIndex: number;
taskContext: string;
durationMs: number;
};
}
interface ReworkEvent {
commitSha: string;
originalCommitSha: string;
filesPaths: string[];
daysSinceOriginal: number;
wasAgentGenerated: boolean;
changeClassification: "bugfix" | "refactor" | "style" | "feature-change";
}
interface MetricsSnapshot {
period: { start: number; end: number };
teamId: string;
firstPassAcceptanceRate: number;
averageIterationDepth: number;
reworkRate: number;
cognitiveLoadScore: number;
defectOriginRate: number;
}The collector itself runs on a cron, aggregating raw events into weekly snapshots:
import { db } from "@/lib/db";
import { gitClient } from "@/lib/git";
import { ciClient } from "@/lib/ci";
class AgentMetricsCollector {
private readonly REWORK_WINDOW_DAYS = 7;
async collectWeeklySnapshot(teamId: string): Promise<MetricsSnapshot> {
const period = this.getCurrentWeekPeriod();
const interactions = await db.agentInteractions.findMany({
where: { teamId, timestamp: { gte: period.start, lte: period.end } },
});
const acceptanceRate = this.calculateAcceptanceRate(interactions);
const iterationDepth = this.calculateIterationDepth(interactions);
const reworkRate = await this.calculateReworkRate(teamId, period);
const cognitiveLoad = await this.calculateCognitiveLoad(teamId, period);
const defectOrigin = await this.calculateDefectOriginRate(teamId, period);
return {
period,
teamId,
firstPassAcceptanceRate: acceptanceRate,
averageIterationDepth: iterationDepth,
reworkRate,
cognitiveLoadScore: cognitiveLoad,
defectOriginRate: defectOrigin,
};
}
private calculateAcceptanceRate(interactions: AgentInteraction[]): number {
const sessions = this.groupBySessions(interactions);
let accepted = 0;
let total = 0;
for (const session of sessions.values()) {
const generations = session.filter((i) => i.type === "generation");
for (const gen of generations) {
total++;
const subsequentEdits = session.filter(
(i) =>
i.timestamp > gen.timestamp &&
i.type === "iteration" &&
i.metadata.iterationIndex === 1,
);
// If no iterations followed, check for acceptance event
const wasAccepted = session.some(
(i) =>
i.type === "acceptance" &&
i.timestamp > gen.timestamp &&
i.metadata.linesModified / gen.metadata.linesGenerated < 0.1,
);
if (wasAccepted && subsequentEdits.length === 0) {
accepted++;
}
}
}
return total === 0 ? 0 : accepted / total;
}
private calculateIterationDepth(interactions: AgentInteraction[]): number {
const sessions = this.groupBySessions(interactions);
const depths: number[] = [];
for (const session of sessions.values()) {
const maxIteration = Math.max(
...session
.filter((i) => i.type === "iteration")
.map((i) => i.metadata.iterationIndex),
0,
);
if (maxIteration > 0) {
depths.push(maxIteration);
}
}
return depths.length === 0
? 0
: depths.reduce((a, b) => a + b, 0) / depths.length;
}
private async calculateReworkRate(
teamId: string,
period: { start: number; end: number },
): Promise<number> {
const agentCommits = await gitClient.getCommitsWithTag("agent-generated", {
teamId,
before: period.start - this.REWORK_WINDOW_DAYS * 86400000,
});
let reworked = 0;
for (const commit of agentCommits) {
const subsequentChanges = await gitClient.getModificationsToFiles(
commit.files,
{
after: commit.timestamp,
within: this.REWORK_WINDOW_DAYS * 86400000,
},
);
const meaningfulRework = subsequentChanges.filter(
(c) =>
c.classification === "refactor" || c.classification === "bugfix",
);
if (meaningfulRework.length > 0) {
reworked++;
}
}
return agentCommits.length === 0 ? 0 : reworked / agentCommits.length;
}
private async calculateCognitiveLoad(
teamId: string,
period: { start: number; end: number },
): Promise<number> {
const surveyScores = await db.weeklySurveys.findMany({
where: { teamId, weekOf: { gte: period.start, lte: period.end } },
select: { cognitiveLoadScore: true },
});
const reviewTimes = await db.pullRequestReviews.aggregate({
where: {
teamId,
createdAt: { gte: period.start, lte: period.end },
isAgentGenerated: true,
},
avg: { reviewDurationMinutes: true },
});
const baselineReviewTime = await db.pullRequestReviews.aggregate({
where: {
teamId,
createdAt: { gte: period.start, lte: period.end },
isAgentGenerated: false,
},
avg: { reviewDurationMinutes: true },
});
const selfReported =
surveyScores.length > 0
? surveyScores.reduce((a, b) => a + b.cognitiveLoadScore, 0) /
surveyScores.length
: 3;
// Normalize review time ratio to 1-5 scale
const reviewRatio =
(reviewTimes.avg.reviewDurationMinutes ?? 1) /
(baselineReviewTime.avg.reviewDurationMinutes ?? 1);
const measuredScore = Math.min(5, Math.max(1, reviewRatio * 2.5));
return (selfReported + measuredScore) / 2;
}
private async calculateDefectOriginRate(
teamId: string,
period: { start: number; end: number },
): Promise<number> {
const incidents = await db.incidents.findMany({
where: { teamId, resolvedAt: { gte: period.start, lte: period.end } },
});
const agentOriginated = incidents.filter((i) =>
i.rootCauseCommits.some((c) => c.tags.includes("agent-generated")),
);
return incidents.length === 0
? 0
: agentOriginated.length / incidents.length;
}
private groupBySessions(
interactions: AgentInteraction[],
): Map<string, AgentInteraction[]> {
const map = new Map<string, AgentInteraction[]>();
for (const interaction of interactions) {
const existing = map.get(interaction.sessionId) ?? [];
existing.push(interaction);
map.set(interaction.sessionId, existing);
}
return map;
}
private getCurrentWeekPeriod() {
const now = Date.now();
const start = now - 7 * 86400000;
return { start, end: now };
}
}The git tagging is the trickiest part. We added a pre-commit hook that checks if the staged changes came from an agent session (the IDE plugin writes a local state file), and if so, appends a agent-generated: true trailer to the commit message. Not perfect, but good enough for the 80% case.
Two Dashboards, Two Audiences
Leadership doesn't need the same view as individual engineers. I learned this the hard way after showing my director the raw iteration depth histogram and watching his eyes glaze over.
The leadership dashboard shows four things:
- First-pass acceptance rate trend (weekly, team-level) — "are we getting better at using these tools?"
- Rework rate vs. baseline — "is agent code holding up in production?"
- Defect origin rate — "are agents introducing bugs?"
- Cost per accepted output — agent tooling spend divided by number of accepted generations
That's it. Four numbers, four trend lines. They can ask questions, and I can drill down, but the top-level view is simple.
The engineer dashboard is more granular:
- Personal acceptance rate by task type (is the agent better at tests than business logic for you?)
- Iteration depth distribution — where are you getting stuck?
- Prompt effectiveness score — which of your prompt patterns lead to first-pass acceptance?
- Time saved estimate — comparing time-to-completion for agent-assisted vs. unassisted tasks of similar complexity
The engineer view is opt-in and private by default. Nobody's getting performance-reviewed on their agent collaboration stats. That was a non-negotiable when I pitched this.
Four Months of Data: What Surprised Us
We've been running this framework for 16 weeks now. Some things I expected. Some I genuinely didn't.
Expected: First-pass acceptance rates vary wildly by task type. For unit tests and boilerplate CRUD endpoints, we're at 73% acceptance. For complex business logic with domain-specific rules, it drops to 22%. No surprise there.
Unexpected: Iteration depth has a bimodal distribution. Most interactions are either 1 round (it worked) or 5+ rounds (the engineer is fighting the agent). There's almost nothing in the 2-3 range. This suggests that when the first attempt isn't close, course-correcting through conversation is harder than it seems. Several engineers told me they now just restart with a better prompt rather than iterating — which the data confirms.
Expected: Rework rate started high (31% in month one) and came down over time (now at 14%). Teams learned what to trust and what to verify.
Genuinely shocking: The defect origin rate for agent-generated code is lower than for human-written code. 8% vs. 12%. My hypothesis: engineers review agent code more carefully than they review their own. There's a healthy skepticism that acts as a quality gate. The agent-generated code that survives review has been scrutinized more thoroughly than the code someone wrote at 4pm on a Friday.
Unexpected: Cognitive load scores went up in month two before coming back down. The learning curve of effective agent collaboration is real. People were spending more mental energy figuring out how to prompt effectively than they would have spent just writing the code. By month three, it normalized. By month four, it was below baseline for senior engineers (though still slightly elevated for mid-levels).
Interesting correlation: Engineers with higher first-pass acceptance rates also write better technical specs and design docs. The skill of clearly articulating what you want to a machine turns out to be the same skill as clearly articulating what you want to a teammate. Who knew.
How Metrics Changed Behavior
This is the part I didn't fully anticipate. Once people could see their own patterns, they changed them.
Three engineers independently started maintaining personal prompt libraries — collections of prompt patterns that consistently led to high acceptance rates for specific task types. One person built a small CLI tool that pre-fills context from the current repo structure before sending a prompt to the agent.
The team started writing more detailed ticket descriptions. Not because I asked them to, but because they realized that the context they'd give an agent was basically the same context a ticket should contain. Better tickets led to better agent outputs led to higher acceptance rates. A virtuous cycle nobody planned.
Two people stopped using agents for certain task categories entirely. Not out of frustration — out of data. They could see that for state machine logic and complex data transformations, their iteration depth was consistently 5+ and their rework rate was high. For those tasks, they're faster without the agent. That's a valid, data-informed decision, and it's exactly the kind of nuanced answer that "are agents making us faster?" demands.
We also noticed that the team's code review practices tightened up across the board. The habit of carefully reviewing agent output bled into reviewing human-written code too. Our overall defect escape rate dropped 18% over the four months — and I can't attribute that entirely to agents, but the cultural shift toward "read it carefully before you approve it" certainly contributed.
What I'd Tell My Director Now
The ROI question isn't answerable with a single number. But I can say this: for the task categories where agents excel (and we now know exactly which ones those are), we're seeing 30-40% time savings with no increase in defect rates. For the categories where they don't, we've stopped wasting time trying to force it. The net effect, accounting for tooling costs and the learning curve, is positive — but only because we measured it properly and let the data guide usage patterns rather than mandating blanket adoption.
Like Lucario reading auras, you've got to sense what's actually happening beneath the surface. The surface-level metrics — more code, more PRs, higher velocity — will tell you a story. It's just not the right one.