Spec-Driven Development: A TypeScript Framework That Guides AI Agents from Design to Implementation

Spec-Driven Development: A TypeScript Framework That Guides AI Agents from Design to Implementation
The Agent Built the Wrong Thing (Again)
Last November I asked an agent to add request caching to our API gateway. Simple enough. I said something like "add a caching layer for GET requests with a 5-minute TTL." The agent delivered working code in under two minutes. TypeScript compiled. Tests passed. It used an in-memory Map with timestamp-based expiry.
The problem? We run twelve instances behind a load balancer. In-memory caching is worthless in that topology. What I needed was a Redis-backed distributed cache with invalidation hooks. The agent had no way to know that from my prompt. It made a perfectly reasonable assumption and built something technically correct but architecturally wrong.
I'd been here before. Many times. And I kept blaming the agent, which is like blaming a contractor for building what you asked for instead of what you meant.
This is the story of how I stopped writing prompts and started writing specs.
Why Natural Language Fails at Scale
Here's the thing about working with AI agents on real codebases: the gap between what you say and what you mean grows with system complexity. When your app is a single-service CRUD API, "add caching" is unambiguous enough. When you're dealing with distributed services, event sourcing, multi-tenant isolation, and deployment constraints, natural language just doesn't carry enough information.
I noticed a pattern in my failures:
- Missing constraints — the agent doesn't know about infrastructure topology
- Implicit architecture decisions — conventions that live in your head, not in code
- Ambiguous scope — "add auth" could mean anything from a middleware check to a full RBAC system
- Anti-patterns it can't detect — things that work in isolation but break under production load
The agent isn't dumb. It's under-informed. And the fix isn't better prompting. It's structured specification.
From Prompts to Markdown to Types
My first attempt at fixing this was markdown specs. I'd write a document before asking the agent to build anything. Requirements, constraints, examples. It helped. Outcomes improved maybe 40%.
But markdown specs have their own problems. They're inconsistent in structure. They drift over time. You can't validate them programmatically. And there's no way to enforce that the agent actually addresses every constraint you listed.
So I did what any engineer would do: I made it a type system problem.
The Spec Schema
Here's the core of the framework. A Zod schema that defines what a feature spec must contain:
import { z } from "zod";
const ArchitectureDecisionSchema = z.object({
decision: z.string().describe("What was decided"),
rationale: z.string().describe("Why this approach over alternatives"),
alternatives: z.array(z.string()).describe("What was considered and rejected"),
constraints: z.array(z.string()).describe("What makes this decision necessary"),
});
const AcceptanceCriterionSchema = z.object({
id: z.string().regex(/^AC-\d+$/),
description: z.string(),
validationStrategy: z.enum(["unit-test", "integration-test", "type-check", "manual", "runtime-assertion"]),
automatable: z.boolean(),
});
const AntiPatternSchema = z.object({
pattern: z.string().describe("What the wrong implementation looks like"),
reason: z.string().describe("Why this breaks in production"),
detection: z.string().describe("How to check if the agent did this"),
});
export const FeatureSpecSchema = z.object({
id: z.string().regex(/^SPEC-\d{4}$/),
title: z.string().max(100),
owner: z.string(),
status: z.enum(["draft", "approved", "in-progress", "validated"]),
context: z.object({
system: z.string().describe("Which service/package this lives in"),
topology: z.string().describe("How the system is deployed"),
existingPatterns: z.array(z.string()).describe("Conventions already in the codebase"),
dependencies: z.array(z.string()).describe("What this feature depends on"),
}),
requirements: z.object({
functional: z.array(z.string()).min(1),
nonFunctional: z.array(z.string()),
outOfScope: z.array(z.string()).describe("Explicitly excluded to prevent scope creep"),
}),
architectureDecisions: z.array(ArchitectureDecisionSchema).min(1),
acceptanceCriteria: z.array(AcceptanceCriterionSchema).min(1),
antiPatterns: z.array(AntiPatternSchema),
implementation: z.object({
entryPoints: z.array(z.string()).describe("Files the agent should modify"),
testStrategy: z.string(),
phasing: z.array(z.object({
phase: z.number(),
description: z.string(),
deliverables: z.array(z.string()),
})).optional(),
}),
});
export type FeatureSpec = z.infer<typeof FeatureSpecSchema>;This isn't theoretical. I use this daily. Every feature I ask an agent to build starts with a spec that passes this schema.
Pro-Tip: The
outOfScopefield is secretly the most important one. Agents love to over-deliver. Telling them what NOT to build is often more valuable than telling them what to build.
Just-in-Time Context Loading
One mistake I made early on was dumping the entire spec into the agent's context at the start of a conversation. That's wasteful and counterproductive. Agents have finite context windows, and front-loading everything means the details that matter most during implementation get pushed furthest from the active generation window.
Instead, I load spec sections contextually based on what phase the agent is in:
import type { FeatureSpec } from "./spec-schema";
type AgentPhase = "planning" | "scaffolding" | "implementing" | "testing" | "reviewing";
interface ContextSlice {
instructions: string;
specData: Record<string, unknown>;
}
export function getContextForPhase(spec: FeatureSpec, phase: AgentPhase): ContextSlice {
switch (phase) {
case "planning":
return {
instructions: "You are planning implementation. Do NOT write code yet.",
specData: {
requirements: spec.requirements,
architectureDecisions: spec.architectureDecisions,
antiPatterns: spec.antiPatterns,
},
};
case "scaffolding":
return {
instructions: "Create file structure and type definitions only. No business logic.",
specData: {
entryPoints: spec.implementation.entryPoints,
existingPatterns: spec.context.existingPatterns,
dependencies: spec.context.dependencies,
},
};
case "implementing":
return {
instructions: "Implement business logic. Follow architecture decisions exactly.",
specData: {
functional: spec.requirements.functional,
architectureDecisions: spec.architectureDecisions,
antiPatterns: spec.antiPatterns,
outOfScope: spec.requirements.outOfScope,
},
};
case "testing":
return {
instructions: "Write tests that validate acceptance criteria.",
specData: {
acceptanceCriteria: spec.acceptanceCriteria,
testStrategy: spec.implementation.testStrategy,
},
};
case "reviewing":
return {
instructions: "Review implementation against spec. Flag any anti-pattern violations.",
specData: {
acceptanceCriteria: spec.acceptanceCriteria,
antiPatterns: spec.antiPatterns,
nonFunctional: spec.requirements.nonFunctional,
},
};
}
}The agent gets what it needs, when it needs it. Like Alakazam channeling psychic energy through its spoons — structured, focused, deliberate. You wouldn't dump an entire codebase into working memory when you only need one module. Same principle.
Pro-Tip: The
reviewingphase is where the anti-patterns field pays off massively. The agent essentially audits its own work against known failure modes before you ever see the PR.
The Validation Layer
Specs are only useful if you verify compliance. I built a lightweight validation layer that runs after the agent produces code. It checks structural requirements, not logical correctness — that's what tests are for.
import type { FeatureSpec } from "./spec-schema";
import { existsSync } from "fs";
import { readFile } from "fs/promises";
interface ValidationResult {
criterion: string;
status: "pass" | "fail" | "skipped";
detail?: string;
}
export async function validateAgentOutput(
spec: FeatureSpec,
outputDir: string,
): Promise<ValidationResult[]> {
const results: ValidationResult[] = [];
// Check that all entry points were modified
for (const entryPoint of spec.implementation.entryPoints) {
const filePath = `${outputDir}/${entryPoint}`;
results.push({
criterion: `Entry point exists: ${entryPoint}`,
status: existsSync(filePath) ? "pass" : "fail",
});
}
// Check for anti-pattern indicators in generated code
for (const antiPattern of spec.antiPatterns) {
const detectionResult = await runDetection(antiPattern.detection, outputDir);
results.push({
criterion: `Anti-pattern avoided: ${antiPattern.pattern}`,
status: detectionResult.found ? "fail" : "pass",
detail: detectionResult.found ? antiPattern.reason : undefined,
});
}
// Verify acceptance criteria markers
for (const criterion of spec.acceptanceCriteria) {
if (criterion.validationStrategy === "type-check") {
results.push({
criterion: `${criterion.id}: ${criterion.description}`,
status: "pass", // TypeScript compiler handles this
detail: "Validated by successful compilation",
});
} else if (criterion.automatable) {
results.push({
criterion: `${criterion.id}: ${criterion.description}`,
status: "skipped",
detail: "Requires test execution",
});
}
}
return results;
}
async function runDetection(
detection: string,
outputDir: string,
): Promise<{ found: boolean }> {
// Detection strings are regex patterns to search generated files
const files = await getGeneratedFiles(outputDir);
const pattern = new RegExp(detection);
for (const file of files) {
const content = await readFile(file, "utf-8");
if (pattern.test(content)) {
return { found: true };
}
}
return { found: false };
}This catches the obvious stuff automatically. Did the agent use an in-memory cache when the spec said Redis? Did it add a dependency that's explicitly out of scope? Did it modify files outside the listed entry points?
Real Example: Speccing a Rate Limiter
Let me show this end-to-end with the rate limiter I built last month. Here's the actual spec (abbreviated for readability):
const rateLimiterSpec: FeatureSpec = {
id: "SPEC-0042",
title: "Sliding Window Rate Limiter",
owner: "marisi",
status: "approved",
context: {
system: "api-gateway",
topology: "12 instances behind AWS ALB, shared Redis cluster (ElastiCache)",
existingPatterns: [
"All middleware follows express-style (req, res, next) signature",
"Configuration loaded from environment via @config/env package",
"Errors use shared ApiError class with status codes",
],
dependencies: ["ioredis", "@config/env", "@shared/errors"],
},
requirements: {
functional: [
"Rate limit by API key with configurable window and max requests",
"Use sliding window algorithm (not fixed window) for fairness",
"Return X-RateLimit-Remaining and X-RateLimit-Reset headers",
"Support per-endpoint override configuration",
],
nonFunctional: [
"Less than 5ms p99 latency overhead per request",
"Graceful degradation if Redis is unreachable (fail open, log warning)",
"Must not leak memory under sustained load",
],
outOfScope: [
"Rate limiting by IP address (future spec)",
"Admin dashboard for viewing rate limit status",
"Dynamic rate limit adjustment without restart",
],
},
architectureDecisions: [
{
decision: "Use Redis sorted sets for sliding window tracking",
rationale: "O(log N) operations, atomic with MULTI, natural TTL support",
alternatives: ["Token bucket in Redis strings", "Fixed window with counters"],
constraints: ["Must be accurate across all 12 instances simultaneously"],
},
{
decision: "Fail open when Redis is unreachable",
rationale: "Availability over protection — a brief Redis outage shouldn't 503 all traffic",
alternatives: ["Fail closed", "Fall back to in-memory per-instance limiting"],
constraints: ["SLA requires 99.95% uptime"],
},
],
acceptanceCriteria: [
{
id: "AC-01",
description: "Requests beyond limit receive 429 with correct headers",
validationStrategy: "integration-test",
automatable: true,
},
{
id: "AC-02",
description: "Sliding window correctly expires old entries",
validationStrategy: "unit-test",
automatable: true,
},
{
id: "AC-03",
description: "Redis failure results in request passthrough, not error",
validationStrategy: "integration-test",
automatable: true,
},
],
antiPatterns: [
{
pattern: "In-memory rate limiting with local Map or object",
reason: "Does not work across multiple instances behind load balancer",
detection: "new Map|Object\\.create\\(null\\)|\\{\\}.*count",
},
{
pattern: "Fixed window algorithm",
reason: "Allows burst at window boundaries (2x limit in 1 second)",
detection: "Math\\.floor.*Date\\.now\\(\\).*\\/.*window",
},
{
pattern: "Throwing errors on Redis connection failure",
reason: "Violates fail-open requirement, causes cascading 503s",
detection: "throw.*redis|throw.*connection",
},
],
implementation: {
entryPoints: [
"src/middleware/rate-limiter.ts",
"src/middleware/rate-limiter.test.ts",
"src/config/rate-limit.config.ts",
],
testStrategy: "Unit tests for algorithm logic, integration tests with Redis test container",
},
};I loaded this spec, phased the context, and the agent nailed it on the first pass. Sorted sets in Redis. Fail-open with a try/catch around the Redis call. Proper headers. No anti-patterns triggered in validation.
That had never happened before with a natural language prompt alone.
Pro-Tip: Write your anti-pattern detection regexes to be slightly overzealous. False positives during validation are cheap — they just trigger a manual review. False negatives mean bugs in production.
The Numbers: No Spec vs. Markdown vs. Typed Spec
I tracked outcomes across 30 features over three months. Same complexity tier, same agent, same codebase.
| Approach | First-pass success rate | Avg. revision rounds | Architecture violations | Time to merge | |----------|------------------------|---------------------|------------------------|---------------| | No spec (prompt only) | 23% | 3.4 | 67% | 4.2 hours | | Markdown spec | 52% | 2.1 | 31% | 2.8 hours | | Typed spec framework | 81% | 1.2 | 8% | 1.4 hours |
The jump from markdown to typed specs surprised me. I expected marginal improvement. Instead, the structured format forced me to think through decisions I'd been leaving implicit. The spec framework doesn't just help the agent — it helps me clarify my own thinking.
Integrating with Planning Tools
I don't write specs in a vacuum. Most features start as Linear tickets or GitHub issues. The bridge is straightforward:
import type { FeatureSpec } from "./spec-schema";
interface LinearTicket {
id: string;
title: string;
description: string;
labels: string[];
project: { name: string };
}
export function scaffoldSpecFromTicket(ticket: LinearTicket): Partial<FeatureSpec> {
return {
id: `SPEC-${ticket.id.slice(-4)}`,
title: ticket.title,
status: "draft",
context: {
system: ticket.project.name,
topology: "", // Must be filled manually
existingPatterns: [],
dependencies: [],
},
requirements: {
functional: extractBulletPoints(ticket.description),
nonFunctional: [],
outOfScope: [],
},
architectureDecisions: [],
acceptanceCriteria: [],
antiPatterns: [],
implementation: {
entryPoints: [],
testStrategy: "",
},
};
}
function extractBulletPoints(markdown: string): string[] {
return markdown
.split("\n")
.filter((line) => /^[-*]\s/.test(line.trim()))
.map((line) => line.replace(/^[-*]\s+/, "").trim());
}The scaffold gives me a starting point. I fill in the gaps — topology, anti-patterns, architecture decisions — which takes maybe ten minutes. That ten minutes saves hours of back-and-forth with the agent later.
The workflow looks like: Linear ticket exists, I run the scaffold, fill in the typed spec, then hand it to the agent phase by phase. When it's done, the validation layer confirms compliance, and I review the diff knowing the structural concerns are already handled.
What This Actually Changes About Your Workflow
The biggest shift isn't technical. It's that you stop thinking of the agent as a conversational partner and start thinking of it as an executor with a contract. You wouldn't hire a contractor without blueprints. You wouldn't deploy a service without a schema. Why would you ask an agent to build production features without a typed specification?
The framework has rough edges. Detection regexes are brittle. The phasing logic assumes linear progression when real implementation is messier. I'm working on a graph-based phase system that handles backtracking.
But even in its current state, it's the single biggest improvement I've made to my agent-assisted workflow. Like giving Alakazam its spoons — the psychic power was always there, it just needed a structured channel to flow through.