Skip to content
Loading

Stop Your AI Agent from Cutting Corners with Agent Skills

Stop Your AI Agent from Cutting Corners with Agent Skills hero image

The Problem: Agents Optimize for Speed, Not Quality

You ask your AI coding agent to add a feature. It writes the code in seconds, the tests pass, and it opens a PR. Impressive. Then you read the diff.

There is no input validation. The API contract was changed without a migration plan. The component renders on every keystroke because no one told the agent about memoization. The agent did exactly what you asked, and nothing more.

The instinctive fix is to write longer prompts. "Make sure you add tests." "Follow best practices." This works once, for one task. It does not scale across a team, across sessions, or across the dozen concerns a senior engineer holds in their head simultaneously.

Agent Skills by Addy Osmani is an open-source collection of 20 production-grade engineering skills for AI coding agents. Each skill is a SKILL.md file that encodes a step-by-step workflow with checkpoints, anti-rationalization tables, and non-negotiable verification requirements. This post walks through how they work and what they look like in practice.


Installing Agent Skills

Claude Code via the marketplace:

/plugin marketplace add addyosmani/agent-skills
/plugin install agent-skills@addy-agent-skills

Local development:

git clone https://github.com/addyosmani/agent-skills.git
claude --plugin-dir /path/to/agent-skills

Cursor - copy any SKILL.md directly into .cursor/rules/:

cp agent-skills/skills/test-driven-development/SKILL.md .cursor/rules/

OpenCode - paste the skill contents into your AGENTS.md. The skill tool loads them into the active session on demand.

Gemini CLI:

gemini skills install https://github.com/addyosmani/agent-skills.git --path skills

Skills are plain Markdown. They work with any agent that accepts system prompts or instruction files.


What a Skill Actually Is

Every SKILL.md follows a consistent anatomy:

SKILL.md
├── Frontmatter   → name, description, trigger conditions
├── Overview      → what this skill does in one paragraph
├── When to Use   → exact triggering conditions (and when NOT to)
├── Process       → numbered steps with explicit exit criteria
├── Rationalizations → common excuses + documented counter-arguments
├── Red Flags     → signs something is wrong
└── Verification  → required evidence - tests, build output, runtime data

The key difference from a prompt is that skills are workflows, not advice. Each step has an exit condition. The agent cannot advance without satisfying the gate. The verification section requires actual proof, not "I believe this works."


Skill 1: Test-Driven Development

The test-driven-development skill enforces the Red-Green-Refactor cycle. Here is what the agent is instructed to follow:

    RED                GREEN              REFACTOR
 Write a test    Write minimal code    Clean up the
 that fails  ──→  to make it pass  ──→  implementation  ──→  (repeat)
      │                  │                    │
      ▼                  ▼                    ▼
   Test FAILS        Test PASSES         Tests still PASS

Step 1 - RED: Write a failing test first

// This test must fail. A test that passes immediately proves nothing.
describe("TaskService", () => {
  it("creates a task with title and default status", async () => {
    const task = await taskService.createTask({ title: "Buy groceries" });

    expect(task.id).toBeDefined();
    expect(task.title).toBe("Buy groceries");
    expect(task.status).toBe("pending");
    expect(task.createdAt).toBeInstanceOf(Date);
  });
});

Step 2 - GREEN: Write the minimum code to pass

export async function createTask(input: { title: string }): Promise<Task> {
  const task = {
    id: generateId(),
    title: input.title,
    status: "pending" as const,
    createdAt: new Date(),
  };
  await db.tasks.insert(task);
  return task;
}

No over-engineering. No speculative abstractions. Only what the failing test requires.

Step 3 - REFACTOR: Clean up with tests green

Extract shared logic, improve naming, remove duplication. Run the tests after every step to confirm nothing broke.

The Prove-It Pattern for bug fixes

The skill also enforces a specific bug-fix workflow. The agent is not allowed to start with a fix. It must first write a reproduction test:

// Bug: "Completing a task doesn't update the completedAt timestamp"

// Step 1: Write the reproduction test - it must FAIL
it("sets completedAt when task is completed", async () => {
  const task = await taskService.createTask({ title: "Test" });
  const completed = await taskService.completeTask(task.id);

  expect(completed.status).toBe("completed");
  expect(completed.completedAt).toBeInstanceOf(Date); // fails → bug confirmed
});

// Step 2: Fix the bug
export async function completeTask(id: string): Promise<Task> {
  return db.tasks.update(id, {
    status: "completed",
    completedAt: new Date(), // this was missing
  });
}

// Step 3: Test passes → fix proven, regression guarded

The anti-rationalization table

Every skill includes a table of excuses the agent might use to skip a step, with pre-loaded counter-arguments. This is the most important design detail in the whole library.

| Rationalization | Reality | | --------------------------------------- | ----------------------------------------------------------------------------------- | | "I'll write tests after the code works" | You won't. Tests written after the fact test implementation, not behavior. | | "This is too simple to test" | Simple code gets complicated. The test documents expected behavior. | | "I tested it manually" | Manual testing doesn't persist. Tomorrow's change may break it with no way to know. | | "It's just a prototype" | Prototypes become production code. Test debt accumulates from day one. |

Without this table, an agent under time pressure will rationalize its way out of any process. With it, the counter-argument is pre-loaded into its context.

Verification - the non-negotiable exit criteria

The skill ends with a checklist the agent must satisfy before considering work done:

- [ ] Every new behavior has a corresponding test
- [ ] All tests pass: npm test
- [ ] Bug fixes include a reproduction test that failed before the fix
- [ ] Test names describe the behavior being verified
- [ ] No tests were skipped or disabled
- [ ] Coverage hasn't decreased (if tracked)

"Seems right" is explicitly disqualified as evidence.


Skill 2: Spec-Driven Development

The spec-driven-development skill enforces a gated four-phase workflow. The agent cannot advance to the next phase without human approval:

SPECIFY ──→ PLAN ──→ TASKS ──→ IMPLEMENT
   │          │        │          │
   ▼          ▼        ▼          ▼
 Human      Human    Human      Human
 reviews    reviews  reviews    reviews

Before writing any spec content, the agent is required to surface its assumptions explicitly:

ASSUMPTIONS I'M MAKING:
1. This is a web application (not native mobile)
2. Authentication uses session-based cookies (not JWT)
3. The database is PostgreSQL (based on existing Prisma schema)
4. We're targeting modern browsers only (no IE11)
→ Correct me now or I'll proceed with these.

Vague requirements get reframed as concrete, testable success criteria:

REQUIREMENT: "Make the dashboard faster"

REFRAMED SUCCESS CRITERIA:
- Dashboard LCP < 2.5s on 4G connection
- Initial data load completes in < 500ms
- No layout shift during load (CLS < 0.1)
→ Are these the right targets?

This single habit eliminates an entire category of rework: building the wrong thing correctly.


Skill 3: Code Review and Quality

The code-review-and-quality skill gives the agent a five-axis review model it must apply to every change before merging:

| Axis | What to check | | ---------------- | ------------------------------------------------------------------------ | | Correctness | Edge cases, error paths, off-by-one errors, race conditions | | Readability | Clear names, straightforward control flow, no "clever" tricks | | Architecture | Follows existing patterns, clean module boundaries, no circular deps | | Security | Input validation, no secrets in code, parameterized queries, auth checks | | Performance | N+1 queries, unbounded loops, missing pagination, unnecessary re-renders |

Every review comment gets a severity label so the author knows what is required versus optional:

| Prefix | Meaning | Action | | ------------------------- | --------------- | ---------------------------------------------- | | (no prefix) | Required | Must address before merge | | Critical: | Blocks merge | Security hole, data loss, broken functionality | | Nit: | Minor, optional | Author may ignore | | Optional: / Consider: | Suggestion | Worth considering, not required | | FYI | Informational | No action needed |

The skill also includes a target for change size:

~100 lines changed   → Good. Reviewable in one sitting.
~300 lines changed   → Acceptable for a single logical change.
~1000 lines changed  → Too large. Split it.

And the same anti-rationalization table pattern applies here:

| Rationalization | Reality | | ------------------------------------ | -------------------------------------------------------------------------------------- | | "It works, that's good enough" | Working code that is unreadable or insecure creates debt that compounds. | | "AI-generated code is probably fine" | AI code needs more scrutiny, not less. It is confident and plausible, even when wrong. | | "The tests pass, so it's good" | Tests don't catch architecture problems, security issues, or readability concerns. |


How the Lifecycle Fits Together

The 20 skills map directly onto the software development lifecycle, activated by seven slash commands:

  DEFINE    PLAN     BUILD    VERIFY   REVIEW    SHIP
 ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
 │ Spec │ │ Plan │ │ Code │ │ Test │ │  QA  │ │  Go  │
 └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘
  /spec    /plan    /build   /test    /review   /ship

Skills also activate automatically based on context. Designing an API triggers api-and-interface-design. Building UI triggers frontend-ui-engineering. The agent does not need to be told to apply them.

The collection encodes specific patterns from Software Engineering at Google: Hyrum's Law in api-and-interface-design, the Beyonce Rule ("if you liked it, you should have put a test on it") in test-driven-development, Chesterton's Fence in code-simplification, trunk-based development in git-workflow-and-versioning, and Shift Left with feature flags in ci-cd-and-automation. These are not abstract principles. They are embedded directly into the step-by-step workflows the agent follows.

The result is an agent that behaves less like an autocomplete engine and more like a disciplined engineer who knows when to stop, check the spec, and ask whether the verification criteria have actually been met.