Skip to content
Loading

How I Achieve Continuous Delivery With Confidence

How I Achieve Continuous Delivery With Confidence hero image

I used to feel it on Thursdays. A branch had been open for four days, it had grown into an 800-line diff, and the only way to land it was to merge it and hope. The test suite was green, but the tests were all unit tests - isolated functions, mocked dependencies, no concept of the application as a whole. Nobody had run the actual flows end-to-end. Nobody had verified that the API endpoints still returned the right shapes after the refactor. Nobody had checked whether the shared component we changed had quietly broken a screen on the other side of the app.

That anxiety is not inevitable. It is a symptom of specific practices: large branches, deferred integration, tests that cover the parts but not the whole. Here is the CI loop I run instead, from the first commit to the merge button.


The Branch Discipline That Makes Everything Else Work

Before any tooling: branch size.

I keep branches open for one to two days at most. Each branch maps to exactly one shippable behaviour - one endpoint added, one component updated, one migration applied, one feature flag enabled. Not a week of work shaped into a single PR on a Friday morning.

This is a forcing function, not a restriction. When a branch has to close in 48 hours, there is no room to scope-creep it. The PR becomes reviewable in under fifteen minutes because the diff is genuinely small and the intent is obvious. Merge conflicts become rare because the branch does not live long enough to drift from main.

Every tool in this stack performs better on small branches. Cypress runs in under three minutes on a focused change. Chromatic surfaces two or three meaningful visual diffs rather than forty. A visual regression on a small diff is straightforward to reason about - you know exactly what you changed and you can see whether the visual delta matches your intent. On a week-old 800-line branch, the same diff is noise.

The branch discipline is not the cherry on top. It is the foundation the rest of this sits on.


Testing the Back: Cypress for Server Behaviour

Most people reach for Cypress as a browser testing tool. That is a fair use. But I get just as much value from it at the API layer - exercising the server directly before the browser is ever involved.

cy.request() runs HTTP calls straight against your running server. No browser. No DOM. Just the request lifecycle: middleware, auth guards, database writes, response serialisation.

// cypress/e2e/api/orders.cy.js
describe("Orders API", () => {
  it("returns 401 for unauthenticated requests", () => {
    cy.request({
      method: "GET",
      url: "/api/orders",
      failOnStatusCode: false,
    }).then((res) => {
      expect(res.status).to.eq(401);
    });
  });

  it("creates an order and returns the expected shape", () => {
    cy.loginByApi(); // custom command - POST /api/auth/session with test credentials

    cy.request("POST", "/api/orders", {
      items: [{ productId: "prod_123", quantity: 2 }],
      currency: "GBP",
    }).then((res) => {
      expect(res.status).to.eq(201);
      expect(res.body).to.have.property("orderId");
      expect(res.body.status).to.eq("pending");
    });
  });
});

These tests catch a different class of failure: a changed response shape, a missing auth guard, a 200 where the contract says 201. They run fast - no browser startup, no paint events - and when one fails, the location of the problem is unambiguous. The issue is server-side and I know it before I run anything else.

The other reason I write API tests first is that they define the contract the browser tests work against. When I wire cy.intercept() into the E2E suite, the fixture I use is not a guess - it mirrors a shape I have already confirmed the real server returns.

// cypress/e2e/checkout.cy.js
describe("Checkout flow", () => {
  beforeEach(() => {
    cy.login();
    cy.intercept("POST", "/api/orders", { fixture: "order-success.json" }).as(
      "createOrder",
    );
  });

  it("submits the order and shows confirmation", () => {
    cy.visit("/cart");
    cy.findByRole("button", { name: /proceed to checkout/i }).click();
    cy.findByLabelText("Card number").type("4242424242424242");
    cy.findByLabelText("Expiry").type("12/28");
    cy.findByLabelText("CVC").type("123");
    cy.findByRole("button", { name: /place order/i }).click();

    cy.wait("@createOrder").its("request.body").should("deep.include", {
      currency: "GBP",
    });

    cy.findByText("Order confirmed").should("be.visible");
    cy.url().should("include", "/orders/");
  });
});

The split is deliberate: API tests confirm the contract, browser tests confirm the experience against a known-good contract. In CI I run the API suite first. If an endpoint is broken, there is no point running the full browser flow - the failure is faster to find and the fix location is obvious.

# .github/workflows/ci.yml
- name: Cypress API tests
  run: npx cypress run --spec "cypress/e2e/api/**"
  env:
    CYPRESS_BASE_URL: http://localhost:3000

- name: Cypress E2E tests
  run: ELECTRON_EXTRA_LAUNCH_ARGS=--remote-debugging-port=9222 npx cypress run --spec "cypress/e2e/**"
  env:
    CYPRESS_BASE_URL: http://localhost:3000

Shipping Delight: Chromatic for Visual Confidence

Cypress tells me the application behaves correctly. It does not tell me the application looks correct. That gap is where the expensive regressions live - the shared <Button> with a different line-height after a CSS variable change, the card layout that starts overflowing once the product title is three words longer than the test fixture, the dark mode variant that shipped without anyone looking at it.

I fill that gap with Chromatic. It integrates directly into the Cypress run: while Cypress executes, Chromatic communicates with the browser over Chrome DevTools Protocol and captures a full archive of every page the tests visit - DOM, CSS, fonts, assets. Those archives are uploaded, rendered into snapshots, and pixel-diffed against the baseline from the last accepted build.

bun add --dev chromatic @chromatic-com/cypress
// cypress/support/e2e.js
import "@chromatic-com/cypress/support";
// cypress.config.js
const { defineConfig } = require("cypress");
const { installPlugin } = require("@chromatic-com/cypress");

module.exports = defineConfig({
  e2e: {
    setupNodeEvents(on, config) {
      installPlugin(on, config);
    },
  },
});

In CI, Chromatic processes the archives after the Cypress run completes:

- name: Chromatic visual review
  run: npx chromatic --cypress -t=${{ secrets.CHROMATIC_PROJECT_TOKEN }} --exit-zero-on-changes=false

What this produces is a review queue that is specific and actionable. Not "something changed visually" - but "the checkout confirmation screen changed visually in this test, here is the before, here is the after, here are the exact pixels that shifted." I review it, decide whether the change is intentional, and accept or reject in one click.

The delight I want to highlight is the accept flow. When I ship a genuine UI improvement - tighter spacing, a better hover state, a more readable error message - Chromatic shows me the delta in the diff. I accept it. The baseline updates. That new state is locked in and any future regression from it will be caught automatically. It is a very different relationship with UI changes than landing a PR and hoping QA saw every screen.

The fetch-depth: 0 is mandatory in the checkout step. Chromatic uses git history to find the correct baseline for the current branch. A shallow clone causes it to compare against the wrong commit and produces meaningless diffs.

- uses: actions/checkout@v4
  with:
    fetch-depth: 0

One More Reviewer That Never Sleeps

Cypress and Chromatic between them tell me the behaviour is correct and the pixels are right. There is still a gap: the code itself. Logic that works correctly in the tests but is brittle under edge cases. Error handling that is missing because the test fixture never triggers a 500. Patterns that technically function but diverge from the conventions the rest of the codebase follows.

I close that gap with an AI reviewer that fires automatically when a PR opens. It reads the diff, the surrounding context, and the test files, and posts a structured review before any human has touched it.

The prompt is deliberately scoped to what the rest of the CI stack cannot see:

# .github/workflows/ai-review.yml
name: AI PR Review

on:
  pull_request:
    types: [opened, synchronize, reopened, ready_for_review]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
      pull-requests: read
      issues: read
    steps:
      - uses: actions/checkout@v4
        with:
          persist-credentials: false

      - uses: anomalyco/opencode/github@latest
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          model: anthropic/claude-sonnet-4-20250514
          use_github_token: true
          prompt: |
            Review this pull request.
            - Flag missing error handling and unguarded edge cases.
            - Check that any API changes stay consistent with the
              Cypress test suite in cypress/e2e/api/.
            - Highlight anything that could visually regress a
              component Chromatic is already tracking.
            - Note patterns that diverge from codebase conventions.
            Keep comments specific and actionable.

The prompt anchors the AI to the same contract the automated tests enforce. It is not a generic "find bugs" instruction - it is looking for the things that Cypress and Chromatic have blind spots on: logic paths the tests never hit, error states the fixtures never trigger, drift from the patterns established elsewhere in the codebase.

When a human reviewer opens the PR, the landscape has already been mapped. Cypress checks are green or flagged. Chromatic diffs are queued. The AI review is posted. What remains for the human is the one thing that genuinely requires human judgment: does this change do what it is meant to do, and does it belong in this codebase? On a small branch, that is a fifteen-minute conversation - not because people are cutting corners but because there is genuinely little uncertainty left to resolve.


The Loop, End to End

These practices compound because they are layered in the right order:

CI Pipeline — end to end

Branch opensCULTURE

One behaviour · 1–2 days max · no scope creep

Cypress API testsBACKEND

cy.request() · auth guards · response shapes · status codes

Cypress E2E testsE2E

Real browser · full user flows · network interception

Chromatic archives + diffVISUAL

Full DOM + CSS snapshot · pixel diff vs. accepted baseline

AI PR reviewAI

Code quality · logic · patterns · contract regressions flagged

Human reviewREVIEW

Narrow diff · visual diffs · AI annotation already there

Merge → deploySHIP

All checks green · baselines accepted · boring on purpose

  1. Branch opens - scoped to one behaviour, one to two days of work
  2. Cypress API tests - confirm the server contract: auth guards, response shapes, status codes
  3. Cypress E2E tests - confirm user-facing flows against a known-good contract
  4. Chromatic - captures full DOM + CSS archives, diffs every screen against the accepted baseline
  5. AI reviews the PR - flags code quality, missing error handling, patterns, contract regressions
  6. Human reviews - narrow diff, visual diffs ready, AI annotation already posted
  7. Merge - all checks green, baselines accepted, nothing waiting to surprise production

The small branch is load-bearing in every step. A large branch makes step 4 noisy - forty visual diffs, most of them intentional but none of them obvious. It makes step 5 produce worse output - an 800-line diff gives the AI reviewer too much surface area and the annotations lose precision. It makes step 6 slow - a week-old diff is not reviewable in good faith in fifteen minutes. It makes step 7 risky - you are merging something you cannot fully hold in your head.

Small branches make the tooling useful, the diffs readable, and the reviews honest.


What It Feels Like on the Other Side

The anxiety I described at the start does not come from deploying - it comes from not knowing. Not knowing whether the API contract changed under a refactor. Not knowing whether the UI shifted somewhere you did not look. Not knowing whether there is a missing error handler waiting to surface in production. Not knowing whether the branch has grown large enough that something important got lost in the diff.

The loop above removes each of those unknowns in sequence: the API suite tells me the server is honest, the E2E suite tells me the flows work, Chromatic tells me the pixels match intent, the AI review surfaces anything the tests cannot see, and the branch size means I actually understand the surface area I am merging.

The result is that I can merge at the end of a working day without thinking about it again. That is the bar I work to - not just "CI is green" but "I understand exactly what is going in and I would not be surprised by anything that comes out."

Boring deploys are the best deploys.