Skip to content
Loading

Stop Burning Your Token Budget in Claude Code

Stop Burning Your Token Budget in Claude Code hero image

You open Claude Code, review a medium-sized PR, go back and forth a few times, and you are already at 60% of your usage limit. You have touched nothing complex yet. If you are on the Max plan and this still happens, the problem is not your plan. It is how the context window actually works.


Why Token Usage Explodes: The Re-Send Model

Claude Code does not stream a delta on each turn. Every time you send a message, the entire conversation history is re-sent to the model from scratch: your previous prompts, every tool result, every file read, every response Claude has ever generated in this session. Transformers have no persistent state between calls. Re-processing the full history on every turn is not a bug; it is the fundamental architecture.

The practical consequence is that token cost is not linear with message count, it is quadratic. A session with 10 back-and-forth turns where each turn reads a 500-line file does not cost 10x a single turn. It costs closer to 55x, because turn 10 re-sends all 9 prior tool results plus their responses.

A PR review hits the ceiling fast because each file Claude reads injects the entire file as a tool result, and every subsequent message re-sends all of those results again.


The Common Mistakes

Most developers hit the limit in one of three ways. They paste entire files into chat instead of referencing them. They load every MCP server they have ever installed even on sessions that do not need them. And they run long sessions end-to-end without ever compacting, so the context balloons with stale tool outputs and superseded reasoning that Claude still re-processes on every turn.

The fix is not a bigger plan. The fix is managing what lives in context at any given moment.


The Solution: Treat Context Like Memory, Not a Log

Every item in your context window has a cost that compounds across every subsequent message. The goal is to keep only what is currently decision-relevant. This means being deliberate about what you load, when you compact, which model you invoke, and how you write prompts. None of these require changing how you work in a fundamental way; they are mostly defaults you change once.


1. Write a Lean CLAUDE.md

Running /init in your project root generates a CLAUDE.md file that Claude reads automatically at session start. It stays in context for the entire session. This is the right place for stack constraints, naming conventions, and architectural invariants: the things you would otherwise type from scratch in every session.

The key discipline is keeping it short. A CLAUDE.md is paid on every single session. A 4,000-token file that covers things you only need once a week is a bad trade.

A real project file should target 300 to 600 tokens:

# CLAUDE.md

## Stack

- Node 20, TypeScript strict, Prisma ORM
- Tests: Vitest (not Jest)

## Constraints

- Never use `any`. Use `unknown` + type guards.
- Controllers call services. Services call DB. No bypassing layers.
- Use `Result<T, E>` for fallible operations, not thrown exceptions.

## Naming

- Files: kebab-case. Classes: PascalCase. Hooks: `use*` prefix.

## Commands

- `bun run lint` = eslint + typecheck combined. Always run before commit.

For monorepos, split CLAUDE.md across subdirectories. Subdirectory files load only when Claude navigates into that folder, so the full monorepo context does not land in every session.


2. Audit Context with /context

Before deciding whether to compact or start fresh, run /context. Claude returns a structured breakdown of every item currently in the context window: open files, tool definitions, conversation turns, attached documents, and the system prompt, with token counts per element and cumulative usage.

Use it to answer three questions:

  • Are there files in context that were referenced earlier but are no longer relevant to the current task?
  • Has the conversation thread grown long enough that compacting would give a net benefit?
  • How far are you from the ceiling, and is it worth continuing this session or starting fresh?

Think of /context as a memory profiler. Running it takes one message and gives you enough information to make a deliberate choice rather than guessing.


3. Use /compact Before You Need It

/compact summarizes the entire conversation into a structured representation capturing decisions made, files modified, current task state, errors resolved, and outstanding work. It then continues the session from that summary as the new baseline, discarding intermediate reasoning chains, raw tool outputs, and superseded approaches.

The common mistake is running /compact reactively when Claude starts giving degraded responses. By that point, the conversation is already full of stale context, and a degraded session produces a worse summary. Run /compact proactively when you finish a distinct phase of work.

A healthy workflow looks like this:

Phase 1: exploration and planning  →  /compact
Phase 2: implementation             →  /compact
Phase 3: testing and review         →  done

If you finish a task entirely and are moving to something unrelated, use /clear instead. It wipes the context completely without closing the terminal session.


4. Define Repetitive Workflows as /commands

/commands defines named aliases for multi-step instruction sequences. Where natural language prompts are probabilistic (the same wording produces slightly different behavior each run), commands are deterministic. Claude executes the defined sequence without re-parsing intent.

Create a command file at .claude/commands/test-fix.md:

---
name: test-fix
description: Run tests, fix all failures, then lint
---

1. Run `bun run test` and capture all failing test names and error messages.
2. Fix each failure. Do not change test assertions unless the test is provably wrong.
3. Run `bun run lint` and fix any ESLint or TypeScript errors.
4. Report what was changed and confirm all checks pass.

Invoke with /test-fix. Use commands for any workflow you run more than twice a week: generating components with your folder conventions, writing commit messages in your team's format, pre-deploy validation sequences.


5. Match the Model to the Task

The default model on the Max plan is Opus. Opus is the right tool for hard planning tasks in plan mode. It is the wrong tool for renaming a variable, generating a test scaffold, or asking what a function does.

| Model | Use case | | ---------- | ----------------------------------------------------------------- | | Opus | Hard architectural decisions, planning complex multi-file changes | | Sonnet | Day-to-day feature work, refactoring, writing tests, code reviews | | Haiku | Simple questions, quick lookups, summarising output |

Switch with /model. The token rate difference between Opus and Haiku is roughly 30x. Most of what you do in a session does not need Opus.


6. Reference Files, Do Not Paste Them

When you paste a file into chat, that content is injected into the conversation history as a user message and re-sent on every subsequent turn for the rest of the session. It becomes dead weight immediately after the turn it was used.

Claude Code's @file reference system avoids this. The file is pulled in when Claude needs it and handled as a tool result rather than conversation history. Use it consistently:

# Avoid this
[pastes entire auth.ts here] Can you add refresh token logic?

# Do this instead
Fix the expired token bug in @src/middleware/auth.ts that returns
401 instead of triggering a refresh when the access token expires.

For reusable context (API conventions, data schemas, ADR summaries), store them as .md files and load them on demand with @filename.md. They travel with the project and are only injected when explicitly referenced.


7. Write Precise Prompts

Vague prompts have two costs. They produce verbose responses as Claude attempts to cover all possible interpretations, and they often result in Claude reading additional files to infer context you could have provided directly. Both inflate token usage.

The pattern that works is: what + where + expected vs. actual behavior:

# Vague (expensive)
Fix the login bug

# Precise (cheap)
Fix the bug in @src/features/auth/login.service.ts where calling
refreshToken() on a 401 response throws "Cannot read properties of
undefined (reading 'accessToken')" instead of returning a new session.

The precise version costs slightly more input tokens but produces a shorter, more accurate response and eliminates the file-reading round trips Claude would otherwise need.


8. Audit Your MCP Servers

Every connected MCP server loads its full tool manifest into the context window at session start, regardless of whether you use any of those tools in that session. Tool definitions include names, descriptions, and JSON Schema parameter definitions. A large MCP server can consume several hundred tokens. Four of them adds up.

Check what is currently wired:

claude mcp list

Remove servers you do not use on most sessions:

claude mcp remove <server-name>

For servers you need occasionally, remove them from the global config and add them only to the specific projects that need them via a project-level .mcp.json. This limits the token cost to sessions where the server is actually relevant.


9. Turn Off Extended Thinking for Simple Tasks

Claude Code runs extended internal reasoning before responding by default, regardless of task complexity. For a simple rename, a documentation lookup, or a git log summary, this reasoning is unnecessary and consumes tokens you will never see.

Prefix low-complexity prompts to signal that reasoning is not needed:

Quick: what does the `normalizeHeaders` function in @src/lib/http.ts return?

For analytical tasks where reasoning genuinely helps (debugging a non-obvious race condition, designing a schema migration) leave it enabled. The cost is worth it. For mechanical tasks, it is not.


10. Use the Terminal, Not the Desktop App

The Claude Code desktop app does not surface per-turn token usage in a visible way. The terminal version prints context usage after every tool call, which makes the cost of each operation concrete and visible.

Launch the terminal client directly:

claude

Running in the terminal also lets you pipe output, script session setup, and integrate with your existing workflow tooling in ways the desktop app does not support.


The Mental Model

Every token in your context window is a cost you pay on every subsequent message. The session is not a chat log; it is a sliding window that grows until you compact or clear it. Managing that window deliberately, through a lean CLAUDE.md, proactive compacting, precise prompts, and disciplined MCP hygiene, is not optimization for its own sake. It is the difference between a session that finishes a feature and one that hits the ceiling halfway through.