How I Cut My Claude API Bill by 90%: Prompt Caching in Practice
Three weeks after launching a Claude-powered feature in Buckist, I opened my Anthropic console and stared at a number that made no sense.
The feature was working great. Users loved it. But the API cost was running about $180/month for a level of usage Iâd estimated at $20. I went back through the math. Correct number of requests, correct token counts, correct model. The arithmetic was right. The bill was wrong.
The problem wasnât my math. It was that I was sending a 12,000-token system prompt on every single request â and paying full input price for it every time. The system prompt hadnât changed once. I was paying for the same 12,000 tokens, over and over, as if Claude had never seen them before.
Prompt caching fixes this. And it took me two lines of code to fix the entire problem.
What Prompt Caching Actually Is
Hereâs the thing: when you send a request to the Claude API, the model processes every token in your prompt from scratch. Every time. Your 8,000-token system prompt that describes your appâs personality, your few-shot examples, your domain knowledge â processed fresh on each call, billed as new input tokens every time.
Prompt caching changes this. You mark parts of your prompt as cacheable. On the first request, Claude processes those tokens and stores them as a cached âprefix.â On subsequent requests with the same prefix, Claude reads from cache instead of reprocessing â and cache reads cost 10% of the standard input price.
Thatâs not a 10% discount. Itâs a 90% reduction on the cached portion.
The cache is keyed on the exact content of the prefix. If your system prompt hasnât changed (and why would it?), every request after the first is a cache hit.
The Math: Why This Is a Bigger Deal Than It Sounds
Let me run the numbers for a realistic Claude-powered app.
Scenario: A coding assistant with a 10,000-token system prompt (detailed instructions, few-shot examples, style guide). Youâre running 500 requests/day on Claude Sonnet 4.6.
Without caching:
- 10,000 tokens Ă 500 requests = 5,000,000 input tokens/day
- At $3/MTok: $15/day â $450/month just for the system prompt
With 5-minute caching:
- Day 1 first request: 10,000 cache write tokens ($3.75/MTok) = $0.0375
- Every subsequent request: 10,000 cache read tokens ($0.30/MTok) = $0.003
- 499 cache hits/day = $1.497/day from the system prompt
- $1.50/day â ~$45/month
Thatâs a drop from $450 to $45 â for the system prompt alone, with zero change in what Claude receives or outputs.
For a larger system prompt on Opus 4.7 â say, a 50,000-token context â the savings are even more extreme. Without caching: $125/day for the system prompt at 500 requests. With caching: ~$7.50/day.
The Code: Two Minutes to Set Up
There are two ways to enable caching. Automatic caching is the easiest:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # This one line
system="Your detailed system prompt here...",
messages=[
{"role": "user", "content": "User's question"}
]
)
That single cache_control field tells the API to automatically cache everything up to the last stable content block. For multi-turn conversations, subsequent messages in the growing history also get cached automatically. You donât manage any of it â the API figures out whatâs stable.
Check if it worked:
print(f"Cache write: {response.usage.cache_creation_input_tokens}")
print(f"Cache read: {response.usage.cache_read_input_tokens}")
print(f"Regular input: {response.usage.input_tokens}")
On the first request: cache_creation_input_tokens will show the tokens written. On every subsequent request with the same system prompt: cache_read_input_tokens will show your savings.
Explicit Breakpoints for Fine-Grained Control
If you want precise control over whatâs cached â useful when different parts of your prompt change at different rates â use explicit cache_control on individual content blocks:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a code review assistant for Python projects.",
"cache_control": {"type": "ephemeral"} # Static persona â always cache
},
{
"type": "text",
"text": open("codebase_context.txt").read(), # 8k tokens of project context
"cache_control": {"type": "ephemeral"} # Changes daily, not per-request
}
],
messages=[
{
"role": "user",
"content": f"Review this PR: {pr_diff}" # Changes every request â not cached
}
]
)
The rule: cache content thatâs identical across requests. Donât put cache_control on anything that changes per request â a cache miss every time means youâre paying the 1.25x write cost with zero benefit.
Three Patterns That Pay Off Most
1. The Large System Prompt Agent
This is the scenario from the intro. If youâve built anything with Claude that has a detailed system prompt â personality, instructions, examples, constraints â youâre almost certainly leaving money on the table.
The fix is exactly the snippet above. Mark your system prompt as ephemeral. First request writes the cache, every subsequent request within 5 minutes is a hit.
For an agent that runs many requests in sequence (like the agent loop from the previous tutorial), the savings compound: every turn of the agent loop has the same system prompt, and every turn after the first is a cache hit.
SYSTEM_PROMPT = """You are an expert travel planning assistant with deep knowledge
of flight booking, hotel recommendations, local attractions, and visa requirements.
[... 5,000 tokens of detailed instructions and examples ...]
"""
def run_agent(user_message: str):
messages = [{"role": "user", "content": user_message}]
for _ in range(10):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
cache_control={"type": "ephemeral"}, # Cache the whole growing context
system=SYSTEM_PROMPT,
tools=TOOLS,
messages=messages
)
if response.stop_reason == "end_turn":
return response.content[0].text
# Append assistant response + tool results; next turn reads from cache
messages.append({"role": "assistant", "content": response.content})
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
In a 5-turn agent loop, turns 2â5 all hit the cache. The system prompt cost drops by 90% for the entire session.
2. Document Analysis at Scale
You have a large document (contract, codebase, research paper) and you want to ask multiple questions about it. Without caching, youâre re-sending the document on every question â paying full input price each time.
With caching, you pay once to write the document to cache and then read it cheaply for every question:
document_text = open("contract.pdf.txt").read() # 30,000 tokens
questions = [
"What are the payment terms?",
"Are there any auto-renewal clauses?",
"What's the liability cap?",
"What are the termination conditions?",
]
for question in questions:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal document analyst. Answer questions accurately and cite specific clauses.",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": document_text,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": question}]
)
print(f"Q: {question}\nA: {response.content[0].text}\n")
First question: 30,000+ tokens written to cache. Questions 2â4: 30,000 tokens read from cache at $0.30/MTok instead of $3/MTok. For 4 questions on Sonnet 4.6, you go from $0.36 to $0.10. Across thousands of documents and hundreds of questions, this is significant.
3. Multi-Turn Conversations with Growing History
Chatbots and AI assistants accumulate conversation history. By turn 10, you might have 20,000 tokens of prior context â and youâre re-sending it all every turn at full price.
Automatic caching handles this perfectly. The cache_control at the request level automatically caches the growing conversation, so each new turn only pays for the new messages, not the entire history:
conversation = []
while True:
user_input = input("You: ")
conversation.append({"role": "user", "content": user_input})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # Caches up to the latest stable content
system=SYSTEM_PROMPT,
messages=conversation
)
assistant_reply = response.content[0].text
conversation.append({"role": "assistant", "content": assistant_reply})
print(f"Claude: {assistant_reply}")
# By turn 10: previous 9 turns are cached, you pay only for the new turn
For a support chat session averaging 15 turns, caching reduces the per-session cost by 70â85% on earlier turns of the conversation.
What Not to Cache: The Common Mistakes
Donât cache dynamic content. Anything with per-request variation â timestamps, user IDs, session state, the actual user query â should never get a cache_control marker. Every cache miss on a marked block costs 1.25x instead of 1x, meaning you pay a penalty for content that never actually caches.
Donât cache tiny prefixes. Claude Sonnet 4.6 requires a minimum of 1,024 tokens before caching activates. If your system prompt is 800 tokens, caching is a no-op â you get charged the write cost but no cache is stored. Check the responseâs cache_creation_input_tokens field; if itâs 0, you didnât meet the minimum. The fix: expand your system prompt with more detail, examples, or context until you exceed the threshold.
Donât forget about TTL. The default cache lasts 5 minutes. If your users have sessions longer than 5 minutes (most do), use the 1-hour TTL for content that needs to persist across a full session.
The 1-Hour TTL: When to Use It
By default, the cache is valid for 5 minutes. If a request comes in after 5 minutes, the cache miss re-writes it (at 1.25x cost) and the 5-minute window resets.
For long-running sessions or infrequent requests, 5 minutes isnât enough. The 1-hour TTL costs 2x the base input price to write but pays off if you get 2 or more cache reads within the hour:
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral", "ttl": "1h"} # Persists for an hour
}
]
You can mix TTLs in the same request â 1-hour for the stable system prompt, 5-minute for the semi-stable document context that might rotate:
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral", "ttl": "1h"} # Static, cache for an hour
},
{
"type": "text",
"text": session_context,
"cache_control": {"type": "ephemeral"} # Session-specific, 5 minutes is fine
}
]
The rule: use 1-hour TTL for anything thatâs stable across user sessions. Use 5-minute for anything thatâs stable within a session but might rotate between them.
Pre-Warming: Eliminating Cold Start Latency
One subtlety: the first user request in any session still pays the cache write cost and the full processing time. If latency matters (and it usually does), you can pre-warm the cache before users arrive:
# Run this on server startup, before the first user request
prewarm_response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=0, # No output needed â just write the cache
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}
],
messages=[{"role": "user", "content": "warmup"}]
)
# Now every actual user request is a cache hit from the start
max_tokens=0 tells the API to process the prompt and write the cache without generating a response â you pay only the cache write cost, and every subsequent request gets the fast, cheap cache read path.
Stack It with Batch API for Maximum Savings
If you have workloads that arenât time-sensitive â nightly reports, bulk processing, background enrichment â the Batch API gives you a 50% discount on top of prompt caching.
The combination is powerful: cache read tokens already cost 10% of standard. The Batch API applies a 50% discount on that. Youâre looking at 5% of standard input price for cached tokens in batch mode.
For a 50,000-token system prompt on Sonnet 4.6 at $3/MTok standard:
- Uncached, real-time: $0.15 per request
- Cached, real-time: $0.015 per request (cache read at $0.30/MTok)
- Cached, batch: $0.0075 per request
Running 1,000 such requests: $150 standard â $7.50 cached â $3.75 cached + batch.
For non-interactive processing that can wait minutes or hours, this is the most cost-efficient path to Claude currently available.
Checking Your Work
Add this to any Claude call to see whatâs actually happening:
def log_cache_stats(response, label=""):
usage = response.usage
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
regular = usage.input_tokens
total_tokens = cache_read + cache_write + regular
# Cost calculation (Sonnet 4.6 rates)
standard_cost = total_tokens * 3 / 1_000_000
actual_cost = (
cache_write * 3.75 / 1_000_000 +
cache_read * 0.30 / 1_000_000 +
regular * 3 / 1_000_000
)
savings = standard_cost - actual_cost
print(f"[{label}] Cache write: {cache_write} | Cache read: {cache_read} | Regular: {regular}")
print(f" Standard cost: ${standard_cost:.4f} | Actual: ${actual_cost:.4f} | Saved: ${savings:.4f}")
If cache_read_input_tokens is consistently 0, something is wrong: the prefix isnât matching (content is changing), youâre under the minimum token threshold, or the cache expired between requests.
The One Thing Most Developers Miss
After fixing Buckist and helping a few other indie hackers audit their Claude API bills, I keep seeing the same pattern: developers who are using Claude heavily, building real things, and completely unaware that they could be paying a fraction of what theyâre paying.
The fix is almost always the same: add cache_control to the system prompt, verify the cache hits in the usage fields, and watch the bill drop.
For the agent loop I built in the last post, adding caching reduced the per-session cost from $0.48 to $0.08 for a typical 5-turn session â a number that now actually fits into a freemium tier without bleeding money on every user.
If youâre building anything Claude-powered that has more than a minimal system prompt, check your cache hit rate before you do anything else. Itâs the highest-ROI five minutes youâll spend on your API costs.
Running a Claude-powered app and want to compare notes on API costs? Drop it in the comments â Iâm curious whether the numbers others are seeing match what Iâve found. And if you havenât built the agent loop yet, start here before diving into caching.