CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
Add One Field to Your System Prompt and Cut Your Claude API Bill by 90%

Add One Field to Your System Prompt and Cut Your Claude API Bill by 90%

Chris Harper

3 min read

Jun 27, 2026 · 20:02 UTC

AI
Workflow
Best Practices
LLM

TL;DR: Add "cache_control": {"type": "ephemeral"} to your system prompt block and every subsequent Claude API call within 5 minutes pays 10% of the normal input-token price — an 85% latency cut is included at no extra charge.

If your app sends the same system prompt with every request — and almost every app does — you're paying full input-token price every single time. Prompt caching eliminates that by storing the encoded state of a content block server-side and reusing it for 5 minutes (or 1 hour with extended TTL).

The one-line change

import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "<your full system prompt here>",
            "cache_control": {"type": "ephemeral"}  # ← this is the entire change
        }
    ],
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)

That's it. The first call is slightly more expensive (writing to cache costs 25% extra). Every call within the TTL window that sends the same bytes pays 10% of the base input-token price.

What to cache

Not everything is cache-worthy. The pattern that maximizes savings:

  1. System prompt — almost always worth caching; it's static, often long, sent on every call
  2. Tool definitions — if you inject 20+ tool schemas, cache the whole block
  3. Retrieved documents / few-shot examples — cache when the same set recurs across calls in a session
  4. Conversation history — mark the last stable turn; earlier turns stay cached as long as the prefix is unchanged

Only one cache_control breakpoint can be active per content role type, so place it at the longest stable prefix.

5-minute vs 1-hour TTL

The standard "ephemeral" entry lives for a minimum of 5 minutes. For sustained traffic or long-lived chat sessions, use the 1-hour variant:

"cache_control": {"type": "ephemeral", "ttl": 3600}

The 1-hour write costs more but pays off quickly on any server process that runs for more than a few minutes.

What it unlocks beyond cost

  • 85% latency reduction on the cached prefix — the model skips re-encoding the system prompt and jumps straight to generation
  • Richer system prompts — a 30,000-token context document that cost ~$0.09 per call on Sonnet 4.6 costs ~$0.009 on a cache hit; you can finally include the whole reference doc
  • Multi-turn conversations at scale — cache the growing conversation history prefix to keep per-turn cost flat even as the transcript grows

Verify cache activity with the response fields cache_read_input_tokens (savings realized) and cache_creation_input_tokens (write cost paid).

Sources: Prompt caching — Anthropic Platform Docs | Prompt caching announcement — Anthropic