Context Caching

Cache repeated context to reduce costs and latency. Pay 75% less for cached tokens.

75%
Token cost savings
~80%
Latency reduction
Automatic
No code changes

How It Works

1

First Request

Send a prompt with your context (system prompt, documents, etc.)

2

Automatic Caching

We cache the processed context for future requests

3

Subsequent Requests

Identical prefixes hit the cache, saving tokens and time

Automatic Prompt Caching

Prompt caching is enabled by default. When you send requests with identical prefixes, the cached portion is charged at a discounted rate.

Python
# First request - full price for all tokens
response1 = client.chat.completions.create(
    model="mythic-4",
    messages=[
        {"role": "system", "content": """You are an expert legal assistant.
        Here is the full text of the contract: [10,000 words...]"""},
        {"role": "user", "content": "What are the termination clauses?"}
    ]
)

# Second request - same prefix is cached at 75% discount!
response2 = client.chat.completions.create(
    model="mythic-4",
    messages=[
        {"role": "system", "content": """You are an expert legal assistant.
        Here is the full text of the contract: [10,000 words...]"""},
        {"role": "user", "content": "Are there any liability limitations?"}
    ]
)

# Check usage
print(response2.usage.prompt_tokens_details.cached_tokens)
# 12000  (these were charged at 75% discount)

💡 Cache Hits

Check usage.prompt_tokens_details.cached_tokens in the response to see how many tokens were served from cache.

Before & After

Without Caching
With Caching

10 requests × 15K tokens each

150K tokens

12K cached + 3K new × 10

45K effective

Best Use Cases

📚 Document Q&A

Include large documents in system prompt, ask multiple questions.

Save 70%+ on follow-up questions

💻 Code Assistants

Cache repository context for multiple code generation requests.

Save 60%+ on token costs

🤖 Multi-turn Chat

Conversation history is automatically cached between turns.

Save 50%+ per conversation

📋 Few-shot Prompts

Cache example-heavy prompts used across many requests.

Save 80%+ on examples

Pricing

Token Type Price per 1M tokens Discount
Regular Input Tokens $2.50 -
Cached Input Tokens $0.625 75% off
Output Tokens $10.00 -

Cache Requirements

Optimizing for Cache Hits

🎯 Structure Your Prompts

Put static content (system prompts, documents, examples) at the beginning. Put dynamic content (user query) at the end. This maximizes the cacheable prefix.

Optimal Prompt Structure
# ✅ Good: Static content first (cacheable)
messages = [
    {"role": "system", "content": "[Long static instructions...]"},
    {"role": "user", "content": "[Examples...]"},
    {"role": "assistant", "content": "[Example responses...]"},
    {"role": "user", "content": user_query}  # Dynamic at end
]

# ❌ Bad: Dynamic content breaks cache prefix
messages = [
    {"role": "system", "content": f"Today is {date}. You are..."},  # Dynamic!
    ...
]