How Caching Works
Caching operates on prefix matching: the system stores processed tokens and reuses them when subsequent requests start with the same content. Consider a chatbot with a 2,000-token system prompt:1
Request 1
System prompt (2,000 tokens) + user message (50 tokens)Processed: 2,050 tokens · From cache: 0 tokensPrefix written to cache.
2
Request 2
System prompt (2,000 tokens) + user message (80 tokens)Processed: 80 tokens · From cache: 2,000 tokens
3
Request 3
System prompt (2,000 tokens) + user message (120 tokens)Processed: 120 tokens · From cache: 2,000 tokens
Supported Models and Pricing
Loading…
Claude Opus 4.5 charges a premium rate for cache writes ($7.50/1M tokens vs $6.00 for regular input). The first request populating the cache costs more, but subsequent cache hits save 90%. Other models don’t charge extra for cache writes.
Provider-Specific Behavior
Venice normalizes caching across providers. For most models, caching is automatic. Just send your requests and check the response for cache statistics. The exception is Claude, which requires explicit cache markers for optimal performance. Caching behavior is ultimately controlled by each provider and may change, so check provider docs for the latest details.| Model | Provider | Min Tokens | Cache Lifetime | Write Cost | Read Discount | Explicit Markers |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Anthropic | ~4,000 | 5 min | +25% | 90% | Required |
| GPT-5.2 | OpenAI | 1,024 | 5-10 min | None | 90% | Not needed |
| Gemini | ~1,024 | 1 hour | None | 75-90% | Not needed | |
| Grok | xAI | ~1,024 | 5 min | None | 75-88% | Not needed |
| DeepSeek | DeepSeek | ~1,024 | 5 min | None | 50% | Not needed |
| MiniMax | MiniMax | ~1,024 | 5 min | None | 90% | Not needed |
| Kimi | Moonshot | ~1,024 | 5 min | None | 50% | Not needed |
Claude Opus 4.5 (Anthropic)
Claude requires explicit cache breakpoints. For fine-grained control beyond the system prompt, you can add additional breakpoints manually. What you need to know:- Explicit markers required: Add
cache_control: { "type": "ephemeral" }to content blocks you want cached - Up to 4 breakpoints per request: The system uses the longest matching prefix
- Cache key is byte-exact: Whitespace changes, different image encodings, or reordered tools break cache hits
- Cache-aware rate limits: Cached tokens don’t count against your ITPM limit, enabling higher effective throughput
- 25% write premium: First request costs more, but 90% savings on subsequent reads
All Other Models
Caching is automatic. No special parameters needed. Just ensure your prompts exceed ~1,024 tokens and useprompt_cache_key for consistent routing.
Request Parameters
| Parameter | Type | Models | Description |
|---|---|---|---|
prompt_cache_key | string | All | Routing hint for cache affinity. Requests with the same key are more likely to hit the same server with warm cache. |
cache_control | object | Claude | Marks content blocks for caching. See Claude Opus 4.5 section. |
prompt_cache_key
For multi-turn conversations or agentic workflows, use a consistentprompt_cache_key to improve cache hit rates:
Response Fields
The responseusage object includes cache statistics:
| Field | Description |
|---|---|
prompt_tokens | Total input tokens in the request |
prompt_tokens_details.cached_tokens | Tokens served from cache (billed at discounted rate) |
prompt_tokens_details.cache_creation_input_tokens | Tokens written to cache (may incur premium on Claude) |
- 5000 cached tokens × $0.60/1M = $0.003
- 500 uncached tokens × $6.00/1M = $0.003
- Total: $0.006 (vs $0.033 without caching, 82% savings)
Best Practices
Structure prompts for caching
Place static content at the beginning, dynamic content at the end. Good structure| Position | Content | Cached? |
|---|---|---|
| 1 | System instructions | Yes |
| 2 | Reference documents | Yes |
| 3 | Few-shot examples | Yes |
| 4 | User query | No |
| Position | Content | Cached? |
|---|---|---|
| 1 | Current timestamp | No (invalidates everything after) |
| 2 | System instructions | No |
| 3 | User query | No |
Keep prefixes byte-identical
Cache keys are computed from exact byte sequences. Even trivial differences break cache hits:- Different whitespace or newlines
- Timestamps or request IDs in prompts
- Randomized few-shot example ordering
- Different formatting of the same content
Meet minimum token thresholds
If your prompts are below the minimum (typically 1,024 tokens), caching won’t activate. For small prompts, consider:- Adding more context or examples to reach the threshold
- Bundling multiple small requests into batched prompts
- Accepting that caching won’t apply for simple queries
Use prompt_cache_key for conversations
For multi-turn conversations, set a consistentprompt_cache_key:
Monitor cache performance
Track these metrics:- Cache hit rate:
cached_tokens / prompt_tokens - Cost savings: Compare actual cost vs. uncached cost
- Latency reduction: Time-to-first-token with vs. without cache hits
cached_tokens is consistently 0:
- Prompts may be below minimum token threshold
- Prompts may be changing between requests
- Requests may be hitting different servers (use
prompt_cache_key) - Cache may have expired (requests too infrequent)
Consider cache economics
Claude Opus 4.5 cache write premium: First request costs 25% more, but 90% savings on subsequent reads.| Scenario | Cache write premium worth it? |
|---|---|
| 1 request with this prompt | No (pay 25% more, no benefit) |
| 2+ requests with same prefix | Yes (break even at 2nd request) |
| Rapidly changing prompts | No (constant write costs) |
| Stable system prompt, many queries | Yes (amortized over many reads) |
Cache Lifetime
Caches expire after a period of inactivity (typically 5-10 minutes). This means:| Traffic pattern | Caching benefit |
|---|---|
| Continuous requests (< 5 min gaps) | High: cache stays warm |
| Bursty traffic (gaps > 10 min) | Limited: cache expires between bursts |
| Sporadic requests (hours apart) | None: cache always cold |
Caching with Tools and Functions
Function definitions can be cached along with system prompts:Caching with Images and Documents
For vision models, images can be included in cached content:Troubleshooting
cached_tokens is always 0
cached_tokens is always 0
| Cause | Solution |
|---|---|
| Prompt too short | Ensure prompt exceeds ~1,024 tokens (4,000 for Claude) |
| Prefix changed | Check for dynamic content at the start of your prompt |
| First request | Expected: first request writes to cache, subsequent requests read |
| Cache expired | Reduce time between requests to under 5 minutes |
| Different servers | Add prompt_cache_key to route requests consistently |
cache_creation_input_tokens on every request
cache_creation_input_tokens on every request
| Cause | Solution |
|---|---|
| Prompt changing | Remove timestamps, request IDs, or other dynamic content from the prefix |
| Missing cache_control | For Claude, ensure cache_control marker is present on content blocks |
| Below threshold | Prompts under minimum token count don’t trigger caching |
Higher costs than expected
Higher costs than expected
| Cause | Solution |
|---|---|
| Cache write premium | Claude charges 25% more for writes. Only worth it if you reuse the prompt. |
| Low reuse | If each prompt is unique, you pay write costs without read benefits |
| Bad prompt structure | Move dynamic content to the end so the prefix stays stable |