A three-layer caching architecture, provider-level prompt caching, exact-match response caching, and semantic similarity caching, is the standard pattern for cutting LLM API spend on systems with repetitive query traffic. Prompt caching cuts input token costs 30-50% by eliminating repeated system prompt and RAG context charges. Exact-match caching typically handles 20-30% of queries from cache with zero quality risk. Semantic similarity caching at a 0.92-0.94 threshold catches another 15-25% of paraphrased-duplicate queries. Combined with model routing, total LLM spend on chatbot- and FAQ-heavy workloads routinely drops 70%+ with no measurable decline in response quality. Engineering investment is typically four to six weeks for the full three-layer build.
Across the production AI systems we audit, the same pattern shows up: the platform works, users love it, the monthly API bill keeps climbing, and most of that spend is the system answering the same questions over and over. A typical profile is three LLM-dependent services, customer support automation, document processing, and an internal analytics assistant, collectively running tens of thousands of API calls daily, with redundant work hidden inside every layer.
The fix usually isn't switching models, renegotiating contracts, or cutting features. It's recognizing that the system is paying full price to answer the same questions repeatedly. A well-designed smart caching architecture can drive monthly spend down by 70% or more without touching response quality.
At Particula Tech, we've built caching layers for AI systems across industries. This article walks through the exact three-layer strategy we use, the decisions that matter, and the numbers behind each layer's contribution. If your AI API costs are growing faster than your revenue, this is the playbook.
Why Their AI API Costs Were Spiraling
Before fixing the problem, you need to understand where the money goes. Instrumenting the LLM-dependent services with request-level cost tracking for two weeks, the same metadata-tagging plus gateway pattern we walk through in our per-tenant LLM cost attribution guide. At their spend level, the gateway tier itself is a decision worth making deliberately, our AI gateway decision framework covers when LiteLLM is enough versus when Portkey or Kong starts to pay for itself. The data revealed a pattern we see in almost every AI application that wasn't designed with cost efficiency in mind.
Their customer support chatbot processed 28,000 queries daily. Each query included a 1,200-token system prompt, 400-800 tokens of retrieved knowledge base context, and the user's question. Roughly 38% of user queries were semantically identical to questions already answered that day, "How do I reset my password?" asked forty different ways.
The document processing pipeline extracted structured data from invoices, contracts, and reports. Around 15% of documents were resubmissions or duplicates that had already been processed. Every resubmission triggered the same expensive extraction pipeline from scratch.
Their analytics assistant fielded repeated questions from different team members. "What were last month's sales?" and "Show me revenue for January" hit the LLM as separate requests despite requiring identical analysis. About 25% of analytics queries were functional duplicates.
The common thread: redundant computation. The system was doing expensive work it had already done, paying full price every time. The solution wasn't to reduce what the AI could do, it was to stop repeating work unnecessarily.
The Three-Layer Caching Architecture
We designed a caching strategy with three distinct layers, each targeting a different type of redundancy. Stacking these layers compounds their individual savings because each one catches queries the others miss.
Layer 1: Provider-Level Prompt Caching
The simplest layer required almost no custom code. Both Anthropic and OpenAI offer prompt caching that stores repeated input tokens, system prompts, static instructions, and frequently included context, and charges reduced rates when those tokens appear in subsequent requests. Anthropic's implementation discounts cached input tokens by 90%. Note that since the March 2026 default TTL change, production deployments need to explicitly set ttl: 3600 on cache_control breakpoints to keep the 1-hour cache window, a single field that has cost teams 15-20x in monthly spend when missed. For the customer support service, the 1,200-token system prompt was identical across every request. Before prompt caching, those tokens cost full input price 28,000 times daily. After enabling prompt caching, the first request of each session paid full price, and every subsequent request that session paid 10% for those same tokens. The math was straightforward. At $3 per million input tokens, 1,200 tokens across 28,000 daily requests cost roughly $100/day just for system prompts. With 90% cached token discounts, that dropped to approximately $13/day. Across all three services, prompt caching alone reduced input token costs by about 40%.
Layer 2: Exact-Match Response Caching
The second layer cached complete LLM responses keyed on a hash of the full input, query text, relevant context, and model parameters. When an identical request appeared, the cached response returned in under 5 milliseconds instead of making a fresh API call. We implemented this with Redis, using composite cache keys that included the user query, the retrieved context hash, and the model version. This prevented serving stale responses when the knowledge base updated or models changed. TTLs varied by service: 48 hours for customer support, indefinite with hash-based invalidation for document extraction, and 4 hours for analytics queries tied to refreshing data. Exact-match caching is conservative by design, it only fires on literally identical inputs. But for document processing, where the same PDF might get uploaded multiple times, and for analytics queries where dashboards trigger identical questions on refresh, the hit rates were meaningful: 22% of all requests across the three services were exact duplicates that hit this cache layer. The beauty of exact-match caching is zero quality risk. Same input, same output, guaranteed. There's no threshold to tune, no similarity to debate. It either matches perfectly or it doesn't.
Semantic Similarity Caching for the Long Tail
The third and most impactful layer addressed the real cost driver: queries that were different in wording but identical in intent. This is where semantic similarity caching transformed the economics.
How We Built the Semantic Layer
When a query passed through the exact-match layer without a hit, we converted it to an embedding using text-embedding-3-small and searched against cached query embeddings using FAISS. If the highest cosine similarity score exceeded our threshold, we served the cached response. If not, the query went to the LLM, and both the response and embedding were added to the cache. The embedding lookup added 8-12 milliseconds of latency on cache misses, negligible against a 1-3 second LLM response time. On cache hits, the response returned in under 15 milliseconds total.
Threshold Calibration Made the Difference
We started with a 0.90 threshold based on general recommendations. Initial results showed a 26% hit rate, but quality audits revealed that roughly 3% of cached responses were incorrect, similar-sounding queries with different intent were matching. "Update my billing address" and "Update my subscription plan" scored 0.91 similarity but require completely different handling. We pulled 1,200 query pairs from production logs, labeled them as semantically equivalent or not, and plotted the similarity distribution. The data showed a clean separation at 0.93, above that threshold, 99.6% of matches were genuinely equivalent. We adjusted accordingly. At the 0.93 threshold, the semantic cache caught 19% of queries that passed through the exact-match layer. Combined with exact-match hits, 41% of all requests were served from cache without an API call. For the customer support service specifically, the combined cache hit rate reached 48%.
Per-Category Thresholds for Higher Precision
We went further by classifying incoming queries into categories before the cache lookup and applying different thresholds per category. Password and account access queries used a 0.91 threshold because they're highly repetitive with minimal variation. Billing questions used 0.94 because subtle differences in wording often implied different issues. Product feature questions used 0.92 as a middle ground. This category-aware approach lifted the overall semantic cache hit rate from 19% to 23% while keeping the false positive rate below 0.4%.
Cache Invalidation That Didn't Break Everything
The hardest part of any caching implementation is invalidation. We needed cached responses to stay fresh without manual intervention, and stale answers in a customer-facing system create support tickets that cost more than the API calls we saved.
Version-Based Keys as the Foundation
Every cache key included the model version, the system prompt hash, and the knowledge base version. When any of these changed, old entries automatically became misses. This handled the most dangerous staleness scenario, model updates or knowledge base changes that make previously correct responses wrong, without requiring operational intervention.
Event-Driven Invalidation for Content Changes
The knowledge base powered by their RAG system updated 2-3 times weekly. We connected cache invalidation to the document ingestion pipeline: when a source document was updated, the system identified cached responses that had used that document as context and purged them selectively. This targeted approach avoided flushing the entire cache for a single document change.
TTL as a Safety Net, Not a Strategy
We set conservative TTLs, 48 hours for support queries, 4 hours for analytics, as backstops. But the version-based and event-driven layers handled 95% of invalidation proactively. The TTLs existed to catch edge cases, not as the primary freshness mechanism. Teams that rely solely on TTL-based invalidation either set them too short (killing hit rates) or too long (serving stale data). Both cost money.
The Numbers: From $47K to $11.8K Monthly
After six weeks of phased deployment, prompt caching first, then exact-match, then semantic, we measured the results across a full billing cycle.
The cost breakdown by caching layer tells the story of compounding savings:
The quality score, measured through user feedback and automated evaluation against ground truth, didn't move. Users didn't notice the caching layer at all, which is exactly the point. A cache that degrades user experience isn't an optimization; it's a liability.
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly API spend | $47,200 | $11,800 | -75% |
| Daily LLM API calls | 52,000 | 28,600 | -45% |
| Avg response latency (cached) | N/A | 12ms | N/A |
| Avg response latency (uncached) | 1,800ms | 1,750ms | -3% |
| Cache hit rate (combined) | 0% | 41% | +41% |
| Response quality score | 4.3/5.0 | 4.3/5.0 | No change |
What We Would Do Differently Next Time
Six weeks of implementation taught us lessons that would compress a future project to three or four weeks.
Start with Prompt Caching on Day One
Prompt caching requires almost no engineering effort and delivers immediate savings. We spent the first week on instrumentation and analysis before touching caching. Next time, we'd enable provider-level prompt caching during the instrumentation phase, it's risk-free and starts saving money while you're still gathering data to design the other layers.
Warm the Semantic Cache Before Launch
Our semantic cache started cold and took 10 days to reach steady-state hit rates. We could have analyzed historical query logs, identified the top 500 query patterns per service, pre-generated responses, and seeded the cache before flipping the switch. Cache warming would have captured an additional $2,000-3,000 in savings during that ramp period.
Invest in Monitoring Earlier
We built quality monitoring dashboards in week five. We should have built them in week one. Caching introduces a new failure mode, serving stale or incorrect responses, that's invisible without explicit monitoring. Tracking cached response quality from day one gives you confidence to tune thresholds aggressively and expand cache coverage faster.
Don't Forget the Infrastructure You Already Have
Before building the semantic layer, we nearly overlooked that their existing Redis instance could handle exact-match caching with zero additional infrastructure. Always audit what's already deployed. The fastest, cheapest caching layer is one built on infrastructure that's already running and paid for.
Building a Caching Strategy for Your AI Stack
A 75% cost reduction sounds dramatic, but the mechanics are straightforward. Most AI applications reprocess the same queries repeatedly, include the same prompts in every request, and pay full price for work they've already completed. A layered caching architecture systematically eliminates each type of redundancy.
Start with prompt caching because it's free to implement and risk-free. Add exact-match response caching for your highest-volume endpoints. Then introduce semantic similarity caching with conservative thresholds, calibrate against production data, and expand coverage as confidence builds.
The goal isn't caching everything, it's caching the right things. Classify your endpoints, understand which responses are safe to cache, and invest your engineering effort where the ROI is highest. The organizations paying the least per AI interaction aren't the ones using the cheapest models. They're the ones that stopped paying to answer the same question twice.
Frequently Asked Questions
Quick answers to common questions about this topic
The infrastructure cost is minimal, typically $50-200/month for Redis or an in-memory cache. The real investment is engineering time: 2-4 weeks for a basic exact-match + prompt caching setup, 4-6 weeks for a full three-layer architecture including semantic similarity caching. For organizations spending $5K+ monthly on LLM APIs, the payback period is usually under one month.


