We implemented a three-layer caching architecture—provider-level prompt caching, exact-match response caching, and semantic similarity caching—for a client spending $47K/month on LLM API calls across customer support, document processing, and internal analytics. Prompt caching cut input token costs 40% by eliminating repeated system prompt and RAG context charges. Exact-match caching handled 22% of queries from cache with zero quality risk. Semantic similarity caching at a 0.93 threshold caught another 19% of queries that were paraphrased duplicates. Combined with model routing that sent classification tasks to smaller models, total API spend dropped to $11,800/month—a 75% reduction—with no measurable decline in response quality. The entire implementation took six weeks.
Last quarter, a client came to us with a problem that's become painfully common: their AI-powered platform was working well, users loved it, but the monthly API bill had ballooned to $47,000 and was climbing with every new feature. Three LLM-dependent services—customer support automation, document processing, and an internal analytics assistant—were each making tens of thousands of API calls daily, and most of those calls were processing queries the system had already answered.
The fix wasn't switching models, renegotiating contracts, or cutting features. It was recognizing that their system was paying full price to answer the same questions over and over. Within six weeks, a smart caching architecture reduced their monthly spend to $11,800—a 75% reduction—without touching response quality.
At Particula Tech, we've built caching layers for AI systems across industries. This article walks through the exact three-layer strategy we used, the decisions that mattered, and the numbers behind each layer's contribution. If your AI API costs are growing faster than your revenue, this is the playbook.
Why Their AI API Costs Were Spiraling
Before we could fix the problem, we needed to understand where the money was going. We instrumented their three services with request-level cost tracking for two weeks. The data revealed a pattern we see in almost every AI application that wasn't designed with cost efficiency in mind.
Their customer support chatbot processed 28,000 queries daily. Each query included a 1,200-token system prompt, 400-800 tokens of retrieved knowledge base context, and the user's question. Roughly 38% of user queries were semantically identical to questions already answered that day—"How do I reset my password?" asked forty different ways.
The document processing pipeline extracted structured data from invoices, contracts, and reports. Around 15% of documents were resubmissions or duplicates that had already been processed. Every resubmission triggered the same expensive extraction pipeline from scratch.
Their analytics assistant fielded repeated questions from different team members. "What were last month's sales?" and "Show me revenue for January" hit the LLM as separate requests despite requiring identical analysis. About 25% of analytics queries were functional duplicates.
The common thread: redundant computation. The system was doing expensive work it had already done, paying full price every time. The solution wasn't to reduce what the AI could do—it was to stop repeating work unnecessarily.
The Three-Layer Caching Architecture
We designed a caching strategy with three distinct layers, each targeting a different type of redundancy. Stacking these layers compounds their individual savings because each one catches queries the others miss.
Layer 1: Provider-Level Prompt Caching
The simplest layer required almost no custom code. Both Anthropic and OpenAI offer prompt caching that stores repeated input tokens—system prompts, static instructions, and frequently included context—and charges reduced rates when those tokens appear in subsequent requests. Anthropic's implementation discounts cached input tokens by 90%. For the customer support service, the 1,200-token system prompt was identical across every request. Before prompt caching, those tokens cost full input price 28,000 times daily. After enabling prompt caching, the first request of each session paid full price, and every subsequent request that session paid 10% for those same tokens. The math was straightforward. At $3 per million input tokens, 1,200 tokens across 28,000 daily requests cost roughly $100/day just for system prompts. With 90% cached token discounts, that dropped to approximately $13/day. Across all three services, prompt caching alone reduced input token costs by about 40%.
Layer 2: Exact-Match Response Caching
The second layer cached complete LLM responses keyed on a hash of the full input—query text, relevant context, and model parameters. When an identical request appeared, the cached response returned in under 5 milliseconds instead of making a fresh API call. We implemented this with Redis, using composite cache keys that included the user query, the retrieved context hash, and the model version. This prevented serving stale responses when the knowledge base updated or models changed. TTLs varied by service: 48 hours for customer support, indefinite with hash-based invalidation for document extraction, and 4 hours for analytics queries tied to refreshing data. Exact-match caching is conservative by design—it only fires on literally identical inputs. But for document processing, where the same PDF might get uploaded multiple times, and for analytics queries where dashboards trigger identical questions on refresh, the hit rates were meaningful: 22% of all requests across the three services were exact duplicates that hit this cache layer. The beauty of exact-match caching is zero quality risk. Same input, same output, guaranteed. There's no threshold to tune, no similarity to debate. It either matches perfectly or it doesn't.
Semantic Similarity Caching for the Long Tail
The third and most impactful layer addressed the real cost driver: queries that were different in wording but identical in intent. This is where semantic similarity caching transformed the economics.
How We Built the Semantic Layer
When a query passed through the exact-match layer without a hit, we converted it to an embedding using text-embedding-3-small and searched against cached query embeddings using FAISS. If the highest cosine similarity score exceeded our threshold, we served the cached response. If not, the query went to the LLM, and both the response and embedding were added to the cache. The embedding lookup added 8-12 milliseconds of latency on cache misses—negligible against a 1-3 second LLM response time. On cache hits, the response returned in under 15 milliseconds total.
Threshold Calibration Made the Difference
We started with a 0.90 threshold based on general recommendations. Initial results showed a 26% hit rate, but quality audits revealed that roughly 3% of cached responses were incorrect—similar-sounding queries with different intent were matching. "Update my billing address" and "Update my subscription plan" scored 0.91 similarity but require completely different handling. We pulled 1,200 query pairs from production logs, labeled them as semantically equivalent or not, and plotted the similarity distribution. The data showed a clean separation at 0.93—above that threshold, 99.6% of matches were genuinely equivalent. We adjusted accordingly. At the 0.93 threshold, the semantic cache caught 19% of queries that passed through the exact-match layer. Combined with exact-match hits, 41% of all requests were served from cache without an API call. For the customer support service specifically, the combined cache hit rate reached 48%.
Per-Category Thresholds for Higher Precision
We went further by classifying incoming queries into categories before the cache lookup and applying different thresholds per category. Password and account access queries used a 0.91 threshold because they're highly repetitive with minimal variation. Billing questions used 0.94 because subtle differences in wording often implied different issues. Product feature questions used 0.92 as a middle ground. This category-aware approach lifted the overall semantic cache hit rate from 19% to 23% while keeping the false positive rate below 0.4%.
Cache Invalidation That Didn't Break Everything
The hardest part of any caching implementation is invalidation. We needed cached responses to stay fresh without manual intervention, and stale answers in a customer-facing system create support tickets that cost more than the API calls we saved.
Version-Based Keys as the Foundation
Every cache key included the model version, the system prompt hash, and the knowledge base version. When any of these changed, old entries automatically became misses. This handled the most dangerous staleness scenario—model updates or knowledge base changes that make previously correct responses wrong—without requiring operational intervention.
Event-Driven Invalidation for Content Changes
The knowledge base powered by their RAG system updated 2-3 times weekly. We connected cache invalidation to the document ingestion pipeline: when a source document was updated, the system identified cached responses that had used that document as context and purged them selectively. This targeted approach avoided flushing the entire cache for a single document change.
TTL as a Safety Net, Not a Strategy
We set conservative TTLs—48 hours for support queries, 4 hours for analytics—as backstops. But the version-based and event-driven layers handled 95% of invalidation proactively. The TTLs existed to catch edge cases, not as the primary freshness mechanism. Teams that rely solely on TTL-based invalidation either set them too short (killing hit rates) or too long (serving stale data). Both cost money.
The Numbers: From $47K to $11.8K Monthly
After six weeks of phased deployment—prompt caching first, then exact-match, then semantic—we measured the results across a full billing cycle.
The cost breakdown by caching layer tells the story of compounding savings:
The quality score—measured through user feedback and automated evaluation against ground truth—didn't move. Users didn't notice the caching layer at all, which is exactly the point. A cache that degrades user experience isn't an optimization; it's a liability.
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly API spend | $47,200 | $11,800 | -75% |
| Daily LLM API calls | 52,000 | 28,600 | -45% |
| Avg response latency (cached) | N/A | 12ms | N/A |
| Avg response latency (uncached) | 1,800ms | 1,750ms | -3% |
| Cache hit rate (combined) | 0% | 41% | +41% |
| Response quality score | 4.3/5.0 | 4.3/5.0 | No change |
What We Would Do Differently Next Time
Six weeks of implementation taught us lessons that would compress a future project to three or four weeks.
Start with Prompt Caching on Day One
Prompt caching requires almost no engineering effort and delivers immediate savings. We spent the first week on instrumentation and analysis before touching caching. Next time, we'd enable provider-level prompt caching during the instrumentation phase—it's risk-free and starts saving money while you're still gathering data to design the other layers.
Warm the Semantic Cache Before Launch
Our semantic cache started cold and took 10 days to reach steady-state hit rates. We could have analyzed historical query logs, identified the top 500 query patterns per service, pre-generated responses, and seeded the cache before flipping the switch. Cache warming would have captured an additional $2,000-3,000 in savings during that ramp period.
Invest in Monitoring Earlier
We built quality monitoring dashboards in week five. We should have built them in week one. Caching introduces a new failure mode—serving stale or incorrect responses—that's invisible without explicit monitoring. Tracking cached response quality from day one gives you confidence to tune thresholds aggressively and expand cache coverage faster.
Don't Forget the Infrastructure You Already Have
Before building the semantic layer, we nearly overlooked that their existing Redis instance could handle exact-match caching with zero additional infrastructure. Always audit what's already deployed. The fastest, cheapest caching layer is one built on infrastructure that's already running and paid for.
Building a Caching Strategy for Your AI Stack
A 75% cost reduction sounds dramatic, but the mechanics are straightforward. Most AI applications reprocess the same queries repeatedly, include the same prompts in every request, and pay full price for work they've already completed. A layered caching architecture systematically eliminates each type of redundancy.
Start with prompt caching because it's free to implement and risk-free. Add exact-match response caching for your highest-volume endpoints. Then introduce semantic similarity caching with conservative thresholds, calibrate against production data, and expand coverage as confidence builds.
The goal isn't caching everything—it's caching the right things. Classify your endpoints, understand which responses are safe to cache, and invest your engineering effort where the ROI is highest. The organizations paying the least per AI interaction aren't the ones using the cheapest models. They're the ones that stopped paying to answer the same question twice.
Frequently Asked Questions
Quick answers to common questions about this topic
The infrastructure cost is minimal—typically $50-200/month for Redis or an in-memory cache. The real investment is engineering time: 2-4 weeks for a basic exact-match + prompt caching setup, 4-6 weeks for a full three-layer architecture including semantic similarity caching. For organizations spending $5K+ monthly on LLM APIs, the payback period is usually under one month.