Why Your LLM Bill Is Exploding — And How Semantic Caching Can Cut It (Really) Hard

AI generated image for Why Your LLM Bill Is Exploding — And How Semantic Caching Can Cut It (Really) Hard

Somewhere, right now, a CFO is looking at an LLM invoice and whispering the sacred enterprise mantra: “Why is this number… doing that?”

If you’ve shipped anything from a customer support bot to an internal “Ask the Docs” assistant, you’ve seen it: usage rises, costs rise faster, and your dashboards look like they were inspired by a SpaceX launch. The awkward part is that the growth curve often has less to do with “more users” and more to do with something far more mundane: humans are creative re-phrasers.

This article is based on (and links to) the excellent guest post “Why your LLM bill is exploding — and how semantic caching can cut it by 73%” by Sreenivasa Reddy Hulebeedu Reddy, published on VentureBeat on January 12, 2026. citeturn3view0

Reddy describes a very relatable scenario: their LLM API bill grew around 30% month over month, even though traffic wasn’t rising at the same rate. The culprit wasn’t some exotic prompt injection or a rogue intern with a love for haiku. It was redundancy: users asking essentially the same thing in many different ways. Exact-match caching caught only a small fraction; semantic caching (embedding-based similarity) drove cache hit rates much higher and cut LLM API costs by 73% in their production results. citeturn3view0

Let’s unpack what’s actually happening, why semantic caching works, where it bites you, and how to implement it without accidentally caching the wrong answer to “How do I cancel my order?” for someone who asked “How do I cancel my subscription?” (Ask me how I know.)

LLM bills don’t just grow — they metastasize

Traditional SaaS costs usually scale with users, seats, or at least something your finance team recognizes. LLM costs scale with tokens, which are influenced by:

  • How often you call the model (requests per second, background jobs, agent loops).
  • How long your prompts are (system prompts, retrieved context, conversation history).
  • How long your outputs are (verbose answers, chain-of-thought style reasoning, multi-step plans).

Now add the most expensive ingredient: repetition.

In the VentureBeat piece, Reddy analyzed query logs and found three buckets across 100,000 production queries:

  • 18% exact duplicates
  • 47% semantically similar (same intent, different wording)
  • 35% genuinely novel

Exact-match caching, keyed on the raw query text, captures that first 18% and shrugs helplessly at the rest. Semantic caching goes after the 47%. citeturn3view0

Why the “same question” problem is worse than it sounds

When humans talk to LLM apps, they do three things that destroy naive caching:

  • They vary phrasing: “refund policy” vs “can I return this?”
  • They add fluff: “Hey, quick question…” (token tax!)
  • They include context inconsistently: product name, order ID, region, date.

And that last bullet is the trap: sometimes queries are semantically similar but should not share answers. The difference between “cancel order” and “cancel subscription” can mean refunds, retention flows, legal terms, and your support team’s sanity. Reddy calls this out explicitly when discussing threshold mistakes at similarity 0.85. citeturn3view0

Caching options: exact-match, prompt caching, and semantic caching

Before we go all-in on semantic caching, it helps to separate three distinct “caching-ish” strategies that often get conflated in architecture diagrams:

1) Exact-match caching (string key)

This is the classic. Hash the prompt, store the response, return it if you see the same prompt again. It’s simple, cheap, and safe when the prompt contains all relevant context and the output is deterministic enough for your app. But in real user chat, exact duplicates are rarer than you’d think, which is why Reddy only saw an 18% hit rate. citeturn3view0

2) Provider-side prompt caching (prefix reuse)

OpenAI and Azure OpenAI both provide a form of prompt caching that discounts repeated input token computation when the beginning (prefix) of the prompt is identical. OpenAI introduced API Prompt Caching publicly on October 1, 2024, offering discounted cached input tokens when the model has recently seen the same prefix (with cache behavior typically clearing after 5–10 minutes of inactivity and always within an hour). citeturn2search0

Azure OpenAI’s documentation similarly describes prompt caching as applying when the beginning of the prompt is identical (with requirements like minimum prompt length and identical initial tokens) and notes cache lifetimes on the order of minutes. citeturn1search1turn1search2

This is not the same as returning a cached model answer. Prompt caching reduces compute and cost for identical prefixes; semantic caching can skip the LLM call entirely by reusing the previous output if the new question is “close enough” in meaning.

3) Semantic caching (embedding similarity)

Semantic caching replaces “string equality” with “vector similarity.” Instead of hashing the prompt text, you embed it, search for a near neighbor in a vector index, and if the similarity is above a threshold, you return the stored response.

Reddy provides a clear architecture sketch: an embedding model, a vector store (e.g., FAISS, Pinecone), and a response store (e.g., Redis, DynamoDB). citeturn3view0

Tools vary, but the pattern is consistent:

  • Compute embedding for the incoming query.
  • Run a nearest-neighbor search against cached query embeddings.
  • If similarity ≥ threshold, return the stored response instead of calling the LLM.
  • If miss, call the LLM, store the new query+response, and index the query embedding.

The key metric: hit rate is good; wrong hits are catastrophic

Semantic caching works because a lot of “different” questions are effectively the same intent. But it’s also dangerous because some “similar” questions aren’t the same. That’s why Reddy’s post spends a lot of time on the similarity threshold, and frankly, it deserves the attention.

The threshold problem (a.k.a. “0.85 is a lie”)

Reddy describes starting with a similarity threshold of 0.85 and quickly discovering false positives — like treating “cancel subscription” as equivalent to “cancel order.” Their conclusion: a single global threshold is a trap, and the best threshold varies by query type. They list example thresholds such as 0.94 for FAQ-style, 0.88 for product search, 0.92 for support, and 0.97 for transactional queries. citeturn3view0

This is the heart of semantic caching as an engineering discipline: you’re not just optimizing dollars; you’re optimizing the trade-off between precision (don’t return wrong cached answers) and recall (do return cached answers when appropriate).

A practical labeling approach (yes, humans still exist)

Reddy didn’t tune thresholds blindly. They sampled query pairs at different similarity levels, had human annotators label whether the intent matched, then produced precision/recall curves and picked thresholds based on error cost. citeturn3view0

That’s the correct approach. You can automate parts of it, but you still need a ground-truth dataset that reflects your domain. Especially if your app lives anywhere near regulated content, payments, legal policy, or anything that can be screenshot and tweeted.

Architecture: what “semantic cache” actually means in production

At a minimum, you need two storage systems:

  • A vector index for similarity lookup (FAISS, Redis vector search, Pinecone, etc.).
  • A key-value/document store holding the cached response payload and metadata (Redis, DynamoDB, Postgres, etc.).

You can combine them in some platforms, but conceptually these are different concerns: “find similar item” vs “store and retrieve response blob and metadata.”

Vector store choices: FAISS, Redis, Pinecone, and friends

FAISS is a widely used open-source library for similarity search and clustering of dense vectors, built in C++ with Python bindings and optional GPU acceleration. It’s designed for efficient search across large vector sets. citeturn2search3

Redis Stack supports vector fields and vector similarity queries via RediSearch commands and indexing methods like FLAT and HNSW (approximate nearest neighbor). citeturn2search1

Pinecone is a managed vector database. One operational detail that’s relevant when you’re implementing semantic caching: the system doesn’t impose a default “similarity threshold” for you; you typically retrieve top_k results and decide how to filter them based on score, your own threshold, and your business logic. citeturn2search2

If you want a rule of thumb:

  • FAISS is great when you control infra and want performance and flexibility.
  • Redis is attractive when you already run Redis and want low-latency, operational simplicity, and caching semantics in one place.
  • Managed vector DBs shine when you’d rather not own the scaling story.

Don’t forget the response store: you’re caching more than text

A cached LLM response in a modern app is rarely just a string. You’ll likely want to store:

  • The response text (and maybe citations or tool outputs).
  • Which model produced it (responses can differ across models/versions).
  • Prompt template version and system prompt hash.
  • Retrieval context identifiers (document IDs, knowledge base version).
  • Timestamps and TTL policy.
  • Safety metadata (moderation results, policy decisions).

This matters because “same user question” can legitimately yield a different best response after you update documentation, change products, or rotate policy language.

Latency: semantic caching adds overhead, but overall gets faster

Semantic caching adds two steps before you know whether you’ll hit the LLM:

  • Embedding computation
  • Vector search

Reddy provides representative measurements: roughly 12ms p50 for embedding, 8ms p50 for vector search, 20ms total lookup p50; compared with ~850ms p50 for an LLM API call (and much higher p99). On misses, you pay the extra ~20ms. On hits, you save the full LLM roundtrip. citeturn3view0

The key is that with a high enough hit rate, your overall latency improves even though cache misses get slightly slower. Reddy’s math shows average latency dropping from ~850ms to ~300ms with a 67% hit rate. citeturn3view0

If you’re building an interactive chat UI, this is not a “nice to have.” It’s the difference between “feels instant” and “feels like a fax machine.”

Invalidation: the part everyone postpones until it hurts

Caching is easy. Correct caching is hard. Semantic caching introduces a particularly spicy failure mode: confidently returning a plausible answer that is outdated or wrong for the user’s case.

Reddy outlines three invalidation strategies:

  • Time-based TTL by content type (pricing vs policy vs FAQ)
  • Event-based invalidation when underlying content changes
  • Staleness detection by periodically re-generating and comparing responses

That’s the right trio. citeturn3view0

TTL isn’t enough (but it’s still necessary)

TTL alone assumes you can guess staleness by time. That’s better than nothing, but pricing pages can change three times in a day, while some policy text is stable for months. You’ll want TTLs tuned to content volatility, and you’ll want the ability to override them when an urgent update lands.

Event-based invalidation is the dream (if your content is structured)

If your assistant answers based on a knowledge base, ticketing system, catalog, or policy repository, you should treat those systems as sources of truth and emit invalidation events on update. That requires metadata linking: cached responses must record which content IDs or data entities they depended on.

Without dependency tracking, “invalidate everything” becomes your only safe option, and your cache hit rate will respond by falling down the stairs.

Staleness detection is expensive — so sample it

Reddy’s approach of sampling cached entries daily for freshness checks is pragmatic. You don’t need to re-validate everything constantly; you need a mechanism that catches drift and warns you when content is changing faster than your TTL policy assumes. citeturn3view0

What you should not cache (unless you like incident retrospectives)

Semantic caching is not a blanket “cache all the things” strategy. In the VentureBeat post, Reddy explicitly warns against caching personalized responses, time-sensitive information, and transactional confirmations. citeturn3view0

I’ll broaden that with a newsroom-style list of “here be dragons”:

  • PII or user-specific data: If the answer depends on account data, don’t share it through a global cache.
  • Authorization-scoped content: Anything behind RBAC/ABAC must be keyed and isolated by permission scope.
  • Real-time facts: “Is service X down?” “What’s the status of my ticket?”
  • Payments and legal commitments: Refunds, cancellations, contract language. Cache carefully with very high thresholds and short TTLs.
  • Tool-using agent results: If the response depends on a live tool call (web search, database query), you need to cache the tool result with clear staleness semantics—if at all.

Semantic caching is best when the answer is stable and the intent space is repetitive: FAQs, onboarding, product documentation, internal policy explanations, and support triage steps.

Real-world cost control: semantic caching vs provider prompt caching

It’s tempting to view provider prompt caching as “problem solved,” but it addresses a different slice of the cost stack.

Provider prompt caching: great for long, repetitive prefixes

Prompt caching shines when you have a large, identical prefix across many calls — such as a long system prompt, a big code context, or repeated conversation history. OpenAI’s prompt caching discounts cached input tokens for prompts over a certain size and tracks cache usage via a cached_tokens field. citeturn2search0

Azure OpenAI describes similar behavior, including short cache lifetimes and identical-prefix requirements. citeturn1search1turn1search2

This can materially cut costs and latency, but it doesn’t help when users ask the same thing with different wording, and it doesn’t remove output-token costs.

Semantic caching: best for repeated intent, different phrasing

Semantic caching can avoid the LLM call entirely on hits. That saves input tokens, output tokens, and time. The catch is correctness: you’re returning a previous answer, not merely reusing compute.

In practice, many teams use both: provider prompt caching to reduce the cost of heavy prompts, plus semantic caching to avoid redundant calls for repeated intent. They solve different problems and often stack well.

Tooling and frameworks: you don’t have to build it all from scratch

Semantic caching has gone from “clever research idea” to “product checkbox.” A few options:

LangChain: standard and semantic cache integrations

LangChain documents both standard Redis caching and semantic caching using Redis, where hits are evaluated based on semantic similarity. citeturn0search3turn0search4

This is useful if you’re already in the LangChain ecosystem and want a faster path to a prototype or a production-ready baseline.

Redis LangCache: managed semantic caching

Redis offers LangCache, positioned as a managed semantic caching service to reduce LLM calls and costs by reusing responses for repeated queries. citeturn0search2

If you’re a small team, “don’t build your own caching platform” is often the best optimization you can make. (Second-best is usually “stop running your agent loop 17 times per user message.”)

GPTCache: open-source semantic caching for LLM queries

For an open-source approach, GPTCache (by Zilliz) describes itself as a semantic cache library for LLM queries, integrated with LangChain and LlamaIndex. citeturn1search5

Whether you use GPTCache or roll your own, the key is to treat cache correctness and invalidation as first-class concerns.

Production results: what “73% cheaper” actually implies

Reddy’s reported production results after three months are attention-grabbing for a reason:

  • Cache hit rate improved from 18% to 67%
  • LLM API costs reduced from $47K/month to $12.7K/month (a 73% drop)
  • Average latency dropped from 850ms to 300ms
  • False-positive rate reported around 0.8%
  • Customer complaints increased only slightly (+0.3%)

Those numbers, importantly, are not magic. They’re the result of:

  • High natural redundancy in user intent
  • Threshold tuning by query type
  • Not caching everything
  • Having an invalidation strategy

All of which are the boring, repeatable kind of engineering you can actually operationalize. citeturn3view0

A reference implementation outline (without the “AI theater”)

If you’re implementing semantic caching in a mature production system, the questions you should be able to answer are:

1) What is the cache key, really?

In semantic caching, the “key” is the embedding vector plus the similarity threshold policy. But you still need to incorporate:

  • Model/version (don’t mix outputs across models casually)
  • Prompt template version
  • Knowledge base version
  • User locale (language and region can change the correct answer)
  • Permission scope (for internal tools)

2) How do you classify query types?

Reddy uses query-type-specific thresholds via a query classifier. citeturn3view0

That classifier can be:

  • A rules engine (regex + metadata)
  • A lightweight model
  • An LLM classification step (ironic, but workable if it’s cheap and cached)

Start simple. Most teams can separate “FAQ-ish” vs “transactional/account” vs “search/browse” with high accuracy using a small set of patterns and metadata.

3) What’s your policy on “near misses”?

When the top match is below threshold, do you:

  • Call the LLM normally (safe default)?
  • Use the near match as context (“previously answered similar question…”)?
  • Use it to seed a shorter prompt (hybrid with provider prompt caching)?

These hybrids can help, but they add complexity. The nice thing about a strict semantic cache is you can reason about it: hit returns cached output; miss calls the model.

4) How do you evaluate quality degradation?

Reddy tracked false positives and customer complaint deltas. citeturn3view0

In your environment, also track:

  • Escalation rate to human support
  • Thumbs up/down or satisfaction signals
  • Task success (if you can instrument it)
  • “Correction turns” (user says “no, I meant…”) as a proxy for wrong cache hits

Security and privacy considerations (because caches remember everything)

Caching is a form of memory, and memory is a form of liability. A few practical safeguards:

  • Scope caches: at minimum by tenant, often by user group/role, sometimes by user.
  • Redact or avoid storing PII: do not cache outputs that include personal data; implement detection (as Reddy suggests) and exclusion rules. citeturn3view0
  • Encrypt at rest (especially if you cache tool outputs or internal docs).
  • Auditability: store provenance: where did this answer come from, and when?

Also: if your system uses retrieval (RAG), you must cache in a way that preserves access controls. Otherwise you’ve built a very fast data exfiltration machine.

So, should you implement semantic caching?

Use semantic caching when:

  • Your app has a high volume of repeated intents (support, FAQs, internal policy Q&A).
  • Answers are relatively stable (or you have a robust invalidation mechanism).
  • You can tolerate a small false-positive rate (or tune thresholds high for sensitive categories).
  • Your LLM cost is dominated by redundant calls rather than novel reasoning.

Be cautious when:

  • Requests are heavily personalized.
  • Information is real-time or frequently changing.
  • Wrong answers are very expensive (payments, compliance). Use very high thresholds and narrow cache rules.

The punchline from the VentureBeat post is straightforward: semantic caching is one of the highest ROI optimizations for production LLM systems, but it’s only safe when you treat threshold tuning and invalidation as part of the core product—not as “later.” citeturn3view0

Sources

Bas Dorland, Technology Journalist & Founder of dorland.org