How to Reduce Cost and Latency of Your RAG Application Using Semantic LLM Caching
Semantic caching in LLM (Large Language Model) purposes optimizes efficiency by storing and reusing responses based mostly on semantic similarity quite than actual textual content matches. When a brand new question arrives, it’s transformed into an embedding and in contrast with cached ones utilizing similarity search. If a detailed match is discovered (above a similarity threshold), the cached response is returned immediately—skipping the costly retrieval and technology course of. Otherwise, the complete RAG pipeline runs, and the brand new query-response pair is added to the cache for future use.
In a RAG setup, semantic caching usually saves responses just for questions which have really been requested, not each attainable question. This helps scale back latency and API prices for repeated or barely reworded questions. In this text, we’ll check out a brief instance demonstrating how caching can considerably decrease each price and response time in LLM-based purposes. Check out the FULL CODES here.
How Semantic Caching in LLM Works
Semantic caching features by storing and retrieving responses based mostly on the which means of person queries quite than their actual wording. Each incoming question is transformed right into a vector embedding that represents its semantic content material. The system then performs a similarity search—typically utilizing Approximate Nearest Neighbor (ANN) methods—to evaluate this embedding with these already saved within the cache.
If a sufficiently comparable query-response pair exists (i.e., its similarity rating exceeds an outlined threshold), the cached response is returned instantly, bypassing costly retrieval or technology steps. Otherwise, the complete RAG pipeline executes, retrieving paperwork and producing a brand new reply, which is then saved within the cache for future use. Check out the FULL CODES here.

What Gets Cached in Memory
In a RAG utility, semantic caching solely shops responses for queries which have really been processed by the system—there’s no pre-caching of all attainable questions. Each question that reaches the LLM and produces a solution can create a cache entry containing the question’s embedding and corresponding response.
Depending on the system’s design, the cache might retailer simply the ultimate LLM outputs, the retrieved paperwork, or each. To preserve effectivity, cache entries are managed by insurance policies like time-to-live (TTL) expiration or Least Recently Used (LRU) eviction, guaranteeing that solely current or often accessed queries stay in reminiscence over time. Check out the FULL CODES here.
How Semantic Caching Works: Explained with an instance
Installing dependencies
pip set up openai numpy
Setting up the dependencies
import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')
For this tutorial, we will likely be utilizing OpenAI, however you should utilize any LLM supplier.
from openai import OpenAI
shopper = OpenAI()
Running Repeated Queries Without Caching
In this part, we run the identical question 10 occasions instantly by the GPT-4.1 mannequin to observe how lengthy it takes when no caching mechanism is utilized. Each name triggers a full LLM computation and response technology, main to repetitive processing for equivalent inputs. Check out the FULL CODES here.
This helps set up a baseline for complete time and price earlier than we implement semantic caching within the subsequent half.
import time
def ask_gpt(question):
begin = time.time()
response = shopper.responses.create(
mannequin="gpt-4.1",
enter=question
)
finish = time.time()
return response.output[0].content material[0].textual content, finish - begin
question = "Explain the idea of semantic caching in simply 2 strains."
total_time = 0
for i in vary(10):
_, period = ask_gpt(question)
total_time += period
print(f"Run {i+1} took {period:.2f} seconds")
print(f"nTotal time for 10 runs: {total_time:.2f} seconds")

Even although the question stays the identical, each name nonetheless takes between 1–3 seconds, leading to a complete of ~22 seconds for 10 runs. This inefficiency highlights why semantic caching might be so precious — it permits us to reuse earlier responses for semantically equivalent queries and save each time and API price. Check out the FULL CODES here.
Implementing Semantic Caching for Faster Responses
In this part, we improve the earlier setup by introducing semantic caching, which permits our utility to reuse responses for semantically comparable queries as a substitute of repeatedly calling the GPT-4.1 API.
Here’s the way it works: every incoming question is transformed right into a vector embedding utilizing the text-embedding-3-small mannequin. This embedding captures the semantic which means of the textual content. When a brand new question arrives, we calculate its cosine similarity with embeddings already saved in our cache. If a match is discovered with a similarity rating above the outlined threshold (e.g., 0.85), the system immediately returns the cached response — avoiding one other API name.
If no sufficiently comparable question exists within the cache, the mannequin generates a recent response, which is then saved together with its embedding for future use. Over time, this method dramatically reduces each response time and API prices, particularly for often requested or rephrased queries. Check out the FULL CODES here.
import numpy as np
from numpy.linalg import norm
semantic_cache = []
def get_embedding(textual content):
emb = shopper.embeddings.create(mannequin="text-embedding-3-small", enter=textual content)
return np.array(emb.knowledge[0].embedding)
def cosine_similarity(a, b):
return np.dot(a, b) / (norm(a) * norm(b))
def ask_gpt_with_cache(question, threshold=0.85):
query_embedding = get_embedding(question)
# Check similarity with present cache
for cached_query, cached_emb, cached_resp in semantic_cache:
sim = cosine_similarity(query_embedding, cached_emb)
if sim > threshold:
print(f"
Using cached response (similarity: {sim:.2f})")
return cached_resp, 0.0 # no API time
# Otherwise, name GPT
begin = time.time()
response = shopper.responses.create(
mannequin="gpt-4.1",
enter=question
)
finish = time.time()
textual content = response.output[0].content material[0].textual content
# Store in cache
semantic_cache.append((question, query_embedding, textual content))
return textual content, finish - begin
queries = [
"Explain semantic caching in simple terms.",
"What is semantic caching and how does it work?",
"How does caching work in LLMs?",
"Tell me about semantic caching for LLMs.",
"Explain semantic caching simply.",
]
total_time = 0
for q in queries:
resp, t = ask_gpt_with_cache(q)
total_time += t
print(f"
Query took {t:.2f} secondsn")
print(f"nTotal time with caching: {total_time:.2f} seconds")

In the output, the primary question took round 8 seconds as there was no cache and the mannequin had to generate a recent response. When the same query was requested subsequent, the system recognized a excessive semantic similarity (0.86) and immediately reused the cached reply, saving time. Some queries, like “How does caching work in LLMs?” and “Tell me about semantic caching for LLMs,” had been sufficiently completely different, so the mannequin generated new responses, every taking on 10 seconds. The ultimate question was almost equivalent to the primary one (similarity 0.97) and was served from cache immediately.
Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish How to Reduce Cost and Latency of Your RAG Application Using Semantic LLM Caching appeared first on MarkTechPost.
