Semantic caching in LLM (Large Language Model) purposes optimizes efficiency by storing and reusing responses based mostly on semantic similarity quite than actual textual content matches. When a brand new question arrives, it’s transformed into an embedding and in contrast with cached ones utilizing similarity search. If a detailed match is discovered (above a similarity threshold), the cached response is returned immediately—skipping the costly retrieval and technology course of. Otherwise, the complete RAG pipeline runs, and the brand new query-response pair is added to the cache for future use.

In a RAG setup, semantic caching usually saves responses just for questions which have really been requested, not each attainable question. This helps scale back latency and API prices for repeated or barely reworded questions. In this text, we’ll check out a brief instance demonstrating how caching can considerably decrease each price and response time in LLM-based purposes. Check out the FULL CODES here.

How Semantic Caching in LLM Works

Semantic caching features by storing and retrieving responses based mostly on the which means of person queries quite than their actual wording. Each incoming question is transformed right into a vector embedding that represents its semantic content material. The system then performs a similarity search—typically utilizing Approximate Nearest Neighbor (ANN) methods—to evaluate this embedding with these already saved within the cache.

If a sufficiently comparable query-response pair exists (i.e., its similarity rating exceeds an outlined threshold), the cached response is returned instantly, bypassing costly retrieval or technology steps. Otherwise, the complete RAG pipeline executes, retrieving paperwork and producing a brand new reply, which is then saved within the cache for future use. Check out the FULL CODES here.

What Gets Cached in Memory

In a RAG utility, semantic caching solely shops responses for queries which have really been processed by the system—there’s no pre-caching of all attainable questions. Each question that reaches the LLM and produces a solution can create a cache entry containing the question’s embedding and corresponding response.

Depending on the system’s design, the cache might retailer simply the ultimate LLM outputs, the retrieved paperwork, or each. To preserve effectivity, cache entries are managed by insurance policies like time-to-live (TTL) expiration or Least Recently Used (LRU) eviction, guaranteeing that solely current or often accessed queries stay in reminiscence over time. Check out the FULL CODES here.

How Semantic Caching Works: Explained with an instance

Installing dependencies

Copy Code

pip set up openai numpy

Setting up the dependencies

Copy Code

import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

For this tutorial, we will likely be utilizing OpenAI, however you should utilize any LLM supplier.

Copy Code

from openai import OpenAI
shopper = OpenAI()

Running Repeated Queries Without Caching

In this part, we run the identical question 10 occasions instantly by the GPT-4.1 mannequin to observe how lengthy it takes when no caching mechanism is utilized. Each name triggers a full LLM computation and response technology, main to repetitive processing for equivalent inputs. Check out the FULL CODES here.

This helps set up a baseline for complete time and price earlier than we implement semantic caching within the subsequent half.

Copy Code

import time
def ask_gpt(question):
    begin = time.time()
    response = shopper.responses.create(
      mannequin="gpt-4.1",
      enter=question
    )
    finish = time.time()
    return response.output[0].content material[0].textual content, finish - begin

Copy Code

question = "Explain the idea of semantic caching in simply 2 strains."
total_time = 0

for i in vary(10):
    _, period = ask_gpt(question)
    total_time += period
    print(f"Run {i+1} took {period:.2f} seconds")

print(f"nTotal time for 10 runs: {total_time:.2f} seconds")

Even although the question stays the identical, each name nonetheless takes between 1–3 seconds, leading to a complete of ~22 seconds for 10 runs. This inefficiency highlights why semantic caching might be so precious — it permits us to reuse earlier responses for semantically equivalent queries and save each time and API price. Check out the FULL CODES here.

Implementing Semantic Caching for Faster Responses

In this part, we improve the earlier setup by introducing semantic caching, which permits our utility to reuse responses for semantically comparable queries as a substitute of repeatedly calling the GPT-4.1 API.

Here’s the way it works: every incoming question is transformed right into a vector embedding utilizing the text-embedding-3-small mannequin. This embedding captures the semantic which means of the textual content. When a brand new question arrives, we calculate its cosine similarity with embeddings already saved in our cache. If a match is discovered with a similarity rating above the outlined threshold (e.g., 0.85), the system immediately returns the cached response — avoiding one other API name.

If no sufficiently comparable question exists within the cache, the mannequin generates a recent response, which is then saved together with its embedding for future use. Over time, this method dramatically reduces each response time and API prices, particularly for often requested or rephrased queries. Check out the FULL CODES here.

Copy Code

import numpy as np
from numpy.linalg import norm
semantic_cache = []

def get_embedding(textual content):
    emb = shopper.embeddings.create(mannequin="text-embedding-3-small", enter=textual content)
    return np.array(emb.knowledge[0].embedding)
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

def ask_gpt_with_cache(question, threshold=0.85):
    query_embedding = get_embedding(question)
    
    # Check similarity with present cache
    for cached_query, cached_emb, cached_resp in semantic_cache:
        sim = cosine_similarity(query_embedding, cached_emb)
        if sim > threshold:
            print(f" Using cached response (similarity: {sim:.2f})")
            return cached_resp, 0.0  # no API time
    
    # Otherwise, name GPT
    begin = time.time()
    response = shopper.responses.create(
        mannequin="gpt-4.1",
        enter=question
    )
    finish = time.time()
    textual content = response.output[0].content material[0].textual content
    
    # Store in cache
    semantic_cache.append((question, query_embedding, textual content))
    return textual content, finish - begin

Copy Code

queries = [
    "Explain semantic caching in simple terms.",
    "What is semantic caching and how does it work?",
    "How does caching work in LLMs?",
    "Tell me about semantic caching for LLMs.",
    "Explain semantic caching simply.", 
]


total_time = 0
for q in queries:
    resp, t = ask_gpt_with_cache(q)
    total_time += t
    print(f" Query took {t:.2f} secondsn")

print(f"nTotal time with caching: {total_time:.2f} seconds")

In the output, the primary question took round 8 seconds as there was no cache and the mannequin had to generate a recent response. When the same query was requested subsequent, the system recognized a excessive semantic similarity (0.86) and immediately reused the cached reply, saving time. Some queries, like “How does caching work in LLMs?” and “Tell me about semantic caching for LLMs,” had been sufficiently completely different, so the mannequin generated new responses, every taking on 10 seconds. The ultimate question was almost equivalent to the primary one (similarity 0.97) and was served from cache immediately.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish How to Reduce Cost and Latency of Your RAG Application Using Semantic LLM Caching appeared first on MarkTechPost.

How to Reduce Cost and Latency of Your RAG Application Using Semantic LLM Caching

How Semantic Caching in LLM Works

What Gets Cached in Memory

How Semantic Caching Works: Explained with an instance

Installing dependencies

Setting up the dependencies

Running Repeated Queries Without Caching

Implementing Semantic Caching for Faster Responses

Qwen Team Introduces Qwen-Image-Edit: The Image Editing Version of Qwen-Image with Advanced Capabilities for Semantic and Appearance Editing

Anthropic details its AI safety strategy

StepFun Introduces Step-Audio-AQAA: A Fully End-to-End Audio Language Model for Natural Voice Interaction

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows

Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Models Built for Multimodal Training Directly on High-Fidelity Raw Physical Interaction

6-figure secure AI solutions‍ that deliver 7-figure ROI

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

How Semantic Caching in LLM Works

What Gets Cached in Memory

How Semantic Caching Works: Explained with an instance

Installing dependencies

Setting up the dependencies

Running Repeated Queries Without Caching

Implementing Semantic Caching for Faster Responses

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!