|

Meet Pyversity Library: How to Improve Retrieval Systems by Diversifying the Results Using Pyversity?

Pyversity is a quick, light-weight Python library designed to enhance the variety of outcomes from retrieval programs. Retrieval usually returns objects which can be very related, main to redundancy. Pyversity effectively re-ranks these outcomes to floor related however much less redundant objects.

It provides a transparent, unified API for a number of well-liked diversification methods, together with Maximal Marginal Relevance (MMR), Max-Sum-Diversification (MSD), Determinantal Point Processes (DPP), and Cover. Its solely dependency is NumPy, making it very light-weight.

In this tutorial, we’ll deal with the MMR and MSD methods utilizing a sensible instance. Check out the FULL CODES here.

Why is diversification required?

Diversification in retrieval is critical as a result of conventional rating strategies, which prioritize solely relevance to the person question, incessantly produce a set of high outcomes which can be extremely redundant or near-duplicates. 

This excessive similarity creates a poor person expertise by limiting exploration and losing display area on almost an identical objects. Diversification methods deal with this by balancing relevance with selection, making certain that newly chosen objects introduce novel info not current in the objects already ranked. 

This strategy is crucial throughout numerous domains: in E-commerce, it exhibits totally different product kinds; in News search, it surfaces totally different viewpoints or sources; and in RAG/LLM contexts, it prevents feeding the mannequin repetitive, near-duplicate textual content passages, enhancing the high quality of the total response. Check out the FULL CODES here.

Different diversification methods

Installing the dependencies

pip set up openai numpy pyversity scikit-learn

Loading OpenAI API Key

import os
from openai import OpenAI
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

shopper = OpenAI()

Creating a Redundant Search Result Set for Diversification Testing

In this step, we’re simulating the form of search outcomes you would possibly retrieve from a vector database (like Pinecone, Weaviate, or FAISS) after performing a semantic seek for a question comparable to “Smart and constant canines for household.”

These outcomes deliberately comprise redundant entries — a number of mentions of comparable breeds like Golden Retrievers, Labradors, and German Shepherds — every described with overlapping traits comparable to loyalty, intelligence, and family-friendliness.

This redundancy mirrors what usually occurs in real-world retrieval programs, the place extremely related objects obtain excessive similarity scores. We’ll use this dataset to reveal how diversification methods can scale back repetition and produce a extra balanced, numerous set of search outcomes. Check out the FULL CODES here.

import numpy as np

search_results = [
    "The Golden Retriever is the perfect family companion, known for its loyalty and gentle nature.",
    "A Labrador Retriever is highly intelligent, eager to please, and makes an excellent companion for active families.",
    "Golden Retrievers are highly intelligent and trainable, making them ideal for first-time owners.",
    "The highly loyal Labrador is consistently ranked number one for US family pets due to its stable temperament.",
    "Loyalty and patience define the Golden Retriever, one of the top family dogs globally and easily trainable.",
    "For a smart, stable, and affectionate family dog, the Labrador is an excellent choice, known for its eagerness to please.",
    "German Shepherds are famous for their unwavering loyalty and are highly intelligent working dogs, excelling in obedience.",
    "A highly trainable and loyal companion, the German Shepherd excels in family protection roles and service work.",
    "The Standard Poodle is an exceptionally smart, athletic, and surprisingly loyal dog that is also hypoallergenic.",
    "Poodles are known for their high intelligence, often exceeding other breeds in advanced obedience training.",
    "For herding and smarts, the Border Collie is the top choice, recognized as the world's most intelligent dog breed.",
    "The Dachshund is a small, playful dog with a distinctive long body, originally bred in Germany for badger hunting.",
    "French Bulldogs are small, low-energy city dogs, known for their easy-going temperament and comical bat ears.",
    "Siberian Huskies are energetic, friendly, and need significant cold weather exercise due to their running history.",
    "The Beagle is a gentle, curious hound known for its excellent sense of smell and a distinctive baying bark.",
    "The Great Dane is a very large, gentle giant breed; despite its size, it's known to be a low-energy house dog.",
    "The Australian Shepherd (Aussie) is a medium-sized herding dog, prized for its beautiful coat and sharp intellect."
]

Creating the Embeddings

def get_embeddings(texts):
    """Fetches embeddings from the OpenAI API."""
    print("Fetching embeddings from OpenAI...")
    response = shopper.embeddings.create(
        mannequin="text-embedding-3-small",
        enter=texts
    )
    return np.array([data.embedding for data in response.data])

embeddings = get_embeddings(search_results)
print(f"Embeddings form: {embeddings.form}")

Ranking Search Results by Relevance

In this step, we calculate how carefully every search consequence matches the person’s question utilizing cosine similarity between their vector embeddings. This produces a ranked checklist of outcomes purely based mostly on semantic relevance, exhibiting which texts are most related in which means to the question. Essentially, it simulates what a search engine or retrieval system would return earlier than making use of any diversification methods, usually leading to a number of extremely related or redundant entries at the high. Check out the FULL CODES here.

from sklearn.metrics.pairwise import cosine_similarity

query_text = "Smart and constant canines for household"
query_embedding = get_embeddings([query_text])[0]


scores = cosine_similarity(query_embedding.reshape(1, -1), embeddings)[0]

print("n--- Initial Relevance-Only Ranking (Top 5) ---")
initial_ranking_indices = np.argsort(scores)[::-1] # Sort descending
for i in initial_ranking_indices[:5]:
    print(f"Score: {scores[i]:.4f} | Result: {search_results[i]}")

As seen in the output above, the high outcomes are dominated by a number of mentions of Labradors and Golden Retrievers, every described with related traits like loyalty, intelligence, and family-friendliness. This is typical of a relevance-only retrieval system, the place the high outcomes are semantically related however usually redundant, providing little variety in content material. While these outcomes are all related to the question, they lack selection — making them much less helpful for customers who need a broader overview of various breeds or views. Check out the FULL CODES here.

Maximal Marginal Relevance

MMR works by discovering a stability between relevance and variety. Instead of merely choosing the most related outcomes to the question, it regularly selects objects which can be nonetheless related however not too related to what’s already been chosen.

In less complicated phrases, think about you’re constructing an inventory of canine breeds for “sensible and constant household canines.” The first consequence is perhaps a Labrador — extremely related. For the subsequent alternative, MMR avoids choosing one other Labrador description and as a substitute selects one thing like a Golden Retriever or German Shepherd.

This manner, MMR ensures your last outcomes are each helpful and assorted, decreasing repetition whereas conserving all the things carefully associated to what the person really looked for. Check out the FULL CODES here.

from pyversity import diversify, Strategy

# MMR: Focuses on novelty towards already picked objects.
mmr_result = diversify(
    embeddings=embeddings,
    scores=scores,
    ok=5,
    technique=Strategy.MMR,
    variety=0.5  # 0.0 is pure relevance, 1.0 is pure variety
)

print("nn--- Diversified Ranking utilizing MMR (Top 5) ---")
for rank, idx in enumerate(mmr_result.indices):
    print(f"Rank {rank+1} (Original Index {idx}): {search_results[idx]}")

After making use of the MMR (Maximal Marginal Relevance) technique, the outcomes are noticeably extra numerous. While the top-ranked objects like the Labrador and German Shepherd stay extremely related to the question, the subsequent entries embrace totally different breeds comparable to Siberian Huskies and French Bulldogs. This exhibits how MMR reduces redundancy by avoiding a number of related outcomes — as a substitute, it balances relevance and selection, giving customers a broader and extra informative set of outcomes that also keep on matter. Check out the FULL CODES here.

Max Sum of Distances

The MSD (Max Sum of Distances) technique focuses on choosing outcomes that aren’t solely related to the question but in addition as totally different from one another as doable. Instead of worrying about similarity to beforehand picked objects one by one (like MMR does), MSD appears at the total unfold of the chosen outcomes.

In less complicated phrases, it tries to choose outcomes that cowl a wider vary of concepts or subjects, making certain sturdy variety throughout the whole set. So, for the identical canine instance, MSD would possibly embrace breeds like Labrador, German Shepherd, Beagle, and Husky — every distinct in sort and temperament — to give a broader, well-rounded view of “sensible and constant household canines.” Check out the FULL CODES here.

# MSD: Focuses on sturdy unfold/distance throughout all candidates.
msd_result = diversify(
    embeddings=embeddings,
    scores=scores,
    ok=5,
    technique=Strategy.MSD,
    variety=0.5
)

print("nn--- Diversified Ranking utilizing MSD (Top 5) ---")
for rank, idx in enumerate(msd_result.indices):
    print(f"Rank {rank+1} (Original Index {idx}): {search_results[idx]}")

The outcomes produced by the MSD (Max Sum of Distances) technique present a powerful deal with selection and protection. While the Labrador and German Shepherd stay related to the question, the inclusion of breeds like the French Bulldog, Siberian Husky, and Dachshund highlights MSD’s tendency to choose outcomes which can be distinct from each other.

This strategy ensures that customers see a broader mixture of choices fairly than carefully associated or repetitive entries. In essence, MSD emphasizes most variety throughout the whole consequence set, providing a wider perspective whereas nonetheless sustaining total relevance to the search intent.


Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t overlook to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish Meet Pyversity Library: How to Improve Retrieval Systems by Diversifying the Results Using Pyversity? appeared first on MarkTechPost.

Similar Posts