Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset

In this tutorial, we work with the amphora/ResearchMath-14k dataset, a assortment of research-level arithmetic issues mined from arXiv. We load the dataset, examine its construction, and discover how the issues are distributed throughout mathematical fields and open-status classes. We then transfer past fundamental evaluation by extracting field-specific key phrases, producing semantic embeddings, visualizing the downside panorama, clustering associated issues, and constructing a easy search engine over the dataset. Also, we practice a classifier to foretell downside standing from embeddings and detect carefully associated or near-duplicate issues.

Copy Code

!pip -q set up -U datasets sentence-transformers scikit-learn umap-learn 
   pandas matplotlib seaborn wordcloud 2>/dev/null
import warnings, numpy as np, pandas as pd
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(type="whitegrid", palette="deep")
SAMPLE_SIZE = 4000
RANDOM_STATE = 42
EMB_MODEL   = "sentence-transformers/all-MiniLM-L6-v2"

We start by putting in the required libraries and importing the instruments wanted for evaluation, visualization, embeddings, and information dealing with. We additionally set the major configuration values, together with pattern dimension, random seed, and embedding mannequin. This provides us a clear setup earlier than we begin working with the ResearchMath dataset.

Copy Code

from datasets import load_dataset
ds = load_dataset("amphora/ResearchMath-14k", break up="take a look at")
df = ds.to_pandas()
print("Rows:", len(df))
print("Columns:", listing(df.columns))
df.head(3)
TEXT_COL = "self_contained_problem"
df = df[df[TEXT_COL].astype(str).str.len() > 20].reset_index(drop=True)

We load the amphora/ResearchMath-14k dataset from Hugging Face and convert it into a pandas DataBody. We examine the variety of rows, obtainable columns, and a few pattern data to know the dataset construction. We then maintain solely downside statements of significant size in order that subsequent evaluation works on helpful textual content.

Copy Code

print("n--- open_status distribution ---")
print(df["open_status"].value_counts(dropna=False))
print("n--- taxonomy_level_1 (math fields) ---")
print(df["taxonomy_level_1"].value_counts())
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
df["open_status"].value_counts().plot(
   variety="bar", ax=axes[0], shade="steelblue")
axes[0].set_title("Problem standing"); axes[0].tick_params(axis="x", rotation=30)
df["taxonomy_level_1"].value_counts().plot(
   variety="barh", ax=axes[1], shade="seagreen")
axes[1].set_title("Top-level math discipline"); axes[1].invert_yaxis()
df["doc_len"] = df[TEXT_COL].str.break up().apply(len)
axes[2].hist(df["doc_len"].clip(higher=400), bins=40, shade="indianred")
axes[2].set_title("Problem size (phrases, clipped @400)")
plt.tight_layout(); plt.present()
ct = pd.crosstab(df["taxonomy_level_1"], df["open_status"], normalize="index")
plt.determine(figsize=(10, 6))
sns.heatmap(ct, annot=True, fmt=".2f", cmap="rocket_r")
plt.title("Fraction of every standing inside every discipline")
plt.tight_layout(); plt.present()

We discover the dataset by checking how issues are distributed throughout open-status labels and mathematical fields. We visualize the standing counts, discipline counts, and downside lengths to rapidly get an summary of the corpus. We additionally create a heatmap to see how open-status classes fluctuate throughout totally different math fields.

Copy Code

from sklearn.feature_extraction.textual content import TfidfVectorizer
def top_terms_per_group(body, group_col, text_col, ok=8):
   out = {}
   for g, sub in body.groupby(group_col):
       if len(sub) < 20:
           proceed
       vec = TfidfVectorizer(max_features=3000, stop_words="english",
                             ngram_range=(1, 2), min_df=3)
       X = vec.fit_transform(sub[text_col])
       scores = np.asarray(X.imply(axis=0)).ravel()
       phrases = np.array(vec.get_feature_names_out())
       out[g] = phrases[scores.argsort()[::-1][:k]].tolist()
   return out
for discipline, phrases in top_terms_per_group(df, "taxonomy_level_1", TEXT_COL).objects():
   print(f"n{discipline:35s} -> {', '.be a part of(phrases)}")

We use TF-IDF to seek out the most necessary phrases inside every top-level mathematical discipline. We group the dataset by discipline and extract the strongest key phrases or phrases that signify every group. This helps us perceive what matters and terminology dominate totally different areas of analysis in arithmetic.

Copy Code

from sklearn.feature_extraction.textual content import TfidfVectorizer
def top_terms_per_group(body, group_col, text_col, ok=8):
   out = {}
   for g, sub in body.groupby(group_col):
       if len(sub) < 20:
           proceed
       vec = TfidfVectorizer(max_features=3000, stop_words="english",
                             ngram_range=(1, 2), min_df=3)
       X = vec.fit_transform(sub[text_col])
       scores = np.asarray(X.imply(axis=0)).ravel()
       phrases = np.array(vec.get_feature_names_out())
       out[g] = phrases[scores.argsort()[::-1][:k]].tolist()
   return out
for discipline, phrases in top_terms_per_group(df, "taxonomy_level_1", TEXT_COL).objects():
   print(f"n{discipline:35s} -> {', '.be a part of(phrases)}")

We pattern the dataset and convert every mathematical downside into a semantic embedding utilizing a SentenceTransformer mannequin. We cut back the embeddings into two dimensions utilizing UMAP, or PCA if UMAP is unavailable, and visualize the downside panorama by discipline. We then apply Ok-Means clustering and examine the ensuing clusters with the human-labeled taxonomy utilizing ARI and NMI.

Copy Code

from sentence_transformers import util
def search(question, ok=5):
   q = mannequin.encode([query], normalize_embeddings=True)
   sims = util.cos_sim(q, emb)[0].cpu().numpy()
   idx = sims.argsort()[::-1][:k]
   print(f'n=== Query: "{question}" ===')
   for rank, i in enumerate(idx, 1):
       row = work.iloc[i]
       print(f"n[{rank}] sim={sims[i]:.3f} | {row['taxonomy_level_1']} "
             f"| standing={row['open_status']}")
       print("   ", row[TEXT_COL][:260].substitute("n", " "), "...")
search("rational factors on hyperelliptic curves")
search("multiplicativity of maximal output p-norm of a quantum channel")
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
y = work["open_status"].values
Xtr, Xte, ytr, yte = train_test_split(
   emb, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y)
clf = LogisticRegression(max_iter=2000, class_weight="balanced", C=2.0)
clf.match(Xtr, ytr)
pred = clf.predict(Xte)
print("n=== open_status classifier (embeddings + logistic regression) ===")
print(classification_report(yte, pred))
fig, ax = plt.subplots(figsize=(7, 6))
ConfusionMatrixDisplay.from_predictions(
   yte, pred, ax=ax, cmap="Blues", xticks_rotation=45,
   normalize="true", values_format=".2f")
ax.set_title("open_status confusion matrix (row-normalized)")
plt.tight_layout(); plt.present()
sims = util.cos_sim(emb, emb).cpu().numpy()
np.fill_diagonal(sims, 0)
i, j = np.unravel_index(sims.argmax(), sims.form)
print(f"nMost related pair (cos={sims[i, j]:.3f}):")
for n in (i, j):
   print(f"n  paper_id={work.iloc[n]['paper_id']} | "
         f"{work.iloc[n]['taxonomy_level_1']}")
   print("   ", work.iloc[n][TEXT_COL][:240].substitute("n", " "), "...")
print("nDone. Set SAMPLE_SIZE=None at the prime to run on the full 14.1k rows.")

We construct a semantic search perform that retrieves the most related analysis issues for a given question. We then practice a classifier on the embeddings to foretell every downside’s open-status label. Finally, we compute similarity throughout all embedded issues to detect the closest pair and determine near-duplicate or strongly associated downside statements.

In conclusion, we now have a full workflow for analyzing research-level mathematical issues utilizing fashionable NLP and machine studying instruments. We began with dataset exploration, then used TF-IDF, sentence embeddings, dimensionality discount, clustering, semantic search, and classification to know the corpus’s construction from a number of angles. It provides us a sensible method to research how mathematical issues are grouped, how related issues will be retrieved, and how embeddings can help each exploratory evaluation and supervised prediction duties.

Check out the Full Codes with Notebook. Also, be at liberty to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up Building a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset appeared first on MarkTechPost.