Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export
In this tutorial, we discover the TuringEnterprises/Open-MM-RL dataset as a sensible basis for multimodal reasoning and reinforcement studying with verifiable rewards. We load the dataset, examine its schema, analyze domains, codecs, query lengths, reply sorts, and picture distributions, and visualize consultant examples from every area. We additionally construct a light-weight reward operate that checks precise, numeric, fractional, LaTeX, and symbolic solutions, giving us a helpful solution to consider mannequin outputs. Finally, we format prompts for vision-language fashions, optionally take a look at SmolVLM on pattern examples, and export the dataset into a GRPO-style construction for future multimodal RL coaching.
import subprocess, sys
subprocess.run([sys.executable, "-m", "pip", "-q", "install",
"datasets>=3.0", "huggingface_hub>=0.24", "transformers>=4.45",
"Pillow", "matplotlib", "pandas", "numpy", "sympy",
"accelerate", "tqdm"], test=True)
import os, re, io, json, math, random, textwrap, hashlib, warnings
from collections import Counter
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import sympy as sp
from datasets import load_dataset
warnings.filterwarnings("ignore")
random.seed(0); np.random.seed(0)
pd.set_option("show.max_colwidth", 120)
DS_ID = "TuringEnterprises/Open-MM-RL"
ds = load_dataset(DS_ID, cut up="practice")
print(f"Loaded {DS_ID} — {len(ds)} rows")
print("Features:", ds.options)
print("Row 0 keys:", checklist(ds[0].keys()))
We set up all required libraries and import the core instruments wanted for dataset loading, evaluation, visualization, symbolic math, and file dealing with. We set random seeds for reproducibility and configure pandas in order that longer textual content fields show clearly. We then load the TuringEnterprises/Open-MM-RL dataset from Hugging Face and examine its measurement, options, and first-row construction.
df = ds.remove_columns(["images"]).to_pandas()
df["n_images"] = [len(ex["images"]) for ex in ds]
df["q_len_chars"] = df["question"].str.len()
df["a_len_chars"] = df["answer"].str.len()
print("n=== Domain ==="); print(df["domain"].value_counts())
print("n=== Format ==="); print(df["format"].value_counts())
print("n=== Sub-domain (prime by area) ===")
print(df.groupby("area")["subDomain"].value_counts().head(15))
print(f"nMean photos/instance: {df['n_images'].imply():.2f} max: {df['n_images'].max()}")
print(f"Median Q size: {df['q_len_chars'].median():.0f} "
f"Median A size: {df['a_len_chars'].median():.0f}")
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
df["domain"].value_counts().plot.bar(ax=axes[0], shade="#4C72B0")
axes[0].set_title("Examples per area"); axes[0].set_ylabel("depend")
df["format"].value_counts().plot.bar(ax=axes[1], shade="#55A868")
axes[1].set_title("Image-format sort"); axes[1].tick_params(axis='x', rotation=25)
df["n_images"].plot.hist(ax=axes[2], bins=vary(1, df["n_images"].max() + 2),
shade="#C44E52", edgecolor="white")
axes[2].set_title("Images per instance"); axes[2].set_xlabel("n_images")
plt.tight_layout(); plt.present()
def img_stats(ex):
sizes = [im.size for im in ex["images"]]
modes = [im.mode for im in ex["images"]]
return ".be a part of(sorted(set(modes))),
"total_pixels": sum(w * h for w, h in sizes),
img_df = pd.DataFrame([img_stats(ex) for ex in ds])
print("n=== Image decision stats ===")
print(img_df[["min_w", "max_w", "min_h", "max_h", "total_pixels"]].describe().spherical(0))
print("nMode combine:", Counter("|".be a part of(img_df["modes"]).cut up("|")))
We convert the dataset into a DataFrame after eradicating the picture column, then calculate helpful fields such because the variety of photos, query size, and reply size. We analyze area counts, format distribution, sub-domain breakdowns, and primary textual content/picture statistics. We additionally create charts to visualise the variety of examples per area, the picture codecs, and the distribution of photos per instance.
def show_example(ex, max_chars=600):
print("=" * 80)
print(f"id={ex['conversation_id']} {ex['domain']} / {ex['subDomain']}")
print(f"format={ex['format']} n_images={len(ex['images'])}")
print("-" * 80)
q = ex["question"][:max_chars] + ("..." if len(ex["question"]) > max_chars else "")
print("Q:", textwrap.fill(q, 100))
print("-" * 80)
print("A (gold):", ex["answer"])
n = len(ex["images"])
fig, axes = plt.subplots(1, n, figsize=(5 * n, 5)) if n > 1
else plt.subplots(1, 1, figsize=(6, 6))
axes = np.atleast_1d(axes)
for ax, im in zip(axes, ex["images"]):
ax.imshow(im); ax.set_xticks([]); ax.set_yticks([])
ax.set_title(f"{im.measurement[0]}×{im.measurement[1]} ({im.mode})")
plt.tight_layout(); plt.present()
for dom in df["domain"].distinctive():
idx = int(df[df["domain"] == dom].index[0])
show_example(ds[idx])
LATEX_PAT = re.compile(r"[[sS]+?]|([sS]+?)|$[^$]+$")
df["latex_blocks_q"] = df["question"].apply(lambda s: len(LATEX_PAT.findall(s or "")))
df["latex_blocks_a"] = df["answer"].apply(lambda s: len(LATEX_PAT.findall(s or "")))
print("n=== LaTeX blocks per subject ===")
print(df[["latex_blocks_q", "latex_blocks_a"]].describe().spherical(2))
def classify_answer(a):
s = (a or "").strip().strip("$ []").strip()
s_no_dollar = s.substitute("$", "")
if re.fullmatch(r"-?s*d+(.d+)?s*", s_no_dollar): return "integer/float"
if any(t in s for t in ["sqrt", "frac", "pi", "^", "_", "kappa", "lceil"]):
return "symbolic"
if re.fullmatch(r"[-+0-9./()sa-zA-Z{}]+", s) and any(c.isdigit() for c in s):
return "numeric_expr"
return "textual content"
df["answer_type"] = df["answer"].apply(classify_answer)
print("n=== Answer-type breakdown ==="); print(df["answer_type"].value_counts())
print("n=== Answer-type × area ===")
print(pd.crosstab(df["domain"], df["answer_type"]))
We outline a helper operate to show one consultant instance from every area, together with its query, gold reply, and related photos. We use this visible inspection step to higher perceive how multimodal reasoning issues are structured throughout totally different domains. We then analyze LaTeX utilization in questions and solutions, classify reply sorts, and examine answer-type distributions throughout domains.
EXTRACT_PATS = [
r"boxed{([^{}]+)}",
r"finals+solutions*[:=]s*([^n]+)",
r"solutions*[:=]s*([^n]+)",
]
def extract_final(textual content):
if not textual content: return ""
for p in EXTRACT_PATS:
m = re.search(p, textual content, flags=re.IGNORECASE)
if m: return m.group(1).strip().strip(".,;")
traces = [l.strip() for l in str(text).strip().splitlines() if l.strip()]
return traces[-1] if traces else ""
def latex_to_sympy(s):
s = (s or "").strip().strip("$").strip()
s = re.sub(r"^[[(]", "", s); s = re.sub(r"[])]$", "", s)
s = (s.substitute("pi", "pi").substitute("cdot", "*").substitute("occasions", "*")
.substitute(",", "").substitute(";", "").substitute("!", ""))
s = re.sub(r"fracs*{([^{}]+)}s*{([^{}]+)}", r"((1)/(2))", s)
s = re.sub(r"sqrts*{([^{}]+)}", r"sqrt(1)", s)
s = s.substitute("^", "**")
s = re.sub(r"[a-zA-Z]+", "", s)
s = s.substitute("{", "(").substitute("}", ")")
return s
def grade(pred, gold, tol=1e-4):
"""Verifiable reward in [0,1]: precise > numeric > sympy-symbolic > partial."""
if pred is None or gold is None: return 0.0
p = extract_final(str(pred)).strip()
g = str(gold).strip()
norm = lambda x: re.sub(r"s+", "", x.decrease()).strip("$.,;[]()")
if norm(p) == norm(g): return 1.0
def to_float(x):
attempt: return float(latex_to_sympy(x))
besides Exception:
attempt: return float(sp.sympify(latex_to_sympy(x)).evalf())
besides Exception: return None
fp, fg = to_float(p), to_float(g)
if fp will not be None and fg will not be None:
if abs(fp - fg) / max(1.0, abs(fg)) < tol: return 1.0
attempt:
ep = sp.sympify(latex_to_sympy(p)); eg = sp.sympify(latex_to_sympy(g))
if sp.simplify(ep - eg) == 0: return 1.0
besides Exception:
cross
if norm(g) and norm(g) in norm(p): return 0.5
return 0.0
print("n=== Grader sanity checks ===")
for pred, gold, need in [
("The answer is boxed{120}", "[120]", 1.0),
("After computing: 7396 pi", "7396pi", 1.0),
("Final reply: -71/4", "-frac{71}{4}", 1.0),
("Therefore the result's 0.0074", "0.0074", 1.0),
("Final reply: nucleus accumbens", "Nucleus accumbens",1.0),
("I do not know", "12", 0.0),
]:
print(f" pred={pred[:38]!r:42s} gold={gold!r:22s} -> r={grade(pred, gold)} (need {need})")
SYSTEM = ("You are a STEM professional fixing multimodal reasoning issues. "
"You will see a query and a number of figures. "
"Reason step-by-step, then finish with precisely one line:n"
"Final reply: <your reply>")
def build_prompt(ex):
img_tags = "n".be a part of(f"[Image {i+1}]" for i in vary(len(ex["images"])))
return f"{SYSTEM}nn{img_tags}nnQuestion:n{ex['question']}nnLet's suppose step-by-step."
print("n=== Example immediate (truncated) ===")
print(build_prompt(ds[0])[:600], "...n")
We construct a verifiable reward operate that extracts last solutions and compares predictions in opposition to gold solutions utilizing precise, numeric, and symbolic matching. We additionally add a LaTeX-to-SymPy conversion helper, permitting mathematical expressions to be evaluated extra reliably. We take a look at the grader with sanity checks and then create a structured immediate format for vision-language mannequin reasoning.
import torch
USE_VLM = torch.cuda.is_available()
print(f"CUDA out there: {USE_VLM}")
if USE_VLM:
attempt:
from transformers import AutoProcessor, AutoModelForVision2Seq
MODEL_ID = "HuggingFaceTB/SmolVLM-Instruct"
print(f"Loading {MODEL_ID} (this takes ~1 min) ...")
processor = AutoProcessor.from_pretrained(MODEL_ID)
mannequin = AutoModelForVision2Seq.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
def vlm_solve(ex, max_new_tokens=512):
imgs = [im.convert("RGB") for im in ex["images"]]
content material = [{"type": "image"} for _ in imgs]
content material.append({"sort": "textual content", "textual content": build_prompt(ex)})
textual content = processor.apply_chat_template(
[{"role": "user", "content": content}], add_generation_prompt=True)
inputs = processor(textual content=textual content, photos=imgs, return_tensors="pt").to(mannequin.gadget)
with torch.no_grad():
out = mannequin.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
return processor.batch_decode(
out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
rows, sample_idx = [], random.pattern(vary(len(ds)), 6)
for i in sample_idx:
ex = ds[i]
attempt:
pred = vlm_solve(ex)
r = grade(pred, ex["answer"])
besides Exception as e:
pred, r = f"<error: {e}>", 0.0
rows.append({"id": ex["conversation_id"], "area": ex["domain"],
"reward": r, "pred_tail": pred[-200:]})
print(f" id={ex['conversation_id']} {ex['domain']:9s} r={r:.2f}")
res = pd.DataFrame(rows)
print(f"nMean reward over {len(res)} samples: {res['reward'].imply():.3f}")
print(res.groupby("area")["reward"].imply().rename("avg_reward"))
besides Exception as e:
print(f"VLM run failed ({e}); reward & information pipeline stay usable.")
else:
print("No GPU detected — skipping dwell VLM inference (Runtime → Change runtime sort → GPU).")
out_dir = Path("/content material/open_mm_rl_processed"); out_dir.mkdir(exist_ok=True, dad and mom=True)
img_dir = out_dir / "photos"; img_dir.mkdir(exist_ok=True)
information = []
for ex in ds:
paths = []
for j, im in enumerate(ex["images"]):
p = img_dir / f"{ex['conversation_id']}_{j}.png"
im.convert("RGB").save(p)
paths.append(str(p))
information.append({
"id": ex["conversation_id"],
"area": ex["domain"],
"subDomain": ex["subDomain"],
"format": ex["format"],
"immediate": build_prompt(ex),
"gold": ex["answer"],
"image_paths": paths,
})
jsonl_path = out_dir / "information.jsonl"
with open(jsonl_path, "w") as f:
for r in information: f.write(json.dumps(r) + "n")
print(f"nWrote {len(information)} information → {jsonl_path}")
print(f"Saved {sum(len(r['image_paths']) for r in information)} photos below {img_dir}")
def mock_policy_samples(gold, Okay=4):
"""Stand-in for Okay coverage rollouts. Replace with mannequin.generate(do_sample=True)."""
return [gold,
"Final answer: 0",
f"Final answer: {gold} (≈)",
"I think the answer is unclear."][:K]
def grpo_advantages(rewards):
r = np.asarray(rewards, dtype=float)
return (r - r.imply()) / (r.std() + 1e-6)
print("n=== Mock GRPO rollouts for instance 0 ===")
gold0 = ds[0]["answer"]
cands = mock_policy_samples(gold0, Okay=4)
rewards = [grade(c, gold0) for c in cands]
adv = grpo_advantages(rewards)
for c, r, a in zip(cands, rewards, adv):
print(f" r={r:.2f} adv={a:+.2f} cand={c!r}")
print("nDone. To flip this into actual coaching:")
print(" 1. Replace mock_policy_samples with vlm_solve(..., do_sample=True, num_return_sequences=Okay).")
print(" 2. Feed (immediate, Okay rollouts, Okay rewards) into TRL's GRPOTrainer or verl.")
print(" 3. Curriculum: begin with examples the place rewards have non-zero variance.")
We test whether or not CUDA is on the market and, optionally, run SmolVLM on a few examples to generate predictions, then rating them utilizing our reward operate. We then export the dataset to a GRPO-style JSONL format, saving all photos to disk for future multimodal RL experiments. Finally, we show mock GRPO rollouts, calculate group-relative benefits, and define how this may be changed with actual model-generated samples.
In conclusion, we constructed a full workflow for understanding, evaluating, and making ready the Open-MM-RL dataset for multimodal reasoning experiments. We moved from dataset loading and exploratory evaluation to picture inspection, LaTeX-aware reply classification, reward scoring, immediate building, optionally available VLM inference, and GRPO-style rollout preparation. It gives a robust start line for coaching and evaluating vision-language fashions with verifiable rewards, whereas additionally serving to us perceive methods to remodel multimodal datasets into sensible reinforcement studying pipelines.
Check out the Full Codes with Notebook here. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export appeared first on MarkTechPost.
