Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture

What if, as an alternative of re-sampling one agent, you may push Gemini-2.5 Pro to 34.1% on HLE by mixing 12–15 tool-using brokers that share notes and cease early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, launched TUMIX (Tool-Use Mixture)—a test-time framework that ensembles heterogeneous agent types (text-only, code, search, guided variants) and lets them share intermediate solutions over just a few refinement rounds, then cease early through an LLM-based choose. The end result: greater accuracy at decrease price on onerous reasoning benchmarks akin to HLE, GPQA-Diamond, and AIME (2024/2025).

So, What precisely is totally different new?
- Mixture over modality, not simply extra samples: TUMIX runs ~15 agent types spanning Chain-of-Thought (CoT), code execution, internet search, dual-tool brokers, and guided variants. Each spherical, each agent sees (a) the unique query and (b) different brokers’ earlier solutions, then proposes a refined reply. This message-passing raises common accuracy early whereas variety regularly collapses—so stopping issues.
- Adaptive early-termination: An LLM-as-Judge halts refinement as soon as solutions exhibit robust consensus (with a minimal spherical threshold). This preserves accuracy at ~49% of the inference price vs. fixed-round refinement; token price drops to ~46% as a result of late rounds are token-heavier.
- Auto-designed brokers: Beyond human-crafted brokers, TUMIX prompts the bottom LLM to generate new agent varieties; mixing these with the guide set yields an extra ~+1.2% common carry with out additional price. The empirical “candy spot” is ~12–15 agent types.

How does it work?
TUMIX runs a bunch of heterogeneous brokers—text-only Chain-of-Thought, code-executing, web-searching, and guided variants—in parallel, then iterates a small variety of refinement rounds the place every agent situations on the unique query plus the opposite brokers’ prior rationales and solutions (structured note-sharing). After every spherical, an LLM-based choose evaluates consensus/consistency to determine early termination; if confidence is inadequate, one other spherical is triggered, in any other case the system finalizes through easy aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for numerous reasoning paths, bettering protection of right candidates whereas controlling token/instrument budgets; empirically, advantages saturate round 12–15 agent types, and stopping early preserves variety and lowers price with out sacrificing accuracy
Lets focus on the Results
Under comparable inference budgets to robust tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX yields the greatest common accuracy; a scaled variant (TUMIX+) pushes additional with extra compute:
- HLE (Humanity’s Last Exam): Pro: 21.6% → 34.1% (TUMIX+); Flash: 9.7% → 23.1%.
(HLE is a 2,500-question, troublesome, multi-domain benchmark finalized in 2025.) - GPQA-Diamond: Pro: as much as 88.3%; Flash: as much as 82.1%. (GPQA-Diamond is the toughest 198-question subset authored by area specialists.)
- AIME 2024/25: Pro: 96.7%; Flash: 86.7% with TUMIX(+) at check time.
Across duties, TUMIX averages +3.55% over the perfect prior tool-augmented test-time scaling baseline at comparable price, and +7.8% / +17.4% over no-scaling for Pro/Flash, respectively.

Our Comments
TUMIX is a good strategy from Google as a result of it frames test-time scaling as a search drawback over heterogeneous instrument insurance policies quite than brute-force sampling. The parallel committee (textual content, code, search) improves candidate protection, whereas the LLM-judge allows early-stop that preserves variety and reduces token/instrument spend—helpful beneath latency budgets. The HLE positive factors (34.1% with Gemini-2.5 Pro) align with the benchmark’s finalized 2,500-question design, and the ~12–15 agent types “candy spot” signifies choice—not era—is the limiting issue.
Check out the Paper. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture appeared first on MarkTechPost.