Poetiq’s Meta-System Automatically Builds a Model-Agnostic Harness That Improved Every LLM Tested on LiveCodeBench Pro Without Fine-Tuning

Poetiq has simply revealed some very attention-grabbing outcomes exhibiting its Meta-System reached a new state-of-the-art on LiveCodeBench Pro (LCB Pro), a aggressive coding benchmark, by routinely constructing and optimizing its personal inference harness — with out fine-tuning any underlying mannequin or accessing mannequin internals.

The consequence: GPT 5.5 High with Poetiq’s harness scores 93.9% on LCB Pro (25Q2), up from its baseline of 89.6%. Gemini 3.1 Pro, the mannequin the harness was particularly optimized on, jumps from 78.6% to 90.9% — surpassing Google’s personal Gemini 3 Deep Think (88.8%), a mannequin that isn’t even accessible by way of API for exterior verification.

https://poetiq.ai/posts/recursive_self_improvement_coding/

What is LiveCodeBench Pro?

Before stepping into the mechanics, it helps to know why the benchmark issues. LiveCodeBench Pro (LCB) is designed to check AI coding capability in a manner that resists two frequent failure modes in benchmarks: knowledge contamination and overfitting.

LCB Pro pulls issues from main aggressive programming competitions and withholds public ground-truth code. Instead, options are validated in opposition to a complete testing framework. Correct output alone isn’t sufficient — options should additionally fulfill particular reminiscence and runtime constraints. The benchmark can also be topic to steady updates, which distinguishes it from many normal benchmarks that turn into stale.

The benchmark focuses on C++ challenges and emphasizes inventive coding, testing a mannequin’s capability for advanced problem-solving and high-quality, performant procedural logic. This distinguishes it from datasets like SWEBench that consider device utilization or bug-fixing workflows. Problems are categorized by problem — Easy, Medium, and Hard — based mostly on aggressive human clear up charges.

Poetiq’s Strategic Framing: Three LLM Task Categories

This is Poetiq’s third publicly reported benchmark, and the selection of LCB Pro was deliberate. The analysis group frames LLM efficiency round three distinct activity classes: Reasoning challenges (ARC-AGI is their benchmark right here), Retrieval challenges (Humanity’s Last Exam, or HLE), and Coding challenges — which, as essentially the most pervasive industrial software for AI in the present day, meld reasoning and retrieval with the technology of specialised procedural logic.

Their coding initiative had three particular, acknowledged targets: first, show that an clever harness can increase efficacy with out fine-tuning or particular mannequin entry; second, validate the Meta-System’s capability for recursive self-improvement in creating that harness routinely; and third, exhibit that the ensuing harness is model-agnostic and may be utilized to any mannequin with out modification. According to their outcomes, all three have been happy.

What is a Harness, and Why Does It Matter?

In this context, a harness refers back to the infrastructure wrapped round a language mannequin to deal with a particular activity. Think of it as an orchestration layer — it controls how the mannequin is prompted, how outputs are structured, how solutions are assembled throughout a number of calls, and the way options are evaluated.

Traditionally, these harnesses are hand-built by engineers. Poetiq’s declare is that their Meta-System builds and optimizes these harnesses routinely, by way of recursive self-improvement. Internally, the Meta-System works by creating higher methods for figuring out what to ask, refining sequential chain-of-questions, and devising new strategies for assembling the solutions. The system always incorporates learnings from earlier and present duties and datasets to create new, customized task-specific harnesses — in addition to brokers and orchestrators for different activity sorts.

How the Harness was Built?

Poetiq’s Meta-System was given the LCB Pro activity and constructed a harness from scratch utilizing solely Gemini 3.1 Pro as the bottom mannequin. The Meta-System accounted for all three dimensions LCB Pro assessments: accuracy, runtime, and reminiscence constraints. The system constructed on insights from its earlier work on ARC-AGI and HLE when designing the harness. No fine-tuning of the underlying mannequin was carried out, and no entry to inside mannequin activations was required — solely normal API entry.

Once the harness was constructed and optimized for Gemini 3.1 Pro, it was then utilized to a broad set of different fashions from completely different suppliers and generations — each open-weights and proprietary — with none further optimization. Every mannequin examined improved.

The Numbers

The benchmark outcomes throughout problem tiers are value intimately. On Hard issues — the class the place gaps between fashions are largest — Gemini 3.1 Pro with Poetiq’s harness scores 58.3%, up from its 7.7% baseline. GPT 5.5 High with the harness reaches 75.0% on Hard, up from 50.0%. Across Easy and Medium classes, the harness additionally outperforms all base fashions.

Some of the smaller mannequin outcomes are additionally notable. Gemini 3.0 Flash improves by 10 share factors, going from 72.3% to 82.3% — overtaking Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.2 High, all bigger and costlier fashions. This mirrors a sample Poetiq beforehand noticed on ARC-AGI, the place their optimization allowed a smaller, extra economical mannequin to surpass a larger one. Kimi K2.6 sees the most important leap: from 50.0% to 79.9%, a roughly 30 share level enchancment. Nemotron 3 Super 120B improves by 12.8%.

Accuracy numbers are reported immediately from the LCB Pro leaderboard at livecodebenchpro.com (25Q2). For fashions not featured on the leaderboard, Poetiq performed its personal evaluations, cross-validating its experimental setup by replicating official leaderboard accuracies for baseline fashions.

Key Takeaways

Poetiq’s Meta-System routinely builds task-specific harnesses by way of recursive self-improvement, with no mannequin fine-tuning or inside mannequin entry
GPT 5.5 High with the harness reaches 93.9% on LCB Pro (25Q2), up 4.3% from its 89.6% baseline; Gemini 3.1 Pro jumps 12.3% (78.6% → 90.9%)
The harness is model-agnostic: optimized utilizing solely Gemini 3.1 Pro, it improved each different mannequin examined — open-weights and proprietary — with out modification
Gemini 3.0 Flash beneficial properties 10 share factors with the harness (72.3% → 82.3%), surpassing Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.2 High regardless of being smaller and cheaper
Kimi K2.6 exhibits the most important achieve at ~30 share factors (50.0% → 79.9%); Nemotron 3 Super 120B improves by 12.8%

Check out the Technical details here. Also, be happy to comply with us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish Poetiq’s Meta-System Automatically Builds a Model-Agnostic Harness That Improved Every LLM Tested on LiveCodeBench Pro Without Fine-Tuning appeared first on MarkTechPost.

Poetiq’s Meta-System Automatically Builds a Model-Agnostic Harness That Improved Every LLM Tested on LiveCodeBench Pro Without Fine-Tuning

What is LiveCodeBench Pro?

Poetiq’s Strategic Framing: Three LLM Task Categories

What is a Harness, and Why Does It Matter?

How the Harness was Built?

The Numbers

Key Takeaways

Hugging Face Releases Smol2Operator: A Fully Open-Source Pipeline to Train a 2.2B VLM into an Agentic GUI Coder

NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

Delinea Released an MCP Server to Put Guardrails Around AI Agents Credential Access

Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Automated Behavioral Evaluations of Frontier AI Models

Why Apple’s Critique of AI Reasoning Is Premature

CloudFlare AI Team Just Open-Sourced ‘VibeSDK’ that Lets Anyone Build and Deploy a Full AI Vibe Coding Platform with a Single Click

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What is LiveCodeBench Pro?

Poetiq’s Strategic Framing: Three LLM Task Categories

What is a Harness, and Why Does It Matter?

How the Harness was Built?

The Numbers

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!