OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

OpenAI has launched GPT-5.5, its most succesful mannequin thus far and the primary totally retrained base mannequin since GPT-4.5. GPT-5.5 is designed to finish advanced, multi-step laptop duties with minimal human path. Think of it because the distinction between an assistant who wants a guidelines and one who understands the underlying purpose and figures out the steps themselves. The launch is rolling out immediately to Plus, Pro, Business, and Enterprise subscribers throughout ChatGPT and Codex.

What ‘Agentic’ Actually Means Here

An agentic mannequin doesn’t simply reply to a single immediate — it takes a sequence of actions, makes use of instruments (like searching the net, writing code, operating scripts, or working software program), checks its personal work, and retains going till the duty is completed. Prior fashions typically stalled at handoff factors, requiring the consumer to re-prompt or right course. GPT-5.5 is constructed to scale back these interruptions.

OpenAI launched GPT-5.5 as a mannequin focused at agentic laptop use — it writes and debugs code, browses the net, fills out spreadsheets, and retains working by way of multi-step duties with out requiring a human to oversee each transfer.

The Four Domains Where Gains Are Concentrated

The positive factors are concentrated in 4 areas: agentic coding, laptop use, information work, and early scientific analysis — domains OpenAI describes as these ‘the place progress relies upon on reasoning throughout context and taking motion over time.’

For software program engineers, essentially the most instantly related benchmark is SWE-Bench Pro, which evaluates real-world GitHub concern decision throughout 4 programming languages. GPT-5.5 resolves 58.6% of duties end-to-end in a single go. Worth noting: Claude Opus 4.7 scores larger at 64.3% on this similar benchmark, although OpenAI has famous that Anthropic reported indicators of memorization on a subset of these issues, which can have an effect on the comparability.

For long-horizon coding particularly, OpenAI additionally experiences outcomes on Expert-SWE, an inner benchmark measuring duties with a median estimated human completion time of 20 hours. GPT-5.5 outperforms GPT-5.4 on Expert-SWE. This benchmark is critical as a result of it displays the type of prolonged, multi-session engineering work — massive refactors, characteristic builds, debugging deep in a codebase — that agentic instruments are more and more being requested to deal with autonomously.

Developers who examined the system early mentioned GPT-5.5 has a higher understanding of the “form” of a software program system, and can higher perceive why one thing is failing, the place the repair is required, and what else within the codebase can be affected.

https://openai.com/index/introducing-gpt-5-5/

For ML engineers and knowledge scientists who spend important time in terminal environments orchestrating pipelines and debugging scripts, the Terminal-Bench 2.0 outcomes are essentially the most compelling sign. GPT-5.5 scores 82.7% on Terminal-Bench 2.0, which assessments advanced command-line workflows requiring planning, iteration, and device coordination — beating Claude Opus 4.7 at 69.4% and Gemini 3.1 Pro at 68.5%. That will not be a marginal lead.

For broader information work, GPT-5.5 scores 84.9% on GDPval, which assessments brokers throughout 44 occupations of information work. On OSWorld-Verified, a benchmark measuring whether or not a mannequin can autonomously function actual laptop environments, it reaches 78.7%.

GPT-5.5 additionally ships with a Pro variant constructed for higher-accuracy, tougher duties. On BrowseComp, which assessments a mannequin’s skill to trace down hard-to-find info throughout the net, GPT-5.5 Pro scores 90.1%, forward of Gemini 3.1 Pro at 85.9%. The mannequin can be the top-ranked system on the Artificial Analysis Intelligence Index.

Speed and Token Efficiency

One concern with extra succesful fashions is that they are usually slower or costlier to run. OpenAI addressed this straight. GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving whereas performing higher throughout almost each analysis measured. It additionally makes use of considerably fewer tokens to finish the identical Codex duties — which means shorter, extra environment friendly runs even on advanced agentic workflows.

On pricing, the usual GPT-5.5 API will likely be charged at $5 per million enter tokens and $30 per million output tokens. For context, GPT-5.4 was priced at $2.50 per million enter tokens and $15 per million output tokens — so the per-token value has doubled. OpenAI workforce argued that token effectivity positive factors offset the fee, since GPT-5.5 completes the identical Codex duties with fewer tokens, which means cheaper runs total even on the larger per-token fee. GPT-5.5 Pro, the higher-accuracy variant, is priced at $30 per million enter tokens and $180 per million output tokens within the API.

For groups operating Codex at scale, the online math is what issues: if GPT-5.5 completes a activity in materially fewer tokens than GPT-5.4, the efficient price per accomplished workflow can nonetheless come out decrease regardless of the upper fee.

Scale and Adoption

OpenAI has seen a surge in Codex utilization, with about 4 million builders utilizing the device weekly. That scale issues for understanding the deployment context: GPT-5.5 will not be a analysis preview however a manufacturing mannequin being pushed to an energetic, massive developer base instantly on launch.

Key Takeaways

GPT-5.5 is OpenAI’s first totally retrained base mannequin since GPT-4.5, designed particularly for agentic workflows — it may well perceive advanced objectives, use instruments, examine its personal work, and carry multi-step duties by way of to completion with minimal human path.
The greatest efficiency positive factors are in agentic coding, laptop use, information work, and early scientific analysis — GPT-5.5 scores 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, and 78.7% on OSWorld-Verified, outperforming each Claude Opus 4.7 and Gemini 3.1 Pro on a number of key benchmarks.
GPT-5.5 matches GPT-5.4’s per-token latency whereas being extra succesful throughout almost each benchmark — it additionally makes use of considerably fewer tokens to finish the identical Codex duties, which means higher outcomes with out a proportional enhance in velocity or price per accomplished workflow.
API pricing will increase to $5/M enter tokens and $30/M output tokens (up from $2.50 and $15 for GPT-5.4), with GPT-5.5 Pro priced at $30/M enter and $180/M output — OpenAI workforce argues token effectivity positive factors offset the upper per-token fee for many workloads.
GPT-5.5 is rolling out immediately to Plus, Pro, Business, and Enterprise customers in ChatGPT and Codex, with roughly 4 million builders already utilizing Codex weekly.

The publish OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval appeared first on MarkTechPost.

OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

What ‘Agentic’ Actually Means Here

The Four Domains Where Gains Are Concentrated

Speed and Token Efficiency

Scale and Adoption

Key Takeaways

Qwen Team Releases Qwen3-Coder-Next: An Open-Weight Language Model Designed Specifically for Coding Agents and Local Development

Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs

Best AI Agents for Software Development Ranked: A Benchmark-Driven Look at the Current Field

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants

OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs

How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What ‘Agentic’ Actually Means Here

The Four Domains Where Gains Are Concentrated

Speed and Token Efficiency

Scale and Adoption

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!