|

OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work

OpenAI has simply launched GPT-5.2, its most superior frontier mannequin for skilled work and lengthy working brokers, and is rolling it out throughout ChatGPT and the API.

GPT-5.2 is a household of three variants. In ChatGPT, customers see ChatGPT-5.2 Instant, Thinking and Pro. In the API, the corresponding fashions are gpt-5.2-chat-latest, gpt-5.2, and gpt-5.2-pro. Instant targets on a regular basis help and studying, Thinking targets advanced multi step work and brokers, and Pro allocates extra compute for arduous technical and analytical duties.

Benchmark profile, from GDPval to SWE Bench

GPT-5.2 Thinking is positioned as the principle workhorse for actual world information work. On GDPval, an analysis of properly specified information duties throughout 44 occupations in 9 giant industries, it beats or ties high trade professionals on 70.9 p.c of comparisons, whereas producing outputs at greater than 11 instances the velocity and below 1 p.c of the estimated knowledgeable price. For engineering groups this implies the mannequin can reliably generate artifacts comparable to shows, spreadsheets, schedules, and diagrams given structured directions.

On an inner benchmark of junior funding banking spreadsheet modeling duties, common scores rise from 59.1 p.c with GPT-5.1 to 68.4 p.c with GPT-5.2 Thinking and 71.7 p.c with GPT-5.2 Pro. These duties embrace three assertion fashions and leveraged buyout fashions with constraints on formatting and citations, which is consultant of many structured enterprise workflows.

In software program engineering, GPT-5.2 Thinking reaches 55.6 p.c on SWE-Bench Pro and 80.0 p.c on SWE-bench Verified. SWE-Bench Pro evaluates repository stage patch technology over a number of languages, whereas SWE-bench Verified focuses on Python.

Long context and agentic workflows

Long context is a core design goal. GPT-5.2 Thinking units a brand new cutting-edge on OpenAI MRCRv2, a benchmark that inserts a number of an identical ‘needle’ queries into lengthy dialogue “haystacks” and measures whether or not the mannequin can reproduce the proper reply. It is the primary mannequin reported to succeed in close to one hundred pc accuracy on the 4 needle MRCR variant out to 256k tokens.

For workloads that exceed even that context, GPT-5.2 Thinking integrates with the Responses /compact endpoint, which performs context compaction to increase the efficient window for software heavy, lengthy working jobs. This is related if you’re constructing brokers that iteratively name instruments over many steps and want to take care of state past the uncooked token restrict.

On software utilization, GPT-5.2 Thinking reaches 98.7 p.c on Tau2-bench Telecom, a multi flip buyer help benchmark the place the mannequin should orchestrate software calls throughout a practical workflow. The official examples from OpenAI launch publish present situations like a traveler with a delayed flight, missed connection, misplaced bag and medical seating requirement, the place GPT-5.2 manages rebooking, particular help seating and compensation in a constant sequence whereas GPT-5.1 leaves steps unfinished.

Vision, science and math

Vision high quality additionally strikes up. GPT-5.2 Thinking roughly halves error charges on chart reasoning and person interface understanding benchmarks like CharXiv Reasoning and ScreenSpot Pro when a Python software is enabled. The mannequin exhibits improved spatial understanding of photographs, for instance when labeling motherboard parts with approximate bounding bins, GPT-5.2 identifies extra areas with tighter placement than GPT-5.1.

For scientific workloads, GPT-5.2 Pro scores 93.2 p.c and GPT-5.2 Thinking 92.4 p.c on GPQA Diamond, and GPT-5.2 Thinking solves 40.3 p.c of FrontierMath Tier 1 to Tier 3 issues with Python instruments enabled. These benchmarks cowl graduate stage physics, chemistry, biology and knowledgeable arithmetic, and OpenAI highlights early use the place GPT-5.2 Pro contributed to a proof in statistical studying principle below human verification.

Comparison Table

Model Primary positioning Context window / max output Knowledge cutoff Notable benchmarks (Thinking / Pro vs GPT-5.1 Thinking)
GPT-5.1 Flagship mannequin for coding and agentic duties with configurable reasoning effort 400,000 tokens context, 128,000 max output 2024-09-30 SWE-Bench Pro 50.8 p.c, SWE-bench Verified 76.3 p.c, ARC-AGI-1 72.8 p.c, ARC-AGI-2 17.6 p.c
GPT-5.2 (Thinking) New flagship mannequin for coding and agentic duties throughout industries and for lengthy working brokers 400,000 tokens context, 128,000 max output 2025-08-31 GDPval wins or ties 70.9 p.c vs trade professionals, SWE-Bench Pro 55.6 p.c, SWE-bench Verified 80.0 p.c, ARC-AGI-1 86.2 p.c, ARC-AGI-2 52.9 p.c
GPT-5.2 Pro Higher compute model of GPT-5.2 for the toughest reasoning and scientific workloads, produces smarter and extra exact responses 400,000 tokens context, 128,000 max output 2025-08-31 GPQA Diamond 93.2 p.c vs 92.4 p.c for GPT-5.2 Thinking and 88.1 p.c for GPT-5.1 Thinking, ARC-AGI-1 90.5 p.c and ARC-AGI-2 54.2 p.c

Key Takeaways

  1. GPT-5.2 Thinking is the brand new default workhorse mannequin: It replaces GPT-5.1 Thinking as the principle mannequin for coding, information work and brokers, whereas holding the identical 400k context and 128k max output, however with clearly larger benchmark efficiency throughout GDPval, SWE-Bench, ARC-AGI and scientific QA.
  2. Substantial accuracy soar over GPT-5.1 at comparable scale: On key benchmarks, GPT-5.2 Thinking strikes from 50.8 p.c to 55.6 p.c on SWE-Bench Pro and from 76.3 p.c to 80.0 p.c on SWE-bench Verified, and from 72.8 p.c to 86.2 p.c on ARC-AGI-1 and from 17.6 p.c to 52.9 p.c on ARC-AGI-2, whereas holding token limits comparable.
  3. GPT-5.2 Pro is focused at excessive finish reasoning and science: GPT-5.2 Pro is the next compute variant that primarily improves arduous reasoning and scientific duties, for instance reaching 93.2 p.c on GPQA Diamond versus 92.4 p.c for GPT-5.2 Thinking and 88.1 p.c for GPT-5.1 Thinking, and better scores on ARC-AGI tiers.

The publish OpenAI Introduces GPT 5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work appeared first on MarkTechPost.

Similar Posts