OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric

Most biology benchmarks ask slender, fact-based questions with clear solutions. Scientists weigh imperfect proof and make selections. OpenAI launched LifeSciBench and it targets that hole instantly.

Even the strongest mannequin passes roughly one process in three. The benchmark is much from saturated.

What is LifeSciBench

LifeSciBench incorporates 750 expert-authored duties. They span seven workflows and 7 organic domains. Each process pairs a immediate, supporting artifacts, and a grading rubric.

The seven workflows cowl proof dealing with and evaluation. They additionally embody design and optimization, scientific reasoning, validation and operations, translation, and scientific communication.

The seven domains run from genomics and medicinal chemistry to medical and translational science.

Tasks are written as a scientist would temporary a colleague. They are free-response, not multiple-choice. Around 79% require a number of reasoning or decision-making steps, averaging 4 steps every.

How the Benchmark was Built

A cohort of 173 professional scientists wrote the duties. Each held a Ph.D. and had biotechnology or pharmaceutical expertise. Accepted duties averaged six automated evaluation cycles and no less than two professional critiques.

Many duties ship with artifacts. The benchmark contains 1,062 connected artifacts in whole. About 53% of duties require no less than one artifact. Types embody sequences, figures, tables, PDFs, and chemical constructions.

A separate cohort validated high quality. There had been 453 reviewers, and 97% held doctorates. Overall settlement exceeded 96% on relevance, reasoning, grounding, and usefulness.

The Rubric System

Rubrics are the core mechanic right here. They comprise 19,020 standards throughout the benchmark. That is roughly 25 standards per process.

Each criterion rewards one concrete property. Examples embody a particular reality, a reasoning step, or a numeric reply inside tolerance. Grading runs in opposition to the rubric, not a single reference string.

Two metrics summarize efficiency. Normalized rubric rating divides awarded factors by whole factors. Task cross fee counts duties scoring at or above 70%.

This separation issues for interpretation. A response can earn partial credit score whereas nonetheless failing the duty. The cross threshold is strict by design.

Here is the scoring logic in plain Python:

Copy Code

def grade(rubric, awarded_ids):
    whole = sum(c["pts"] for c in rubric)
    earned = sum(c["pts"] for c in rubric if c["id"] in awarded_ids)
    normalized = earned / whole          # partial credit score
    handed = normalized >= 0.70          # task-level success
    return normalized, handed

How the Models Performed

OpenAI evaluated 5 fashions in a single-turn setting. Each mannequin noticed the immediate and artifacts as soon as. Unrestricted web looking was permitted.

Model	Normalized rating	Task cross fee
GPT-Rosalind	0.576	36.1%
GPT-5.5	0.519	25.7%
Gemini 3.1 Pro	0.515	23.6%
GPT-5.4	0.479	20.7%
Grok 4.3	0.399	13.0%

GPT-Rosalind, OpenAI’s domain-specialized mannequin, led total. It had the best per-task imply on 386 of 750 duties. It additionally lifted the general cross fee over GPT-5.5, from 25.7% to 36.1%. Pass charges stayed modest throughout each mannequin.

Rankings are usually not the entire story. Gemini 3.1 Pro uniquely led on 214 duties. Aggregate scores can disguise task-specific strengths.

Where Models Win, and Where They Fall Short

Models had been strongest on structured judgment. GPT-Rosalind reached a 0.712 imply rating on Translation. Scientific Communication scored 0.718, however that class is small, so learn it cautiously.

Two workflows stayed laborious. Design, Optimization, and Prediction was among the many hardest, with GPT-Rosalind passing 30.7%. Analysis was shut behind at 30.3%.

Artifact use was a clear bottleneck. GPT-Rosalind dropped from 45.1% on text-only duties to twenty-eight.1% on artifact duties. GPT-5.5 fell the identical method, from 29.9% to 21.9%.

Exact outputs had been hardest of all. Sequence and construction criterion success ranged from 46.9% to 18.0% throughout fashions. GPT-Rosalind’s achieve over GPT-5.5 on generate/assemble gadgets was simply +0.001.

Models additionally stalled mid-task. For GPT-Rosalind, 109 duties earned no less than 50% rubric credit score however nonetheless handed beneath 20%.

Headroom stays giant. No mannequin handed 171 duties (22.8%). And 261 duties (34.8%) had a best-model cross fee beneath 20%.

Strengths and Weaknesses

Strengths:

Broad protection throughout seven workflows and 7 organic domains
Expert-authored rubrics with 19,020 atomic, gradeable standards
Realistic artifacts: sequences, figures, tables, PDFs, and constructions
Independent validation by 453 professional reviewers, 97% with doctorates

Weaknesses:

Single-turn solely; actual analysis is iterative and multi-turn
Built by OpenAI, which additionally provides most evaluated fashions
Public launch could also be restricted by security and licensing constraints
750 duties can’t cowl each scientific specialty

Try It: Interactive Rubric Grader Demo

LifeSciBench — Interactive Demo

Rubric Grader & Model Leaderboard

See how rubric-based grading works on a actual benchmark process. Toggle the standards a mannequin “acquired proper” and watch the normalized rating and 70% cross threshold replace stay.

Task (Analysis — Spatial Transcriptomics): Using connected Visium knowledge from an FFPE cervical most cancers slide, cluster the spots into 4 k-means teams, annotate the dominant cell kind per cluster, and suggest the 1–2 most promising focused therapies (ADC, TCE, or CAR-T) based mostly on antigen expression variations between tumor and non-tumor areas.

Simulate a response:

0 / 76 pts

Normalized rating: 0%

▲ 70% cross threshold (53.2 pts)

FAIL — beneath 70%

A response can acquire partial credit score but nonetheless fail the duty. That hole is precisely what LifeSciBench measures.

Single-turn analysis; unrestricted web looking permitted. GPT-Rosalind led total however uniquely topped solely 386 of 750 duties; Gemini 3.1 Pro uniquely led on 214.

Built by Marktechpost · Data: OpenAI LifeSciBench preprint & launch

Verified Jun 17, 2026

‘;
RUBRIC[grp].forEach(operate(c){
var sel = !!chosen[c.id];
html += ‘‘
+ ‘‘
+ ‘‘+esc(c.lbl)+’‘
+ ‘+’+c.pts+’‘;
});
html += ‘

OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric

What is LifeSciBench

How the Benchmark was Built

The Rubric System

How the Models Performed

Where Models Win, and Where They Fall Short

Strengths and Weaknesses

Strengths:

Weaknesses:

Try It: Interactive Rubric Grader Demo

Meet Mamba-3: A New State Space Model Frontier with 2x Smaller States and Enhanced MIMO Decoding Hardware Efficiency

Mistral AI Releases Devstral 2507 for Code-Centric Language Modeling

How to Build a Self-Evaluating Agentic AI System with LlamaIndex and OpenAI Using Retrieval, Tool Use, and Automated Quality Checks

Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

Boston’s healthcare AI: Past changes and what’s next

Alibaba Qwen Team Releases Qwen3.5 Omni: A Native Multimodal Model for Text, Audio, Video, and Realtime Interaction

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What is LifeSciBench

How the Benchmark was Built

The Rubric System

How the Models Performed

Where Models Win, and Where They Fall Short

Strengths and Weaknesses

Strengths:

Weaknesses:

Try It: Interactive Rubric Grader Demo

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!