|

OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric

Most biology benchmarks ask slender, fact-based questions with clear solutions. Scientists weigh imperfect proof and make selections. OpenAI launched LifeSciBench and it targets that hole instantly.

Even the strongest mannequin passes roughly one process in three. The benchmark is much from saturated.

What is LifeSciBench

LifeSciBench incorporates 750 expert-authored duties. They span seven workflows and 7 organic domains. Each process pairs a immediate, supporting artifacts, and a grading rubric.

The seven workflows cowl proof dealing with and evaluation. They additionally embody design and optimization, scientific reasoning, validation and operations, translation, and scientific communication.

The seven domains run from genomics and medicinal chemistry to medical and translational science.

Tasks are written as a scientist would temporary a colleague. They are free-response, not multiple-choice. Around 79% require a number of reasoning or decision-making steps, averaging 4 steps every.

How the Benchmark was Built

A cohort of 173 professional scientists wrote the duties. Each held a Ph.D. and had biotechnology or pharmaceutical expertise. Accepted duties averaged six automated evaluation cycles and no less than two professional critiques.

Many duties ship with artifacts. The benchmark contains 1,062 connected artifacts in whole. About 53% of duties require no less than one artifact. Types embody sequences, figures, tables, PDFs, and chemical constructions.

A separate cohort validated high quality. There had been 453 reviewers, and 97% held doctorates. Overall settlement exceeded 96% on relevance, reasoning, grounding, and usefulness.

The Rubric System

Rubrics are the core mechanic right here. They comprise 19,020 standards throughout the benchmark. That is roughly 25 standards per process.

Each criterion rewards one concrete property. Examples embody a particular reality, a reasoning step, or a numeric reply inside tolerance. Grading runs in opposition to the rubric, not a single reference string.

Two metrics summarize efficiency. Normalized rubric rating divides awarded factors by whole factors. Task cross fee counts duties scoring at or above 70%.

This separation issues for interpretation. A response can earn partial credit score whereas nonetheless failing the duty. The cross threshold is strict by design.

Here is the scoring logic in plain Python:

def grade(rubric, awarded_ids):
    whole = sum(c["pts"] for c in rubric)
    earned = sum(c["pts"] for c in rubric if c["id"] in awarded_ids)
    normalized = earned / whole          # partial credit score
    handed = normalized >= 0.70          # task-level success
    return normalized, handed

How the Models Performed

OpenAI evaluated 5 fashions in a single-turn setting. Each mannequin noticed the immediate and artifacts as soon as. Unrestricted web looking was permitted.

Model Normalized rating Task cross fee
GPT-Rosalind 0.576 36.1%
GPT-5.5 0.519 25.7%
Gemini 3.1 Pro 0.515 23.6%
GPT-5.4 0.479 20.7%
Grok 4.3 0.399 13.0%

GPT-Rosalind, OpenAI’s domain-specialized mannequin, led total. It had the best per-task imply on 386 of 750 duties. It additionally lifted the general cross fee over GPT-5.5, from 25.7% to 36.1%. Pass charges stayed modest throughout each mannequin.

Rankings are usually not the entire story. Gemini 3.1 Pro uniquely led on 214 duties. Aggregate scores can disguise task-specific strengths.

Where Models Win, and Where They Fall Short

Models had been strongest on structured judgment. GPT-Rosalind reached a 0.712 imply rating on Translation. Scientific Communication scored 0.718, however that class is small, so learn it cautiously.

Two workflows stayed laborious. Design, Optimization, and Prediction was among the many hardest, with GPT-Rosalind passing 30.7%. Analysis was shut behind at 30.3%.

Artifact use was a clear bottleneck. GPT-Rosalind dropped from 45.1% on text-only duties to twenty-eight.1% on artifact duties. GPT-5.5 fell the identical method, from 29.9% to 21.9%.

Exact outputs had been hardest of all. Sequence and construction criterion success ranged from 46.9% to 18.0% throughout fashions. GPT-Rosalind’s achieve over GPT-5.5 on generate/assemble gadgets was simply +0.001.

Models additionally stalled mid-task. For GPT-Rosalind, 109 duties earned no less than 50% rubric credit score however nonetheless handed beneath 20%.

Headroom stays giant. No mannequin handed 171 duties (22.8%). And 261 duties (34.8%) had a best-model cross fee beneath 20%.

Strengths and Weaknesses

Strengths:

  • Broad protection throughout seven workflows and 7 organic domains
  • Expert-authored rubrics with 19,020 atomic, gradeable standards
  • Realistic artifacts: sequences, figures, tables, PDFs, and constructions
  • Independent validation by 453 professional reviewers, 97% with doctorates

Weaknesses:

  • Single-turn solely; actual analysis is iterative and multi-turn
  • Built by OpenAI, which additionally provides most evaluated fashions
  • Public launch could also be restricted by security and licensing constraints
  • 750 duties can’t cowl each scientific specialty

Try It: Interactive Rubric Grader Demo






LifeSciBench — Interactive Demo
Rubric Grader & Model Leaderboard
See how rubric-based grading works on a actual benchmark process. Toggle the standards a mannequin “acquired proper” and watch the normalized rating and 70% cross threshold replace stay.


Task (Analysis — Spatial Transcriptomics): Using connected Visium knowledge from an FFPE cervical most cancers slide, cluster the spots into 4 k-means teams, annotate the dominant cell kind per cluster, and suggest the 1–2 most promising focused therapies (ADC, TCE, or CAR-T) based mostly on antigen expression variations between tumor and non-tumor areas.
Simulate a response:



0 / 76 pts
Normalized rating: 0%

▲ 70% cross threshold (53.2 pts)
FAIL — beneath 70%
A response can acquire partial credit score but nonetheless fail the duty. That hole is precisely what LifeSciBench measures.


Single-turn analysis; unrestricted web looking permitted. GPT-Rosalind led total however uniquely topped solely 386 of 750 duties; Gemini 3.1 Pro uniquely led on 214.

Built by Marktechpost · Data: OpenAI LifeSciBench preprint & launch
Verified Jun 17, 2026

‘;
RUBRIC[grp].forEach(operate(c){
var sel = !!chosen[c.id];
html += ‘‘;
});
html += ‘

Similar Posts