OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric
Most biology benchmarks ask slender, fact-based questions with clear solutions. Scientists weigh imperfect proof and make selections. OpenAI launched LifeSciBench and it targets that hole instantly.
Even the strongest mannequin passes roughly one process in three. The benchmark is much from saturated.
What is LifeSciBench
LifeSciBench incorporates 750 expert-authored duties. They span seven workflows and 7 organic domains. Each process pairs a immediate, supporting artifacts, and a grading rubric.
The seven workflows cowl proof dealing with and evaluation. They additionally embody design and optimization, scientific reasoning, validation and operations, translation, and scientific communication.
The seven domains run from genomics and medicinal chemistry to medical and translational science.
Tasks are written as a scientist would temporary a colleague. They are free-response, not multiple-choice. Around 79% require a number of reasoning or decision-making steps, averaging 4 steps every.
How the Benchmark was Built
A cohort of 173 professional scientists wrote the duties. Each held a Ph.D. and had biotechnology or pharmaceutical expertise. Accepted duties averaged six automated evaluation cycles and no less than two professional critiques.
Many duties ship with artifacts. The benchmark contains 1,062 connected artifacts in whole. About 53% of duties require no less than one artifact. Types embody sequences, figures, tables, PDFs, and chemical constructions.
A separate cohort validated high quality. There had been 453 reviewers, and 97% held doctorates. Overall settlement exceeded 96% on relevance, reasoning, grounding, and usefulness.
The Rubric System
Rubrics are the core mechanic right here. They comprise 19,020 standards throughout the benchmark. That is roughly 25 standards per process.
Each criterion rewards one concrete property. Examples embody a particular reality, a reasoning step, or a numeric reply inside tolerance. Grading runs in opposition to the rubric, not a single reference string.
Two metrics summarize efficiency. Normalized rubric rating divides awarded factors by whole factors. Task cross fee counts duties scoring at or above 70%.
This separation issues for interpretation. A response can earn partial credit score whereas nonetheless failing the duty. The cross threshold is strict by design.
Here is the scoring logic in plain Python:
def grade(rubric, awarded_ids):
whole = sum(c["pts"] for c in rubric)
earned = sum(c["pts"] for c in rubric if c["id"] in awarded_ids)
normalized = earned / whole # partial credit score
handed = normalized >= 0.70 # task-level success
return normalized, handed
How the Models Performed
OpenAI evaluated 5 fashions in a single-turn setting. Each mannequin noticed the immediate and artifacts as soon as. Unrestricted web looking was permitted.
| Model | Normalized rating | Task cross fee |
|---|---|---|
| GPT-Rosalind | 0.576 | 36.1% |
| GPT-5.5 | 0.519 | 25.7% |
| Gemini 3.1 Pro | 0.515 | 23.6% |
| GPT-5.4 | 0.479 | 20.7% |
| Grok 4.3 | 0.399 | 13.0% |
GPT-Rosalind, OpenAI’s domain-specialized mannequin, led total. It had the best per-task imply on 386 of 750 duties. It additionally lifted the general cross fee over GPT-5.5, from 25.7% to 36.1%. Pass charges stayed modest throughout each mannequin.
Rankings are usually not the entire story. Gemini 3.1 Pro uniquely led on 214 duties. Aggregate scores can disguise task-specific strengths.
Where Models Win, and Where They Fall Short
Models had been strongest on structured judgment. GPT-Rosalind reached a 0.712 imply rating on Translation. Scientific Communication scored 0.718, however that class is small, so learn it cautiously.
Two workflows stayed laborious. Design, Optimization, and Prediction was among the many hardest, with GPT-Rosalind passing 30.7%. Analysis was shut behind at 30.3%.
Artifact use was a clear bottleneck. GPT-Rosalind dropped from 45.1% on text-only duties to twenty-eight.1% on artifact duties. GPT-5.5 fell the identical method, from 29.9% to 21.9%.
Exact outputs had been hardest of all. Sequence and construction criterion success ranged from 46.9% to 18.0% throughout fashions. GPT-Rosalind’s achieve over GPT-5.5 on generate/assemble gadgets was simply +0.001.
Models additionally stalled mid-task. For GPT-Rosalind, 109 duties earned no less than 50% rubric credit score however nonetheless handed beneath 20%.
Headroom stays giant. No mannequin handed 171 duties (22.8%). And 261 duties (34.8%) had a best-model cross fee beneath 20%.
Strengths and Weaknesses
Strengths:
- Broad protection throughout seven workflows and 7 organic domains
- Expert-authored rubrics with 19,020 atomic, gradeable standards
- Realistic artifacts: sequences, figures, tables, PDFs, and constructions
- Independent validation by 453 professional reviewers, 97% with doctorates
Weaknesses:
- Single-turn solely; actual analysis is iterative and multi-turn
- Built by OpenAI, which additionally provides most evaluated fashions
- Public launch could also be restricted by security and licensing constraints
- 750 duties can’t cowl each scientific specialty
Try It: Interactive Rubric Grader Demo
‘;
RUBRIC[grp].forEach(operate(c){
var sel = !!chosen[c.id];
html += ‘‘;
});
html += ‘
