BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

📢 Latest News
- ✨ [2026-03-31] Data Split Update! We have officially released the Train/Test splits for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
- 🔥 [2026-03-18] Our BioProAgent is now live on AI4S LAB! Try it out and order wet-lab experiments here.
- 🎉 [2026-03-03] Our BioProAgent has been accepted by the ICLR 2026 LLA Workshop!
- 📝 [2026-01-21] BioProBench paper has been updated with new experimental results.arXiv.
- 🚀 [2025-12-01] Code and dataset (v1.0) are released on GitHub.
🌟 Introduction
BioProBench is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.
Key Features:
- 📚 Large-scale Data: Built upon 27K original biological protocols, yielding nearly 556K high-quality structured instances.
- 🎯 Comprehensive Tasks: 5 core tasks: PQA (Question Answering), ORD (Step Ordering), ERR (Error Correction), GEN (Generation), and REA (Reasoning).
- 🧬 Broad Domain Coverage: Covers 16 biological subdomains from 6 major repositories.
- 🔬 Standardized Evaluation: A robust framework combining NLP metrics with novel domain-specific measures.
📊 Dataset Structure & Tasks
We provide standardized JSON files for each task, now including Train and Test splits:
| Task |
Description |
Files |
| PQA |
Protocol Question Answering |
PQA_train.json, PQA_test.json |
| ORD |
Step Ordering |
ORD_train.json, ORD_test.json |
| ERR |
Error Correction |
ERR_train.json, ERR_test.json |
| GEN |
Protocol Generation |
GEN_train.json, GEN_test.json |
| Raw |
Full Protocol Corpus |
Bio-protocol.json, Protocol-io.json, etc. |
🔗 Useful Links
🔬 Key Findings
We evaluated 12 mainstream LLMs. Our findings reveal:
- Surface vs. Deep Understanding: Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
- Reasoning Bottleneck: Performance drops significantly on Step Ordering and Protocol Generation (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
- Bio-specific Models: Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.
🤝 Contributing & Contact
We welcome contributions such as new protocol sources, additional domains, or novel tasks!
📜 Citation
@misc{bioprotocolbench2025,
title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
year={2025},
url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
}