AI & ML interests

None defined yet.

Recent Activity

GreatCaptainNemo  updated a dataset about 3 hours ago
BioProBench/BioProBench
sunshinepku  updated a Space about 12 hours ago
BioProBench/README
GreatCaptainNemo  updated a dataset 1 day ago
BioProBench/BioProBench
View all activity

Organization Card
BioProBench Logo

BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

ArXiv Hugging Face GitHub Project Page License: CC BY 4.0


📢 Latest News

  • [2026-03-31] Data Split Update! We have officially released the Train/Test splits for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
  • 🔥 [2026-03-18] Our BioProAgent is now live on AI4S LAB! Try it out and order wet-lab experiments here.
  • 🎉 [2026-03-03] Our BioProAgent has been accepted by the ICLR 2026 LLA Workshop!
  • 📝 [2026-01-21] BioProBench paper has been updated with new experimental results.arXiv.
  • 🚀 [2025-12-01] Code and dataset (v1.0) are released on GitHub.

🌟 Introduction

BioProBench is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.

BioProBench Overview

Key Features:

  • 📚 Large-scale Data: Built upon 27K original biological protocols, yielding nearly 556K high-quality structured instances.
  • 🎯 Comprehensive Tasks: 5 core tasks: PQA (Question Answering), ORD (Step Ordering), ERR (Error Correction), GEN (Generation), and REA (Reasoning).
  • 🧬 Broad Domain Coverage: Covers 16 biological subdomains from 6 major repositories.
  • 🔬 Standardized Evaluation: A robust framework combining NLP metrics with novel domain-specific measures.

📊 Dataset Structure & Tasks

BioProBench Samples

We provide standardized JSON files for each task, now including Train and Test splits:

Task Description Files
PQA Protocol Question Answering PQA_train.json, PQA_test.json
ORD Step Ordering ORD_train.json, ORD_test.json
ERR Error Correction ERR_train.json, ERR_test.json
GEN Protocol Generation GEN_train.json, GEN_test.json
Raw Full Protocol Corpus Bio-protocol.json, Protocol-io.json, etc.

🔗 Useful Links


🔬 Key Findings

We evaluated 12 mainstream LLMs. Our findings reveal:

  • Surface vs. Deep Understanding: Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
  • Reasoning Bottleneck: Performance drops significantly on Step Ordering and Protocol Generation (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
  • Bio-specific Models: Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.

🤝 Contributing & Contact

We welcome contributions such as new protocol sources, additional domains, or novel tasks!

📜 Citation

@misc{bioprotocolbench2025,
  title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
  year={2025},
  url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
}

models 0

None public yet