Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published May 19, 2025 • 45
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation Paper • 2506.07977 • Published Jun 9, 2025 • 41
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published Jun 13, 2025 • 24
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification Paper • 2506.15569 • Published Jun 18, 2025 • 12
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation Paper • 2506.14028 • Published Jun 16, 2025 • 93
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents Paper • 2506.11763 • Published Jun 13, 2025 • 74
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning Paper • 2506.09049 • Published Jun 10, 2025 • 37
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers Paper • 2507.02694 • Published Jul 3, 2025 • 19
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once Paper • 2507.10541 • Published Jul 14, 2025 • 30
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs Paper • 2507.08616 • Published Jul 11, 2025 • 15
The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations Paper • 2507.13302 • Published Jul 17, 2025 • 5
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research Paper • 2507.13300 • Published Jul 17, 2025 • 20
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering Paper • 2507.11527 • Published Jul 15, 2025 • 35
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Paper • 2507.10787 • Published Jul 14, 2025 • 13
WideSearch: Benchmarking Agentic Broad Info-Seeking Paper • 2508.07999 • Published Aug 11, 2025 • 110
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents Paper • 2508.13186 • Published Aug 14, 2025 • 19
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions Paper • 2508.16402 • Published Aug 22, 2025 • 14
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20, 2025 • 43
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model Paper • 2508.14444 • Published Aug 20, 2025 • 42
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation Paper • 2508.17472 • Published Aug 24, 2025 • 26
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks Paper • 2508.15804 • Published Aug 14, 2025 • 15
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers Paper • 2508.20453 • Published Aug 28, 2025 • 63
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks Paper • 2509.01396 • Published Sep 1, 2025 • 58
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs Paper • 2509.04013 • Published Sep 4, 2025 • 4
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge Paper • 2509.07968 • Published Sep 9, 2025 • 14
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering Paper • 2509.09614 • Published Sep 11, 2025 • 7
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark Paper • 2501.01290 • Published Jan 2, 2025
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 176
OceanGym: A Benchmark Environment for Underwater Embodied Agents Paper • 2509.26536 • Published Sep 30, 2025 • 36
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs Paper • 2510.09507 • Published Oct 10, 2025 • 11
PICABench: How Far Are We from Physically Realistic Image Editing? Paper • 2510.17681 • Published Oct 20, 2025 • 64
LiveTradeBench: Seeking Real-World Alpha with Large Language Models Paper • 2511.03628 • Published Nov 5, 2025 • 13
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks Paper • 2511.15065 • Published Nov 19, 2025 • 77
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark Paper • 2511.17729 • Published Nov 21, 2025 • 17
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward Paper • 2511.20561 • Published Nov 25, 2025 • 32
RefineBench: Evaluating Refinement Capability of Language Models via Checklists Paper • 2511.22173 • Published Nov 27, 2025 • 15
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle Paper • 2512.04324 • Published Dec 3, 2025 • 154
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents Paper • 2512.12730 • Published Dec 14, 2025 • 46
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value Paper • 2512.14051 • Published Dec 16, 2025 • 46
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments Paper • 2512.19432 • Published Dec 22, 2025 • 13
FrontierCS: Evolving Challenges for Evolving Intelligence Paper • 2512.15699 • Published Dec 17, 2025 • 5
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models Paper • 2512.15560 • Published Dec 17, 2025 • 25