GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond Paper • 2309.16583 • Published Sep 28, 2023 • 13
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models Paper • 2310.08491 • Published Oct 12, 2023 • 57
SO-Bench: A Structural Output Evaluation of Multimodal LLMs Paper • 2511.21750 • Published Nov 23, 2025 • 6
LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics Paper • 2512.21010 • Published Dec 24, 2025 • 4
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published 28 days ago • 54
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale Paper • 2602.23866 • Published 14 days ago • 83
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants Paper • 2603.09652 • Published 2 days ago • 11