Benchmark and Evaluation
updated
Paper
•
2501.14249
•
Published
•
77
Benchmarking LLMs for Political Science: A United Nations Perspective
Paper
•
2502.14122
•
Published
•
2
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in
Expert-Domain Information Retrieval
Paper
•
2503.04644
•
Published
•
21
ExpertGenQA: Open-ended QA generation in Specialized Domains
Paper
•
2503.02948
•
Published
Toward Stable and Consistent Evaluation Results: A New Methodology for
Base Model Evaluation
Paper
•
2503.00812
•
Published
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging
Fabricated Claims with Humorous Content
Paper
•
2503.16031
•
Published
•
3
JudgeLRM: Large Reasoning Models as a Judge
Paper
•
2504.00050
•
Published
•
62
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on
Technical Documents
Paper
•
2504.13128
•
Published
•
7
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Paper
•
2504.13359
•
Published
•
4
Benchmarking LLMs' Swarm intelligence
Paper
•
2505.04364
•
Published
•
20
A Multi-Dimensional Constraint Framework for Evaluating and Improving
Instruction Following in Large Language Models
Paper
•
2505.07591
•
Published
•
11
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Paper
•
2509.04013
•
Published
•
4