AccessEval: Benchmarking Disability Bias in Large Language Models Paper β’ 2509.22703 β’ Published Sep 22 β’ 20
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications Paper β’ 2509.23879 β’ Published Sep 28 β’ 20
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks Paper β’ 2509.23673 β’ Published Sep 28 β’ 20
Aligning LLMs for Multilingual Consistency in Enterprise Applications Paper β’ 2509.23659 β’ Published Sep 28 β’ 20
view article Article πΊπ¦ββ¬ LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark Jan 10 β’ 8
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper β’ 2506.00482 β’ Published May 31 β’ 8