admarcosai 's Collections Benchmarks
updated
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
• 2311.12022
• Published
• 35
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published
• 246
Updated
• 360
• 74
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper
• 2312.04724
• Published
• 21
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
• 2401.03065
• Published
• 11
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
with Multi-Granularity Answers
Paper
• 2401.04695
• Published
• 13
Updated
• 1.15k
• 132
Viewer
• Updated
• 100 • 1.66k
• 12
reasoning-machines/gsm-hard
Viewer
• Updated
• 1.32k • 1.51k
• 63
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Paper
• 2402.01622
• Published
• 38
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool
Utilization in Real-World Complex Scenarios
Paper
• 2401.17167
• Published
• 1
Language Models, Agent Models, and World Models: The LAW for Machine
Reasoning and Planning
Paper
• 2312.05230
• Published
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper
• 2401.18058
• Published
• 24
Premise Order Matters in Reasoning with Large Language Models
Paper
• 2402.08939
• Published
• 28
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs
Miss
Paper
• 2402.10790
• Published
• 42
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
• 2412.15204
• Published
• 38
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World
Scenarios
Paper
• 2412.08972
• Published
• 11