EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies Paper • 2602.09514 • Published 10 days ago • 9
AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios Paper • 2601.20613 • Published 23 days ago • 10
xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations Paper • 2506.13651 • Published Jun 16, 2025 • 8