Reasoning Models Struggle to Control their Chains of Thought Paper β’ 2603.05706 β’ Published 13 days ago β’ 31
Running 90 Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks π 90 Evaluate multilingual models using FineTasks
FrenchBench Evaluation datasets Collection These datasets are used to evaluate models on French performance using: https://github.com/EleutherAI/lm-evaluation-harness (from CroissantLLM paper) β’ 11 items β’ Updated Jun 7, 2024 β’ 8
Running on CPU Upgrade 13.9k Open LLM Leaderboard π 13.9k Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade 104 Open LLM Leaderboard π 104 Track, rank and evaluate open LLMs and chatbots