๐ Try it now: FINAL-Bench/Gemma-4-Multi Two Models, One Space Switch between both Gemma 4 variants in a single interface:
โก Gemma 4 26B-A4B โ MoE with 128 experts, only 3.8B active params. 95% of the 31B's quality at ~8x faster inference. AIME 88.3%, GPQA 82.3%. ๐ Gemma 4 31B โ Dense 30.7B. Best quality among Gemma 4 family. AIME 89.2%, GPQA 84.3%, Codeforces 2150. Arena open-model top 3.
Features
Vision โ Upload images for analysis, OCR, chart reading, document parsing Thinking Mode โ Toggle chain-of-thought reasoning with Gemma 4's native <|channel> thinking tokens System Prompts โ 6 presets (General, Code, Math, Creative, Translate, Research) or write your own Streaming โ Real-time token-by-token response via ZeroGPU Apache 2.0 โ Fully open, no restrictions
Technical Details Built with the dev build of transformers (5.5.0.dev0) for full Gemma 4 support including multimodal apply_chat_template, variable-resolution image processing, and native thinking mode. Runs on HF ZeroGPU with @spaces.GPU โ no dedicated GPU needed. Both models support 256K context window and 140+ languages out of the box.
๐งฌ Darwin-35B-A3B-Opus โ The Child That Surpassed Both Parents
What if a merged model could beat both its parents? We proved it can. Darwin-35B-A3B-Opus is a 35B MoE model (3B active) built with our Darwin V5 engine โ the first evolution system that CT-scans parent models before merging them. ๐ค Model: FINAL-Bench/Darwin-35B-A3B-Opus
The result speaks for itself: GPQA Diamond 90.0%, versus Father (Qwen3.5-35B-A3B) at 84.2% and Mother (Claude 4.6 Opus Distilled) at 85.0%. That's +6.9% over Father and +5.9% over Mother. Not a tradeoff โ a genuine leap. Meanwhile, MMMLU sits at 85.0% (Father: 85.2%), multimodal is fully intact, and all 201 languages are preserved.
How? Model MRI changed everything. Traditional merging is guesswork. Darwin V4 added evolution. Darwin V5 added X-ray vision. Model MRI scans each parent layer by layer and discovers: Mother's L34โL38 is the reasoning engine (peak cosine distance), 50โ65% of Mother's experts are dead (killed by text-only distillation), and Father is a healthy generalist with every expert alive. The prescription: transplant Mother's reasoning brain at L38 (90% weight), replace her dead experts with Father's living ones, and let Father's router handle the output layer. Reasoning went up. Versatility stayed intact. No tradeoff โ just evolution.
35B total, 3B active (MoE) ยท GPQA Diamond 90.0% ยท MMMLU 85.0% (201 languages) ยท Multimodal Image & Video ยท 262K native context ยท 147.8 tok/s on H100 ยท Runs on a single RTX 4090 (Q4) ยท Apache 2.0 Darwin V5's full algorithm and technical details will be released alongside an upcoming paper.
๐ World Model Bench โ does your world model actually think?
FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.
We just released WM Bench โ the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint โ not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?
Those are cognitive questions. No existing benchmark asks them. So we built one.
- ๐ P1 Perception (25%) โ Can it read the scene? - ๐ง P2 Cognition (45%) โ Does it predict threats, escalate emotions, utilize memory? - ๐ฅ P3 Embodiment (30%) โ Does the body respond with the right motion?
All evaluation is via simple JSON I/O โ no 3D engine, no special hardware. Any model with an API can participate.
We also built PROMETHEUS as a live reference implementation โ runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive โ Predict โ Decide โ Act). Scored 726/1000 (Grade B) on Track C โ the only directly verified model so far. Submissions from other teams very welcome.
๐ World Model Bench โ does your world model actually think?
FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.
We just released WM Bench โ the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint โ not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?
Those are cognitive questions. No existing benchmark asks them. So we built one.
- ๐ P1 Perception (25%) โ Can it read the scene? - ๐ง P2 Cognition (45%) โ Does it predict threats, escalate emotions, utilize memory? - ๐ฅ P3 Embodiment (30%) โ Does the body respond with the right motion?
All evaluation is via simple JSON I/O โ no 3D engine, no special hardware. Any model with an API can participate.
We also built PROMETHEUS as a live reference implementation โ runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive โ Predict โ Decide โ Act). Scored 726/1000 (Grade B) on Track C โ the only directly verified model so far. Submissions from other teams very welcome.
Hundreds of AI leaderboards exist on HuggingFace. Knowing which ones the community actually trusts has never been easy โ until now.
Leaderboard of Leaderboards (LoL) ranks the leaderboards themselves, using live HuggingFace trending scores and cumulative likes as the signal. No editorial curation. No manual selection. Just what the global AI research community is actually visiting and endorsing, surfaced in real time.
Sort by trending to see what is capturing attention right now, or by likes to see what has built lasting credibility over time. Nine domain filters let you zero in on what matters most to your work, and every entry shows both its rank within this collection and its real-time global rank across all HuggingFace Spaces.
The collection spans well-established standards like Open LLM Leaderboard, Chatbot Arena, MTEB, and BigCodeBench alongside frameworks worth watching. FINAL Bench targets AGI-level evaluation across 100 tasks in 15 domains and recently reached the global top 5 in HuggingFace dataset rankings. Smol AI WorldCup runs tournament-format competitions for sub-8B models scored via FINAL Bench criteria. ALL Bench aggregates results across frameworks into a unified ranking that resists the overfitting risks of any single standard.
The deeper purpose is not convenience. It is transparency. How we measure AI matters as much as the AI we measure.
๐๏ธ Smol AI WorldCup: A 4B Model Just Beat 8B โ Here's the Data
We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.
โ A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.
โ Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.
What makes this benchmark different?
Most benchmarks ask "how smart?" โ we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.