7 85

gen

ginigini

AI & ML interests

None yet

Recent Activity

liked a Space 2 days ago

FINAL-Bench/Gemma-4-Multi

reacted to SeaWolf-AI's post with 🔥 2 days ago

💎 Gemma 4 Playground — Dual Model Demo on ZeroGPU We just launched a Gemma 4 Playground that lets you chat with Google DeepMind's latest open models — directly on Hugging Face Spaces with ZeroGPU. https://huggingface.co/spaces/FINAL-Bench/Gemma-4-Multi 👉 Try it now: FINAL-Bench/Gemma-4-Multi Two Models, One Space Switch between both Gemma 4 variants in a single interface: ⚡ Gemma 4 26B-A4B — MoE with 128 experts, only 3.8B active params. 95% of the 31B's quality at ~8x faster inference. AIME 88.3%, GPQA 82.3%. 🏆 Gemma 4 31B — Dense 30.7B. Best quality among Gemma 4 family. AIME 89.2%, GPQA 84.3%, Codeforces 2150. Arena open-model top 3. Features Vision — Upload images for analysis, OCR, chart reading, document parsing Thinking Mode — Toggle chain-of-thought reasoning with Gemma 4's native <|channel> thinking tokens System Prompts — 6 presets (General, Code, Math, Creative, Translate, Research) or write your own Streaming — Real-time token-by-token response via ZeroGPU Apache 2.0 — Fully open, no restrictions Technical Details Built with the dev build of transformers (5.5.0.dev0) for full Gemma 4 support including multimodal apply_chat_template, variable-resolution image processing, and native thinking mode. Runs on HF ZeroGPU with @spaces.GPU — no dedicated GPU needed. Both models support 256K context window and 140+ languages out of the box. Links - 🤗 Space: [FINAL-Bench/Gemma-4-Multi](https://huggingface.co/spaces/FINAL-Bench/Gemma-4-Multi) - 📄 Gemma 4 26B-A4B: [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) - 📄 Gemma 4 31B: [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) - 🔬 DeepMind Blog: [Gemma 4 Launch](https://deepmind.google/blog/gemma-4-byte-for-byte-the-most-capable-open-models/)

liked a model 2 days ago

FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF

View all activity

Organizations

None yet

liked a Space 2 days ago

Gemma-4 Multichat

👀

Gemma 4 — MoE 26B or Dense 31B, Vision, Thinking

reactedto SeaWolf-AI's post with 🔥 2 days ago

Post

3063

💎 Gemma 4 Playground — Dual Model Demo on ZeroGPU

We just launched a Gemma 4 Playground that lets you chat with Google DeepMind's latest open models — directly on Hugging Face Spaces with ZeroGPU.

FINAL-Bench/Gemma-4-Multi

👉 Try it now: FINAL-Bench/Gemma-4-Multi
Two Models, One Space
Switch between both Gemma 4 variants in a single interface:

⚡ Gemma 4 26B-A4B — MoE with 128 experts, only 3.8B active params. 95% of the 31B's quality at ~8x faster inference. AIME 88.3%, GPQA 82.3%.
🏆 Gemma 4 31B — Dense 30.7B. Best quality among Gemma 4 family. AIME 89.2%, GPQA 84.3%, Codeforces 2150. Arena open-model top 3.

Features

Vision — Upload images for analysis, OCR, chart reading, document parsing
Thinking Mode — Toggle chain-of-thought reasoning with Gemma 4's native <|channel> thinking tokens
System Prompts — 6 presets (General, Code, Math, Creative, Translate, Research) or write your own
Streaming — Real-time token-by-token response via ZeroGPU
Apache 2.0 — Fully open, no restrictions

Technical Details
Built with the dev build of transformers (5.5.0.dev0) for full Gemma 4 support including multimodal apply_chat_template, variable-resolution image processing, and native thinking mode. Runs on HF ZeroGPU with @spaces .GPU — no dedicated GPU needed.
Both models support 256K context window and 140+ languages out of the box.

Links

- 🤗 Space: [FINAL-Bench/Gemma-4-Multi]( FINAL-Bench/Gemma-4-Multi)
- 📄 Gemma 4 26B-A4B: [google/gemma-4-26B-A4B-it]( google/gemma-4-26B-A4B-it)
- 📄 Gemma 4 31B: [google/gemma-4-31B-it]( google/gemma-4-31B-it)
- 🔬 DeepMind Blog: [Gemma 4 Launch](https://deepmind.google/blog/gemma-4-byte-for-byte-the-most-capable-open-models/)

liked 2 models 2 days ago

FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF

Text Generation • 35B • Updated 2 days ago • 639 • 13

bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF

Image-Text-to-Text • 35B • Updated 3 days ago • 6.88k • 15

upvoted an article 3 days ago

Article

"The Child That Surpassed Both Parents Through MRI-Guided Evolutionary Merge"

5 days ago

•

reactedto SeaWolf-AI's post with 👍 4 days ago

Post

2115

🧬 Darwin-35B-A3B-Opus — The Child That Surpassed Both Parents

What if a merged model could beat both its parents? We proved it can.
Darwin-35B-A3B-Opus is a 35B MoE model (3B active) built with our Darwin V5 engine — the first evolution system that CT-scans parent models before merging them.
🤗 Model: FINAL-Bench/Darwin-35B-A3B-Opus

The result speaks for itself: GPQA Diamond 90.0%, versus Father (Qwen3.5-35B-A3B) at 84.2% and Mother (Claude 4.6 Opus Distilled) at 85.0%. That's +6.9% over Father and +5.9% over Mother. Not a tradeoff — a genuine leap. Meanwhile, MMMLU sits at 85.0% (Father: 85.2%), multimodal is fully intact, and all 201 languages are preserved.

How? Model MRI changed everything. Traditional merging is guesswork. Darwin V4 added evolution. Darwin V5 added X-ray vision. Model MRI scans each parent layer by layer and discovers: Mother's L34–L38 is the reasoning engine (peak cosine distance), 50–65% of Mother's experts are dead (killed by text-only distillation), and Father is a healthy generalist with every expert alive. The prescription: transplant Mother's reasoning brain at L38 (90% weight), replace her dead experts with Father's living ones, and let Father's router handle the output layer. Reasoning went up. Versatility stayed intact. No tradeoff — just evolution.

35B total, 3B active (MoE) · GPQA Diamond 90.0% · MMMLU 85.0% (201 languages) · Multimodal Image & Video · 262K native context · 147.8 tok/s on H100 · Runs on a single RTX 4090 (Q4) · Apache 2.0
Darwin V5's full algorithm and technical details will be released alongside an upcoming paper.

🚀 Live Demo: FINAL-Bench/Darwin-35B-A3B-Opus

🏆 FINAL Bench Leaderboard: FINAL-Bench/Leaderboard

📊 ALL Bench Leaderboard: FINAL-Bench/all-bench-leaderboard

Built by VIDRAFT · Supported by the Korean Government GPU Support Program

8 replies

upvoted an article 6 days ago

Article

Introducing WM Bench: A Benchmark for Cognitive Intelligence in World Models

6 days ago

•

reactedto SeaWolf-AI's post with 🔥 6 days ago

Post

4624

🌍 World Model Bench — does your world model actually think?

FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.

We just released WM Bench — the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint — not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?

Those are cognitive questions. No existing benchmark asks them. So we built one.

3 Pillars · 10 Categories · 100 Scenarios · 1,000-point scale

- 👁 P1 Perception (25%) — Can it read the scene?
- 🧠 P2 Cognition (45%) — Does it predict threats, escalate emotions, utilize memory?
- 🔥 P3 Embodiment (30%) — Does the body respond with the right motion?

All evaluation is via simple JSON I/O — no 3D engine, no special hardware. Any model with an API can participate.

We also built PROMETHEUS as a live reference implementation — runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive → Predict → Decide → Act). Scored 726/1000 (Grade B) on Track C — the only directly verified model so far. Submissions from other teams very welcome.

---

🗂 Dataset → FINAL-Bench/World-Model
🌍 Demo → FINAL-Bench/World-Model
🏆 Leaderboard → FINAL-Bench/worldmodel-bench
📝 Article → https://huggingface.co/blog/FINAL-Bench/world-model

Part of the FINAL Bench Family — alongside FINAL Bench (Feb 2026). Feedback on rubrics and missing models always welcome!

liked 2 Spaces 6 days ago

WORLD MODEL Leaderboard

💻

WORLD MODEL Bench

PROMETHEUS v1.0 — World Model Interactive Demo

🔥

World-first embodied AI world model

liked a dataset 6 days ago

FINAL-Bench/World-Model

Viewer • Updated 6 days ago • 100 • 1.25k • 25

reactedto SeaWolf-AI's post with 🤝 6 days ago

Post

4624

liked a Space 20 days ago

SiteAgent - AI 웹 어시스턴트

🤖

어떤 웹 페이지에서든 동작하는 AI 어시스턴트.

liked a Space 23 days ago

Leaderboard of Leaderboards

🔥

Real-time rankings of the most trusted leaderboard

reactedto mayafree's post with ❤️🚀🔥 23 days ago

Post

5859

Leaderboard of Leaderboards — A Real-Time Meta-Ranking of AI Benchmarks

MAYA-AI/all-leaderboard

Hundreds of AI leaderboards exist on HuggingFace. Knowing which ones the community actually trusts has never been easy — until now.

Leaderboard of Leaderboards (LoL) ranks the leaderboards themselves, using live HuggingFace trending scores and cumulative likes as the signal. No editorial curation. No manual selection. Just what the global AI research community is actually visiting and endorsing, surfaced in real time.

Sort by trending to see what is capturing attention right now, or by likes to see what has built lasting credibility over time. Nine domain filters let you zero in on what matters most to your work, and every entry shows both its rank within this collection and its real-time global rank across all HuggingFace Spaces.

The collection spans well-established standards like Open LLM Leaderboard, Chatbot Arena, MTEB, and BigCodeBench alongside frameworks worth watching. FINAL Bench targets AGI-level evaluation across 100 tasks in 15 domains and recently reached the global top 5 in HuggingFace dataset rankings. Smol AI WorldCup runs tournament-format competitions for sub-8B models scored via FINAL Bench criteria. ALL Bench aggregates results across frameworks into a unified ranking that resists the overfitting risks of any single standard.

The deeper purpose is not convenience. It is transparency. How we measure AI matters as much as the AI we measure.

5 replies

reactedto SeaWolf-AI's post with 🔥 25 days ago

Post

11111

🏟️ Smol AI WorldCup: A 4B Model Just Beat 8B — Here's the Data

We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.

Community Article: https://huggingface.co/blog/FINAL-Bench/smol-worldcup
Live Leaderboard: ginigen-ai/smol-worldcup
Dataset: ginigen-ai/smol-worldcup

What we found:

→ Gemma-3n-E4B (4B, 2GB RAM) outscores Qwen3-8B (8B, 5.5GB). Doubling parameters gained only 0.4 points. RAM cost: 2.75x more.

→ GPT-OSS-20B fits in 1.5GB yet matches Champions-league dense models requiring 8.5GB. MoE architecture is the edge AI game-changer.

→ Thinking models hurt structured output. DeepSeek-R1-7B scores 8.7 points below same-size Qwen3-8B and runs 2.7x slower.

→ A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.

→ Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.

What makes this benchmark different?

Most benchmarks ask "how smart?" — we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.

Top 5 by WCS:
1. GPT-OSS-20B — WCS 82.6 — 1.5GB — Raspberry Pi tier
2. Gemma-3n-E4B — WCS 81.8 — 2.0GB — Smartphone tier
3. Llama-4-Scout — WCS 79.3 — 240 tok/s — Fastest model
4. Qwen3-4B — WCS 76.6 — 2.8GB — Smartphone tier
5. Qwen3-1.7B — WCS 76.1 — 1.2GB — IoT tier

Built in collaboration with the FINAL Bench research team. Interoperable with ALL Bench Leaderboard for full small-to-large model comparison.

Dataset is open under Apache 2.0 (125 questions, 7 languages). We welcome new model submissions.