ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
Abstract
ViGoR benchmark addresses limitations in current AIGC evaluation by introducing a comprehensive framework for assessing visual generative reasoning across multiple modalities and cognitive dimensions.
Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/
Community
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/vigor-bench-how-far-are-visual-generative-models-from-zero-shot-visual-reasoners-6197-7f3bb8bb
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Visual-ERM: Reward Modeling for Visual Equivalence (2026)
- Improving Visual Reasoning with Iterative Evidence Refinement (2026)
- SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? (2026)
- RISE-Video: Can Video Generators Decode Implicit World Rules? (2026)
- MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints (2026)
- HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies (2026)
- Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.25823 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
