LoopViT: Scaling Visual ARC with Looped Transformers
Abstract
Loop-ViT introduces a recursive vision transformer architecture that decouples reasoning depth from model capacity through weight-tied recurrence and dynamic exit mechanisms, achieving superior visual reasoning performance with fewer parameters.
Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning is a Modality (2026)
- Forest Before Trees: Latent Superposition for Efficient Visual Reasoning (2026)
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)
- Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space (2025)
- Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization (2026)
- Energy-Entropy Regularization: The True Power of Minimal Looped Transformers (2026)
- Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper