The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding Paper • 2512.19693 • Published Dec 22, 2025 • 65
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published Oct 16, 2025 • 67
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation Paper • 2510.08673 • Published Oct 9, 2025 • 126
FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing Paper • 2503.16153 • Published Mar 20, 2025 • 2
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate Paper • 2410.07167 • Published Oct 9, 2024 • 39