OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation Paper β’ 2601.15369 β’ Published Jan 21 β’ 21
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model Paper β’ 2601.15892 β’ Published Jan 22 β’ 53
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration Paper β’ 2511.21689 β’ Published Nov 26, 2025 β’ 125
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models Paper β’ 2511.17487 β’ Published Nov 21, 2025 β’ 12
Kimi Linear: An Expressive, Efficient Attention Architecture Paper β’ 2510.26692 β’ Published Oct 30, 2025 β’ 127
The End of Manual Decoding: Towards Truly End-to-End Language Models Paper β’ 2510.26697 β’ Published Oct 30, 2025 β’ 117
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM Paper β’ 2510.15870 β’ Published Oct 17, 2025 β’ 91
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models Paper β’ 2510.08559 β’ Published Oct 9, 2025 β’ 9
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Paper β’ 2508.06471 β’ Published Aug 8, 2025 β’ 206
view article Article TimeScope: How Long Can Your Video Large Multimodal Model Go? +2 Jul 23, 2025 β’ 48
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Paper β’ 2506.01844 β’ Published Jun 2, 2025 β’ 151
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning Paper β’ 2505.14231 β’ Published May 20, 2025 β’ 53
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math Paper β’ 2504.21233 β’ Published Apr 30, 2025 β’ 49
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation Paper β’ 2504.17502 β’ Published Apr 24, 2025 β’ 55
Describe Anything: Detailed Localized Image and Video Captioning Paper β’ 2504.16072 β’ Published Apr 22, 2025 β’ 64
FlowReasoner: Reinforcing Query-Level Meta-Agents Paper β’ 2504.15257 β’ Published Apr 21, 2025 β’ 47