Small Vision-Language Models are Smart Compressors for Long Video Understanding Paper • 2604.08120 • Published 12 days ago • 20
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web Paper • 2604.08516 • Published 12 days ago • 42
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models Paper • 2604.08546 • Published 12 days ago • 114
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver Paper • 2604.08377 • Published 12 days ago • 280
ClawBench: Can AI Agents Complete Everyday Online Tasks? Paper • 2604.08523 • Published 12 days ago • 255
ELT: Elastic Looped Transformers for Visual Generation Paper • 2604.09168 • Published 11 days ago • 19
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 12 days ago • 239
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music Paper • 2604.10905 • Published 8 days ago • 28
Strips as Tokens: Artist Mesh Generation with Native UV Segmentation Paper • 2604.09132 • Published 11 days ago • 50
Running 3.79k The Ultra-Scale Playbook 🌌 3.79k The ultimate guide to training LLM on large GPU Clusters