From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Haiwen Diao
Paranioar
AI & ML interests
Vision-and-Language, Parameter-efficient Transfer Learning, Multi-modal Large Language Model
Recent Activity
upvoted a paper about 3 hours ago
VLANeXt: Recipes for Building Strong VLA Models upvoted a paper about 5 hours ago
A Very Big Video Reasoning Suite upvoted a paper 6 days ago
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling