Papers
arxiv:2512.02536

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Published on Dec 2, 2025
Authors:
,
,
,
,
,
,
,
,

Abstract

Recent advances in multimodal large language models highlight the challenge of efficiently connecting vision-language models with diffusion models, where fixed query tokens cause task generalization issues; this work proposes noisy query tokens and a VAE branch to enhance continual learning and detail recovery.

AI-generated summary

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.02536 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.02536 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.02536 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.