MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization
Abstract
Knowledge-aware concept customization binds textual knowledge to visual concepts through a two-stage framework that learns visual anchors and updates textual knowledge for high-fidelity generation, supported by a new benchmark and cross-modal transfer capabilities.
Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens (2026)
- TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval (2026)
- WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition (2026)
- Ego: Embedding-Guided Personalization of Vision-Language Models (2026)
- UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing (2026)
- Mario: Multimodal Graph Reasoning with Large Language Models (2026)
- Evolving Prompt Adaptation for Vision-Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper