arxiv:2604.16902

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Published on Apr 18

· Submitted by

Authors:

Abstract

Research reveals that native omni-modal large language models exhibit visual preference over text, with modality preference emerging progressively in mid-to-late layers and enabling diagnosis of cross-modal hallucinations.

AI-generated summary

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

View arXiv page View PDF GitHub 5 Add to collection

Community

Bowieee

Paper submitter about 17 hours ago

Proposing a modality preference evaluation framework for OLLMs: Constructing a tri-modal semantic conflict dataset with quantitative metrics to systematically measure model modality preferences.
Revealing the modality preference landscape of OLLMs: Unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference.
Investigating the internal evolution patterns of modality preference: Employing layer-wise linear probing to reveal that modality preference signals are absent in shallow layers and gradually emerge in mid-to-late layers.
Leveraging linear probes for hallucination detection: Discovering that hallucination generation is accompanied by abnormally elevated preference probability toward the interfering modality, enabling effective hallucination detection via linear probes.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.16902

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.16902 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.16902 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.16902 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.