Papers
arxiv:2603.19466

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Published on Mar 19
· Submitted by
Thomas De Min
on Mar 23
Authors:
,
,
,
,

Abstract

MLLMs demonstrate limited proactive behavior in requesting user interventions for challenging tasks, with performance hindered by conversational context and in-context learning biases, though reinforcement learning fine-tuning shows potential for learning such behaviors.

AI-generated summary

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

Community

Paper submitter

We introduce ProactiveBench, a benchmark to evaluate whether MLLMs can ask for help when faced with unanswerable visual queries, e.g., suggesting to move an occluding object rather than hallucinating or abstaining. We repurpose 7 datasets into 7 distinct proactive scenarios (occlusion removal, camera movement, image quality enhancement, sketch completion, and more), totaling 108k+ images across 18k samples.

Evaluating 22 MLLMs, we find that models broadly lack proactiveness regardless of size, and that hinting, conversation history, and in-context learning offer only marginal or even negative gains. Encouragingly, we show that proactiveness can be learned via RL fine-tuning (GRPO) and generalizes to unseen scenarios. We release ProactiveBench as a first step toward building more collaborative multimodal models.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.19466 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.19466 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.