arxiv:2603.19466

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

Published on Mar 19

· Submitted by

Authors:

Abstract

MLLMs demonstrate limited proactive behavior in requesting user interventions for challenging tasks, with performance hindered by conversational context and in-context learning biases, though reinforcement learning fine-tuning shows potential for learning such behaviors.

AI-generated summary

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

View arXiv page View PDF Project page GitHub 18 Add to collection

Community

tdemin16

Paper submitter about 15 hours ago

We introduce ProactiveBench, a benchmark to evaluate whether MLLMs can ask for help when faced with unanswerable visual queries, e.g., suggesting to move an occluding object rather than hallucinating or abstaining. We repurpose 7 datasets into 7 distinct proactive scenarios (occlusion removal, camera movement, image quality enhancement, sketch completion, and more), totaling 108k+ images across 18k samples.

Evaluating 22 MLLMs, we find that models broadly lack proactiveness regardless of size, and that hinting, conversation history, and in-context learning offer only marginal or even negative gains. Encouragingly, we show that proactiveness can be learned via RL fine-tuning (GRPO) and generalizes to unseen scenarios. We release ProactiveBench as a first step toward building more collaborative multimodal models.