arxiv:2604.11626

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Published on Apr 13

· Submitted by

Haozhe Wang on Apr 16

#2 Paper of the day

Upvote

Authors:

Jiaming Liu ,

Abstract

Training reward models to generate multi-dimensional critiques improves visual generation through both enhanced reinforcement learning rewards and test-time refinement loops, achieving state-of-the-art performance with reduced training data requirements.

AI-generated summary

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

View arXiv page View PDF Project page GitHub 33 Add to collection

Community

JasperHaozhe

Paper submitter about 10 hours ago

•

edited about 8 hours ago

Rationalrewards, a reasoning reward model that just scales image generation quality at both training and test time. A quick test-time prompt revision matches or beats full RL fine-tuning (the kind that eats 400 GPU hours) with almost zero extra delay. This suggests modern visual generators are secretly packed with massive dormant superpowers… we just needed the right key to wake them up. 🔥